发现Youtube上有一个视频叫Evolution of clusterProfiler, 是Landon Wilkins用Gource做的。于是我也来玩一下,看一下自己这几年码代码的过程。
As an author of several Bioconductor packages. I found many questions from users are quite annoying. Some of them never use google and they are reluctant to read vignettes.
Step 1: make sure you are using the latest release
I found many peoples are using out-of-date packages. When they got an issue of an out-dated package, they never check whether the issue still exists in latest release.
We are happy to announce that ggtree
supports interactive tree annotation/manipulation by implementing an identify
method. Users can click on a node to highlight a clade, to label or rotate it etc.
Here is an example of highlighting clades using geom_hilight
with identify
:
不记得是什么时候知道统计之都的,但我记得最早知道的是太云,因为我用了他写的corrplot包。后来统计之都最早接触的也是太云,他给我写邮件问我能不能帮忙校对《ggplot2:数据分析与图形艺术》,从此开始和太云变成了网友。
我在暨大的时候,太云曾经邀请我去China-R会议做报告,但我觉得自己没什么好分享的,GOSemSim这个包是硕士的时候做的,不好去讲之前做的东西。而当时我写的另一个包clusterProfiler,纯粹是因为大量做富集分析的工具都是针对模式生物,而我们实验室有做各种细菌;另外有一些工具,背景设置是有问题的。自己实现一个包,不受别人的限制。即便是这个包现在受到了一定的认可,比如BioC 3.3中有个debrowser的包使用了clusterProfiler,而在BioC 3.4中又有个新包bioCancer也使用了clusterProfiler;再比如这次在北京,有好几个参会的人员在茶歇时问了clusterProfiler的问题。但始终觉得这只是个实用性的包而已,算法是别人的,而且已经比较老了,类似的工具简直就是成百上千。所以也是不好意思拿出来讲的。所以我拒绝了太云的邀请,一直也没有参加China-R的会议。
今年是第九届China-R会议,这次会议规模很大,有22个分会场,超过100个演讲嘉宾,参会人数超过4000人。这一次刚好有个Bioconductor的分会场,Matt写信给我,说我写过几个Bioconductor包,他本人喜欢我的ChIPseeker包,问我能否在会上分享与Bioconductor包相关的经验。这是Bioconductor在中国的首秀,我欣然接受,当然也是因为这两年我写了ChIPseeker和ggtree,我自己觉得还拿得出手🙈。
Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.
bitr_kegg
clusterProfiler
can convert biological IDs using OrgDb
object via the bitr
function. Now I implemented another function, bitr_kegg
for converting IDs through KEGG API.
library(clusterProfiler)
data(gcSample)
hg <- gcSample[[1]]
head(hg)
## [1] "4597" "7111" "5266" "2175" "755" "23046"
eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
## Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid",
## organism = "hsa"): 3.7% of input gene IDs are fail to map...
head(eg2np)
## kegg ncbi-proteinid
## 1 8326 NP_003499
## 2 58487 NP_001034707
## 3 139081 NP_619647
## 4 59272 NP_068576
## 5 993 NP_001780
## 6 2676 NP_001487
np2up <- bitr_kegg(eg2np[,2], fromType='ncbi-proteinid', toType='uniprot', organism='hsa')
head(np2up)
## ncbi-proteinid uniprot
## 1 NP_005457 O75586
## 2 NP_005792 P41567
## 3 NP_005792 Q6IAV3
## 4 NP_037536 Q13421
## 5 NP_006054 O60662
## 6 NP_001092002 O95398
The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene
ID for eukaryote species and Locus
ID for prokaryotes.
KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:
- pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)
- structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)
- functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)
- signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)