发现Youtube上有一个视频叫Evolution of clusterProfiler, 是Landon Wilkins用Gource做的。于是我也来玩一下,看一下自己这几年码代码的过程。
Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.
bitr_kegg
clusterProfiler
can convert biological IDs using OrgDb
object via the bitr
function. Now I implemented another function, bitr_kegg
for converting IDs through KEGG API.
library(clusterProfiler)
data(gcSample)
hg <- gcSample[[1]]
head(hg)
## [1] "4597" "7111" "5266" "2175" "755" "23046"
eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')
## Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid",
## organism = "hsa"): 3.7% of input gene IDs are fail to map...
head(eg2np)
## kegg ncbi-proteinid
## 1 8326 NP_003499
## 2 58487 NP_001034707
## 3 139081 NP_619647
## 4 59272 NP_068576
## 5 993 NP_001780
## 6 2676 NP_001487
np2up <- bitr_kegg(eg2np[,2], fromType='ncbi-proteinid', toType='uniprot', organism='hsa')
head(np2up)
## ncbi-proteinid uniprot
## 1 NP_005457 O75586
## 2 NP_005792 P41567
## 3 NP_005792 Q6IAV3
## 4 NP_037536 Q13421
## 5 NP_006054 O60662
## 6 NP_001092002 O95398
The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene
ID for eukaryote species and Locus
ID for prokaryotes.
KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:
- pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)
- structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)
- functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)
- signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)
clusterProfiler
supports over-representation test and gene set
enrichment analysis of Gene Ontology. It supports GO annotation from
OrgDb object, GMT file and user’s own data.
support many species
In github version of clusterProfiler, enrichGO
and gseGO
functions
removed the parameter organism and add another parameter OrgDb, so
that any species that have OrgDb
object available can be analyzed in
clusterProfiler. Bioconductor have already provide OrgDb for about
20 species, see
http://bioconductor.org/packages/release/BiocViews.html#___OrgDb, and
users can build OrgDb
via AnnotationHub
.
Thanks @mevers for raising the issue to me and his efforts in benchmarking clusterProfiler.
He pointed out two issues:
- outputs from gseGO and GSEA-P are poorly overlap.
- pvalues from gseGO are generally smaller and don’t show a lot of variation
For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.
First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.
To simplify enriched GO result, we can use slim version of GO and use enricher function to analyze.
Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term. To make this feature available to clusterProfiler users, I develop a simplify method to reduce redundant GO terms from output of enrichGO function.
require(clusterProfiler)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
bp <- enrichGO(de, ont="BP")
enrichMap(bp)