KEGG enrichment analysis with latest online data using clusterProfiler
KEGG.db is not updated since 2012. The data is now pretty old, but many of the Bioconductor packages still using it for KEGG annotation and enrichment analysis. As pointed out in ‘Are there too many biological databases’, there is a problem that many out of date biological databases often don’t get offline. This issue also exists in web-server or software that using out-of-date data. For example, the WEGO web-server stopped updating GO annotation data since 2009, and WEGO still online with many people using it. The biological story may changed totally if using a recently updated data. Seriously, We should keep an eye on this issue.
Now enrichKEGG function is reloaded with a new parameter use_internal_data. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use_internal_data is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended. With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example ‘hsa’ for human and ‘mmu’ for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html. In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance “human” and “mouse”. For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don’t want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.
Example 1: Using online KEGG annotation
library(DOSE)
data(geneList)
de <- names(geneList)[geneList > 1]
library(clusterProfiler)
kk <- enrichKEGG(de, organism="hsa", pvalueCutoff=0.05, pAdjustMethod="BH",
qvalueCutoff=0.1, readable=TRUE)
head(summary(kk))
> head(summary(kk))
ID Description GeneRatio BgRatio
hsa04110 hsa04110 Cell cycle 31/247 124/6861
hsa03030 hsa03030 DNA replication 9/247 36/6861
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 25/247 265/6861
hsa04114 hsa04114 Oocyte meiosis 14/247 113/6861
hsa04115 hsa04115 p53 signaling pathway 10/247 68/6861
hsa04062 hsa04062 Chemokine signaling pathway 18/247 189/6861
pvalue p.adjust qvalue
hsa04110 2.280256e-18 9.349050e-17 5.280593e-17
hsa03030 3.527197e-06 7.230753e-05 4.084123e-05
hsa04060 8.404037e-06 1.148552e-04 6.487326e-05
hsa04114 4.827484e-05 4.948171e-04 2.794859e-04
hsa04115 1.406946e-04 9.801620e-04 5.536217e-04
hsa04062 1.434383e-04 9.801620e-04 5.536217e-04
geneID
hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1/MCM2/CDC25A/CDC6/PLK1/BUB1B/ESPL1/CCNE1/ORC6/ORC1/CCNE2/MCM6/MCM4/DBF4/SKP2/CDC25B/BUB1/MYC/PCNA/E2F1/CDKN2A
hsa03030 MCM5/MCM2/MCM6/MCM4/FEN1/RFC4/PCNA/RNASEH2A/DNA2
hsa04060 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/IL1R2/CCL8/CXCL3/CCL20/IL12RB2/CXCL8/TNFRSF11A/CCL5/CXCR6/IL2RA/CCR1/CCL2/IL2RG/CCL4/CCR8/CCR7/GDF5/IL24/LTB/IL7
hsa04114 CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/PLK1/ESPL1/CCNE1/CCNE2/BUB1/FBXO5
hsa04115 CCNB2/RRM2/CDK1/CHEK1/CCNB1/GTSE1/CCNE1/CCNE2/SERPINB5/CDKN2A
hsa04062 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL3/CCL20/CXCL8/CCL5/CXCR6/CCR1/STAT1/CCL2/CCL4/HCK/CCR8/CCR7
Count
hsa04110 31
hsa03030 9
hsa04060 25
hsa04114 14
hsa04115 10
hsa04062 18
>
In the KEGG.db, there are only 5894 human genes annotated. With current online data, the number of annotated gene increase to 6861 as shown above and of course, p-values changed. User should pay attention to another issue that readable parameter is only available for those species that has an annotation db. For example, for human we use org.Hs.eg.db for mapping gene ID to Symbol.
Example 2: enrichment analysis of species which are not previously supported
Here, I use a gene list of Streptococcus pneumoniae D39 to demonstrate using enrichKEGG with species that are not supported previously.
> gene
[1] "SPD_0065" "SPD_0071" "SPD_0293" "SPD_0295" "SPD_0296" "SPD_0297"
[7] "SPD_0327" "SPD_0426" "SPD_0427" "SPD_0428" "SPD_0559" "SPD_0560"
[13] "SPD_0561" "SPD_0562" "SPD_0580" "SPD_0789" "SPD_1046" "SPD_1047"
[19] "SPD_1048" "SPD_1050" "SPD_1051" "SPD_1052" "SPD_1053" "SPD_1057"
[25] "SPD_1326" "SPD_1432" "SPD_1534" "SPD_1582" "SPD_1612" "SPD_1613"
[31] "SPD_1633" "SPD_1634" "SPD_1648" "SPD_1678" "SPD_1919"
> spdKEGG = enrichKEGG(gene, organism="spd")
> summary(spdKEGG)
ID Description GeneRatio BgRatio
spd00052 spd00052 Galactose metabolism 35/35 35/752
spd02060 spd02060 Phosphotransferase system (PTS) 12/35 47/752
spd01100 spd01100 Metabolic pathways 28/35 341/752
spd00520 spd00520 Amino sugar and nucleotide sugar metabolism 9/35 43/752
pvalue p.adjust qvalue
spd00052 4.961477e-61 2.480739e-60 5.222608e-61
spd02060 2.470177e-07 6.175443e-07 1.300093e-07
spd01100 1.958319e-05 3.263866e-05 6.871296e-06
spd00520 6.534975e-05 8.168718e-05 1.719730e-05
geneID
spd00052 SPD_0065/SPD_0071/SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0327/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1678/SPD_1919
spd02060 SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0426/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_1047/SPD_1048/SPD_1057
spd01100 SPD_0071/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
spd00520 SPD_0580/SPD_1326/SPD_1432/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
Count
spd00052 35
spd02060 12
spd01100 28
spd00520 9
Summary
To summarize, clusterProfiler supports downloading the latest KEGG annotation for enrichment analysis and it supports all species that have KEGG annotation available in KEGG database.
Citation
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.