KEGG enrichment analysis with latest online data using clusterProfiler
KEGG.db is not updated since 2012. The data is now pretty old, but many of the Bioconductor packages still using it for KEGG annotation and enrichment analysis. As pointed out in ‘Are there too many biological databases', there is a problem that many out of date biological databases often don’t get offline. This issue also exists in web-server or software that using out-of-date data. For example, the WEGO web-server stopped updating GO annotation data since 2009, and WEGO still online with many people using it. The biological story may changed totally if using a recently updated data. Seriously, We should keep an eye on this issue.
Now enrichKEGG function is reloaded with a new parameter use_internal_data. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use_internal_data is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended. With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example ‘hsa’ for human and ‘mmu’ for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html. In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance “human” and “mouse”. For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don’t want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.
Example 1: Using online KEGG annotation
library(DOSE) data(geneList) de <- names(geneList)[geneList > 1] library(clusterProfiler) kk <- enrichKEGG(de, organism="hsa", pvalueCutoff=0.05, pAdjustMethod="BH", qvalueCutoff=0.1, readable=TRUE) head(summary(kk))
> head(summary(kk)) ID Description GeneRatio BgRatio hsa04110 hsa04110 Cell cycle 31/247 124/6861 hsa03030 hsa03030 DNA replication 9/247 36/6861 hsa04060 hsa04060 Cytokine-cytokine receptor interaction 25/247 265/6861 hsa04114 hsa04114 Oocyte meiosis 14/247 113/6861 hsa04115 hsa04115 p53 signaling pathway 10/247 68/6861 hsa04062 hsa04062 Chemokine signaling pathway 18/247 189/6861 pvalue p.adjust qvalue hsa04110 2.280256e-18 9.349050e-17 5.280593e-17 hsa03030 3.527197e-06 7.230753e-05 4.084123e-05 hsa04060 8.404037e-06 1.148552e-04 6.487326e-05 hsa04114 4.827484e-05 4.948171e-04 2.794859e-04 hsa04115 1.406946e-04 9.801620e-04 5.536217e-04 hsa04062 1.434383e-04 9.801620e-04 5.536217e-04 geneID hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1/MCM2/CDC25A/CDC6/PLK1/BUB1B/ESPL1/CCNE1/ORC6/ORC1/CCNE2/MCM6/MCM4/DBF4/SKP2/CDC25B/BUB1/MYC/PCNA/E2F1/CDKN2A hsa03030 MCM5/MCM2/MCM6/MCM4/FEN1/RFC4/PCNA/RNASEH2A/DNA2 hsa04060 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/IL1R2/CCL8/CXCL3/CCL20/IL12RB2/CXCL8/TNFRSF11A/CCL5/CXCR6/IL2RA/CCR1/CCL2/IL2RG/CCL4/CCR8/CCR7/GDF5/IL24/LTB/IL7 hsa04114 CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/PLK1/ESPL1/CCNE1/CCNE2/BUB1/FBXO5 hsa04115 CCNB2/RRM2/CDK1/CHEK1/CCNB1/GTSE1/CCNE1/CCNE2/SERPINB5/CDKN2A hsa04062 CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL3/CCL20/CXCL8/CCL5/CXCR6/CCR1/STAT1/CCL2/CCL4/HCK/CCR8/CCR7 Count hsa04110 31 hsa03030 9 hsa04060 25 hsa04114 14 hsa04115 10 hsa04062 18 >
In the KEGG.db, there are only 5894 human genes annotated. With current online data, the number of annotated gene increase to 6861 as shown above and of course, p-values changed. User should pay attention to another issue that readable parameter is only available for those species that has an annotation db. For example, for human we use org.Hs.eg.db for mapping gene ID to Symbol.
Example 2: enrichment analysis of species which are not previously supported
Here, I use a gene list of Streptococcus pneumoniae D39 to demonstrate using enrichKEGG with species that are not supported previously.
> gene  "SPD_0065" "SPD_0071" "SPD_0293" "SPD_0295" "SPD_0296" "SPD_0297"  "SPD_0327" "SPD_0426" "SPD_0427" "SPD_0428" "SPD_0559" "SPD_0560"  "SPD_0561" "SPD_0562" "SPD_0580" "SPD_0789" "SPD_1046" "SPD_1047"  "SPD_1048" "SPD_1050" "SPD_1051" "SPD_1052" "SPD_1053" "SPD_1057"  "SPD_1326" "SPD_1432" "SPD_1534" "SPD_1582" "SPD_1612" "SPD_1613"  "SPD_1633" "SPD_1634" "SPD_1648" "SPD_1678" "SPD_1919" > spdKEGG = enrichKEGG(gene, organism="spd") > summary(spdKEGG) ID Description GeneRatio BgRatio spd00052 spd00052 Galactose metabolism 35/35 35/752 spd02060 spd02060 Phosphotransferase system (PTS) 12/35 47/752 spd01100 spd01100 Metabolic pathways 28/35 341/752 spd00520 spd00520 Amino sugar and nucleotide sugar metabolism 9/35 43/752 pvalue p.adjust qvalue spd00052 4.961477e-61 2.480739e-60 5.222608e-61 spd02060 2.470177e-07 6.175443e-07 1.300093e-07 spd01100 1.958319e-05 3.263866e-05 6.871296e-06 spd00520 6.534975e-05 8.168718e-05 1.719730e-05 geneID spd00052 SPD_0065/SPD_0071/SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0327/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1678/SPD_1919 spd02060 SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0426/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_1047/SPD_1048/SPD_1057 spd01100 SPD_0071/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919 spd00520 SPD_0580/SPD_1326/SPD_1432/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919 Count spd00052 35 spd02060 12 spd01100 28 spd00520 9
To summarize, clusterProfiler supports downloading the latest KEGG annotation for enrichment analysis and it supports all species that have KEGG annotation available in KEGG database.
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.