KEGG enrichment analysis with latest online data using clusterProfiler

KEGG.db is not updated since 2012. The data is now pretty old, but many of the Bioconductor packages still using it for KEGG annotation and enrichment analysis. As pointed out in ‘Are there too many biological databases’, there is a problem that many out of date biological databases often don’t get offline. This issue also exists in web-server or software that using out-of-date data. For example, the WEGO web-server stopped updating GO annotation data since 2009, and WEGO still online with many people using it. The biological story may changed totally if using a recently updated data. Seriously, We should keep an eye on this issue.

Now enrichKEGG function is reloaded with a new parameter use_internal_data. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use_internal_data is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended. With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example ‘hsa’ for human and ‘mmu’ for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html. In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance “human” and “mouse”. For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don’t want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.

Example 1: Using online KEGG annotation

library(DOSE)
data(geneList)
de <- names(geneList)[geneList > 1]

library(clusterProfiler)
kk <- enrichKEGG(de, organism="hsa", pvalueCutoff=0.05, pAdjustMethod="BH", 
                 qvalueCutoff=0.1, readable=TRUE)
head(summary(kk))

> head(summary(kk))
               ID                            Description GeneRatio  BgRatio
hsa04110 hsa04110                             Cell cycle    31/247 124/6861
hsa03030 hsa03030                        DNA replication     9/247  36/6861
hsa04060 hsa04060 Cytokine-cytokine receptor interaction    25/247 265/6861
hsa04114 hsa04114                         Oocyte meiosis    14/247 113/6861
hsa04115 hsa04115                  p53 signaling pathway    10/247  68/6861
hsa04062 hsa04062            Chemokine signaling pathway    18/247 189/6861
               pvalue     p.adjust       qvalue
hsa04110 2.280256e-18 9.349050e-17 5.280593e-17
hsa03030 3.527197e-06 7.230753e-05 4.084123e-05
hsa04060 8.404037e-06 1.148552e-04 6.487326e-05
hsa04114 4.827484e-05 4.948171e-04 2.794859e-04
hsa04115 1.406946e-04 9.801620e-04 5.536217e-04
hsa04062 1.434383e-04 9.801620e-04 5.536217e-04
                                                                                                                                                                              geneID
hsa04110 CDC45/CDC20/CCNB2/CCNA2/CDK1/MAD2L1/TTK/CHEK1/CCNB1/MCM5/PTTG1/MCM2/CDC25A/CDC6/PLK1/BUB1B/ESPL1/CCNE1/ORC6/ORC1/CCNE2/MCM6/MCM4/DBF4/SKP2/CDC25B/BUB1/MYC/PCNA/E2F1/CDKN2A
hsa03030                                                                                                                            MCM5/MCM2/MCM6/MCM4/FEN1/RFC4/PCNA/RNASEH2A/DNA2
hsa04060                           CXCL10/CXCL13/CXCL11/CXCL9/CCL18/IL1R2/CCL8/CXCL3/CCL20/IL12RB2/CXCL8/TNFRSF11A/CCL5/CXCR6/IL2RA/CCR1/CCL2/IL2RG/CCL4/CCR8/CCR7/GDF5/IL24/LTB/IL7
hsa04114                                                                                          CDC20/CCNB2/CDK1/MAD2L1/CALML5/AURKA/CCNB1/PTTG1/PLK1/ESPL1/CCNE1/CCNE2/BUB1/FBXO5
hsa04115                                                                                                               CCNB2/RRM2/CDK1/CHEK1/CCNB1/GTSE1/CCNE1/CCNE2/SERPINB5/CDKN2A
hsa04062                                                                       CXCL10/CXCL13/CXCL11/CXCL9/CCL18/CCL8/CXCL3/CCL20/CXCL8/CCL5/CXCR6/CCR1/STAT1/CCL2/CCL4/HCK/CCR8/CCR7
         Count
hsa04110    31
hsa03030     9
hsa04060    25
hsa04114    14
hsa04115    10
hsa04062    18
>

In the KEGG.db, there are only 5894 human genes annotated. With current online data, the number of annotated gene increase to 6861 as shown above and of course, p-values changed. User should pay attention to another issue that readable parameter is only available for those species that has an annotation db. For example, for human we use org.Hs.eg.db for mapping gene ID to Symbol.

Example 2: enrichment analysis of species which are not previously supported

Here, I use a gene list of Streptococcus pneumoniae D39 to demonstrate using enrichKEGG with species that are not supported previously.

> gene
 [1] "SPD_0065" "SPD_0071" "SPD_0293" "SPD_0295" "SPD_0296" "SPD_0297"
 [7] "SPD_0327" "SPD_0426" "SPD_0427" "SPD_0428" "SPD_0559" "SPD_0560"
[13] "SPD_0561" "SPD_0562" "SPD_0580" "SPD_0789" "SPD_1046" "SPD_1047"
[19] "SPD_1048" "SPD_1050" "SPD_1051" "SPD_1052" "SPD_1053" "SPD_1057"
[25] "SPD_1326" "SPD_1432" "SPD_1534" "SPD_1582" "SPD_1612" "SPD_1613"
[31] "SPD_1633" "SPD_1634" "SPD_1648" "SPD_1678" "SPD_1919"
> spdKEGG = enrichKEGG(gene, organism="spd")
> summary(spdKEGG)
               ID                                 Description GeneRatio BgRatio
spd00052 spd00052                        Galactose metabolism     35/35  35/752
spd02060 spd02060             Phosphotransferase system (PTS)     12/35  47/752
spd01100 spd01100                          Metabolic pathways     28/35 341/752
spd00520 spd00520 Amino sugar and nucleotide sugar metabolism      9/35  43/752
               pvalue     p.adjust       qvalue
spd00052 4.961477e-61 2.480739e-60 5.222608e-61
spd02060 2.470177e-07 6.175443e-07 1.300093e-07
spd01100 1.958319e-05 3.263866e-05 6.871296e-06
spd00520 6.534975e-05 8.168718e-05 1.719730e-05
                                                                                                                                                                                                                                                                                                                             geneID
spd00052 SPD_0065/SPD_0071/SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0327/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1678/SPD_1919
spd02060                                                                                                                                                                                                                SPD_0293/SPD_0295/SPD_0296/SPD_0297/SPD_0426/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_1047/SPD_1048/SPD_1057
spd01100                                                                SPD_0071/SPD_0426/SPD_0427/SPD_0428/SPD_0559/SPD_0560/SPD_0561/SPD_0562/SPD_0580/SPD_0789/SPD_1046/SPD_1047/SPD_1048/SPD_1050/SPD_1051/SPD_1052/SPD_1053/SPD_1057/SPD_1326/SPD_1432/SPD_1534/SPD_1582/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
spd00520                                                                                                                                                                                                                                           SPD_0580/SPD_1326/SPD_1432/SPD_1612/SPD_1613/SPD_1633/SPD_1634/SPD_1648/SPD_1919
         Count
spd00052    35
spd02060    12
spd01100    28
spd00520     9

Summary

To summarize, clusterProfiler supports downloading the latest KEGG annotation for enrichment analysis and it supports all species that have KEGG annotation available in KEGG database.

Citation

Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

KEGG enrichment analysis with latest online data using clusterProfiler

Example 1: Using online KEGG annotation

Example 2: enrichment analysis of species which are not previously supported

Summary

Citation

Guangchuang Yu