use clusterProfiler as an universal enrichment analysis tool
clusterProfiler supports enrichment analysis of both hypergeometric test and gene set enrichment analysis. It internally supports Gene Ontology analysis of about 20 species, Kyoto Encyclopedia of Genes and Genomes (KEGG) with all species that have annotation available in KEGG database, DAVID annotation (only hypergeometric test supported), Disease Ontology and Network of Cancer Genes (via DOSE for human) and Reactome Pathway (via ReactomePA for several species). This is still not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (eg GO via blastgo and KEGG via KAAS), unsupported ontology/pathway or customized annotation.
clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
Here I used DisGeNET which contains 381056 associations, between 16666 genes and 13172 diseases, as an example to demonstrate the usage of enricher and GSEA functions.
> require(DOSE)
> require(clusterProfiler)
> data(geneList)
> deg <- names(geneList)[abs(geneList)>2]
## downloaded from http://www.disgenet.org/ds/DisGeNET/results/all_gene_disease_associations.tar.gz
> gda <- read.delim("all_gene_disease_associations.txt")
> dim(gda)
[1] 415681 9
> head(gda)
geneId geneSymbol geneName diseaseId
1 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
2 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
3 540 ATP7B ATPase, Cu++ transporting, beta polypeptide umls:C0019202
4 4160 MC4R melanocortin 4 receptor umls:C0028754
5 4160 MC4R melanocortin 4 receptor umls:C0028754
6 3667 IRS1 insulin receptor substrate 1 umls:C0011860
diseaseName score NumberOfPubmeds associationType
1 Hepatolenticular Degeneration 0.9898764 229 GeneticVariation
2 Hepatolenticular Degeneration 0.9898764 229 Biomarker
3 Hepatolenticular Degeneration 0.9898764 229 AlteredExpression
4 Obesity 0.9400000 234 GeneticVariation
5 Obesity 0.9400000 234 Biomarker
6 Diabetes Mellitus, Type 2 0.9077659 106 GeneticVariation
source
1 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
2 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
3 UNIPROT, CTD_human, MGD, RGD, GAD, LHGDN, BeFree
4 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
5 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
6 UNIPROT, CTD_human, MGD, RGD, GAD, BeFree
>
> disease2gene=gda[, c("diseaseId", "geneId")]
> disease2name=gda[, c("diseaseId", "diseaseName")]
> x = enricher(deg, TERM2GENE=disease2gene, TERM2NAME=disease2name)
> head(summary(x))
ID Description GeneRatio BgRatio
umls:C0006142 umls:C0006142 Malignant neoplasm breast 89/196 3324/16666
umls:C0006826 umls:C0006826 NEOPLASM MALIGNANT 113/196 5102/16666
umls:C0678222 umls:C0678222 Breast Carcinoma 83/196 3072/16666
umls:C1458156 umls:C1458156 Recurrent Malignant Neoplasm 47/196 1158/16666
umls:C1458155 umls:C1458155 Breast Neoplasms 60/196 1981/16666
umls:C0600139 umls:C0600139 prostate carcinoma 56/196 2033/16666
pvalue p.adjust qvalue
umls:C0006142 4.934860e-16 9.845045e-13 8.249008e-13
umls:C0006826 3.071791e-15 3.064112e-12 2.567371e-12
umls:C0678222 5.894907e-15 3.920113e-12 3.284601e-12
umls:C1458156 3.230248e-14 1.611086e-11 1.349904e-11
umls:C1458155 1.724317e-12 6.880024e-10 5.764664e-10
umls:C0600139 5.025005e-10 1.670814e-07 1.399949e-07
geneID
umls:C0006142 4312/10874/991/6280/2305/9493/1062/4605/9833/9133/6279/10403/8685/597/7153/6278/259266/3627/27074/6241/7368/11065/55355/9582/55872/51203/3669/22974/10460/10563/4751/820/27338/890/983/4085/6362/9837/5918/332/3832/6286/5163/2146/3002/50852/7272/2568/9212/51659/1111/9055/4321/3620/6790/891/4174/9232/9370/1602/4629/771/3117/80129/23158/125/1311/5174/4250/2167/652/4036/4137/8839/2066/4693/3158/3169/1408/9547/2922/10647/5304/8614/2625/7802/9/5241/10551
umls:C0006826 4312/10874/55388/991/6280/2305/1062/3868/4605/9833/6279/10403/597/7153/6278/79733/3627/27074/6241/55165/9787/7368/11065/9582/55872/51203/3669/83461/22974/10460/10563/4751/6373/8140/1844/4283/27299/27338/890/9415/983/10232/4085/6362/5080/5918/81620/332/3832/6286/5163/2146/3002/50852/7272/2568/2842/9212/8715/1111/9055/3833/4321/11182/10112/3902/3620/3887/51514/6790/4521/891/8544/24137/10578/9232/9370/1602/3708/9122/10699/4629/771/3117/23158/125/4958/1311/2018/1308/4250/652/80736/4036/8839/2066/4693/3169/1408/9547/2922/1524/1580/10647/5304/8614/2625/7802/11122/9/5241/10551/4969
umls:C0678222 4312/10874/6280/2305/4605/9833/9133/6279/10403/8685/597/7153/6278/259266/3627/27074/6241/55165/11065/55355/9582/55872/51203/3669/10563/4751/820/27338/890/983/4085/6362/9837/5918/332/6286/5163/2146/3002/50852/7272/2568/9212/1111/9055/4321/3620/6790/891/9232/9370/1602/3708/4629/771/3117/23158/125/1311/5174/2532/4250/2167/652/4036/4137/8839/2066/4693/3158/3169/9547/2922/1524/10647/5304/8614/2625/7021/7802/9/5241/10551
umls:C1458156 4312/991/6280/6279/8685/7153/6278/259266/3627/27074/6241/55165/9787/3669/22974/983/10232/4085/5080/332/2146/3002/50852/2568/9212/4321/3620/3887/6790/891/4174/9232/3708/4629/771/3117/23158/730/2018/4036/2066/9547/2625/9/5241/10551/57758
umls:C1458155 4312/10874/2305/4605/9833/9133/10403/597/7153/6278/259266/3627/11065/9582/3669/10563/8140/820/1844/27338/890/9415/983/81620/332/2146/3002/2568/9212/51659/1111/9055/11182/3620/51514/6790/4521/891/8544/9232/9370/1602/771/23158/125/4250/2167/652/8839/2066/3169/10647/5304/8614/2625/7021/9/5241/10551/57758
umls:C0600139 4312/6280/2305/4605/9833/6279/7153/3627/6241/55165/11065/22974/10563/8140/1844/890/983/5080/5918/6286/5163/2146/9212/1111/4321/3620/6790/891/8544/9232/9370/8857/1602/3708/23090/4629/3117/23158/2532/2167/652/80736/4036/3169/9547/2922/11283/1524/5304/8614/2625/79689/11122/9/5241/10551
Count
umls:C0006142 89
umls:C0006826 113
umls:C0678222 83
umls:C1458156 47
umls:C1458155 60
umls:C0600139 56
>
> barplot(x)
> y = GSEA(geneList, TERM2GENE=disease2gene, TERM2NAME=disease2name)
> head(summary(y))
ID Description setSize enrichmentScore
umls:C0001418 umls:C0001418 Adenocarcinoma 1224 0.2431172
umls:C0001973 umls:C0001973 Alcoholism 548 -0.3213479
umls:C0002171 umls:C0002171 Alopecia Areata 74 0.5122605
umls:C0003864 umls:C0003864 Arthritis 447 0.3140411
umls:C0003872 umls:C0003872 Arthritis, Psoriatic 122 0.4311280
umls:C0003873 umls:C0003873 Arthritis, Rheumatoid 1186 0.2855594
pvalue p.adjust qvalues
umls:C0001418 0 0 0
umls:C0001973 0 0 0
umls:C0002171 0 0 0
umls:C0003864 0 0 0
umls:C0003872 0 0 0
umls:C0003873 0 0 0
> gseaplot(y, "umls:C0003872")
Citation
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.