He pointed out two issues:
For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.
First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.
Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term. To make this feature available to clusterProfiler users, I develop a simplify method to reduce redundant GO terms from output of enrichGO function.
require(clusterProfiler) data(geneList, package="DOSE") de <- names(geneList)[abs(geneList) > 2] bp <- enrichGO(de, ont="BP") enrichMap(bp)
I think it would be interesting to incorporate seq2gene with clusterProfiler. But it fail to run due to it call absolute path of python installed in the author’s computer.
clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene and TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.
Some users told me that they may want to use DAVID at some circumstances. I think it maybe a good idea to make clusterProfiler supports DAVID, so that DAVID users can use visualization functions provided by clusterProfiler.
require(DOSE) require(clusterProfiler) data(geneList) gene = names(geneList)[abs(geneList) > 2] david = enrichDAVID(gene = gene, idType="ENTREZ_GENE_ID", listType="Gene", annotation="KEGG_PATHWAY") > summary(david) ID Description GeneRatio BgRatio pvalue hsa04110 hsa04110 Cell cycle 11/68 125/5085 4.254437e-06 hsa04114 hsa04114 Oocyte meiosis 10/68 110/5085 1.119764e-05 hsa03320 hsa03320 PPAR signaling pathway 7/68 69/5085 2.606715e-04 p.adjust qvalue geneID hsa04110 0.0003998379 NA 9133/4174/890/991/1111/891/7272/8318/4085/983/9232 hsa04114 0.0005261534 NA 9133/5241/51806/3708/991/891/4085/983/9232/6790 hsa03320 0.0081354974 NA 4312/2167/5346/5105/3158/9370/9415 Count hsa04110 11 hsa04114 10 hsa03320 7
There are only 5085 human genes annotated by KEGG, this is due to out-of-date DAVID data.
Now enrichKEGG function is reloaded with a new parameter use_internal_data. This parameter is by default setting to FALSE, and enrichKEGG function will download the latest KEGG data for enrichment analysis. If the parameter use_internal_data is explicitly setting to TRUE, it will use the KEGG.db which is still supported but not recommended. With this new feature, supported species is unlimited if only there are KEGG annotations available in KEGG database. You can access the full list of species supported by KEGG via: http://www.genome.jp/kegg/catalog/org_list.html Now the organism parameter in enrichKEGG should be abbreviation of academic name, for example ‘hsa’ for human and ‘mmu’ for mouse. It accepts any species listed in http://www.genome.jp/kegg/catalog/org_list.html. In the current release version of clusterProfiler (in Bioconductor 3.0), enrichKEGG supports about 20 species, and the organism parameter accept common name of species, for instance “human” and “mouse”. For these previously supported species, common name is also supported. So that you script is still working with new version of clusterProfiler. For other species, common name is not supported, since I don’t want to maintain such a long mapping list with many species have no common name available and it may also introduce unexpected bugs.