Comparison of clusterProfiler and GSEA-P
Thanks @mevers for raising the issue to me and his efforts in benchmarking clusterProfiler.
He pointed out two issues:
- outputs from gseGO and GSEA-P are poorly overlap.
- pvalues from gseGO are generally smaller and don’t show a lot of variation
For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.
First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.
For instance, with his gene list as input, clusterProfiler annotates 195 genes as ribosome, while GSEA-P (using c5.cc.v5.0.symbols.gmt) only annotates 38 genes.
As the gene set collections is so different, I don’t believe the comparison can produce any valuable results.
The first step should be extending clusterProfiler to support using GMT file as gene set annotation, thereafter we can use identical input (both gene list and gene sets) and then benchmarking will be valuable for detecting issues that exclusively attributed to the implementation of GSEA algorithm.
clusterProfiler supports GMT file
Currently clusterProfiler supports user’s own annotations via enricher and GSEA functions which require users provide their own annotation in a data.frame. This is a general interface for using user’s own annotation.
To support GMT file, we only need a function, read.gmt, to parse GMT file and output a data.frame that is suitable for enricher and GSEA. Now this function is available in devel branch (in BioC 3.3 or github) of clusterProfiler.
As @mevers used c5.cc, I also use it for further comparison. I have packed the file into clusterProfiler, so that users can use it for testing/practice.
> gmtfile <- system.file("extdata", "c5.cc.v5.0.entrez.gmt",
package="clusterProfiler")
## only 207K.
## It's indeed a tiny subset of CC.
> file.size(gmtfile)/1000
[1] 207.608
> c5 <- read.gmt(gmtfile)
hypergeometric test with GMT annotation
> require(clusterProfiler)
> data(geneList, package="DOSE")
> de <- names(geneList)[abs(geneList) > 2]
> head(de)
[1] "4312" "8318" "10874" "55143" "55388" "991"
> x <- enricher(de, TERM2GENE=c5)
## omit some columns to make it more readable
> head(summary(x)[, c(1, 3:7)], 3)
ID GeneRatio BgRatio
SPINDLE SPINDLE 11/82 39/5270
MICROTUBULE_CYTOSKELETON MICROTUBULE_CYTOSKELETON 16/82 152/5270
CYTOSKELETAL_PART CYTOSKELETAL_PART 15/82 235/5270
pvalue p.adjust qvalue
SPINDLE 7.667674e-12 6.594200e-10 5.327016e-10
MICROTUBULE_CYTOSKELETON 8.449298e-10 3.633198e-08 2.935019e-08
CYTOSKELETAL_PART 2.414879e-06 6.623386e-05 5.350593e-05
gene set enrichment analysis with GMT annotation
> y <- GSEA(geneList, TERM2GENE=c5)
> head(summary(y)[, -c(1,2)], 2)
setSize enrichmentScore NES pvalue
EXTRACELLULAR_REGION 401 -0.3860230 -1.694322 0.001237624
EXTRACELLULAR_REGION_PART 310 -0.4101043 -1.765775 0.001269036
p.adjust qvalues
EXTRACELLULAR_REGION 0.03047874 0.02316228
EXTRACELLULAR_REGION_PART 0.03047874 0.02316228
Comparison of clusterProfiler and GSEA-P
Now with read.gmt, we can compare clusterProfiler and GSEA-P with the same input.
First of all, I export geneList to a rnk file.
data(geneList, package="DOSE")
d=data.frame(gene=names(geneList), FC=geneList)
write.table(d, row.names=F,col.names=F, quote=F, file="geneList.rnk", sep="\t")
And run GSEA-P with the following parameters:
producer_class xtools.gsea.GseaPreranked
producer_timestamp 1445941169480
param collapse false
param plot_top_x 20
param rnk /Users/guangchuangyu/Downloads/geneList.rnk
param norm meandiv
param scoring_scheme weighted
param make_sets true
param mode Max_probe
param gmx /Users/guangchuangyu/Downloads/c5.cc.v5.0.entrez.gmt
param gui false
param rpt_label my_analysis
param help false
param out /Users/guangchuangyu/gsea_home/output/oct27
param include_only_symbols true
param set_min 15
param nperm 1000
param rnd_seed timestamp
param zip_report false
param set_max 500
Re-run clusterProfiler::GSEA with pvalueCutoff=1.
g <- read.delim("gsea_report_for_na_neg_1445941169480.xls")
xx <- GSEA(geneList, TERM2GENE=c5, nPerm=1000, pvalueCutoff=1)
gy = merge(g, summary(xx), by.x="NAME", by.y="ID")
ggplot(gy, aes(NOM.p.val, pvalue)) + geom_point() + xlim(0, 1) + ylim(0, 1)
Now the comparison tells! clusterProfiler indeed produce smaller pvalues. As I said in why clusterProfiler fails, in general software produce more conservative result is more trustable. This is indeed an issue I couldn’t omit.
This issue is attributed to DOSE package, which serves as a backend of both clusterProfiler and ReactomePA.
In DOSE, we calculate the pvalue in the following way:
For each geneList, we calculate the observed ES, and then perform permutation to generate a null ES distribution. pvalue = (sum(ES >= permES)+1)/(nPerm+1), for greater side as an example, is calculated.
fixed bug of DOSE
Eventually I figured out that the way we calculate pvalue is not correct. As presented in Subramaniam et al, the distribution of ES is bimodal. Positive and negative ES values should be separated when calculating pvalues.
After I changed the source code, the pvalues generated by clusterProfiler::GSEA and GSEA-P are almost identical.
This bug was fixed in both release (>=2.8.1) and devel (>=2.9.1) branches. The fixed will affect DOSE, clusterProfiler and ReactomePA.
other concerns
There are other differences between clusterProfiler and GSEA-P.
clusterProfiler never produce pvalue=0
For calculating pvalues, GSEA-P may produce pvalue=0, while clusterProfiler never produce pvalue=0 since we add pseudocount, 1, in both numerator and demoninator. Some software may not accept pvalue=0, for example if you want to use topGO to visualize enrichment with GO topology, the result can’t contain any pvalue=0.
clusterProfiler test whole gene set collections
> g=read.delim("gsea_report_for_na_neg_1445941169480.xls")
> dim(g)
[1] 56 12
> yy = summary(xx)
> yy = yy[with(yy, setSize >= 15 & setSize <=500),]
> dim(yy)
[1] 152 8
> all(g$NAME %in% yy$ID)
[1] TRUE
GSEA-P didn’t filter result by pvalues, as it reported pvalues ranging from 0 to 1.
We cut our result by setSize in $[15, 500]$ as default parameter of GSEA-P.
clusterProfiler tests all the gene sets while GSEA-P only tests a subset.
If I have some free time, I will figure out how they select gene sets to test. If it’s reasonable, we can add a parameter for clusterProfiler users to switch between two modes (full sets or subset). This is still an open issue/question. If you have any idea, don’t hesitate to let me know. :)
Citation
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.