why clusterProfiler fails

@kaji331 compared cluserProfiler with GeneAnswers and found that clusterProfiler gives larger p values.

It eventually came out that he passed the input gene as numeric vector, which was supposed to be character and he used an old version of clusterProfiler which didn’t use as.character to coerce the input

But his comment forces me to test it.

require(GeneAnswers)
data('humanGeneInput')
y <- geneAnswersBuilder(humanGeneInput, 'org.Hs.eg.db', 
                        categoryType='KEGG', testType='hyperG', 
                        pvalueT=0.1, geneExpressionProfile=humanExpr, 
                        verbose=FALSE)
yy <- y@enrichmentInfo

require(clusterProfiler)
x <- enrichKEGG(humanGeneInput$GeneID, pvalueCutoff=0.2, 
                qvalueCutoff=0.2, minGSSize=1)
xx <- summary(x)

id <- sub("hsa", "", xx$ID)
idx <- id %in% rownames(yy)

p.clusterProfiler <- xx$pvalue[idx]
p.GeneAnswers <- yy[id[idx],]$"p value"
> cor(p.clusterProfiler, p.GeneAnswers)
[1] 0.9996165
> p.clusterProfiler - p.GeneAnswers
 [1]  1.029789e-04 -3.588252e-05 -4.623010e-05  1.079117e-04 -1.075746e-04
 [6] -1.077398e-04 -3.774637e-04 -2.849278e-04 -4.197993e-04  7.588155e-04
[11] -3.702141e-04  2.314721e-03 -5.695641e-04 -5.940830e-04 -4.923697e-04
[16] -5.560738e-04 -5.884079e-04  2.011138e-03

Here, I used the dataset, humanGeneInput, provided by GeneAnswers. There are 19 pathways have p values below 0.1 by GeneAnswers and 18 pathways have p values below 0.1 by clusterProfiler. 18 of them are the same and p values are highly correlated with very small differences. Enrichment analyses are tested by geometric model, I believe all packages do it in the same way. The difference should be attributed to annotation file and I found background ratio are different.

> xx[1,]
               ID Description GeneRatio  BgRatio     pvalue   p.adjust
hsa04110 hsa04110  Cell cycle      6/51 128/5894 0.00075424 0.02191169
             qvalue                      geneID Count
hsa04110 0.01683117 894/595/4175/6502/4173/9134     6
> yy[1,]
      genes in Category percent in the observed List percent in the genome
04110                 6                    0.1176471            0.02112436
      fold of overrepresents odds ratio      p value
04110                5.56926   6.441808 0.0006512612


> allEG = unique(unlist(as.list(org.Hs.egPATH2EG)))
> length(get("04110", org.Hs.egPATH2EG))
[1] 124
> length(allEG)
[1] 5870
> length(get("04110", org.Hs.egPATH2EG))/length(allEG)
[1] 0.02112436

In the above example, the BgRatio of clusterProfiler is 128/5894, slightly larger than 0.02112436 calculated by GeneAnswers. This is why clusterProfiler gives a p value larger than GeneAnswers (In this particular pathway of this particular example, clusterProfiler not always produce larger p values). The annotation file used by clusterProfiler is KEGG.db, while GeneAnswers uses org.Hs.eg.db. This is the reason of the difference and it’s not a big deal.

It’s not a good idea to give the background ratio in decimal format as GeneAnswers does. We have no idea of the number of total genes in background with ratio like 0.02112436. Some software packages may use the gene number of the whole genome as background. If the total gene number is setting to 18000, the background ratio turn out to be 124/18000=0.006888889. With 0.006888889, it’s hard to detect the background gene number was set to a larger number. In this scenario, all the p values will become very small, and many false positive were reported.

clusterProfiler and GeneAnswers estimate the total genes in the right way, but there are many packages may setting it to all the genes, no matter they have annotation or not. For instance, some software may setting the background to 20,000 when testing human genes.

I still remember in 2007, I used princeton’s web server, http://go.princeton.edu/cgi-bin/GOTermFinder, to test GO enrichment. I found the background gene number was set to a very large number and all the p values were extremely small. When writing this blog post, I submit a job to this web server and found the gene number of human is setting to 45758. After I found this, I never use this web server again. The result is extremely “significant”! The web sever has an option, “Enter the number of products estimated for your organism”. This is ridiculous. It’s hard to estimate for ordinary users and it is impossible to estimate it correctly since we don’t know the annotation file in used. The default number, 45758 for human, is too large, result in many false positive.

Input a gene list as background is useful, for instance we may use all genes that are detected in microarray experiment or all proteins identified by MS as background. This is an effective way to eliminate bias of using the whole genome. Use the whole genome is not mean to use the total number of gene in the genome as background. By default, KEGG enrichment analyses of clusterProfiler and GeneAnswers are using the whole genome, that is about 5900 genes for human, which are all genes have pathway annotation in the whole genome. Others without annotation are abandoned. This itself is a bias actually, but it’s the realistic. If we include all genes no matter they have annotation or not, and setting 20000 or even more as background of human genes, definitely we will get significant result! But it is man-made significant.

As I replied to @kaji331, if p values obtained by different software are different, trust the one produce larger p values. Many user will choose a software give a more “significant” result, but trust me, you are making type I error. Conservatively, always choose the one give you larger p-values if you know nothing about the calculation. Software that never “fail” is untrustable.

Citation

Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

why clusterProfiler fails

Citation

Guangchuang Yu