For GSEA analysis, we are familar with the above figure which shows the running enrichment score. But for most of the software, it lack of visualization method to summarize the whole enrichment result.
I am using dotplot() to visualize results from enrichGO(), enrichDO(), enricher() and compareCluster() in clusterProfiler R package. When specifying showCategory, I get the right number of categories except with the results of compareCluser().
In my case, I use compareCluster() on a list of 3 elements:
str(ClusterList) List of 3 $ All : chr [1:1450] "89886" "29923" "100132891" "101410536" ... $ g1 : chr [1:858] "89886" "29923" "100132891" "101410536" ... $ g2: chr [1:592] "5325" "170691" "29953" "283392" ... CompareGO_BP=compareCluster(ClusterList, fun="enrichGO", pvalueCutoff=0.01, pAdjustMethod="BH", OrgDb=org.Hs.eg.db,ont="BP",readable=T) dotplot(CompareGO_BP, showCategory=10, title="GO - Biological Process")
I ask for 10 categories, but I get 15 categories in All, 8 categories in g1 and 12 categories in g2. None of the categories, neither the sum of the categories are 10…
Is the option showCategory working in the case of comparison? Am I missing something here?
And which categories precisely will it plot? the most significant whatever my 3 cases or the most significant of each case?
The question was posted in Bioconductor support site. It seems quite confusing and I think I need to write a post to clarify it.
Leading edge analysis reports
Tags to indicate the percentage of genes contributing to the enrichment score,
List to indicate where in the list the enrichment score is attained and
Signal for enrichment signal strength.
It would also be very interesting to get the core enriched genes that contribute to the enrichment.
clusterProfiler can convert biological IDs using
OrgDb object via the
bitr function. Now I implemented another function,
bitr_kegg for converting IDs through KEGG API.
library(clusterProfiler) data(gcSample) hg <- gcSample[] head(hg) ##  "4597" "7111" "5266" "2175" "755" "23046" eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa') ## Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid", ## organism = "hsa"): 3.7% of input gene IDs are fail to map... head(eg2np) ## kegg ncbi-proteinid ## 1 8326 NP_003499 ## 2 58487 NP_001034707 ## 3 139081 NP_619647 ## 4 59272 NP_068576 ## 5 993 NP_001780 ## 6 2676 NP_001487 np2up <- bitr_kegg(eg2np[,2], fromType='ncbi-proteinid', toType='uniprot', organism='hsa') head(np2up) ## ncbi-proteinid uniprot ## 1 NP_005457 O75586 ## 2 NP_005792 P41567 ## 3 NP_005792 Q6IAV3 ## 4 NP_037536 Q13421 ## 5 NP_006054 O60662 ## 6 NP_001092002 O95398
The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is
entrezgene ID for eukaryote species and
Locus ID for prokaryotes.
KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:
- pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)
- structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)
- functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)
- signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)
clusterProfiler supports over-representation test and gene set
enrichment analysis of Gene Ontology. It supports GO annotation from
OrgDb object, GMT file and user’s own data.
In github version of clusterProfiler,
removed the parameter organism and add another parameter OrgDb, so
that any species that have
OrgDb object available can be analyzed in
clusterProfiler. Bioconductor have already provide OrgDb for about
20 species, see
users can build
He pointed out two issues:
For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.
First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.
To simplify enriched GO result, we can use slim version of GO and use enricher function to analyze.
Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term. To make this feature available to clusterProfiler users, I develop a simplify method to reduce redundant GO terms from output of enrichGO function.
require(clusterProfiler) data(geneList, package="DOSE") de <- names(geneList)[abs(geneList) > 2] bp <- enrichGO(de, ont="BP") enrichMap(bp)