General Information
https://odinlister.sdu.dk/fagbesk/internkode/BMB209/en
When: Jan. 25, 2019
Where: University of Southern Denmark, Odense
https://guangchuangyu.github.io/pathway-analysis-workshop/
Guangchuang Yu, Ph.D. is faculty in the School of Basic Medical Sciences at the Southern Medical University.
Workshop Description
This workshop gives an overview of commonly used methods for enrichment analysis of gene expression data with regard to functional gene sets and pathways, and introduces pathway analysis using clusterProfiler. The clusterProfiler package was designed by considering the supports of multiple ontology/pathway annotations, up-to-date gene annotation, multiple organisms, user customized annotation data and comparative analysis. The workshop will provide reproducible examples of pathway analysis, along with visualization and exploration of results.
The material of this workshop is hosted on github, with source file provided.
Pre-requisites
- Basic knowledge of R syntax
- Familiar with high-throughput gene expression data as obtained with microarray and RNA-seq
- Familiarity with the concept of differential expression analysis (with e.g. limma, edgeR, DESeq2)
Workshop Participation
Execution of example code on practice
R/Bioconductor packages used
- clusterProfiler
- cowplot
- DOSE
- dplyr
- enrichplot
- ggplot2
- GSEABase
- magrittr
- msigdbr
- org.Hs.eg.db
Using the following command to install these packages:
Terminology
Gene sets and pathway
Gene sets are simple lists of usually functional related genes without further specification of relationships between genes.
Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process.
Gene Ontology (GO)
Gene Ontology defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:
- MF: Molecular Function
- molecular activities of gene products
- CC: Cellular Component
- where gene products are active
- BP: Biological Process
- pathways and larger processes made up of the activities of multiple gene products
GO terms are organized in a directed acyclic graph, where edge between the terms represent parent-child relationship.
Kyoto Encyclopedia of Genes and Genomes (KEGG)
KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development.
Other gene sets
GO and KEGG are most frequently used for the functional analysis. They are typically the first choice because their long-standing curation and availability for a wide range of species.
Other gene sets including but not limited to Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb).
Over Representation Analysis
Over Representation Analysis (ORA) (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).
The p-value can be calculated by hypergeometric distribution.
\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}\)
In this equation, N
is the total number of genes in the background distribution, M
is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest, n
is the size of the list of genes of interest and k
is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation. P-values should be adjusted for multiple comparison.
Example: Microarray study, in which 11,812 genes have been tested for differential expression between two sample conditions and 202 genes were found DE.
Among the DE genes, 25 are annotated to a specific functional gene set, which contains in total 262 genes. This setup corresponds to a 2x2 contingency table:
d <- matrix(c(25, 237, 177, 11323),
nrow = 2,
dimnames = list(c("DE", "Not DE"),
c("In GS", "Not in GS")))
d
## In GS Not in GS
## DE 25 177
## Not DE 237 11323
whether the overlap of 25 genes are significantly over represented in the gene set can be assessed using hypergeometric distribution. This corresponding to a one-sided version of Fisher’s exact test.
##
## Fisher's Exact Test for Count Data
##
## data: d
## p-value = 2.815e-12
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 4.502824 Inf
## sample estimates:
## odds ratio
## 6.744738
Gene Set Enrichment Analysis
A major limitation of ORA is that it restricts analysis to DEGs, excluding the vast majority of genes not satisfying the chosen significance threshold.
This is resolved by gene set enrichment analysis (GSEA) (Subramanian et al. 2005). GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
Genes are ranked based on their phenotypes. Given a priori defined set of gene S (e.g., genes sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.
There are two key elements of the GSEA method:
- Calculation of an Enrichment Score.
- The enrichment score (_ES_) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic.
- Esimation of Significance Level of ES.
- The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
Input data
For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with DESeq2 package).
For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList
which was derived from R
package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList
dataset.
The geneList
contains three features:
- numeric vector: fold change or other type of numerical variable
- named vector: every number was named by the corresponding gene ID
- sorted vector: number should be sorted in decreasing order
Suppose you are importing your own data from a csv file and the file contains two columns, one for gene ID (no duplicated allowed) and another one for fold change, you can prepare your own geneList
via the following command:
d <- read.csv(your_csv_file)
## assume that 1st column is ID
## 2nd column is fold change
## feature 1: numeric vector
geneList <- d[,2]
## feature 2: named vector
names(geneList) <- as.character(d[,1])
## feature 3: decreasing order
geneList <- sort(geneList, decreasing = TRUE)
We can load the sample data into R via:
## 4312 8318 10874 55143 55388 991
## 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857
Suppose we define fold change greater than 2 as DEGs:
## [1] "4312" "8318" "10874" "55143" "55388" "991"
clusterProfiler as an universal enrichment analysis tool
clusterProfiler (Yu et al. 2012) internally supports enrichment analysis using different ontology and pathways, including GO, DO, KEGG and Reactome pathway. This is not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via blastgo and KEGG via KAAS), unsupported ontology/pathway or customized annotation.
clusterProfiler provides enricher
function for hypergeometric test and GSEA
function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE
and TERM2NAME
. As indicated in the parameter names, TERM2GENE
is a data.frame with first column of term ID and second column of corresponding mapped gene, while TERM2NAME
is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME
is optional.
Here using clusterProfiler to analyze wiki pathways and molecular signatures database as examples.
WikiPathways analysis
WikiPathways is a continuously updated pathway database curated by a community of researchers and pathway enthusiasts. WikiPathways produces monthly releases of gmt files for supported organisms at data.wikipathways.org. Download the appropriate gmt file and then generate TERM2GENE
and TERM2NAME
to use enricher
and GSEA
functions.
library(magrittr)
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(ont, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp <- enricher(gene, TERM2GENE = wpid2gene, TERM2NAME = wpid2name)
head(ewp)
## ID
## WP2446 WP2446
## WP2361 WP2361
## WP179 WP179
## WP3942 WP3942
## WP4240 WP4240
## WP2328 WP2328
## Description
## WP2446 Retinoblastoma (RB) in Cancer
## WP2361 Gastric Cancer Network 1
## WP179 Cell Cycle
## WP3942 PPAR signaling pathway
## WP4240 Regulation of sister chromatid separation at the metaphase-anaphase transition
## WP2328 Allograft Rejection
## GeneRatio BgRatio pvalue p.adjust qvalue
## WP2446 11/95 88/6249 6.801697e-08 1.054263e-05 9.450779e-06
## WP2361 6/95 29/6249 3.772735e-06 2.923870e-04 2.621058e-04
## WP179 10/95 122/6249 1.384549e-05 7.153503e-04 6.412648e-04
## WP3942 7/95 67/6249 6.210513e-05 2.406574e-03 2.157336e-03
## WP4240 4/95 16/6249 7.931988e-05 2.458916e-03 2.204258e-03
## WP2328 7/95 90/6249 4.016758e-04 1.037663e-02 9.301967e-03
## geneID Count
## WP2446 8318/9133/7153/6241/890/983/81620/7272/1111/891/24137 11
## WP2361 4605/7153/11065/22974/6286/6790 6
## WP179 8318/991/9133/890/983/7272/1111/891/4174/9232 10
## WP3942 4312/9415/9370/5105/2167/3158/5346 7
## WP4240 991/1062/4085/9232 4
## WP2328 10563/6373/4283/3002/10578/3117/730 7
## ID Description setSize
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway 281
## WP306 WP306 Focal Adhesion 188
## WP236 WP236 Adipogenesis 125
## WP474 WP474 Endochondral Ossification 59
## WP2911 WP2911 miRNA targets in ECM and membrane receptors 21
## WP2361 WP2361 Gastric Cancer Network 1 23
## enrichmentScore NES pvalue p.adjust qvalues rank
## WP3932 -0.3903410 -1.659561 0.001307190 0.03278986 0.02707857 1994
## WP306 -0.4230017 -1.717471 0.001418440 0.03278986 0.02707857 2221
## WP236 -0.4387536 -1.689787 0.001529052 0.03278986 0.02707857 2301
## WP474 -0.5111776 -1.752637 0.001633987 0.03278986 0.02707857 2142
## WP2911 -0.6953475 -1.946647 0.001715266 0.03278986 0.02707857 2629
## WP2361 0.8368760 2.547453 0.002309469 0.03278986 0.02707857 854
## leading_edge
## WP3932 tags=26%, list=16%, signal=22%
## WP306 tags=28%, list=18%, signal=23%
## WP236 tags=34%, list=18%, signal=28%
## WP474 tags=47%, list=17%, signal=40%
## WP2911 tags=76%, list=21%, signal=60%
## WP2361 tags=52%, list=7%, signal=49%
## core_enrichment
## WP3932 7059/51719/8660/5563/5295/6794/1288/7010/3910/3371/3082/3791/1301/1027/90993/3643/1129/1975/7450/3685/2034/1942/2149/1280/4804/3675/2261/7248/2246/4803/2259/3912/1902/2308/1278/1277/81617/2846/2057/2247/1281/50509/1290/55970/5618/7058/10161/56034/3693/4254/6720/3480/5159/3991/1289/1292/3908/2690/3909/8817/3551/2791/63923/3913/3667/3679/7060/3479/80310/1311/1101/3169
## WP306 5595/5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/5753/1277/10398/55742/50509/1290/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/3679/7060/3479/10451/80310/1311/1101
## WP236 8321/5925/8609/1499/2908/4088/8660/3399/6778/8648/6258/2662/5914/6776/2034/196/8204/4692/4208/2308/5468/6695/4023/1592/7350/81029/3952/1675/5618/6720/3991/2487/3667/3572/3479/6424/9370/5105/652/5346/2625
## WP474 4921/51719/860/5745/1028/1280/2261/4256/7042/4208/4322/8100/7078/2247/11096/85477/3480/2690/8817/5167/2737/5744/2487/5327/1300/1473/3479
## WP2911 3672/7057/3915/3910/1291/1278/1293/1281/50509/1290/7058/3693/1289/1292/3913
## WP2361 4605/7153/11065/22974/6286/6790/1894/56992/4173/1063/9585/8607
You may want to convert the gene IDs to gene symbols, which can be done by setReadable
function.
library(org.Hs.eg.db)
ewp <- setReadable(ewp, org.Hs.eg.db, keyType = "ENTREZID")
ewp2 <- setReadable(ewp2, org.Hs.eg.db, keyType = "ENTREZID")
head(ewp)
## ID
## WP2446 WP2446
## WP2361 WP2361
## WP179 WP179
## WP3942 WP3942
## WP4240 WP4240
## WP2328 WP2328
## Description
## WP2446 Retinoblastoma (RB) in Cancer
## WP2361 Gastric Cancer Network 1
## WP179 Cell Cycle
## WP3942 PPAR signaling pathway
## WP4240 Regulation of sister chromatid separation at the metaphase-anaphase transition
## WP2328 Allograft Rejection
## GeneRatio BgRatio pvalue p.adjust qvalue
## WP2446 11/95 88/6249 6.801697e-08 1.054263e-05 9.450779e-06
## WP2361 6/95 29/6249 3.772735e-06 2.923870e-04 2.621058e-04
## WP179 10/95 122/6249 1.384549e-05 7.153503e-04 6.412648e-04
## WP3942 7/95 67/6249 6.210513e-05 2.406574e-03 2.157336e-03
## WP4240 4/95 16/6249 7.931988e-05 2.458916e-03 2.204258e-03
## WP2328 7/95 90/6249 4.016758e-04 1.037663e-02 9.301967e-03
## geneID Count
## WP2446 CDC45/CCNB2/TOP2A/RRM2/CCNA2/CDK1/CDT1/TTK/CHEK1/CCNB1/KIF4A 11
## WP2361 MYBL2/TOP2A/UBE2C/TPX2/S100P/AURKA 6
## WP179 CDC45/CDC20/CCNB2/CCNA2/CDK1/TTK/CHEK1/CCNB1/MCM5/PTTG1 10
## WP3942 MMP1/FADS2/ADIPOQ/PCK1/FABP4/HMGCS2/PLIN1 7
## WP4240 CDC20/CENPE/MAD2L1/PTTG1 4
## WP2328 CXCL13/CXCL11/CXCL9/GZMB/GNLY/HLA-DQA1/C7 7
## ID Description setSize
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway 281
## WP306 WP306 Focal Adhesion 188
## WP236 WP236 Adipogenesis 125
## WP474 WP474 Endochondral Ossification 59
## WP2911 WP2911 miRNA targets in ECM and membrane receptors 21
## WP2361 WP2361 Gastric Cancer Network 1 23
## enrichmentScore NES pvalue p.adjust qvalues rank
## WP3932 -0.3903410 -1.659561 0.001307190 0.03278986 0.02707857 1994
## WP306 -0.4230017 -1.717471 0.001418440 0.03278986 0.02707857 2221
## WP236 -0.4387536 -1.689787 0.001529052 0.03278986 0.02707857 2301
## WP474 -0.5111776 -1.752637 0.001633987 0.03278986 0.02707857 2142
## WP2911 -0.6953475 -1.946647 0.001715266 0.03278986 0.02707857 2629
## WP2361 0.8368760 2.547453 0.002309469 0.03278986 0.02707857 854
## leading_edge
## WP3932 tags=26%, list=16%, signal=22%
## WP306 tags=28%, list=18%, signal=23%
## WP236 tags=34%, list=18%, signal=28%
## WP474 tags=47%, list=17%, signal=40%
## WP2911 tags=76%, list=21%, signal=60%
## WP2361 tags=52%, list=7%, signal=49%
## core_enrichment
## WP3932 THBS3/CAB39/IRS2/PRKAA2/PIK3R1/STK11/COL4A6/TEK/LAMA4/TNC/HGF/KDR/COL11A1/CDKN1B/CREB3L1/INSR/CHRM2/EIF4B/VWF/ITGAV/EPAS1/EFNA1/F2R/COL2A1/NGFR/ITGA3/FGFR3/TSC1/FGF1/NGF/FGF14/LAMB1/LPAR1/FOXO1/COL1A2/COL1A1/CAB39L/LPAR4/EPOR/FGF2/COL3A1/COL5A3/COL5A2/GNG12/PRLR/THBS2/LPAR6/PDGFC/ITGB5/KITLG/SREBF1/IGF1R/PDGFRB/LIPE/COL5A1/COL6A2/LAMA2/GHR/LAMA3/FGF18/IKBKB/GNG11/TNN/LAMB2/IRS1/ITGA7/THBS4/IGF1/PDGFD/COMP/CHAD/FOXA1
## WP306 MAPK3/PGF/VEGFC/CTNNB1/MYL5/TLN2/THBS3/PIK3R1/COL4A6/PIP5K1C/LAMA4/TNC/HGF/ARHGAP5/KDR/VWF/BCL2/ITGAV/COL2A1/ITGA3/CCND1/FLNC/LAMB1/DOCK1/COL1A2/PTK6/COL1A1/MYL9/PARVA/COL5A3/COL5A2/FLNB/THBS2/SHC2/PDGFC/ITGB5/IGF1R/PDGFRB/CAV1/COL6A2/LAMA2/LAMA3/TNN/LAMB2/ITGA7/THBS4/IGF1/VAV3/PDGFD/COMP/CHAD
## WP236 FZD1/RB1/KLF7/CTNNB1/NR3C1/SMAD3/IRS2/ID3/STAT6/NCOA1/RXRG/GDF10/RARA/STAT5A/EPAS1/AHR/NRIP1/NDN/MEF2C/FOXO1/PPARG/SPOCK1/LPL/CYP26A1/UCP1/WNT5B/LEP/CFD/PRLR/SREBF1/LIPE/FRZB/IRS1/IL6ST/IGF1/SFRP4/ADIPOQ/PCK1/BMP4/PLIN1/GATA3
## WP474 DDR2/CAB39/RUNX2/PTH1R/CDKN1C/COL2A1/FGFR3/MGP/TGFB2/MEF2C/MMP13/IFT88/TIMP3/FGF2/ADAMTS5/SCIN/IGF1R/GHR/FGF18/ENPP1/GLI3/PTHLH/FRZB/PLAT/COL10A1/CST5/IGF1
## WP2911 ITGA1/THBS1/LAMC1/LAMA4/COL6A1/COL1A2/COL6A3/COL3A1/COL5A3/COL5A2/THBS2/ITGB5/COL5A1/COL6A2/LAMB2
## WP2361 MYBL2/TOP2A/UBE2C/TPX2/S100P/AURKA/ECT2/KIF15/MCM4/CENPF/KIF20B/RUVBL1
As an alternative to manually downloading gmt files, install the rWikiPathways package to gain scripting access to the latest gmt files using the downloadPathwayArchive
function.
MSigDb analysis
Molecular Signatures Database contains 8 major collections:
- H: hallmark gene sets
- C1: positional gene sets
- C2: curated gene sets
- C3: motif gene sets
- C4: computational gene sets
- C5: GO gene sets
- C6: oncogenic signatures
- C7: immunologic signatures
Users can download GMT files from Broad Institute and use read.gmt
to parse the file to be used in enricher()
and GSEA()
.
There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler.
It supports several specices:
## [1] "Bos taurus" "Caenorhabditis elegans"
## [3] "Canis lupus familiaris" "Danio rerio"
## [5] "Drosophila melanogaster" "Gallus gallus"
## [7] "Homo sapiens" "Mus musculus"
## [9] "Rattus norvegicus" "Saccharomyces cerevisiae"
## [11] "Sus scrofa"
We can retrieve all human gene sets:
## gs_name gs_id gs_cat gs_subcat human_gene_symbol species_name
## 1 AAACCAC_MIR140 M12609 C3 MIR ABCC4 Homo sapiens
## 2 AAACCAC_MIR140 M12609 C3 MIR ACTN4 Homo sapiens
## entrez_gene gene_symbol sources
## 1 10257 ABCC4 <NA>
## 2 81 ACTN4 <NA>
Or specific collection. Here we use C6, oncogenic gene sets as an example:
m_t2g <- msigdbr(species = "Homo sapiens", category = "C6") %>%
dplyr::select(gs_name, entrez_gene)
head(m_t2g)
## # A tibble: 6 x 2
## gs_name entrez_gene
## <chr> <int>
## 1 AKT_UP_MTOR_DN.V1_DN 25864
## 2 AKT_UP_MTOR_DN.V1_DN 95
## 3 AKT_UP_MTOR_DN.V1_DN 137872
## 4 AKT_UP_MTOR_DN.V1_DN 134
## 5 AKT_UP_MTOR_DN.V1_DN 55326
## 6 AKT_UP_MTOR_DN.V1_DN 271
## ID Description
## RPS14_DN.V1_DN RPS14_DN.V1_DN RPS14_DN.V1_DN
## GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP
## VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN VEGF_A_UP.V1_DN
## PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN PRC2_EZH2_UP.V1_DN
## CSR_LATE_UP.V1_UP CSR_LATE_UP.V1_UP CSR_LATE_UP.V1_UP
## E2F1_UP.V1_UP E2F1_UP.V1_UP E2F1_UP.V1_UP
## GeneRatio BgRatio pvalue p.adjust
## RPS14_DN.V1_DN 22/186 187/11250 4.072878e-13 6.923892e-11
## GCNP_SHH_UP_LATE.V1_UP 16/186 183/11250 5.657195e-08 4.808615e-06
## VEGF_A_UP.V1_DN 15/186 193/11250 6.903522e-07 3.911996e-05
## PRC2_EZH2_UP.V1_DN 14/186 195/11250 4.197805e-06 1.784067e-04
## CSR_LATE_UP.V1_UP 12/186 172/11250 2.766080e-05 9.404672e-04
## E2F1_UP.V1_UP 12/186 189/11250 6.963494e-05 1.972990e-03
## qvalue
## RPS14_DN.V1_DN 5.616284e-11
## GCNP_SHH_UP_LATE.V1_UP 3.900487e-06
## VEGF_A_UP.V1_DN 3.173198e-05
## PRC2_EZH2_UP.V1_DN 1.447138e-04
## CSR_LATE_UP.V1_UP 7.628557e-04
## E2F1_UP.V1_UP 1.600382e-03
## geneID
## RPS14_DN.V1_DN 10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
## GCNP_SHH_UP_LATE.V1_UP 55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
## VEGF_A_UP.V1_DN 8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
## PRC2_EZH2_UP.V1_DN 8318/55388/4605/23397/9787/55355/10460/81620/2146/7272/9212/11182/3887/24137
## CSR_LATE_UP.V1_UP 55143/2305/4605/6241/11065/55872/983/332/2146/51659/9319/1580
## E2F1_UP.V1_UP 55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
## Count
## RPS14_DN.V1_DN 22
## GCNP_SHH_UP_LATE.V1_UP 16
## VEGF_A_UP.V1_DN 15
## PRC2_EZH2_UP.V1_DN 14
## CSR_LATE_UP.V1_UP 12
## E2F1_UP.V1_UP 12
## ID Description setSize enrichmentScore
## MTOR_UP.N4.V1_DN MTOR_UP.N4.V1_DN MTOR_UP.N4.V1_DN 181 -0.4806248
## ATF2_S_UP.V1_DN ATF2_S_UP.V1_DN ATF2_S_UP.V1_DN 182 -0.4456885
## IL15_UP.V1_DN IL15_UP.V1_DN IL15_UP.V1_DN 182 -0.4687839
## ATF2_UP.V1_DN ATF2_UP.V1_DN ATF2_UP.V1_DN 178 -0.5029082
## IL2_UP.V1_DN IL2_UP.V1_DN IL2_UP.V1_DN 183 -0.4237496
## LEF1_UP.V1_DN LEF1_UP.V1_DN LEF1_UP.V1_DN 188 -0.4938672
## NES pvalue p.adjust qvalues rank
## MTOR_UP.N4.V1_DN -1.955524 0.001364256 0.01538462 0.008225692 3033
## ATF2_S_UP.V1_DN -1.815562 0.001366120 0.01538462 0.008225692 2531
## IL15_UP.V1_DN -1.909644 0.001366120 0.01538462 0.008225692 2436
## ATF2_UP.V1_DN -2.037247 0.001377410 0.01538462 0.008225692 2055
## IL2_UP.V1_DN -1.724007 0.001377410 0.01538462 0.008225692 2774
## LEF1_UP.V1_DN -2.008503 0.001381215 0.01538462 0.008225692 1705
## leading_edge
## MTOR_UP.N4.V1_DN tags=51%, list=24%, signal=39%
## ATF2_S_UP.V1_DN tags=39%, list=20%, signal=32%
## IL15_UP.V1_DN tags=40%, list=19%, signal=33%
## ATF2_UP.V1_DN tags=38%, list=16%, signal=32%
## IL2_UP.V1_DN tags=39%, list=22%, signal=31%
## LEF1_UP.V1_DN tags=36%, list=14%, signal=32%
## core_enrichment
## MTOR_UP.N4.V1_DN 10778/1571/5890/55512/64131/90806/534/4335/3280/7169/54916/10223/26959/51351/3204/466/9975/84820/1998/63920/283232/9648/54737/3005/55634/29896/3476/9710/57337/1203/56985/26054/10023/23177/85459/51111/55251/7716/54014/1901/51185/7552/7561/51170/645644/8334/55450/81627/7733/84159/5334/57326/10224/11226/10252/25901/55758/9665/10795/79800/57730/25949/10516/4311/79684/911/81698/7741/373/9742/51809/582/57396/7832/1657/80035/10628/901/23492/1195/55733/220972/7704/150759/51306/55268/26137/114327/54585/643314/1602
## ATF2_S_UP.V1_DN 53616/55667/51279/4642/7130/7139/2/710/80352/4608/11145/881/2252/3399/22846/10687/12/1397/1301/3554/2619/26011/78987/9284/1306/6586/9627/4054/54813/115207/55890/26249/4803/11075/3400/79789/8527/2353/10580/79987/80760/5493/90865/3485/3751/2202/2199/10610/1047/1294/23492/3908/5364/658/5350/2121/2331/4330/9737/4982/57088/8490/9358/6424/1307/3708/125/1311/56521/10351
## IL15_UP.V1_DN 64798/26959/6588/3784/777/113791/79981/64759/2535/957/6123/8038/51285/126353/55357/1470/2669/1028/4222/7450/225/1942/9976/29068/5592/54996/10272/6019/55779/80823/80215/5334/9639/57326/6403/26115/54453/57126/151011/57644/1831/51339/6591/55204/1393/1272/57608/79999/653483/9886/10279/55805/1047/10628/901/64123/27244/116039/79921/7041/54361/79843/6857/2239/6947/23541/55857/1811/3357/5104/56967/284266
## ATF2_UP.V1_DN 9738/2252/8660/10687/5357/80201/12/22998/3554/2619/23556/8840/9284/399491/57419/5737/6586/9627/8863/10085/115207/55890/23015/629/3212/3400/50650/8527/2353/79987/80760/9668/1012/10614/5493/8553/23126/90865/3485/3751/2202/57212/2199/10610/3075/6387/22834/1909/10129/57161/5350/2121/2331/4330/2952/4982/57088/9358/6424/6038/51313/1311/56521/10351/5304/4239/79901
## IL2_UP.V1_DN 10858/5333/2977/79874/10014/7773/64798/26959/6588/579/777/5158/23361/9915/5087/64759/2535/6123/7010/51285/126353/80201/79841/1028/2322/4222/947/9187/225/9976/5592/54996/3202/55779/23148/5334/8605/57326/26115/54453/55766/64927/6591/55204/9940/57608/653483/9886/10279/55805/10628/901/9891/7102/27244/79921/4674/8292/5125/79843/6857/79940/23541/3357/50853/9863/79932/90362/55663/4250
## LEF1_UP.V1_DN 22998/10325/23221/79170/9187/4300/51479/25956/7552/8644/11162/7168/1490/54463/65084/25837/80303/1809/3397/4208/11223/79762/6451/481/54795/51626/8991/5950/5627/64699/6414/10103/221078/9961/5874/56898/90865/57235/83989/5002/6653/56034/55314/85458/956/54502/4832/5783/1363/53832/54985/80221/84786/214/50853/51149/57088/5157/2947/79932/7031/6505/9338/25924/80736/3169/10551
We can test with other collections, for example, using C3 to test whether the genes are up/down-regulated by sharing specific motif.
m_t2g <- msigdbr(species = "Homo sapiens", category = "C3") %>%
dplyr::select(gs_name, entrez_gene)
head(m_t2g)
## # A tibble: 6 x 2
## gs_name entrez_gene
## <chr> <int>
## 1 AAACCAC_MIR140 10257
## 2 AAACCAC_MIR140 81
## 3 AAACCAC_MIR140 90
## 4 AAACCAC_MIR140 8754
## 5 AAACCAC_MIR140 11096
## 6 AAACCAC_MIR140 177
## ID
## TGACATY_UNKNOWN TGACATY_UNKNOWN
## WTGAAAT_UNKNOWN WTGAAAT_UNKNOWN
## GGATTA_PITX2_Q2 GGATTA_PITX2_Q2
## TGCCTTA_MIR124A TGCCTTA_MIR124A
## CAGTATT_MIR200B_MIR200C_MIR429 CAGTATT_MIR200B_MIR200C_MIR429
## ACTGTGA_MIR27A_MIR27B ACTGTGA_MIR27A_MIR27B
## Description setSize
## TGACATY_UNKNOWN TGACATY_UNKNOWN 497
## WTGAAAT_UNKNOWN WTGAAAT_UNKNOWN 458
## GGATTA_PITX2_Q2 GGATTA_PITX2_Q2 449
## TGCCTTA_MIR124A TGCCTTA_MIR124A 436
## CAGTATT_MIR200B_MIR200C_MIR429 CAGTATT_MIR200B_MIR200C_MIR429 383
## ACTGTGA_MIR27A_MIR27B ACTGTGA_MIR27A_MIR27B 364
## enrichmentScore NES pvalue
## TGACATY_UNKNOWN -0.3193965 -1.429544 0.001206273
## WTGAAAT_UNKNOWN -0.3214586 -1.430030 0.001216545
## GGATTA_PITX2_Q2 -0.3345811 -1.484699 0.001223990
## TGCCTTA_MIR124A -0.3066390 -1.359231 0.001234568
## CAGTATT_MIR200B_MIR200C_MIR429 -0.3564722 -1.564565 0.001248439
## ACTGTGA_MIR27A_MIR27B -0.3119216 -1.363876 0.001254705
## p.adjust qvalues rank
## TGACATY_UNKNOWN 0.02011846 0.01386739 2961
## WTGAAAT_UNKNOWN 0.02011846 0.01386739 1523
## GGATTA_PITX2_Q2 0.02011846 0.01386739 2464
## TGCCTTA_MIR124A 0.02011846 0.01386739 3042
## CAGTATT_MIR200B_MIR200C_MIR429 0.02011846 0.01386739 3186
## ACTGTGA_MIR27A_MIR27B 0.02011846 0.01386739 3087
## leading_edge
## TGACATY_UNKNOWN tags=30%, list=24%, signal=24%
## WTGAAAT_UNKNOWN tags=18%, list=12%, signal=17%
## GGATTA_PITX2_Q2 tags=26%, list=20%, signal=22%
## TGCCTTA_MIR124A tags=31%, list=24%, signal=24%
## CAGTATT_MIR200B_MIR200C_MIR429 tags=34%, list=25%, signal=26%
## ACTGTGA_MIR27A_MIR27B tags=30%, list=25%, signal=23%
## core_enrichment
## TGACATY_UNKNOWN 800/5168/55691/5789/11080/6400/55512/1112/9604/29760/4638/3218/27314/2065/4286/9459/4026/7026/4335/2263/53353/54329/7082/10446/5930/8567/51334/10174/8626/687/3321/83478/6588/5441/8929/7881/9846/23328/8924/220988/146057/55909/4661/29116/23405/94134/56171/63898/4131/4776/80000/627/9748/2908/11138/23767/10290/5295/1288/9445/10231/29/3373/2669/5191/4211/5813/25945/7716/596/4300/2078/3931/8633/9627/2872/3213/80021/2261/11194/3223/4212/8848/5837/6764/23037/57326/7106/590/7042/4208/55084/50650/80052/6450/323/481/8436/2353/10370/862/9120/727/79899/23452/1293/10580/60481/54838/55204/5166/2845/81603/6876/8522/1592/5577/3489/6310/2202/29951/27239/55821/9499/1501/57161/6571/4035/8821/5744/90627/4922/55800/4223/10472/23303/1264/54587/9590/9576/185/4675/576/1287/9358/3479/5348/776/771/79689
## WTGAAAT_UNKNOWN 2823/80762/50486/4628/9736/324/80070/51585/9627/8082/11211/4885/64112/7482/5334/79645/4734/80059/4208/11075/55084/2012/6709/6003/55629/8829/1581/54843/2353/862/727/10516/10580/1848/55204/5166/89795/2845/79776/9886/761/116496/4488/11096/10468/23677/9201/3075/6925/4121/443/55184/57332/5205/10129/2819/57161/51306/55812/3249/54361/54970/5744/26960/10186/10472/388677/4330/53829/26137/9079/8404/36/3572/3479/1602/23090/81563/2167/8614/57758/4969
## GGATTA_PITX2_Q2 311/3199/4604/57556/8019/3321/57835/8929/6671/1760/10401/4430/9497/5530/6532/54894/23528/5228/5087/2295/6453/56171/3232/4921/3005/221037/56912/10265/10521/10497/23767/8322/37/23429/6561/4303/57082/845/94/51447/9922/6258/23243/2571/10253/4602/29119/55068/126393/7402/56675/5624/7716/7779/596/4300/57804/79923/1280/100506658/9627/3675/26037/2261/2804/65084/11030/3216/57326/140890/23255/5396/9854/1907/54843/57037/862/23452/55152/1848/8835/89795/57007/64221/5136/7227/10324/55273/6310/2202/56034/956/54502/51474/4254/6925/6097/744/443/23446/4306/4915/4487/25803/4081/4223/10472/388677/55107/347902/3708/10742/51313/4857/730/4036/3169
## TGCCTTA_MIR124A 10160/6431/8473/23061/10579/9695/6667/55691/9647/29946/6509/55652/10848/1910/23411/4286/4026/11018/4641/29966/9706/2263/6990/8325/54329/89797/10014/783/51222/5141/84196/23499/311/55752/7003/57556/3915/26959/23189/5884/23328/9781/107/9341/9497/23001/4784/6416/9759/29116/23405/6867/1718/29956/4131/2013/60468/64418/5791/2908/10424/163486/860/2004/7320/9256/23621/29/10253/129642/22998/182/55082/54014/81562/4300/51479/51164/10217/80762/27303/4628/9627/196/171586/10979/115207/26035/23022/79026/23037/91694/10016/140890/25901/5396/9649/79589/2042/323/2494/22905/2620/60481/1848/1272/89795/7227/55830/54681/1657/4337/29994/6444/26018/51157/5138/63895/65055/220972/23428/81493/2239/214/65982/2891/3679/219654/9254/55103/25924/81563
## CAGTATT_MIR200B_MIR200C_MIR429 31/23291/5179/57092/9969/9865/8473/9644/10395/6383/5569/1392/4008/9647/11041/8613/6405/4884/8491/9706/2263/6567/53353/64766/5310/546/5128/55145/54469/200576/8434/7003/687/3915/1456/3321/10499/23189/26060/55578/1998/51421/22864/10484/10140/23001/4661/253782/9807/83660/95681/1960/2908/51719/2186/22846/108/9710/80205/2104/9522/7320/23243/23258/4602/129642/2669/51439/7716/54014/23047/5099/2823/3340/1942/7552/1843/100506658/8204/10979/79618/23015/8848/4692/5334/57515/26249/6563/23122/6196/4077/2494/10370/3953/51339/79365/54838/89795/8076/5577/3751/6310/1009/29994/10769/1501/57509/3899/2273/23116/4908/27244/775/5783/65055/4035/23414/2737/5744/6857/55800/10186/3131/9828/6935/4982/219654/1602/3708/4857
## ACTGTGA_MIR27A_MIR27B 80212/9865/10274/85021/6667/5569/23129/1112/4440/4724/4297/3218/2354/1856/7423/766/90/89797/783/9945/57556/30845/54778/64925/22878/10106/9781/5139/3092/6416/4661/26118/25900/9648/94134/23314/80000/162427/221037/56674/51719/825/490/23767/23390/8648/23429/4124/10231/10513/22902/5914/10253/27347/3643/55082/23047/9252/27076/9315/4804/8646/8204/2872/3213/3223/4885/7248/4212/9900/91694/4734/140890/167227/22849/387263/2308/80267/51280/1003/2494/5468/373/900/55799/23389/51809/79776/5863/55970/7832/1009/10161/10769/6594/1195/1909/6857/27123/23261/3667/219654/27124/10451/771/4857/81563/4148/11122
Gene Ontology Analysis
GO enrichment analysis is internally supported via enrichGO
(for ORA) and gseGO
(for GSEA). These two functions combine with OrgDb
as background annotation. There are 19 OrgDb
for 19 species respectively available via the Bioconductor proejct.
For example, org.Hs.eg.db contains genome wide annotation for human.
Here are the examples of using enrichGO
and gseGO
for GO enrichment analysis:
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
ego <- enrichGO(gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
Instead of using whole genome annotation as background, users can specify background by the universe
parameter. For example, all genes that are actually expressed in an experiment.
## ID Description GeneRatio
## GO:0005819 GO:0005819 spindle 25/202
## GO:0005876 GO:0005876 spindle microtubule 12/202
## GO:0000779 GO:0000779 condensed chromosome, centromeric region 15/202
## GO:0072686 GO:0072686 mitotic spindle 14/202
## GO:0000775 GO:0000775 chromosome, centromeric region 18/202
## GO:0000776 GO:0000776 kinetochore 15/202
## BgRatio pvalue p.adjust qvalue
## GO:0005819 262/11812 2.569986e-12 8.146855e-10 7.520590e-10
## GO:0005876 45/11812 7.912518e-12 1.254134e-09 1.157726e-09
## GO:0000779 91/11812 3.254384e-11 3.438799e-09 3.174452e-09
## GO:0072686 84/11812 1.292642e-10 9.903693e-09 9.142376e-09
## GO:0000775 156/11812 1.562097e-10 9.903693e-09 9.142376e-09
## GO:0000776 106/11812 3.073022e-10 1.623580e-08 1.498772e-08
## geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876 CENPE/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000779 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0072686 KIF23/CENPE/ASPM/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/AURKB/KIFC1/KIF18B/AURKA
## GO:0000775 CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776 CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
## Count
## GO:0005819 25
## GO:0005876 12
## GO:0000779 15
## GO:0072686 14
## GO:0000775 18
## GO:0000776 15
Similary, we can run GSEA test using gseGO
function:
ego2 <- gseGO(geneList = geneList,
OrgDb = org.Hs.eg.db,
ont = "CC",
nPerm = 1000,
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
GO analysis for non-model organisms
Both enrichGO
and gseGO
functions require an OrgDb
object as background annotation. For organisms that don’t have OrgDb
provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb
available, users can obtain GO annotation from other sources, e.g. from biomaRt or Blast2GO. Then using enricher
or GSEA
function to analyze, similar to the examples using wikiPathways and MSigDB. Another solution is to create OrgDb
by your own using AnnotationForge package.
Reducing GO redundancy
GO is organized in parent-child structure, thus a parent term can be overlap with a large proportion with all its child terms. This can result in redundant findings. To solve this issue, clusterProfiler implement simplify
method to reduce redundant GO terms from the outputs of enrichGO
and gseGO
. The function internally called GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term.
KEGG Analysis
Most of the tools for KEGG analysis use out-dated KEGG annotation. clusterProfiler supports downloading latest online version of KEGG data for enrichment analysis. It supports KEGG Pathway and Module.
## ID Description GeneRatio
## hsa04110 hsa04110 Cell cycle 11/90
## hsa04114 hsa04114 Oocyte meiosis 10/90
## hsa04218 hsa04218 Cellular senescence 10/90
## hsa03320 hsa03320 PPAR signaling pathway 7/90
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation 7/90
## BgRatio pvalue p.adjust qvalue
## hsa04110 124/7485 2.271800e-07 4.384573e-05 4.304463e-05
## hsa04114 125/7485 2.175154e-06 2.099024e-04 2.060673e-04
## hsa04218 160/7485 1.977884e-05 1.272439e-03 1.249190e-03
## hsa03320 74/7485 2.687347e-05 1.296645e-03 1.272954e-03
## hsa04914 99/7485 1.743529e-04 6.730023e-03 6.607059e-03
## geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232 11
## hsa04114 991/9133/983/4085/51806/6790/891/9232/3708/5241 10
## hsa04218 2305/4605/9133/890/983/51806/1111/891/776/3708 10
## hsa03320 4312/9415/9370/5105/2167/3158/5346 7
## hsa04914 9133/890/983/4085/6790/891/5241 7
## ID Description setSize
## hsa04510 hsa04510 Focal adhesion 188
## hsa03030 hsa03030 DNA replication 33
## hsa05340 hsa05340 Primary immunodeficiency 34
## hsa03050 hsa03050 Proteasome 42
## hsa03008 hsa03008 Ribosome biogenesis in eukaryotes 55
## hsa01230 hsa01230 Biosynthesis of amino acids 63
## enrichmentScore NES pvalue p.adjust qvalues
## hsa04510 -0.4188582 -1.705358 0.0001396258 0.004927116 0.00369513
## hsa03030 0.7227680 2.345351 0.0002590003 0.004927116 0.00369513
## hsa05340 0.6229650 2.039151 0.0002593361 0.004927116 0.00369513
## hsa03050 0.7097403 2.439212 0.0002650411 0.004927116 0.00369513
## hsa03008 0.5737068 2.083286 0.0002690342 0.004927116 0.00369513
## hsa01230 0.5504052 2.063442 0.0002730003 0.004927116 0.00369513
## rank leading_edge
## hsa04510 2183 tags=27%, list=17%, signal=23%
## hsa03030 1905 tags=64%, list=15%, signal=54%
## hsa05340 1452 tags=41%, list=12%, signal=36%
## hsa03050 2516 tags=67%, list=20%, signal=53%
## hsa03008 4001 tags=75%, list=32%, signal=51%
## hsa01230 1750 tags=40%, list=14%, signal=34%
## core_enrichment
## hsa04510 5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa03030 4174/4171/4175/4173/2237/5984/5111/10535/1763/5427/23649/4176/5982/5557/5558/4172/5424/5983/5425/54107/6119
## hsa05340 29851/6890/3932/3561/3575/915/959/925/5788/916/100/7374/6891/930
## hsa03050 5688/5709/5698/5693/3458/5713/11047/5721/5691/5685/5690/5684/5686/5695/10213/23198/7979/5699/5714/5702/5708/5692/5704/5683/5694/5718/51371/5682
## hsa03008 23560/54913/10799/55131/27341/5822/10436/10248/29107/3692/54433/29889/81691/5901/10775/10557/55505/1460/23160/6949/10940/4809/2091/10556/51096/55651/92856/55813/7514/55226/56000/25996/10885/55127/4931/51068/1459/10171/55916/55341/9790
## hsa01230 29968/26227/875/445/5214/440/65263/6472/7086/5723/3418/7167/586/2597/5230/2023/5223/5831/6888/50/5315/3419/10993/2805/22934
It supports all species that have KEGG annotation data available in the KEGG database. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.
KEGG Orthology (KO) Database is also supported by specifying organism = "ko"
.
Disease, Reactome and MeSH
clusterProfiler also supports Disease Ontology, Reactome Pathway and Medical Subject Headings (MeSH) via its sub-packages, DOSE (Yu et al. 2015), ReactomePA (Yu and He 2016) and meshes (Yu 2018). Please refer to the packages’ online vignettes for more details.
Functional enrichment of genomic region
Next-generation sequencing is also widely applied for detecting variable and regulatory regions, e.g. single nucleotide polymorphisms, copy number variations and transcription factor binding sites.
ChIPseeker provides a function, seq2gene
, for linking genomic regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. After changing genomic region into gene information, we can apply clusterProfiler to test whether the genomic regions of interest is associated with particular functions and pathways.
Please refer to ChIPseeker vignette for more details.
Visualization
The visualization methods were implemented on enrichplot package.
Dot plot
The rich factors indicate the ratio of the number of DEGs mapped to a certain pathway to the total number of genes mapped to this pathway. Greater rich factor means greater intensiveness.
Gene-Concept Network
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, enrichplot provides cnetplot
function to extract the complex association. The cnetplot
depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network. GSEA result is also supported with only core enriched genes displayed.
Heatmap-like functional classification
The heatplot
is similar to cnetplot
, while displaying the relationships as a heatmap. The gene-concept network may become too complicated if user want to show a large number significant terms. The heatplot
can simplify the result and more easy to identify expression patterns.
Enrichment Map
Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional module.
The emapplot
function supports results obtained from hypergeometric test and gene set enrichment analysis.
ridgeline plot for expression distribution of GSEA result
The ridgeplot
will visualize expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.
running score and preranked list of GSEA result
Running score and preranked list are traditional methods for visualizing GSEA result. The enrichplot package supports both of them to visualize the distribution of the gene set and the enrichment score.
Another method to plot GSEA result is the gseaplot2
function:
The gseaplot2
also supports multile gene sets to be displayed on the same figure:
The gsearank
function plot the ranked list of genes belong to the specific gene set.
Multiple gene sets can be aligned using cowplot
:
library(ggplot2)
library(cowplot)
pp <- lapply(1:3, function(i) {
anno <- gk[i, c("NES", "p.adjust")]
lab <- paste0(names(anno), "=", round(anno, 3), collapse="\n")
es <- gk[i, "enrichmentScore"]
x <- ifelse(es < 0, 0, length(geneList) * .8)
gsearank(gk, i, gk[i, 2]) + xlab(NULL) +ylab(NULL) +
annotate("text", x, es * .5, label = lab,
hjust=0, vjust=0, size = 4) + xlim(0, 12500)
})
plot_grid(plotlist=pp, ncol=1)
Biological theme comparison
clusterProfiler provides compareCluster
function to automatically calculate enriched functional categories among multiple gene clusters that obtained from different experimental conditions.
## X1 X2 X3 X4 X5 X6 X7 X8
## 216 805 392 838 929 585 582 237
## $X1
## [1] "4597" "7111" "5266" "2175" "755" "23046"
##
## $X2
## [1] "23450" "5160" "7126" "26118" "8452" "3675"
##
## $X3
## [1] "894" "7057" "22906" "3339" "10449" "6566"
##
## $X4
## [1] "5573" "7453" "5245" "23450" "6500" "4926"
##
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
##
## $X6
## [1] "5337" "9295" "4035" "811" "23365" "4629"
##
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533"
##
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"
## Cluster ID Description GeneRatio
## 1 X2 hsa04110 Cell cycle 18/359
## 2 X2 hsa05169 Epstein-Barr virus infection 23/359
## 3 X2 hsa05340 Primary immunodeficiency 8/359
## 4 X2 hsa04064 NF-kappa B signaling pathway 13/359
## 5 X2 hsa05166 Human T-cell leukemia virus 1 infection 22/359
## 6 X3 hsa04512 ECM-receptor interaction 9/173
## BgRatio pvalue p.adjust qvalue
## 1 124/7485 2.267654e-05 0.006530844 0.006158471
## 2 201/7485 8.648461e-05 0.012453784 0.011743700
## 3 37/7485 2.938634e-04 0.028210882 0.026602366
## 4 95/7485 5.596671e-04 0.040296031 0.037998450
## 5 219/7485 7.769278e-04 0.044751044 0.042199449
## 6 82/7485 1.051835e-04 0.026401058 0.024579723
## geneID
## 1 991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2 4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772
## 3 100/6891/3932/973/916/925/958/64421
## 4 4067/3383/7128/3932/5971/4050/6850/7187/3551/10892/5588/330/958
## 5 3383/991/1869/890/1871/113/701/9700/3932/9134/5971/916/4487/3600/1029/3551/8498/4488/11200/4776/4214/958
## 6 7057/3339/1299/3695/1101/3679/3910/3696/3693
## Count
## 1 18
## 2 23
## 3 8
## 4 13
## 5 22
## 6 9
Formula interface of compareCluster
compareCluster
also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type \(Entrez \sim group\) or \(Entrez \sim group + othergroup\).
mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"
formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enricher",
TERM2GENE=wpid2gene, TERM2NAME = wpid2name)
head(formula_res)
## Cluster group othergroup ID
## 1 downregulated.A downregulated A WP474
## 2 downregulated.A downregulated A WP3932
## 3 downregulated.A downregulated A WP306
## 4 downregulated.A downregulated A WP2911
## 5 downregulated.B downregulated B WP3942
## 6 downregulated.B downregulated B WP2895
## Description GeneRatio BgRatio
## 1 Endochondral Ossification 13/280 65/6249
## 2 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway 29/280 309/6249
## 3 Focal Adhesion 21/280 202/6249
## 4 miRNA targets in ECM and membrane receptors 8/280 43/6249
## 5 PPAR signaling pathway 5/38 67/6249
## 6 Differentiation of white and brown adipocyte 3/38 25/6249
## pvalue p.adjust qvalue
## 1 4.480512e-06 0.001200777 0.001174366
## 2 1.081666e-04 0.014494322 0.014175515
## 3 2.504461e-04 0.022373181 0.021881077
## 4 5.418905e-04 0.036306661 0.035508086
## 5 4.650188e-05 0.003859656 0.003573302
## 6 4.351306e-04 0.018057919 0.016718175
## geneID
## 1 11096/85477/3480/2690/8817/5167/2737/5744/2487/5327/1300/1473/3479
## 2 1281/50509/1290/55970/5618/7058/10161/56034/3693/4254/6720/3480/5159/3991/1289/1292/3908/2690/3909/8817/3551/2791/63923/3913/3667/3679/7060/3479/80310
## 3 55742/50509/1290/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/3679/7060/3479/10451/80310
## 4 1281/50509/1290/7058/3693/1289/1292/3913
## 5 9370/5105/2167/3158/5346
## 6 9370/23090/652
## Count
## 1 13
## 2 29
## 3 21
## 4 8
## 5 5
## 6 3
Summary
Using ORA, we need to first classify genes as interest or not interest (e.g. DE or not DE), which is rather arbitrary, and the independence assumption between genes is unrealistic. However, the simplicity and general applicability of ORA is unmet by other methods. For example, GSEA requires the expression data as input, which is not available from other experimental types.
There is another strategy, topology-based method, to perform functional analysis. However, network for many species is missing or at least incomplete, restricting the application of this method.
All methods have their own benefits and limitations and no single method is best suited for all applications.
An increasing concern upon the quality of gene annotation has raised an alarm in biomedical research. A previous study (Wadi et al. 2016) reported that about 42% of the tools were outdated by more than five years and functional significance were severely underestimated with only 26% of biological processes or pathways were captured compare to using up-to-date annotation.
Doing pathway analysis, we need to bear in mind that applying multiple methods can be beneficial, and pay attention to the release date of the background annotation.
clusterProfiler provides two most widely used methods, ORA and GSEA, and up-to-date annotation. All species annotated in KEGG database can be analyzed and the data is queried online (most updated). OrgDb
that released within Bioconductor was updated biannual, and can be queried also online via AnnotationHub. In addition, clusterProfiler is a general tool and user can provide their own annotation to analyze their data. For instance users can download the latest wikiPathway, MSigDb, GO, etc.
References
Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18): 3710–5. https://doi.org/10.1093/bioinformatics/bth456.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.
Wadi, Lina, Mona Meyer, Joel Weiser, Lincoln D. Stein, and Jüri Reimand. 2016. “Impact of Outdated Gene Annotations on Pathway Enrichment Analysis.” Nature Methods 13 (9): 705–6. https://doi.org/10.1038/nmeth.3963.
Yu, Guangchuang. 2018. “Using Meshes for MeSH Term Enrichment and Semantic Analyses.” Bioinformatics 34 (21): 3766–7. https://doi.org/10.1093/bioinformatics/bty410.
Yu, Guangchuang, and Qing-Yu He. 2016. “ReactomePA: An R/Bioconductor Package for Reactome Pathway Analysis and Visualization.” Molecular BioSystems 12 (2): 477–79. https://doi.org/10.1039/C5MB00663E.
Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.
Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.
Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.