BMB209 SDU 2019 - Pathway Analysis Workshop

General Information

https://odinlister.sdu.dk/fagbesk/internkode/BMB209/en

When: Jan. 25, 2019

Where: University of Southern Denmark, Odense

https://guangchuangyu.github.io/pathway-analysis-workshop/

Instructor: Guangchuang Yu

Guangchuang Yu, Ph.D. is faculty in the School of Basic Medical Sciences at the Southern Medical University.

Workshop Description

This workshop gives an overview of commonly used methods for enrichment analysis of gene expression data with regard to functional gene sets and pathways, and introduces pathway analysis using clusterProfiler. The clusterProfiler package was designed by considering the supports of multiple ontology/pathway annotations, up-to-date gene annotation, multiple organisms, user customized annotation data and comparative analysis. The workshop will provide reproducible examples of pathway analysis, along with visualization and exploration of results.

The material of this workshop is hosted on github, with source file provided.

Pre-requisites

Basic knowledge of R syntax
Familiar with high-throughput gene expression data as obtained with microarray and RNA-seq
Familiarity with the concept of differential expression analysis (with e.g. limma, edgeR, DESeq2)

Workshop Participation

Execution of example code on practice

R/Bioconductor packages used

clusterProfiler
cowplot
DOSE
dplyr
enrichplot
ggplot2
GSEABase
magrittr
msigdbr
org.Hs.eg.db

Using the following command to install these packages:

install.packages(c("BiocManager", "cowplot", "dplyr", "ggplot2", "magrittr", "msigdbr"))
BiocManager::install(c("clusterProfiler", "DOSE", "enrichplot", "GSEABase", "org.Hs.eg.db"))

Terminology

Gene sets and pathway

Gene sets are simple lists of usually functional related genes without further specification of relationships between genes.

Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process.

Gene Ontology (GO)

Gene Ontology defines concepts/classes used to describe gene function, and relationships between these concepts. It classifies functions along three aspects:

MF: Molecular Function
- molecular activities of gene products
CC: Cellular Component
- where gene products are active
BP: Biological Process
- pathways and larger processes made up of the activities of multiple gene products

GO terms are organized in a directed acyclic graph, where edge between the terms represent parent-child relationship.

Kyoto Encyclopedia of Genes and Genomes (KEGG)

KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development.

Other gene sets

GO and KEGG are most frequently used for the functional analysis. They are typically the first choice because their long-standing curation and availability for a wide range of species.

Other gene sets including but not limited to Disease Ontology (DO), Disease Gene Network (DisGeNET), wikiPathways, Molecular Signatures Database (MSigDb).

Over Representation Analysis

Over Representation Analysis (ORA) (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

The p-value can be calculated by hypergeometric distribution.

\(p = 1 - \displaystyle\sum_{i = 0}^{k-1}\frac{{M \choose i}{{N-M} \choose {n-i}}} {{N \choose n}}\)

In this equation, N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest, n is the size of the list of genes of interest and k is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation. P-values should be adjusted for multiple comparison.

Example: Microarray study, in which 11,812 genes have been tested for differential expression between two sample conditions and 202 genes were found DE.

Among the DE genes, 25 are annotated to a specific functional gene set, which contains in total 262 genes. This setup corresponds to a 2x2 contingency table:

d <-  matrix(c(25, 237, 177, 11323), 
          nrow = 2, 
          dimnames = list(c("DE", "Not DE"),
                        c("In GS", "Not in GS")))
d

##        In GS Not in GS
## DE        25       177
## Not DE   237     11323

whether the overlap of 25 genes are significantly over represented in the gene set can be assessed using hypergeometric distribution. This corresponding to a one-sided version of Fisher’s exact test.

fisher.test(d, alternative = "greater")

## 
##  Fisher's Exact Test for Count Data
## 
## data:  d
## p-value = 2.815e-12
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  4.502824      Inf
## sample estimates:
## odds ratio 
##   6.744738

Gene Set Enrichment Analysis

A major limitation of ORA is that it restricts analysis to DEGs, excluding the vast majority of genes not satisfying the chosen significance threshold.

This is resolved by gene set enrichment analysis (GSEA) (Subramanian et al. 2005). GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.

Genes are ranked based on their phenotypes. Given a priori defined set of gene S (e.g., genes sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (_L_) or primarily found at the top or bottom.

There are two key elements of the GSEA method:

Calculation of an Enrichment Score.
- The enrichment score (_ES_) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic.
Esimation of Significance Level of ES.
- The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.

Input data

For over representation analysis, all we need is a gene vector, that is a vector of gene IDs. These gene IDs can be obtained by differential expression analysis (e.g. with DESeq2 package).

For gene set enrichment analysis, we need a ranked list of genes. DOSE provides an example dataset geneList which was derived from R package breastCancerMAINZ that contained 200 samples, including 29 samples in grade I, 136 samples in grade II and 35 samples in grade III. We computed the ratios of geometric means of grade III samples versus geometric means of grade I samples. Logarithm of these ratios (base 2) were stored in geneList dataset.

The geneList contains three features:

numeric vector: fold change or other type of numerical variable
named vector: every number was named by the corresponding gene ID
sorted vector: number should be sorted in decreasing order

Suppose you are importing your own data from a csv file and the file contains two columns, one for gene ID (no duplicated allowed) and another one for fold change, you can prepare your own geneList via the following command:

d <- read.csv(your_csv_file)
## assume that 1st column is ID
## 2nd column is fold change

## feature 1: numeric vector
geneList <- d[,2]

## feature 2: named vector
names(geneList) <- as.character(d[,1])

## feature 3: decreasing order
geneList <- sort(geneList, decreasing = TRUE)

We can load the sample data into R via:

data(geneList, package="DOSE")
head(geneList)

##     4312     8318    10874    55143    55388      991 
## 4.572613 4.514594 4.418218 4.144075 3.876258 3.677857

Suppose we define fold change greater than 2 as DEGs:

gene <- names(geneList)[abs(geneList) > 2]
head(gene)

## [1] "4312"  "8318"  "10874" "55143" "55388" "991"

clusterProfiler as an universal enrichment analysis tool

clusterProfiler (Yu et al. 2012) internally supports enrichment analysis using different ontology and pathways, including GO, DO, KEGG and Reactome pathway. This is not enough for users may want to analyze their data with unsupported organisms, slim version of GO, novel functional annotation (e.g. GO via blastgo and KEGG via KAAS), unsupported ontology/pathway or customized annotation.

clusterProfiler provides enricher function for hypergeometric test and GSEA function for gene set enrichment analysis that are designed to accept user defined annotation. They accept two additional parameters TERM2GENE and TERM2NAME. As indicated in the parameter names, TERM2GENE is a data.frame with first column of term ID and second column of corresponding mapped gene, while TERM2NAME is a data.frame with first column of term ID and second column of corresponding term name. TERM2NAME is optional.

Here using clusterProfiler to analyze wiki pathways and molecular signatures database as examples.

WikiPathways analysis

WikiPathways is a continuously updated pathway database curated by a community of researchers and pathway enthusiasts. WikiPathways produces monthly releases of gmt files for supported organisms at data.wikipathways.org. Download the appropriate gmt file and then generate TERM2GENE and TERM2NAME to use enricher and GSEA functions.

library(magrittr)
library(clusterProfiler)

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(ont, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME

ewp <- enricher(gene, TERM2GENE = wpid2gene, TERM2NAME = wpid2name)
head(ewp)

##            ID
## WP2446 WP2446
## WP2361 WP2361
## WP179   WP179
## WP3942 WP3942
## WP4240 WP4240
## WP2328 WP2328
##                                                                           Description
## WP2446                                                  Retinoblastoma (RB) in Cancer
## WP2361                                                       Gastric Cancer Network 1
## WP179                                                                      Cell Cycle
## WP3942                                                         PPAR signaling pathway
## WP4240 Regulation of sister chromatid separation at the metaphase-anaphase transition
## WP2328                                                            Allograft Rejection
##        GeneRatio  BgRatio       pvalue     p.adjust       qvalue
## WP2446     11/95  88/6249 6.801697e-08 1.054263e-05 9.450779e-06
## WP2361      6/95  29/6249 3.772735e-06 2.923870e-04 2.621058e-04
## WP179      10/95 122/6249 1.384549e-05 7.153503e-04 6.412648e-04
## WP3942      7/95  67/6249 6.210513e-05 2.406574e-03 2.157336e-03
## WP4240      4/95  16/6249 7.931988e-05 2.458916e-03 2.204258e-03
## WP2328      7/95  90/6249 4.016758e-04 1.037663e-02 9.301967e-03
##                                                       geneID Count
## WP2446 8318/9133/7153/6241/890/983/81620/7272/1111/891/24137    11
## WP2361                       4605/7153/11065/22974/6286/6790     6
## WP179          8318/991/9133/890/983/7272/1111/891/4174/9232    10
## WP3942                    4312/9415/9370/5105/2167/3158/5346     7
## WP4240                                    991/1062/4085/9232     4
## WP2328                   10563/6373/4283/3002/10578/3117/730     7

ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
head(ewp2)

##            ID                                    Description setSize
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway     281
## WP306   WP306                                 Focal Adhesion     188
## WP236   WP236                                   Adipogenesis     125
## WP474   WP474                      Endochondral Ossification      59
## WP2911 WP2911    miRNA targets in ECM and membrane receptors      21
## WP2361 WP2361                       Gastric Cancer Network 1      23
##        enrichmentScore       NES      pvalue   p.adjust    qvalues rank
## WP3932      -0.3903410 -1.659561 0.001307190 0.03278986 0.02707857 1994
## WP306       -0.4230017 -1.717471 0.001418440 0.03278986 0.02707857 2221
## WP236       -0.4387536 -1.689787 0.001529052 0.03278986 0.02707857 2301
## WP474       -0.5111776 -1.752637 0.001633987 0.03278986 0.02707857 2142
## WP2911      -0.6953475 -1.946647 0.001715266 0.03278986 0.02707857 2629
## WP2361       0.8368760  2.547453 0.002309469 0.03278986 0.02707857  854
##                          leading_edge
## WP3932 tags=26%, list=16%, signal=22%
## WP306  tags=28%, list=18%, signal=23%
## WP236  tags=34%, list=18%, signal=28%
## WP474  tags=47%, list=17%, signal=40%
## WP2911 tags=76%, list=21%, signal=60%
## WP2361  tags=52%, list=7%, signal=49%
##                                                                                                                                                                                                                                                                                                                                                                         core_enrichment
## WP3932 7059/51719/8660/5563/5295/6794/1288/7010/3910/3371/3082/3791/1301/1027/90993/3643/1129/1975/7450/3685/2034/1942/2149/1280/4804/3675/2261/7248/2246/4803/2259/3912/1902/2308/1278/1277/81617/2846/2057/2247/1281/50509/1290/55970/5618/7058/10161/56034/3693/4254/6720/3480/5159/3991/1289/1292/3908/2690/3909/8817/3551/2791/63923/3913/3667/3679/7060/3479/80310/1311/1101/3169
## WP306                                                                                                              5595/5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/5753/1277/10398/55742/50509/1290/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/3679/7060/3479/10451/80310/1311/1101
## WP236                                                                                                                                                                       8321/5925/8609/1499/2908/4088/8660/3399/6778/8648/6258/2662/5914/6776/2034/196/8204/4692/4208/2308/5468/6695/4023/1592/7350/81029/3952/1675/5618/6720/3991/2487/3667/3572/3479/6424/9370/5105/652/5346/2625
## WP474                                                                                                                                                                                                                                          4921/51719/860/5745/1028/1280/2261/4256/7042/4208/4322/8100/7078/2247/11096/85477/3480/2690/8817/5167/2737/5744/2487/5327/1300/1473/3479
## WP2911                                                                                                                                                                                                                                                                                                      3672/7057/3915/3910/1291/1278/1293/1281/50509/1290/7058/3693/1289/1292/3913
## WP2361                                                                                                                                                                                                                                                                                                                   4605/7153/11065/22974/6286/6790/1894/56992/4173/1063/9585/8607

You may want to convert the gene IDs to gene symbols, which can be done by setReadable function.

library(org.Hs.eg.db)
ewp <- setReadable(ewp, org.Hs.eg.db, keyType = "ENTREZID")
ewp2 <- setReadable(ewp2, org.Hs.eg.db, keyType = "ENTREZID")
head(ewp)

##            ID
## WP2446 WP2446
## WP2361 WP2361
## WP179   WP179
## WP3942 WP3942
## WP4240 WP4240
## WP2328 WP2328
##                                                                           Description
## WP2446                                                  Retinoblastoma (RB) in Cancer
## WP2361                                                       Gastric Cancer Network 1
## WP179                                                                      Cell Cycle
## WP3942                                                         PPAR signaling pathway
## WP4240 Regulation of sister chromatid separation at the metaphase-anaphase transition
## WP2328                                                            Allograft Rejection
##        GeneRatio  BgRatio       pvalue     p.adjust       qvalue
## WP2446     11/95  88/6249 6.801697e-08 1.054263e-05 9.450779e-06
## WP2361      6/95  29/6249 3.772735e-06 2.923870e-04 2.621058e-04
## WP179      10/95 122/6249 1.384549e-05 7.153503e-04 6.412648e-04
## WP3942      7/95  67/6249 6.210513e-05 2.406574e-03 2.157336e-03
## WP4240      4/95  16/6249 7.931988e-05 2.458916e-03 2.204258e-03
## WP2328      7/95  90/6249 4.016758e-04 1.037663e-02 9.301967e-03
##                                                              geneID Count
## WP2446 CDC45/CCNB2/TOP2A/RRM2/CCNA2/CDK1/CDT1/TTK/CHEK1/CCNB1/KIF4A    11
## WP2361                           MYBL2/TOP2A/UBE2C/TPX2/S100P/AURKA     6
## WP179       CDC45/CDC20/CCNB2/CCNA2/CDK1/TTK/CHEK1/CCNB1/MCM5/PTTG1    10
## WP3942                    MMP1/FADS2/ADIPOQ/PCK1/FABP4/HMGCS2/PLIN1     7
## WP4240                                     CDC20/CENPE/MAD2L1/PTTG1     4
## WP2328                    CXCL13/CXCL11/CXCL9/GZMB/GNLY/HLA-DQA1/C7     7

head(ewp2)

##            ID                                    Description setSize
## WP3932 WP3932 Focal Adhesion-PI3K-Akt-mTOR-signaling pathway     281
## WP306   WP306                                 Focal Adhesion     188
## WP236   WP236                                   Adipogenesis     125
## WP474   WP474                      Endochondral Ossification      59
## WP2911 WP2911    miRNA targets in ECM and membrane receptors      21
## WP2361 WP2361                       Gastric Cancer Network 1      23
##        enrichmentScore       NES      pvalue   p.adjust    qvalues rank
## WP3932      -0.3903410 -1.659561 0.001307190 0.03278986 0.02707857 1994
## WP306       -0.4230017 -1.717471 0.001418440 0.03278986 0.02707857 2221
## WP236       -0.4387536 -1.689787 0.001529052 0.03278986 0.02707857 2301
## WP474       -0.5111776 -1.752637 0.001633987 0.03278986 0.02707857 2142
## WP2911      -0.6953475 -1.946647 0.001715266 0.03278986 0.02707857 2629
## WP2361       0.8368760  2.547453 0.002309469 0.03278986 0.02707857  854
##                          leading_edge
## WP3932 tags=26%, list=16%, signal=22%
## WP306  tags=28%, list=18%, signal=23%
## WP236  tags=34%, list=18%, signal=28%
## WP474  tags=47%, list=17%, signal=40%
## WP2911 tags=76%, list=21%, signal=60%
## WP2361  tags=52%, list=7%, signal=49%
##                                                                                                                                                                                                                                                                                                                                                                                                                            core_enrichment
## WP3932 THBS3/CAB39/IRS2/PRKAA2/PIK3R1/STK11/COL4A6/TEK/LAMA4/TNC/HGF/KDR/COL11A1/CDKN1B/CREB3L1/INSR/CHRM2/EIF4B/VWF/ITGAV/EPAS1/EFNA1/F2R/COL2A1/NGFR/ITGA3/FGFR3/TSC1/FGF1/NGF/FGF14/LAMB1/LPAR1/FOXO1/COL1A2/COL1A1/CAB39L/LPAR4/EPOR/FGF2/COL3A1/COL5A3/COL5A2/GNG12/PRLR/THBS2/LPAR6/PDGFC/ITGB5/KITLG/SREBF1/IGF1R/PDGFRB/LIPE/COL5A1/COL6A2/LAMA2/GHR/LAMA3/FGF18/IKBKB/GNG11/TNN/LAMB2/IRS1/ITGA7/THBS4/IGF1/PDGFD/COMP/CHAD/FOXA1
## WP306                                                                                                                               MAPK3/PGF/VEGFC/CTNNB1/MYL5/TLN2/THBS3/PIK3R1/COL4A6/PIP5K1C/LAMA4/TNC/HGF/ARHGAP5/KDR/VWF/BCL2/ITGAV/COL2A1/ITGA3/CCND1/FLNC/LAMB1/DOCK1/COL1A2/PTK6/COL1A1/MYL9/PARVA/COL5A3/COL5A2/FLNB/THBS2/SHC2/PDGFC/ITGB5/IGF1R/PDGFRB/CAV1/COL6A2/LAMA2/LAMA3/TNN/LAMB2/ITGA7/THBS4/IGF1/VAV3/PDGFD/COMP/CHAD
## WP236                                                                                                                                                                                                    FZD1/RB1/KLF7/CTNNB1/NR3C1/SMAD3/IRS2/ID3/STAT6/NCOA1/RXRG/GDF10/RARA/STAT5A/EPAS1/AHR/NRIP1/NDN/MEF2C/FOXO1/PPARG/SPOCK1/LPL/CYP26A1/UCP1/WNT5B/LEP/CFD/PRLR/SREBF1/LIPE/FRZB/IRS1/IL6ST/IGF1/SFRP4/ADIPOQ/PCK1/BMP4/PLIN1/GATA3
## WP474                                                                                                                                                                                                                                                                          DDR2/CAB39/RUNX2/PTH1R/CDKN1C/COL2A1/FGFR3/MGP/TGFB2/MEF2C/MMP13/IFT88/TIMP3/FGF2/ADAMTS5/SCIN/IGF1R/GHR/FGF18/ENPP1/GLI3/PTHLH/FRZB/PLAT/COL10A1/CST5/IGF1
## WP2911                                                                                                                                                                                                                                                                                                                                   ITGA1/THBS1/LAMC1/LAMA4/COL6A1/COL1A2/COL6A3/COL3A1/COL5A3/COL5A2/THBS2/ITGB5/COL5A1/COL6A2/LAMB2
## WP2361                                                                                                                                                                                                                                                                                                                                                              MYBL2/TOP2A/UBE2C/TPX2/S100P/AURKA/ECT2/KIF15/MCM4/CENPF/KIF20B/RUVBL1

As an alternative to manually downloading gmt files, install the rWikiPathways package to gain scripting access to the latest gmt files using the downloadPathwayArchive function.

MSigDb analysis

Molecular Signatures Database contains 8 major collections:

H: hallmark gene sets
C1: positional gene sets
C2: curated gene sets
C3: motif gene sets
C4: computational gene sets
C5: GO gene sets
C6: oncogenic signatures
C7: immunologic signatures

Users can download GMT files from Broad Institute and use read.gmt to parse the file to be used in enricher() and GSEA().

There is an R package, msigdbr, that already packed the MSigDB gene sets in tidy data format that can be used directly with clusterProfiler.

It supports several specices:

library(msigdbr)
msigdbr_show_species()

##  [1] "Bos taurus"               "Caenorhabditis elegans"  
##  [3] "Canis lupus familiaris"   "Danio rerio"             
##  [5] "Drosophila melanogaster"  "Gallus gallus"           
##  [7] "Homo sapiens"             "Mus musculus"            
##  [9] "Rattus norvegicus"        "Saccharomyces cerevisiae"
## [11] "Sus scrofa"

We can retrieve all human gene sets:

m_df <- msigdbr(species = "Homo sapiens")
head(m_df, 2) %>% as.data.frame

##          gs_name  gs_id gs_cat gs_subcat human_gene_symbol species_name
## 1 AAACCAC_MIR140 M12609     C3       MIR             ABCC4 Homo sapiens
## 2 AAACCAC_MIR140 M12609     C3       MIR             ACTN4 Homo sapiens
##   entrez_gene gene_symbol sources
## 1       10257       ABCC4    <NA>
## 2          81       ACTN4    <NA>

Or specific collection. Here we use C6, oncogenic gene sets as an example:

m_t2g <- msigdbr(species = "Homo sapiens", category = "C6") %>% 
  dplyr::select(gs_name, entrez_gene)
head(m_t2g)

## # A tibble: 6 x 2
##   gs_name              entrez_gene
##   <chr>                      <int>
## 1 AKT_UP_MTOR_DN.V1_DN       25864
## 2 AKT_UP_MTOR_DN.V1_DN          95
## 3 AKT_UP_MTOR_DN.V1_DN      137872
## 4 AKT_UP_MTOR_DN.V1_DN         134
## 5 AKT_UP_MTOR_DN.V1_DN       55326
## 6 AKT_UP_MTOR_DN.V1_DN         271

em <- enricher(gene, TERM2GENE=m_t2g)
em2 <- GSEA(geneList, TERM2GENE = m_t2g)
head(em)

##                                            ID            Description
## RPS14_DN.V1_DN                 RPS14_DN.V1_DN         RPS14_DN.V1_DN
## GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP GCNP_SHH_UP_LATE.V1_UP
## VEGF_A_UP.V1_DN               VEGF_A_UP.V1_DN        VEGF_A_UP.V1_DN
## PRC2_EZH2_UP.V1_DN         PRC2_EZH2_UP.V1_DN     PRC2_EZH2_UP.V1_DN
## CSR_LATE_UP.V1_UP           CSR_LATE_UP.V1_UP      CSR_LATE_UP.V1_UP
## E2F1_UP.V1_UP                   E2F1_UP.V1_UP          E2F1_UP.V1_UP
##                        GeneRatio   BgRatio       pvalue     p.adjust
## RPS14_DN.V1_DN            22/186 187/11250 4.072878e-13 6.923892e-11
## GCNP_SHH_UP_LATE.V1_UP    16/186 183/11250 5.657195e-08 4.808615e-06
## VEGF_A_UP.V1_DN           15/186 193/11250 6.903522e-07 3.911996e-05
## PRC2_EZH2_UP.V1_DN        14/186 195/11250 4.197805e-06 1.784067e-04
## CSR_LATE_UP.V1_UP         12/186 172/11250 2.766080e-05 9.404672e-04
## E2F1_UP.V1_UP             12/186 189/11250 6.963494e-05 1.972990e-03
##                              qvalue
## RPS14_DN.V1_DN         5.616284e-11
## GCNP_SHH_UP_LATE.V1_UP 3.900487e-06
## VEGF_A_UP.V1_DN        3.173198e-05
## PRC2_EZH2_UP.V1_DN     1.447138e-04
## CSR_LATE_UP.V1_UP      7.628557e-04
## E2F1_UP.V1_UP          1.600382e-03
##                                                                                                                                        geneID
## RPS14_DN.V1_DN         10874/55388/991/9493/1062/4605/9133/23397/79733/9787/55872/83461/54821/51659/9319/9055/10112/4174/5105/2532/7021/79901
## GCNP_SHH_UP_LATE.V1_UP                                      55388/7153/79733/6241/9787/51203/983/9212/1111/9319/9055/3833/6790/4174/3169/1580
## VEGF_A_UP.V1_DN                                                   8318/9493/1062/9133/10403/6241/9787/4085/332/3832/7272/891/23362/2167/10234
## PRC2_EZH2_UP.V1_DN                                               8318/55388/4605/23397/9787/55355/10460/81620/2146/7272/9212/11182/3887/24137
## CSR_LATE_UP.V1_UP                                                               55143/2305/4605/6241/11065/55872/983/332/2146/51659/9319/1580
## E2F1_UP.V1_UP                                                                  55388/7153/23397/79733/9787/2146/2842/9212/8208/1111/9055/3833
##                        Count
## RPS14_DN.V1_DN            22
## GCNP_SHH_UP_LATE.V1_UP    16
## VEGF_A_UP.V1_DN           15
## PRC2_EZH2_UP.V1_DN        14
## CSR_LATE_UP.V1_UP         12
## E2F1_UP.V1_UP             12

head(em2)

##                                ID      Description setSize enrichmentScore
## MTOR_UP.N4.V1_DN MTOR_UP.N4.V1_DN MTOR_UP.N4.V1_DN     181      -0.4806248
## ATF2_S_UP.V1_DN   ATF2_S_UP.V1_DN  ATF2_S_UP.V1_DN     182      -0.4456885
## IL15_UP.V1_DN       IL15_UP.V1_DN    IL15_UP.V1_DN     182      -0.4687839
## ATF2_UP.V1_DN       ATF2_UP.V1_DN    ATF2_UP.V1_DN     178      -0.5029082
## IL2_UP.V1_DN         IL2_UP.V1_DN     IL2_UP.V1_DN     183      -0.4237496
## LEF1_UP.V1_DN       LEF1_UP.V1_DN    LEF1_UP.V1_DN     188      -0.4938672
##                        NES      pvalue   p.adjust     qvalues rank
## MTOR_UP.N4.V1_DN -1.955524 0.001364256 0.01538462 0.008225692 3033
## ATF2_S_UP.V1_DN  -1.815562 0.001366120 0.01538462 0.008225692 2531
## IL15_UP.V1_DN    -1.909644 0.001366120 0.01538462 0.008225692 2436
## ATF2_UP.V1_DN    -2.037247 0.001377410 0.01538462 0.008225692 2055
## IL2_UP.V1_DN     -1.724007 0.001377410 0.01538462 0.008225692 2774
## LEF1_UP.V1_DN    -2.008503 0.001381215 0.01538462 0.008225692 1705
##                                    leading_edge
## MTOR_UP.N4.V1_DN tags=51%, list=24%, signal=39%
## ATF2_S_UP.V1_DN  tags=39%, list=20%, signal=32%
## IL15_UP.V1_DN    tags=40%, list=19%, signal=33%
## ATF2_UP.V1_DN    tags=38%, list=16%, signal=32%
## IL2_UP.V1_DN     tags=39%, list=22%, signal=31%
## LEF1_UP.V1_DN    tags=36%, list=14%, signal=32%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 core_enrichment
## MTOR_UP.N4.V1_DN 10778/1571/5890/55512/64131/90806/534/4335/3280/7169/54916/10223/26959/51351/3204/466/9975/84820/1998/63920/283232/9648/54737/3005/55634/29896/3476/9710/57337/1203/56985/26054/10023/23177/85459/51111/55251/7716/54014/1901/51185/7552/7561/51170/645644/8334/55450/81627/7733/84159/5334/57326/10224/11226/10252/25901/55758/9665/10795/79800/57730/25949/10516/4311/79684/911/81698/7741/373/9742/51809/582/57396/7832/1657/80035/10628/901/23492/1195/55733/220972/7704/150759/51306/55268/26137/114327/54585/643314/1602
## ATF2_S_UP.V1_DN                                                                                                                                                   53616/55667/51279/4642/7130/7139/2/710/80352/4608/11145/881/2252/3399/22846/10687/12/1397/1301/3554/2619/26011/78987/9284/1306/6586/9627/4054/54813/115207/55890/26249/4803/11075/3400/79789/8527/2353/10580/79987/80760/5493/90865/3485/3751/2202/2199/10610/1047/1294/23492/3908/5364/658/5350/2121/2331/4330/9737/4982/57088/8490/9358/6424/1307/3708/125/1311/56521/10351
## IL15_UP.V1_DN                                                                                                                   64798/26959/6588/3784/777/113791/79981/64759/2535/957/6123/8038/51285/126353/55357/1470/2669/1028/4222/7450/225/1942/9976/29068/5592/54996/10272/6019/55779/80823/80215/5334/9639/57326/6403/26115/54453/57126/151011/57644/1831/51339/6591/55204/1393/1272/57608/79999/653483/9886/10279/55805/1047/10628/901/64123/27244/116039/79921/7041/54361/79843/6857/2239/6947/23541/55857/1811/3357/5104/56967/284266
## ATF2_UP.V1_DN                                                                                                                                                           9738/2252/8660/10687/5357/80201/12/22998/3554/2619/23556/8840/9284/399491/57419/5737/6586/9627/8863/10085/115207/55890/23015/629/3212/3400/50650/8527/2353/79987/80760/9668/1012/10614/5493/8553/23126/90865/3485/3751/2202/57212/2199/10610/3075/6387/22834/1909/10129/57161/5350/2121/2331/4330/2952/4982/57088/9358/6424/6038/51313/1311/56521/10351/5304/4239/79901
## IL2_UP.V1_DN                                                                                                                                       10858/5333/2977/79874/10014/7773/64798/26959/6588/579/777/5158/23361/9915/5087/64759/2535/6123/7010/51285/126353/80201/79841/1028/2322/4222/947/9187/225/9976/5592/54996/3202/55779/23148/5334/8605/57326/26115/54453/55766/64927/6591/55204/9940/57608/653483/9886/10279/55805/10628/901/9891/7102/27244/79921/4674/8292/5125/79843/6857/79940/23541/3357/50853/9863/79932/90362/55663/4250
## LEF1_UP.V1_DN                                                                                                                                                 22998/10325/23221/79170/9187/4300/51479/25956/7552/8644/11162/7168/1490/54463/65084/25837/80303/1809/3397/4208/11223/79762/6451/481/54795/51626/8991/5950/5627/64699/6414/10103/221078/9961/5874/56898/90865/57235/83989/5002/6653/56034/55314/85458/956/54502/4832/5783/1363/53832/54985/80221/84786/214/50853/51149/57088/5157/2947/79932/7031/6505/9338/25924/80736/3169/10551

We can test with other collections, for example, using C3 to test whether the genes are up/down-regulated by sharing specific motif.

m_t2g <- msigdbr(species = "Homo sapiens", category = "C3") %>% 
  dplyr::select(gs_name, entrez_gene)
head(m_t2g)

## # A tibble: 6 x 2
##   gs_name        entrez_gene
##   <chr>                <int>
## 1 AAACCAC_MIR140       10257
## 2 AAACCAC_MIR140          81
## 3 AAACCAC_MIR140          90
## 4 AAACCAC_MIR140        8754
## 5 AAACCAC_MIR140       11096
## 6 AAACCAC_MIR140         177

em3 <- GSEA(geneList, TERM2GENE = m_t2g)
head(em3)

##                                                            ID
## TGACATY_UNKNOWN                               TGACATY_UNKNOWN
## WTGAAAT_UNKNOWN                               WTGAAAT_UNKNOWN
## GGATTA_PITX2_Q2                               GGATTA_PITX2_Q2
## TGCCTTA_MIR124A                               TGCCTTA_MIR124A
## CAGTATT_MIR200B_MIR200C_MIR429 CAGTATT_MIR200B_MIR200C_MIR429
## ACTGTGA_MIR27A_MIR27B                   ACTGTGA_MIR27A_MIR27B
##                                                   Description setSize
## TGACATY_UNKNOWN                               TGACATY_UNKNOWN     497
## WTGAAAT_UNKNOWN                               WTGAAAT_UNKNOWN     458
## GGATTA_PITX2_Q2                               GGATTA_PITX2_Q2     449
## TGCCTTA_MIR124A                               TGCCTTA_MIR124A     436
## CAGTATT_MIR200B_MIR200C_MIR429 CAGTATT_MIR200B_MIR200C_MIR429     383
## ACTGTGA_MIR27A_MIR27B                   ACTGTGA_MIR27A_MIR27B     364
##                                enrichmentScore       NES      pvalue
## TGACATY_UNKNOWN                     -0.3193965 -1.429544 0.001206273
## WTGAAAT_UNKNOWN                     -0.3214586 -1.430030 0.001216545
## GGATTA_PITX2_Q2                     -0.3345811 -1.484699 0.001223990
## TGCCTTA_MIR124A                     -0.3066390 -1.359231 0.001234568
## CAGTATT_MIR200B_MIR200C_MIR429      -0.3564722 -1.564565 0.001248439
## ACTGTGA_MIR27A_MIR27B               -0.3119216 -1.363876 0.001254705
##                                  p.adjust    qvalues rank
## TGACATY_UNKNOWN                0.02011846 0.01386739 2961
## WTGAAAT_UNKNOWN                0.02011846 0.01386739 1523
## GGATTA_PITX2_Q2                0.02011846 0.01386739 2464
## TGCCTTA_MIR124A                0.02011846 0.01386739 3042
## CAGTATT_MIR200B_MIR200C_MIR429 0.02011846 0.01386739 3186
## ACTGTGA_MIR27A_MIR27B          0.02011846 0.01386739 3087
##                                                  leading_edge
## TGACATY_UNKNOWN                tags=30%, list=24%, signal=24%
## WTGAAAT_UNKNOWN                tags=18%, list=12%, signal=17%
## GGATTA_PITX2_Q2                tags=26%, list=20%, signal=22%
## TGCCTTA_MIR124A                tags=31%, list=24%, signal=24%
## CAGTATT_MIR200B_MIR200C_MIR429 tags=34%, list=25%, signal=26%
## ACTGTGA_MIR27A_MIR27B          tags=30%, list=25%, signal=23%
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    core_enrichment
## TGACATY_UNKNOWN                800/5168/55691/5789/11080/6400/55512/1112/9604/29760/4638/3218/27314/2065/4286/9459/4026/7026/4335/2263/53353/54329/7082/10446/5930/8567/51334/10174/8626/687/3321/83478/6588/5441/8929/7881/9846/23328/8924/220988/146057/55909/4661/29116/23405/94134/56171/63898/4131/4776/80000/627/9748/2908/11138/23767/10290/5295/1288/9445/10231/29/3373/2669/5191/4211/5813/25945/7716/596/4300/2078/3931/8633/9627/2872/3213/80021/2261/11194/3223/4212/8848/5837/6764/23037/57326/7106/590/7042/4208/55084/50650/80052/6450/323/481/8436/2353/10370/862/9120/727/79899/23452/1293/10580/60481/54838/55204/5166/2845/81603/6876/8522/1592/5577/3489/6310/2202/29951/27239/55821/9499/1501/57161/6571/4035/8821/5744/90627/4922/55800/4223/10472/23303/1264/54587/9590/9576/185/4675/576/1287/9358/3479/5348/776/771/79689
## WTGAAAT_UNKNOWN                                                                                                                                                                                                                                                                                                                                                                         2823/80762/50486/4628/9736/324/80070/51585/9627/8082/11211/4885/64112/7482/5334/79645/4734/80059/4208/11075/55084/2012/6709/6003/55629/8829/1581/54843/2353/862/727/10516/10580/1848/55204/5166/89795/2845/79776/9886/761/116496/4488/11096/10468/23677/9201/3075/6925/4121/443/55184/57332/5205/10129/2819/57161/51306/55812/3249/54361/54970/5744/26960/10186/10472/388677/4330/53829/26137/9079/8404/36/3572/3479/1602/23090/81563/2167/8614/57758/4969
## GGATTA_PITX2_Q2                                                                                                                                                                            311/3199/4604/57556/8019/3321/57835/8929/6671/1760/10401/4430/9497/5530/6532/54894/23528/5228/5087/2295/6453/56171/3232/4921/3005/221037/56912/10265/10521/10497/23767/8322/37/23429/6561/4303/57082/845/94/51447/9922/6258/23243/2571/10253/4602/29119/55068/126393/7402/56675/5624/7716/7779/596/4300/57804/79923/1280/100506658/9627/3675/26037/2261/2804/65084/11030/3216/57326/140890/23255/5396/9854/1907/54843/57037/862/23452/55152/1848/8835/89795/57007/64221/5136/7227/10324/55273/6310/2202/56034/956/54502/51474/4254/6925/6097/744/443/23446/4306/4915/4487/25803/4081/4223/10472/388677/55107/347902/3708/10742/51313/4857/730/4036/3169
## TGCCTTA_MIR124A                                                                            10160/6431/8473/23061/10579/9695/6667/55691/9647/29946/6509/55652/10848/1910/23411/4286/4026/11018/4641/29966/9706/2263/6990/8325/54329/89797/10014/783/51222/5141/84196/23499/311/55752/7003/57556/3915/26959/23189/5884/23328/9781/107/9341/9497/23001/4784/6416/9759/29116/23405/6867/1718/29956/4131/2013/60468/64418/5791/2908/10424/163486/860/2004/7320/9256/23621/29/10253/129642/22998/182/55082/54014/81562/4300/51479/51164/10217/80762/27303/4628/9627/196/171586/10979/115207/26035/23022/79026/23037/91694/10016/140890/25901/5396/9649/79589/2042/323/2494/22905/2620/60481/1848/1272/89795/7227/55830/54681/1657/4337/29994/6444/26018/51157/5138/63895/65055/220972/23428/81493/2239/214/65982/2891/3679/219654/9254/55103/25924/81563
## CAGTATT_MIR200B_MIR200C_MIR429                                                                                     31/23291/5179/57092/9969/9865/8473/9644/10395/6383/5569/1392/4008/9647/11041/8613/6405/4884/8491/9706/2263/6567/53353/64766/5310/546/5128/55145/54469/200576/8434/7003/687/3915/1456/3321/10499/23189/26060/55578/1998/51421/22864/10484/10140/23001/4661/253782/9807/83660/95681/1960/2908/51719/2186/22846/108/9710/80205/2104/9522/7320/23243/23258/4602/129642/2669/51439/7716/54014/23047/5099/2823/3340/1942/7552/1843/100506658/8204/10979/79618/23015/8848/4692/5334/57515/26249/6563/23122/6196/4077/2494/10370/3953/51339/79365/54838/89795/8076/5577/3751/6310/1009/29994/10769/1501/57509/3899/2273/23116/4908/27244/775/5783/65055/4035/23414/2737/5744/6857/55800/10186/3131/9828/6935/4982/219654/1602/3708/4857
## ACTGTGA_MIR27A_MIR27B                                                                                                                                                                                                            80212/9865/10274/85021/6667/5569/23129/1112/4440/4724/4297/3218/2354/1856/7423/766/90/89797/783/9945/57556/30845/54778/64925/22878/10106/9781/5139/3092/6416/4661/26118/25900/9648/94134/23314/80000/162427/221037/56674/51719/825/490/23767/23390/8648/23429/4124/10231/10513/22902/5914/10253/27347/3643/55082/23047/9252/27076/9315/4804/8646/8204/2872/3213/3223/4885/7248/4212/9900/91694/4734/140890/167227/22849/387263/2308/80267/51280/1003/2494/5468/373/900/55799/23389/51809/79776/5863/55970/7832/1009/10161/10769/6594/1195/1909/6857/27123/23261/3667/219654/27124/10451/771/4857/81563/4148/11122

Gene Ontology Analysis

GO enrichment analysis is internally supported via enrichGO (for ORA) and gseGO (for GSEA). These two functions combine with OrgDb as background annotation. There are 19 OrgDb for 19 species respectively available via the Bioconductor proejct.

For example, org.Hs.eg.db contains genome wide annotation for human.

Here are the examples of using enrichGO and gseGO for GO enrichment analysis:

library(clusterProfiler)

data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]

ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                OrgDb         = org.Hs.eg.db,
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)

Instead of using whole genome annotation as background, users can specify background by the universe parameter. For example, all genes that are actually expressed in an experiment.

head(ego)

##                    ID                              Description GeneRatio
## GO:0005819 GO:0005819                                  spindle    25/202
## GO:0005876 GO:0005876                      spindle microtubule    12/202
## GO:0000779 GO:0000779 condensed chromosome, centromeric region    15/202
## GO:0072686 GO:0072686                          mitotic spindle    14/202
## GO:0000775 GO:0000775           chromosome, centromeric region    18/202
## GO:0000776 GO:0000776                              kinetochore    15/202
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0005819 262/11812 2.569986e-12 8.146855e-10 7.520590e-10
## GO:0005876  45/11812 7.912518e-12 1.254134e-09 1.157726e-09
## GO:0000779  91/11812 3.254384e-11 3.438799e-09 3.174452e-09
## GO:0072686  84/11812 1.292642e-10 9.903693e-09 9.142376e-09
## GO:0000775 156/11812 1.562097e-10 9.903693e-09 9.142376e-09
## GO:0000776 106/11812 3.073022e-10 1.623580e-08 1.498772e-08
##                                                                                                                                                         geneID
## GO:0005819 CDCA8/CDC20/KIF23/CENPE/ASPM/DLGAP5/SKA1/NUSAP1/TPX2/TACC3/NEK2/CDK1/MAD2L1/KIF18A/BIRC5/KIF11/TTK/AURKB/PRC1/KIFC1/KIF18B/KIF20A/AURKA/CCNB1/KIF4A
## GO:0005876                                                                             CENPE/SKA1/NUSAP1/CDK1/KIF18A/BIRC5/KIF11/AURKB/PRC1/KIF18B/AURKA/KIF4A
## GO:0000779                                                            CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/CDT1/BIRC5/NCAPG/AURKB/AURKA/CCNB1
## GO:0072686                                                                KIF23/CENPE/ASPM/NUSAP1/TPX2/TACC3/CDK1/MAD2L1/KIF18A/KIF11/AURKB/KIFC1/KIF18B/AURKA
## GO:0000775                                           CDCA8/CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/NCAPG/AURKB/AURKA/CCNB1
## GO:0000776                                                             CENPE/NDC80/HJURP/SKA1/NEK2/CENPM/CENPN/ERCC6L/MAD2L1/KIF18A/CDT1/BIRC5/TTK/AURKB/CCNB1
##            Count
## GO:0005819    25
## GO:0005876    12
## GO:0000779    15
## GO:0072686    14
## GO:0000775    18
## GO:0000776    15

Similary, we can run GSEA test using gseGO function:

ego2 <- gseGO(geneList     = geneList,
              OrgDb        = org.Hs.eg.db,
              ont          = "CC",
              nPerm        = 1000,
              minGSSize    = 100,
              maxGSSize    = 500,
              pvalueCutoff = 0.05,
              verbose      = FALSE)

GO analysis for non-model organisms

Both enrichGO and gseGO functions require an OrgDb object as background annotation. For organisms that don’t have OrgDb provided by Bioconductor, users can query one (if available) online via AnnotationHub. If there is no OrgDb available, users can obtain GO annotation from other sources, e.g. from biomaRt or Blast2GO. Then using enricher or GSEA function to analyze, similar to the examples using wikiPathways and MSigDB. Another solution is to create OrgDb by your own using AnnotationForge package.

Reducing GO redundancy

GO is organized in parent-child structure, thus a parent term can be overlap with a large proportion with all its child terms. This can result in redundant findings. To solve this issue, clusterProfiler implement simplify method to reduce redundant GO terms from the outputs of enrichGO and gseGO. The function internally called GOSemSim (Yu et al. 2010) to calculate semantic similarity among GO terms and remove those highly similar terms by keeping one representative term.

sego <- simplify(ego)

library(enrichplot)
emapplot(ego)

emapplot(sego)

KEGG Analysis

Most of the tools for KEGG analysis use out-dated KEGG annotation. clusterProfiler supports downloading latest online version of KEGG data for enrichment analysis. It supports KEGG Pathway and Module.

kk <- enrichKEGG(gene, organism = "hsa")
head(kk)

##                ID                             Description GeneRatio
## hsa04110 hsa04110                              Cell cycle     11/90
## hsa04114 hsa04114                          Oocyte meiosis     10/90
## hsa04218 hsa04218                     Cellular senescence     10/90
## hsa03320 hsa03320                  PPAR signaling pathway      7/90
## hsa04914 hsa04914 Progesterone-mediated oocyte maturation      7/90
##           BgRatio       pvalue     p.adjust       qvalue
## hsa04110 124/7485 2.271800e-07 4.384573e-05 4.304463e-05
## hsa04114 125/7485 2.175154e-06 2.099024e-04 2.060673e-04
## hsa04218 160/7485 1.977884e-05 1.272439e-03 1.249190e-03
## hsa03320  74/7485 2.687347e-05 1.296645e-03 1.272954e-03
## hsa04914  99/7485 1.743529e-04 6.730023e-03 6.607059e-03
##                                                      geneID Count
## hsa04110 8318/991/9133/890/983/4085/7272/1111/891/4174/9232    11
## hsa04114    991/9133/983/4085/51806/6790/891/9232/3708/5241    10
## hsa04218     2305/4605/9133/890/983/51806/1111/891/776/3708    10
## hsa03320                 4312/9415/9370/5105/2167/3158/5346     7
## hsa04914                    9133/890/983/4085/6790/891/5241     7

gk <- gseKEGG(geneList, organism="hsa", nPerm=10000)
head(gk)

##                ID                       Description setSize
## hsa04510 hsa04510                    Focal adhesion     188
## hsa03030 hsa03030                   DNA replication      33
## hsa05340 hsa05340          Primary immunodeficiency      34
## hsa03050 hsa03050                        Proteasome      42
## hsa03008 hsa03008 Ribosome biogenesis in eukaryotes      55
## hsa01230 hsa01230       Biosynthesis of amino acids      63
##          enrichmentScore       NES       pvalue    p.adjust    qvalues
## hsa04510      -0.4188582 -1.705358 0.0001396258 0.004927116 0.00369513
## hsa03030       0.7227680  2.345351 0.0002590003 0.004927116 0.00369513
## hsa05340       0.6229650  2.039151 0.0002593361 0.004927116 0.00369513
## hsa03050       0.7097403  2.439212 0.0002650411 0.004927116 0.00369513
## hsa03008       0.5737068  2.083286 0.0002690342 0.004927116 0.00369513
## hsa01230       0.5504052  2.063442 0.0002730003 0.004927116 0.00369513
##          rank                   leading_edge
## hsa04510 2183 tags=27%, list=17%, signal=23%
## hsa03030 1905 tags=64%, list=15%, signal=54%
## hsa05340 1452 tags=41%, list=12%, signal=36%
## hsa03050 2516 tags=67%, list=20%, signal=53%
## hsa03008 4001 tags=75%, list=32%, signal=51%
## hsa01230 1750 tags=40%, list=14%, signal=34%
##                                                                                                                                                                                                                                                         core_enrichment
## hsa04510 5228/7424/1499/4636/83660/7059/5295/1288/23396/3910/3371/3082/1291/394/3791/7450/596/3685/1280/3675/595/2318/3912/1793/1278/1277/1293/10398/55742/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/1287/3679/7060/3479/10451/80310/1311/1101
## hsa03030                                                                                                                                                    4174/4171/4175/4173/2237/5984/5111/10535/1763/5427/23649/4176/5982/5557/5558/4172/5424/5983/5425/54107/6119
## hsa05340                                                                                                                                                                                               29851/6890/3932/3561/3575/915/959/925/5788/916/100/7374/6891/930
## hsa03050                                                                                                                5688/5709/5698/5693/3458/5713/11047/5721/5691/5685/5690/5684/5686/5695/10213/23198/7979/5699/5714/5702/5708/5692/5704/5683/5694/5718/51371/5682
## hsa03008                     23560/54913/10799/55131/27341/5822/10436/10248/29107/3692/54433/29889/81691/5901/10775/10557/55505/1460/23160/6949/10940/4809/2091/10556/51096/55651/92856/55813/7514/55226/56000/25996/10885/55127/4931/51068/1459/10171/55916/55341/9790
## hsa01230                                                                                                                                    29968/26227/875/445/5214/440/65263/6472/7086/5723/3418/7167/586/2597/5230/2023/5223/5831/6888/50/5315/3419/10993/2805/22934

It supports all species that have KEGG annotation data available in the KEGG database. The full list of KEGG supported organisms can be accessed via http://www.genome.jp/kegg/catalog/org_list.html.

KEGG Orthology (KO) Database is also supported by specifying organism = "ko".

Disease, Reactome and MeSH

clusterProfiler also supports Disease Ontology, Reactome Pathway and Medical Subject Headings (MeSH) via its sub-packages, DOSE (Yu et al. 2015), ReactomePA (Yu and He 2016) and meshes (Yu 2018). Please refer to the packages’ online vignettes for more details.

Functional enrichment of genomic region

Next-generation sequencing is also widely applied for detecting variable and regulatory regions, e.g. single nucleotide polymorphisms, copy number variations and transcription factor binding sites.

ChIPseeker provides a function, seq2gene, for linking genomic regions to genes in a many-to-many mapping. It consider host gene (exon/intron), promoter region and flanking gene from intergenic region that may under control via cis-regulation. After changing genomic region into gene information, we can apply clusterProfiler to test whether the genomic regions of interest is associated with particular functions and pathways.

Please refer to ChIPseeker vignette for more details.

Visualization

The visualization methods were implemented on enrichplot package.

Dot plot

p <- dotplot(ego, showCategory = 10)
print(p)

library(ggplot2)
p <- p + scale_color_continuous(low='purple', high='green') 
print(p)

p+ aes(shape = GeneRatio > 0.1) + 
  scale_size(range = c(5, 12), breaks=c(12, 18, 24))

The rich factors indicate the ratio of the number of DEGs mapped to a certain pathway to the total number of genes mapped to this pathway. Greater rich factor means greater intensiveness.

N <- as.numeric(sub("\\d+/", "", ego[1, "BgRatio"]))
dotplot(ego, showCategory=15, 
        x = ~Count/(BgRatio *N)) + 
  ggplot2::xlab("Rich Factor")

## dotplot for GSEA result
dotplot(gk, showCategory=30) + 
  facet_grid(.~ifelse(NES < 0, 'suppressed', 'activated'))

Gene-Concept Network

In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, enrichplot provides cnetplot function to extract the complex association. The cnetplot depicts the linkages of genes and biological concepts (e.g. GO terms or KEGG pathways) as a network. GSEA result is also supported with only core enriched genes displayed.

cnetplot(sego, foldChange=geneList)

Heatmap-like functional classification

The heatplot is similar to cnetplot, while displaying the relationships as a heatmap. The gene-concept network may become too complicated if user want to show a large number significant terms. The heatplot can simplify the result and more easy to identify expression patterns.

heatplot(sego, foldChange=geneList)

Enrichment Map

Enrichment map organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets are tend to cluster together, making it easy to identify functional module.

The emapplot function supports results obtained from hypergeometric test and gene set enrichment analysis.

p <- emapplot(ego)
p

p + scale_color_continuous(low = "purple", high="darkgreen")

## devtools::install_github("karthik/wesanderson")
require(wesanderson)
pal <- wes_palette("Zissou1", 100, type = "continuous")
p + scale_color_gradientn(colors = pal)

ridgeline plot for expression distribution of GSEA result

The ridgeplot will visualize expression distributions of core enriched genes for GSEA enriched categories. It helps users to interpret up/down-regulated pathways.

ridgeplot(gk)

running score and preranked list of GSEA result

Running score and preranked list are traditional methods for visualizing GSEA result. The enrichplot package supports both of them to visualize the distribution of the gene set and the enrichment score.

gseaplot(gk, geneSetID = 1, title = gk$Description[1])

Another method to plot GSEA result is the gseaplot2 function:

gseaplot2(gk, geneSetID = 1, title = gk$Description[1])

The gseaplot2 also supports multile gene sets to be displayed on the same figure:

gseaplot2(gk, geneSetID = 1:3)

The gsearank function plot the ranked list of genes belong to the specific gene set.

gsearank(gk, 1, title = gk[1, "Description"])

Multiple gene sets can be aligned using cowplot:

library(ggplot2)
library(cowplot)

pp <- lapply(1:3, function(i) {
    anno <- gk[i, c("NES",  "p.adjust")]
    lab <- paste0(names(anno), "=",  round(anno, 3), collapse="\n")

    es <- gk[i, "enrichmentScore"]
    x <- ifelse(es < 0, 0, length(geneList) * .8)
    gsearank(gk, i, gk[i, 2]) + xlab(NULL) +ylab(NULL) +
        annotate("text", x,  es * .5, label = lab, 
                hjust=0, vjust=0, size = 4) + xlim(0, 12500)
})
plot_grid(plotlist=pp, ncol=1)

Biological theme comparison

clusterProfiler provides compareCluster function to automatically calculate enriched functional categories among multiple gene clusters that obtained from different experimental conditions.

data(gcSample)
sapply(gcSample, length)

##  X1  X2  X3  X4  X5  X6  X7  X8 
## 216 805 392 838 929 585 582 237

lapply(gcSample, head)

## $X1
## [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"
## 
## $X2
## [1] "23450" "5160"  "7126"  "26118" "8452"  "3675" 
## 
## $X3
## [1] "894"   "7057"  "22906" "3339"  "10449" "6566" 
## 
## $X4
## [1] "5573"  "7453"  "5245"  "23450" "6500"  "4926" 
## 
## $X5
## [1] "5982" "7318" "6352" "2101" "8882" "7803"
## 
## $X6
## [1] "5337"  "9295"  "4035"  "811"   "23365" "4629" 
## 
## $X7
## [1] "2621" "2665" "5690" "3608" "3550" "533" 
## 
## $X8
## [1] "2665" "4735" "1327" "3192" "5573" "9528"

ck <- compareCluster(geneCluster = gcSample, fun="enrichKEGG")
head(ck)

##   Cluster       ID                             Description GeneRatio
## 1      X2 hsa04110                              Cell cycle    18/359
## 2      X2 hsa05169            Epstein-Barr virus infection    23/359
## 3      X2 hsa05340                Primary immunodeficiency     8/359
## 4      X2 hsa04064            NF-kappa B signaling pathway    13/359
## 5      X2 hsa05166 Human T-cell leukemia virus 1 infection    22/359
## 6      X3 hsa04512                ECM-receptor interaction     9/173
##    BgRatio       pvalue    p.adjust      qvalue
## 1 124/7485 2.267654e-05 0.006530844 0.006158471
## 2 201/7485 8.648461e-05 0.012453784 0.011743700
## 3  37/7485 2.938634e-04 0.028210882 0.026602366
## 4  95/7485 5.596671e-04 0.040296031 0.037998450
## 5 219/7485 7.769278e-04 0.044751044 0.042199449
## 6  82/7485 1.051835e-04 0.026401058 0.024579723
##                                                                                                      geneID
## 1                   991/1869/890/1871/701/990/10926/9088/8317/9700/9134/1029/2810/699/11200/23594/8555/4173
## 2 4067/3383/7128/1869/890/1871/578/864/637/9641/6891/355/9134/5971/916/956/6850/7187/3551/919/4734/958/6772
## 3                                                                       100/6891/3932/973/916/925/958/64421
## 4                                           4067/3383/7128/3932/5971/4050/6850/7187/3551/10892/5588/330/958
## 5  3383/991/1869/890/1871/113/701/9700/3932/9134/5971/916/4487/3600/1029/3551/8498/4488/11200/4776/4214/958
## 6                                                              7057/3339/1299/3695/1101/3679/3910/3696/3693
##   Count
## 1    18
## 2    23
## 3     8
## 4    13
## 5    22
## 6     9

dotplot(ck)

Formula interface of compareCluster

compareCluster also supports passing a formula (the code to support formula has been contributed by Giovanni Dall’Olio) of type \(Entrez \sim group\) or \(Entrez \sim group + othergroup\).

mydf <- data.frame(Entrez=names(geneList), FC=geneList)
mydf <- mydf[abs(mydf$FC) > 1,]
mydf$group <- "upregulated"
mydf$group[mydf$FC < 0] <- "downregulated"
mydf$othergroup <- "A"
mydf$othergroup[abs(mydf$FC) > 2] <- "B"

formula_res <- compareCluster(Entrez~group+othergroup, data=mydf, fun="enricher",
                             TERM2GENE=wpid2gene, TERM2NAME = wpid2name)

head(formula_res)

##           Cluster         group othergroup     ID
## 1 downregulated.A downregulated          A  WP474
## 2 downregulated.A downregulated          A WP3932
## 3 downregulated.A downregulated          A  WP306
## 4 downregulated.A downregulated          A WP2911
## 5 downregulated.B downregulated          B WP3942
## 6 downregulated.B downregulated          B WP2895
##                                       Description GeneRatio  BgRatio
## 1                       Endochondral Ossification    13/280  65/6249
## 2  Focal Adhesion-PI3K-Akt-mTOR-signaling pathway    29/280 309/6249
## 3                                  Focal Adhesion    21/280 202/6249
## 4     miRNA targets in ECM and membrane receptors     8/280  43/6249
## 5                          PPAR signaling pathway      5/38  67/6249
## 6 Differentiation of white and brown adipocyte         3/38  25/6249
##         pvalue    p.adjust      qvalue
## 1 4.480512e-06 0.001200777 0.001174366
## 2 1.081666e-04 0.014494322 0.014175515
## 3 2.504461e-04 0.022373181 0.021881077
## 4 5.418905e-04 0.036306661 0.035508086
## 5 4.650188e-05 0.003859656 0.003573302
## 6 4.351306e-04 0.018057919 0.016718175
##                                                                                                                                                   geneID
## 1                                                                                     11096/85477/3480/2690/8817/5167/2737/5744/2487/5327/1300/1473/3479
## 2 1281/50509/1290/55970/5618/7058/10161/56034/3693/4254/6720/3480/5159/3991/1289/1292/3908/2690/3909/8817/3551/2791/63923/3913/3667/3679/7060/3479/80310
## 3                                         55742/50509/1290/2317/7058/25759/56034/3693/3480/5159/857/1292/3908/3909/63923/3913/3679/7060/3479/10451/80310
## 4                                                                                                               1281/50509/1290/7058/3693/1289/1292/3913
## 5                                                                                                                               9370/5105/2167/3158/5346
## 6                                                                                                                                         9370/23090/652
##   Count
## 1    13
## 2    29
## 3    21
## 4     8
## 5     5
## 6     3

dotplot(formula_res, x=~group) + ggplot2::facet_grid(~othergroup)

Summary

Using ORA, we need to first classify genes as interest or not interest (e.g. DE or not DE), which is rather arbitrary, and the independence assumption between genes is unrealistic. However, the simplicity and general applicability of ORA is unmet by other methods. For example, GSEA requires the expression data as input, which is not available from other experimental types.

There is another strategy, topology-based method, to perform functional analysis. However, network for many species is missing or at least incomplete, restricting the application of this method.

All methods have their own benefits and limitations and no single method is best suited for all applications.

An increasing concern upon the quality of gene annotation has raised an alarm in biomedical research. A previous study (Wadi et al. 2016) reported that about 42% of the tools were outdated by more than five years and functional significance were severely underestimated with only 26% of biological processes or pathways were captured compare to using up-to-date annotation.

Doing pathway analysis, we need to bear in mind that applying multiple methods can be beneficial, and pay attention to the release date of the background annotation.

clusterProfiler provides two most widely used methods, ORA and GSEA, and up-to-date annotation. All species annotated in KEGG database can be analyzed and the data is queried online (most updated). OrgDb that released within Bioconductor was updated biannual, and can be queried also online via AnnotationHub. In addition, clusterProfiler is a general tool and user can provide their own annotation to analyze their data. For instance users can download the latest wikiPathway, MSigDb, GO, etc.

References

Boyle, Elizabeth I, Shuai Weng, Jeremy Gollub, Heng Jin, David Botstein, J Michael Cherry, and Gavin Sherlock. 2004. “GO::TermFinder–open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes.” Bioinformatics (Oxford, England) 20 (18): 3710–5. https://doi.org/10.1093/bioinformatics/bth456.

Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102 (43): 15545–50. https://doi.org/10.1073/pnas.0506580102.

Wadi, Lina, Mona Meyer, Joel Weiser, Lincoln D. Stein, and Jüri Reimand. 2016. “Impact of Outdated Gene Annotations on Pathway Enrichment Analysis.” Nature Methods 13 (9): 705–6. https://doi.org/10.1038/nmeth.3963.

Yu, Guangchuang. 2018. “Using Meshes for MeSH Term Enrichment and Semantic Analyses.” Bioinformatics 34 (21): 3766–7. https://doi.org/10.1093/bioinformatics/bty410.

Yu, Guangchuang, and Qing-Yu He. 2016. “ReactomePA: An R/Bioconductor Package for Reactome Pathway Analysis and Visualization.” Molecular BioSystems 12 (2): 477–79. https://doi.org/10.1039/C5MB00663E.

Yu, Guangchuang, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, and Shengqi Wang. 2010. “GOSemSim: An R Package for Measuring Semantic Similarity Among Go Terms and Gene Products.” Bioinformatics 26 (7): 976–78. https://doi.org/10.1093/bioinformatics/btq064.

Yu, Guangchuang, Le-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. https://doi.org/10.1093/bioinformatics/btu684.