Evolution of my BioC packages

July 13, 2016 in R

发现Youtube上有一个视频叫Evolution of clusterProfiler, 是Landon Wilkins用Gource做的。于是我也来玩一下，看一下自己这几年码代码的过程。

[Bioc 33] NEWS of my BioC packages

May 5, 2016 in R

Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.

convert biological ID with KEGG API using clusterProfiler

May 3, 2016 in R

bitr_kegg

clusterProfiler can convert biological IDs using OrgDb object via the bitr function. Now I implemented another function, bitr_kegg for converting IDs through KEGG API.

library(clusterProfiler)
data(gcSample)
hg <- gcSample[[1]]
head(hg)

## [1] "4597"  "7111"  "5266"  "2175"  "755"   "23046"

eg2np <- bitr_kegg(hg, fromType='kegg', toType='ncbi-proteinid', organism='hsa')

## Warning in bitr_kegg(hg, fromType = "kegg", toType = "ncbi-proteinid",
## organism = "hsa"): 3.7% of input gene IDs are fail to map...

head(eg2np)

##     kegg ncbi-proteinid
## 1   8326      NP_003499
## 2  58487   NP_001034707
## 3 139081      NP_619647
## 4  59272      NP_068576
## 5    993      NP_001780
## 6   2676      NP_001487

np2up <- bitr_kegg(eg2np[,2], fromType='ncbi-proteinid', toType='uniprot', organism='hsa')

head(np2up)

##   ncbi-proteinid uniprot
## 1      NP_005457  O75586
## 2      NP_005792  P41567
## 3      NP_005792  Q6IAV3
## 4      NP_037536  Q13421
## 5      NP_006054  O60662
## 6   NP_001092002  O95398

The ID type (both fromType & toType) should be one of ‘kegg’, ‘ncbi-geneid’, ‘ncbi-proteinid’ or ‘uniprot’. The ‘kegg’ is the primary ID used in KEGG database. The data source of KEGG was from NCBI. A rule of thumb for the ‘kegg’ ID is entrezgene ID for eukaryote species and Locus ID for prokaryotes.

KEGG Module Enrichment Analysis

April 13, 2016 in R

KEGG MODULE is a collection of manually defined functional units, called KEGG modules and identified by the M numbers, used for annotation and biological interpretation of sequenced genomes. There are four types of KEGG modules:

pathway modules – representing tight functional units in KEGG metabolic pathway maps, such as M00002 (Glycolysis, core module involving three-carbon compounds)

structural complexes – often forming molecular machineries, such as M00072 (Oligosaccharyltransferase)

functional sets – for other types of essential sets, such as M00360 (Aminoacyl-tRNA synthases, prokaryotes)

signature modules – as markers of phenotypes, such as M00363 (EHEC pathogenicity signature, Shiga toxin)

GO analysis using clusterProfiler

January 4, 2016 in R

clusterProfiler supports over-representation test and gene set enrichment analysis of Gene Ontology. It supports GO annotation from OrgDb object, GMT file and user’s own data.

support many species

In github version of clusterProfiler, enrichGO and gseGO functions removed the parameter organism and add another parameter OrgDb, so that any species that have OrgDb object available can be analyzed in clusterProfiler. Bioconductor have already provide OrgDb for about 20 species, see http://bioconductor.org/packages/release/BiocViews.html#___OrgDb, and users can build OrgDb via AnnotationHub.

Comparison of clusterProfiler and GSEA-P

November 2, 2015 in R

Thanks @mevers for raising the issue to me and his efforts in benchmarking clusterProfiler.

He pointed out two issues:

outputs from gseGO and GSEA-P are poorly overlap.
pvalues from gseGO are generally smaller and don’t show a lot of variation

For GSEA analysis, we have two inputs, a ranked gene list and gene set collections.

First of all, the gene set collections are very different. The GMT file used in his test is c5.cc.v5.0.symbols.gmt, which is a tiny subset of GO CC, while clusterProfiler used the whole GO CC corpus.

use simplify to remove redundancy of enriched GO terms

October 21, 2015 in Visualization, R

To simplify enriched GO result, we can use slim version of GO and use enricher function to analyze.

Another strategy is to use GOSemSim to calculate similarity of GO terms and remove those highly similar terms by keeping one representative term. To make this feature available to clusterProfiler users, I develop a simplify method to reduce redundant GO terms from output of enrichGO function.

require(clusterProfiler)
data(geneList, package="DOSE")
de <- names(geneList)[abs(geneList) > 2]
bp <- enrichGO(de, ont="BP")
enrichMap(bp)

Evolution of my BioC packages

[Bioc 33] NEWS of my BioC packages

convert biological ID with KEGG API using clusterProfiler

bitr_kegg

KEGG Module Enrichment Analysis

GO analysis using clusterProfiler

support many species

Comparison of clusterProfiler and GSEA-P

use simplify to remove redundancy of enriched GO terms

Guangchuang Yu