Today is my birthday and it happened to be the release day of Bioconductor 3.3. It’s again the time to reflect what I’ve done in the past year.
Several parameters including
ignoreDownstream were added in
annotatePeak requested by @crazyhottommy for using ChIPseeker to annotate breakpoints from whole genome sequencing data.
overlap was also introduced. By default
overlap="TSS" and only overlap with TSS will be reported as the nearest gene. If
overlap="all", then gene overlap with peak will be reported as nearest gene, no matter the overlap is at TSS region or not.
annotatePeak also support using user’s customize regions to annotate their data by passing
getPromoters() function prepare a
GRanges object of promoter regions by user specific upstream and downstream distance from Transcript Start Site (TSS). Then we can align the peaks that are mapping to these regions and visualize the profile or heatmap of ChIP binding to the TSS regions.
Users (1 and 2) are interesting in the intensity of peaks binding to the start of intron/exon, and ChIPseeker provides a new function
getBioRegion to output
GRanges object of Intron/Exon start regions.
GEO data mining
ChIPseeker incorporates GEO database and supports data mining to infer cooperative regulation. The data was updated and now ChIPseeker contains 19348 bed file information.
We compare clusterProfiler with GSEA-P (which released by broad institute), the p-values calculated by these two software are almost identical.
For comparing biological themes,
clusterProfiler supports formula to express complex conditions and facet is supported to visualize complex result.
read.gmt function for parsing GMT file format from Molecular Signatures Database, so that gene set collections in this database can be used in
clusterProfiler for both hypergeometric test and GSEA.
KEGG Module was supported just like the KEGG Pathway,
clusterProfiler will query the online annotation data which keep the annotation data alwasy updated.
The KEGG database was updated quite frequently. The
KEGG.db which was not updated since 2012, it contains annotation of 5894 human genes. In Feb. 2015, when
clusterProfiler first supports querying online KEGG data, KEGG contains annotation of 6861 human genes and today it has 7018 human genes annotated. Most of the tools/webservers used out-dated data (e.g. DAVID not updated since 2010, 5085 human genes annotated by KEGG), the analyzed result may totally changed if we use a recently updated data. Indeed
clusterProfiler is more reliable as we always use the latest data.
In addition to
bitr function that can translate biological ID using
OrgDb object, we provides bitr_kegg that uses KEGG API for translating biological ID. It supports more than 4000 species (can be search via the
search_kegg_species function) as in KEGG Pathway and Module analyses.
The function called of enrichGO and gseGO was changed. Now not only species that have OrgDb available in Bioconductor can be analyzed but also all species that have an OrgDb can be analyzed which can be query online via
AnnotationHub or build with user’s own data. With this update,
gseGO can input any gene ID type if only the ID type was supported in the
GO enrichment analysis alwasy output redundant terms, we implemented a
simplify function to remove redundant terms by calculating GO semantic similarity using
GOSemSim. Several useful utilities include
gsfilter are also provided.
I bump the version to 3.0.0 the following three reasons:
- the changes of function calls
- can analyze any ontology/pathway annotation (supports user’s customize annotation data)
- can analyze all speices that have annotation available (e.g. more than 4000 species for KEGG)
Although the package is very simple when I published it, I keep update and add new features from my own idea or user’s request. Now this package is indeed in good shape. Here is the summary.
This package implements methods to analyze and visualize functional profiles of genomic coordinates (supported by ChIPseeker), gene and gene clusters.
clusterProfiler supports both hypergeometric test and Gene Set Enrichment Analysis for many ontologies/pathways, including:
- Disease Ontology (via DOSE)
- Network of Cancer Gene (via DOSE)
- Gene Ontology (supports many species with GO annotation query online via AnnotationHub)
- KEGG Pathway and Module with latest online data (supports more than 4000 species listed in http://www.genome.jp/kegg/catalog/org_list.html)
- Reactome Pathway (via ReactomePA)
- DAVID (via RDAVIDWebService)
- Molecular Signatures Database
- hallmark gene sets
- positional gene sets
- curated gene sets
- motif gene sets
- computational gene sets
- GO gene sets
- oncogenic signatures
- immunologic signatures
- Other Annotations
clusterProfiler also provides several visualization methods to help interpreting enriched results, including:
- plotGOgraph (via topGO package)
- upsetplot (via UpSetR package)
and several useful utilities:
- bitr (Biological Id TranslatoR)
- bitr_kegg (bitr using KEGG source)
- compareCluster (biological theme comparison)
- dropGO (screen out GO term of specific level or specific term)
- go2ont (convert GO ID to Ontology)
- go2term (convert GO ID to descriptive term)
- gofilter (restrict result at specific GO level)
- gsfilter (restrict result by gene set size)
- search_kegg_organism (search kegg supported organism)
- setReadable (convert IDs stored
enrichResultobject to gene symbol)
- simplify (remove redundant GO terms, supported via GOSemSim)
DOSE now test bimodal separately in GSEA and the output pvalues are [more conserved]((http://guangchuangyu.github.io/2015/11/comparison-of-clusterprofiler-and-gsea-p/).
maxGSSize parameter was added, with default value of
500. Usually if the geneset > 500, its probability of being called significant by GSEA rises quite dramatically.
gsfilter function for restricting enriched results with minimal and maximal gene set sizes.
upsetplot was implemented to visualize overlap of enriched gene sets.
The dot sizes in
enrichMap now scaled by category sizes
All these changes also affect
I put more efforts to extend
ggtree than the sum of all other packages. Here listed the major new features while small improvement and bug fixed can be found in the NEWS file.
- supports NHX file format via
- supports phylip tree format via
- raxml2nwk for converting raxml bootstrap tree to newick text
- all parser functions support passing
textConnection(text_string)as a file
- supports annotating tree with ancestral sequences inferred by
obkDataobject defined by
phyloseqobject defined by
- geom_point2,geom_text2, geom_segment2 and geom_label2 to support subsetting
- geom_treescale for adding scale of branch length
- geom_cladelabel for labeling selected clade
- geom_tiplab2 for adding tiplab of circular tree
- geom_taxalink for connecting related taxa
- geom_range for adding range to present uncertainty of branch lengths
- subview and inset now support annotating with image files
- rescale_tree function to rescale branch lengths using numerical variable
- MRCA for finding Most Recent Common Ancestor among a vector of tips
- viewClade to zoom in a selected clade
Split the long vignette to several small ones and add more examples.
- Tree Data Import
- Tree Visualization
- Tree Annotation
- Tree Manipulation
- Advance Tree Annotation
Here is the NEWS record:
CHANGES IN VERSION 1.3.16 ------------------------ o geom_treescale() supports family argument <2016-04-27, Wed> + https://github.com/GuangchuangYu/ggtree/issues/56 o update fortify.phylo to work with phylo that has missing value of edge length <2016-04-21, Thu> + https://github.com/GuangchuangYu/ggtree/issues/54 o support passing textConnection(text_string) as a file <2016-04-21, Thu> + contributed by Casey Dunn <firstname.lastname@example.org> + https://github.com/GuangchuangYu/ggtree/pull/55#issuecomment-212859693 CHANGES IN VERSION 1.3.15 ------------------------ o geom_tiplab2 supports parameter hjust <2016-04-18, Mon> o geom_tiplab and geom_tiplab2 support using geom_label2 by passing geom="label" <2016-04-07, Thu> o geom_label2 that support subsetting <2016-04-07, Thu> o geom_tiplab2 for adding tip label of circular layout <2016-04-06, Wed> o use plot$plot_env to access ggplot2 parameter <2016-04-06, Wed> o geom_taxalink for connecting related taxa <2016-04-01, Fri> o geom_range for adding range of HPD to present uncertainty of evolutionary inference <2016-04-01, Fri> CHANGES IN VERSION 1.3.14 ------------------------ o geom_tiplab works with NA values, compatible with collapse <2016-03-05, Sat> o update theme_tree2 due to the issue of https://github.com/hadley/ggplot2/issues/1567 <2016-03-05, Sat> o offset works in `align=FFALSE` with `annotation_image` function <2016-02-23, Tue> + see https://github.com/GuangchuangYu/ggtree/issues/46 o subview and inset now supports annotating with img files <2016-02-23, Tue> CHANGES IN VERSION 1.3.13 ------------------------ o add example of rescale_tree function in treeAnnotation.Rmd <2016-02-07, Sun> o geom_cladelabel works with collapse <2016-02-07, Sun> + see https://github.com/GuangchuangYu/ggtree/issues/38 CHANGES IN VERSION 1.3.12 ------------------------ o exchange function name of geom_tree and geom_tree2 <2016-01-25, Mon> o solved issues of geom_tree2 <2016-01-25, Mon> + https://github.com/hadley/ggplot2/issues/1512 o colnames_level parameter in gheatmap <2016-01-25, Mon> o raxml2nwk function for converting raxml bootstrap tree to newick format <2016-01-25, Mon> CHANGES IN VERSION 1.3.11 ------------------------ o solved issues of geom_tree2 <2016-01-25, Mon> + https://github.com/GuangchuangYu/ggtree/issues/36 o change compute_group() to compute_panel in geom_tree2() <2016-01-21, Thu> + fixed issue, https://github.com/GuangchuangYu/ggtree/issues/36 o support phyloseq object <2016-01-21, Thu> o update geom_point2, geom_text2 and geom_segment2 to support setup_tree_data <2016-01-21, Thu> o implement geom_tree2 layer that support duplicated node records via the setup_tree_data function <2016-01-21, Thu> o rescale_tree function for rescaling branch length of tree object <2016-01-20, Wed> o upgrade set_branch_length, now branch can be rescaled using feature in extraInfo slot <2016-01-20, Wed> CHANGES IN VERSION 1.3.10 ------------------------ o remove dependency of gridExtra by implementing multiplot function instead of using grid.arrange <2016-01-20, Wed> o remove dependency of colorspace <2016-01-20, Wed> o support phylip tree format and update vignette of phylip example <2016-01-15, Fri> CHANGES IN VERSION 1.3.9 ------------------------ o optimize getYcoord <2016-01-14, Thu> o add 'multiPhylo' example in 'Tree Visualization' vignette <2016-01-13, Wed> o viewClade, scaleClade, collapse, expand, rotate, flip, get_taxa_name and scale_x_ggtree accepts input tree_view=NULL. these function will access the last plot if tree_view=NULL. <2016-01-13, Wed> + > ggtree(rtree(30)); viewClade(node=35) works. no need to pipe. CHANGES IN VERSION 1.3.8 ------------------------ o add example of viewClade in 'Tree Manipulation' vignette <2016-01-13, Wed> o add viewClade function <2016-01-12, Tue> o support obkData object defined by OutbreakTools <2016-01-12, Tue> o update vignettes <2016-01-07, Thu> o 05 advance tree annotation vignette <2016-01-04, Mon> o export theme_inset <2016-01-04, Mon> o inset, nodebar, nodepie functions <2015-12-31, Thu> CHANGES IN VERSION 1.3.7 ------------------------ o split the long vignette to several vignettes + 00 ggtree <2015-12-29, Tue> + 01 tree data import <2015-12-28, Mon> + 02 tree visualization <2015-12-28, Mon> + 03 tree manipulation <2015-12-28, Mon> + 04 tree annotation <2015-12-29, Tue> CHANGES IN VERSION 1.3.6 ------------------------ o MRCA function for finding Most Recent Common Ancestor among a vector of tips <2015-12-22, Tue> o geom_cladelabel: add bar and label to annotate a clade <2015-12-21, Mon> - remove annotation_clade and annotation_clade2 functions. o geom_treescale: tree scale layer. (add_legend was removed) <2015-12-21, Mon> CHANGES IN VERSION 1.3.5 ------------------------ o bug fixed, read.nhx now works with scientific notation <2015-11-30, Mon> + see https://github.com/GuangchuangYu/ggtree/issues/30 CHANGES IN VERSION 1.3.4 ------------------------ o rename beast feature when name conflict with reserve keywords (label, branch, etc) <2015-11-27, Fri> o get_clade_position function <2015-11-26, Thu> + https://github.com/GuangchuangYu/ggtree/issues/28 o get_heatmap_column_position function <2015-11-25, Wed> + see https://github.com/GuangchuangYu/ggtree/issues/26 o support NHX (New Hampshire X) format via read.nhx function <2015-11-17, Tue> o bug fixed in extract.treeinfo.jplace <2015-11-17, Thu> CHANGES IN VERSION 1.3.3 ------------------------ o support color=NULL in gheatmap, then no colored line will draw within the heatmap <2015-10-30, Fri> o add `angle` for also rectangular, so that it will be available for layout='rectangular' following by coord_polar() <2015-10-27, Tue> CHANGES IN VERSION 1.3.2 ------------------------ o update vignette, add example of ape bootstrap and phangorn ancestral sequences <2015-10-26, Mon> o add support of ape bootstrap analysis <2015-10-26, Mon> see https://github.com/GuangchuangYu/ggtree/issues/20 o add support of ancestral sequences inferred by phangorn <2015-10-26, Mon> see https://github.com/GuangchuangYu/ggtree/issues/21 CHANGES IN VERSION 1.3.1 ------------------------ o change angle to angle + 90, so that label will in radial direction <2015-10-22, Thu> + see https://github.com/GuangchuangYu/ggtree/issues/17 o na.rm should be always passed to layer(), fixed it in geom_hilight and geom_text2 <2015-10-21, Wed> + see https://github.com/hadley/ggplot2/issues/1380 o matching beast stats with tree using internal node number instead of label <2015-10-20, Tue>
update IC data using update OrgDb packages.
We published ReactomePA in Molecular BioSystems.
Yu G, Wang LG and He QY*. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383.
Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
G Yu, LG Wang, GR Yan, QY He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609.
G Yu, DK Smith, H Zhu, Y Guan, TTY Lam*. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution.
Yu G†, Li F†, Qin Y, Bo X*, Wu Y and Wang S*. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010, 26(7):976-978.
G Yu, QY He*. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479.