QQ plot

虽然R提供了很多作图函数,但自己实现一下,是非常好的体验,而且能够让我们了解其中的细节。

最近在读<Modern Applied Statistics With S-PLUS>,115页讲到Q-Q图时,书中给出了一个Trellis的实现。(Trellis是S/S-PLUS的可视化系统,在R里的对等实现是lattice包)。

我们知道一组数字,可以算4分位数,分别是25%, 50%(中位数), 75%,它等于该组数字中所有数值由小到大排列后第X%的数字,事实上每个数字都可以对应一个X%,Q-Q图很简单,把样本数据和理论分布算出来的quantiles,画个散点图而已。分别用base graph和ggplot2实现,图中三个图分别由系统函数qqnorm,和这里定义的qqplot, qqplot2画出来。

Continue reading

上了几天的课,http://ibw2011.fmmu.edu.cn/schedule.htm 今天就上完了,只完成了project 1,想写gibbs sampling,但是没搞明白,汗。

这个纯属练习用,没啥实用价值。

Course Projects:

Project 1: Implementation of a simple gene finder

GOAL

Build a simple codon-usage based gene finder for finding genes in E.coli.

Procedure

Collect 100 gene sequences from the bacterium E. coli in the genbank (http://www.ncbi.nlm.nihh.gov). Compute the codon usage table based on these genes (and the translated protein sequences from them); Build a probabilistic model based on the codon usages; Implement a random sequence model in which the nucleotide frequency is computed from the 100 E. coli genes. For a given DNA sequence (and one selected reading frame), compare your model with a random sequence model; Results that you should submit:

Two FASTA files for the collected 100 genes and 100 translated protein sequences; The printed codon usage table; A program named ECgnfinder, running with the syntax as ECgnfinder –i inputfile

Inputfile stands for the name of input file, which should contain one DNA sequence in FASTA file format; the program should be able to report an error message if the input file is in the wrong format.

The output should be printed to the standard output as (xxx stands for the likelihood)

ORF1: xxx ORF2: xxx

Continue reading

泰勒公式学过微积分都应该知道,可以翻wiki复习一下,https://zh.wikipedia.org/wiki/泰勒公式.

用R简单实现一下:

 efv <- function(f, value, variable="x", a=0, eps=0.001) {
     #estimate function value using Taylor theorem
     assign(eval(variable), a)
     fv.old <- eval(f)
     k <- 1     
     repeat {
         df <- D(f, variable)
         if (df == 0)
             break
         fv.new <- fv.old + eval(df)*(value-a)^k/factorial(k)
         if (fv.new - fv.old < eps)
             break
         fv.old <- fv.new
         f <- df
         k <- k+1
     }
     return (fv.new)
 }

Continue reading

创建R包

> source("code.R") #载入写好的代码 
> package.skeleton(name="pkgname", list=c("function_name_list"))
# 生成R源码包的目录结构 

到man目录下改*.Rd文档。latex格式。这是包和函数的帮助文档。

如果需要vignette文档的话。在包目录下,新建inst/doc,在里面写pkgname.Rnw文档。基本上是latex格式,不过允许你插入R代码。make的时候,会先跑代码。再自动转换成latex,再编译成pdf。

$ R CMD check pkgname 
# 检验代码和文档。这个很重要。通常一些小问题都能在这一步发现。 
$ R CMD build pkgname 
# 打包,源码包格式 
$ R CMD build --binary pkgname
#编译后打包,二进制格式。

Continue reading

Author's picture

Guangchuang Yu

a senior-in-age-but-not-senior-in-knowledge bioinformatician

Postdoc researcher

Hong Kong