Ewan Birney最近的一篇博文(Five statistical things I wished I had been taught 20 years ago )讲述了统计对于生物学的重要性。

一开始从RA Fisher讲起,说生物压根就是统计。Fisher是个农业学家,他所建立的那些统计方法,都是从生物学问题出发。

Ewan所谈及的五个方面分别是:

1. Non parametric statistics. These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing “medians” (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman’s Rho (rather pearson’s r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can’t in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don’t forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So - if in doubt, do things on the rank of metric, rather than the metric itself.

Continue reading

[bootstrap](http://en.wikipedia.org/wiki/Bootstrapping_(statistics))是对观测数据集进行有放回(replacement)的随机抽样,以评估总体的各项统计指标。可以用于假设检验、参数估计。好处是并不要求大样本,也不要求正态数据,并且对于不同的统计指标使用的是同样的计算方法。结果也更为可靠,坏处是计算量大。

统计推断(statistical inference)是基于样本统计值的抽样分布来计算的,抽样分布需要从总体中许多的样本来计算,在只有一个样本的情况下,bootstrap对这一随机样本进行有放回的重复抽样,每一个重抽样本与原始随机样本一样大,每次计算相应的抽样的统计值,重复了N次之后,就可以计算统计值的bootstrap分布。

下面做一个小小的试验:

a <- c(seq(1:10), rnorm(50))  

#创建一个样本,60个数据,非正态分布的,如下图

Continue reading

Author's picture

Guangchuang Yu

Bioinformatics Professor @ SMU

Bioinformatics Professor

Guangzhou