10 General Linear Model
10.1 T检验是回归的特例
data(iris)
tt <- iris[, c("Petal.Length", "Sepal.Length")]
t.test(Petal.Length, Sepal.Length, data=tt, var.equal=TRUE)
##
## Two Sample t-test
##
## data: Petal.Length and Sepal.Length
## t = -13.098, df = 298, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.398643 -1.772023
## sample estimates:
## mean of x mean of y
## 3.758000 5.843333
我们取iris的Petal.Length和Sepal.Length做两样本T检验,因为方差分析和回归分析都假定齐方差,所以这里以齐方差方式进行T检验。
数据可以转换成以下分组形式:
require(reshape2)
## Loading required package: reshape2
tt2 <- melt(tt)
## No id variables; using all as measure variables
head(tt2)
## variable value
## 1 Petal.Length 1.4
## 2 Petal.Length 1.4
## 3 Petal.Length 1.3
## 4 Petal.Length 1.5
## 5 Petal.Length 1.4
## 6 Petal.Length 1.7
t.test(value ~ variable, data=tt2, var.equal=TRUE)
##
## Two Sample t-test
##
## data: value by variable
## t = -13.098, df = 298, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.398643 -1.772023
## sample estimates:
## mean in group Petal.Length mean in group Sepal.Length
## 3.758000 5.843333
分成两组,按分组变量进行T检验,结果是一样的。
对分组数据进行回归分析:
lm.fit <- lm(value ~ variable, data=tt2)
summary(lm.fit)
##
## Call:
## lm(formula = value ~ variable, data = tt2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7580 -0.8433 0.1993 0.9457 3.1420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7580 0.1126 33.38 <2e-16 ***
## variableSepal.Length 2.0853 0.1592 13.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.379 on 298 degrees of freedom
## Multiple R-squared: 0.3654, Adjusted R-squared: 0.3632
## F-statistic: 171.6 on 1 and 298 DF, p-value: < 2.2e-16
回归分析完全包含了T检验,并且给出更丰富的信息。
Intercept相当于Petal.Length的均值,Slope则是均值差,slope的T检验结果和上面两样本均值差T检验是一样的。 而F值是T检验结果的平方:
13.10^2
## [1] 171.61
10.2 T检验是方差分析的特例
res <- aov(value ~ variable, data=tt2)
summary(res)
## Df Sum Sq Mean Sq F value Pr(>F)
## variable 1 326.1 326.1 171.6 <2e-16 ***
## Residuals 298 566.5 1.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
one way ANOVA进行两组分析和T检验是一样一样的,F值是t值的平方。
我们常说T检验和ANOVA是用来做组间比较的,而相关性和回归是用来度量相互关系的,实际上,他们是没有差别的。
10.3 ANOVA是多重回归的特例
这里使用two way ANOVA一节中使用的数据:
data <- read.table("data/gender_dose.tsv", header=TRUE)
summary(with(data, aov(Alertness~Gender+Dosage)))
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 76.56 76.56 3.197 0.0971 .
## Dosage 1 5.06 5.06 0.211 0.6533
## Residuals 13 311.31 23.95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(lm(Alertness~Gender+Dosage, data=data))
##
## Call:
## lm(formula = Alertness ~ Gender + Dosage, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.438 -3.406 0.000 1.594 10.562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.688 2.119 7.403 5.17e-06 ***
## Genderm -4.375 2.447 -1.788 0.0971 .
## Dosageb 1.125 2.447 0.460 0.6533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.894 on 13 degrees of freedom
## Multiple R-squared: 0.2077, Adjusted R-squared: 0.08584
## F-statistic: 1.704 on 2 and 13 DF, p-value: 0.2201
上面的例子中假设Gender和Dosage是完全无关的,如果把两因素的互作考虑在内,则:
summary(with(data, aov(Alertness~Gender*Dosage)))
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 76.56 76.56 2.952 0.111
## Dosage 1 5.06 5.06 0.195 0.666
## Gender:Dosage 1 0.06 0.06 0.002 0.962
## Residuals 12 311.25 25.94
summary(lm(Alertness~Gender+Dosage+Gender*Dosage, data=data))
##
## Call:
## lm(formula = Alertness ~ Gender + Dosage + Gender * Dosage, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.500 -3.375 0.000 1.562 10.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.750 2.546 6.185 4.69e-05 ***
## Genderm -4.500 3.601 -1.250 0.235
## Dosageb 1.000 3.601 0.278 0.786
## Genderm:Dosageb 0.250 5.093 0.049 0.962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.093 on 12 degrees of freedom
## Multiple R-squared: 0.2079, Adjusted R-squared: 0.009862
## F-statistic: 1.05 on 3 and 12 DF, p-value: 0.4062