Chapter 6 Regression
6.1 lm
This function is used for regression according to a linear model,
i.e. linear regression. It returns a model-class object. There are
specialized functions for such models, e.g.
to extract residuals (resid
),
to extract regression coefficients (coef
),
to modify (update
) the model, etc.
In the following example, we construct two regression models. The first step in your regression analysis should always be a thorough visual and numerical inspection, including histograms and scatterplots of the variables under study.
The first model is
\(\texttt{nsyl} = b_0\)
having only a constant intercept. The second model includes the speakers’ age as a predictor, i.e.
\(\texttt{nsyl} = b_0 + b_1 \texttt{age}\)
(The intercept is included in this model too, by default, unless
suppressed explicitly with -1
or
~0
in the regression formula). The key
question here is whether inclusion of a predictor yields a better
model, with significantly smaller residuals and significantly higher
\(R^2\). The intercept-only model and the linear-regression model are
compared with the anova
function.
require(hqmisc)
data(talkers)
<- lm( nsyl~1, data=talkers ) # only intercept
model1.lm <- lm( nsyl~age, data=talkers ) # with intercept
model2.lm anova( model1.lm, model2.lm ) # compare two models
## Analysis of Variance Table
##
## Model 1: nsyl ~ 1
## Model 2: nsyl ~ age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 381.85
## 2 78 375.06 1 6.7823 1.4105 0.2386
Including the age
predictor does improve the model a little bit,
as indicated by the somewhat smaller residual sums-of-squares
(RSS
). The improvement, however, is too small to be of
significance. The linear effect of a speaker’s age on his or her
average phrase length (in syllables) is not significant.
6.2 glm
For logistic regression we use function
glm(family=binomial)
, again with a
regression formula as an obligatory argument. Logistic regression
can be imagined as computing the logit of the hit-rate for each
cell, and then regressing these logits on the predictor(s). Here is
an annotated example (Moore and McCabe 2003) (Exercise 15.25).
The response variable outcome
indicates
the death (0
) or survival
(1
) of 2900 patients in two hospitals.
<- read.table(
ips1525 file=url("http://www.hugoquene.nl/emlar/ipsex1525.txt"),
header=T, sep=",")
with( ips1525, table(outcome) )
## outcome
## 0 1
## 79 2821
log(2821/79) # log of odds of survival is 3.575
## [1] 3.575399
# intercept-only logistic-regression model
<- glm( outcome~1, data=ips1525, family=binomial )
model1.glm summary(model1.glm)
##
## Call:
## glm(formula = outcome ~ 1, family = binomial, data = ips1525)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.684 0.235 0.235 0.235 0.235
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5754 0.1141 31.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 725.1 on 2899 degrees of freedom
## Residual deviance: 725.1 on 2899 degrees of freedom
## AIC: 727.1
##
## Number of Fisher Scoring iterations: 6
The intercept coefficient of the intercept-only logistic regression
model equals the overall log odds of survival; this intercept is
significantly different from zero 9. Thus the intercept-only
logistic regression model captures the overall log odds of survival.
Next, let’s try to improve this model, by including two predictors:
first, the hospital
where the patient was treated, and second, the
patient’s condition
at intake, classified as bad
(0
) or good (1
).
<- glm( outcome~hospital, data=ips1525, family=binomial )
model2.glm <- glm( outcome~hospital*condition,
model3.glm data=ips1525, family=binomial )
The deviance among logistic-regression models follows a \(\chi^2\) distribution. Hence we can compare models by computing the \(\chi^2\) probability of their deviances. Both model 2 and model 3 are compared here against model 1.
anova( model1.glm, model2.glm, model3.glm, test="Chisq" )
## Analysis of Deviance Table
##
## Model 1: outcome ~ 1
## Model 2: outcome ~ hospital
## Model 3: outcome ~ hospital * condition
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2899 725.10
## 2 2898 722.78 1 2.3253 0.1273
## 3 2896 703.96 2 18.8222 8.181e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results indicate that there is no significant difference among hospitals in their survival rates (model 2, \(p>.10\)), but there is a significant effect of intake condition on the outcome (model 3, \(p<.001\)). Of course, you should also inspect the models themselves before drawing conclusions.