# Chapter 6 Regression

## 6.1 `lm`

This function is used for regression according to a linear model,
i.e. linear regression. It returns a model-class object. There are
specialized functions for such models, e.g.
to extract residuals (`resid`

),
to extract regression coefficients (`coef`

),
to modify (`update`

) the model, etc.

In the following example, we construct two regression models. The first step in your regression analysis should always be a thorough visual and numerical inspection, including histograms and scatterplots of the variables under study.

The first model is

\(\texttt{nsyl} = b_0\)

having only a constant intercept. The second model includes the speakers’ age as a predictor, i.e.

\(\texttt{nsyl} = b_0 + b_1 \texttt{age}\)

(The intercept is included in this model too, by default, unless
suppressed explicitly with `-1`

or
`~0`

in the regression formula). The key
question here is whether inclusion of a predictor yields a better
model, with significantly smaller residuals and significantly higher
\(R^2\). The intercept-only model and the linear-regression model are
compared with the `anova`

function.

```
require(hqmisc)
data(talkers)
model1.lm <- lm( nsyl~1, data=talkers ) # only intercept
model2.lm <- lm( nsyl~age, data=talkers ) # with intercept
anova( model1.lm, model2.lm ) # compare two models
```

```
## Analysis of Variance Table
##
## Model 1: nsyl ~ 1
## Model 2: nsyl ~ age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 381.85
## 2 78 375.06 1 6.7823 1.4105 0.2386
```

Including the `age`

predictor does improve the model a little bit,
as indicated by the somewhat smaller residual sums-of-squares
(`RSS`

). The improvement, however, is too small to be of
significance. The linear effect of a speaker’s age on his or her
average phrase length (in syllables) is not significant.

## 6.2 `glm`

For logistic regression we use function
`glm(family=binomial)`

, again with a
regression formula as an obligatory argument. Logistic regression
can be imagined as computing the logit of the hit-rate for each
cell, and then regressing these logits on the predictor(s). Here is
an annotated example (Moore and McCabe 2003) (Exercise 15.25).

The response variable `outcome`

indicates
the death (`0`

) or survival
(`1`

) of 2900 patients in two hospitals.

```
ips1525 <- read.table(
file=url("http://www.hugoquene.nl/emlar/ipsex1525.txt"),
header=T, sep=",")
with( ips1525, table(outcome) )
```

```
## outcome
## 0 1
## 79 2821
```

`## [1] 3.575399`

```
# intercept-only logistic-regression model
model1.glm <- glm( outcome~1, data=ips1525, family=binomial )
summary(model1.glm)
```

```
##
## Call:
## glm(formula = outcome ~ 1, family = binomial, data = ips1525)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.684 0.235 0.235 0.235 0.235
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.5754 0.1141 31.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 725.1 on 2899 degrees of freedom
## Residual deviance: 725.1 on 2899 degrees of freedom
## AIC: 727.1
##
## Number of Fisher Scoring iterations: 6
```

The intercept coefficient of the intercept-only logistic regression
model equals the overall log odds of survival; this intercept is
significantly different from zero.^{9} Thus the intercept-only
logistic regression model captures the overall log odds of survival.
Next, let’s try to improve this model, by including two predictors:
first, the `hospital`

where the patient was treated, and second, the
patient’s `condition`

at intake, classified as bad
(`0`

) or good (`1`

).

```
model2.glm <- glm( outcome~hospital, data=ips1525, family=binomial )
model3.glm <- glm( outcome~hospital*condition,
data=ips1525, family=binomial )
```

The deviance among logistic-regression models follows a \(\chi^2\) distribution. Hence we can compare models by computing the \(\chi^2\) probability of their deviances. Both model 2 and model 3 are compared here against model 1.

```
## Analysis of Deviance Table
##
## Model 1: outcome ~ 1
## Model 2: outcome ~ hospital
## Model 3: outcome ~ hospital * condition
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2899 725.10
## 2 2898 722.78 1 2.3253 0.1273
## 3 2896 703.96 2 18.8222 0.00008181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

The results indicate that there is no significant difference among hospitals in their survival rates (model 2, \(p>.10\)), but there is a significant effect of intake condition on the outcome (model 3, \(p<.001\)). Of course, you should also inspect the models themselves before drawing conclusions.

### References

Moore, David S., and George P. McCabe. 2003. *Introduction to the Practice of Statistics*. 4th ed. Freeman.

If the odds are \(1:1\), then the log of the odds is \(\log(1)=0\).↩