In library(MASS)
ist u. a. ein Datensatz von 200 Personen (Pima-Indianer, amerikanische Ureinwohner im südlichen Arizona) zu Diabetes enthalten.
?Pima.tr
sagt uns etwas über die Daten und ihre Struktur, sowie über die zur Verfügung stehenden Variablen.
Uns interessiert hier, ob das Vorliegen einer Diagnose von Diabetes (Variable “type”) aus anderen Variablen “vorhergesagt” werden kann.
[res_begin:res_1]
# we find the data in library(MASS)
require(MASS)
## Loading required package: MASS
df <- Pima.tr
# the number of diagnoses (frequency table of yes and no)
table(df$type)
##
## No Yes
## 132 68
# zero model
m.0 = glm(type ~ 1, data = df, family = 'binomial')
summary(m.0)
##
## Call:
## glm(formula = type ~ 1, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9116 -0.9116 -0.9116 1.4689 1.4689
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6633 0.1493 -4.444 8.85e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 256.41 on 199 degrees of freedom
## AIC: 258.41
##
## Number of Fisher Scoring iterations: 4
# get at the log(odds)
coef(m.0)
## (Intercept)
## -0.6632942
# odds ratio
round(exp(coef(m.0)),2)
## (Intercept)
## 0.52
# we look at the relation to number of pregnancies by extending model m.0
m.1 = update(m.0, .~. + npreg)
# or alternativly
m.1 <- glm(type ~ npreg, data = df, family = 'binomial')
# we compare the models
anova(m.0, m.1)
## Analysis of Deviance Table
##
## Model 1: type ~ 1
## Model 2: type ~ npreg
## Resid. Df Resid. Dev Df Deviance
## 1 199 256.41
## 2 198 242.03 1 14.388
# deviance of residuals is computed using the log likelihood.
# we find it named "Resid Dev" for each model
-2*logLik(m.0)
## 'log Lik.' 256.4142 (df=1)
-2*logLik(m.1)
## 'log Lik.' 242.0262 (df=2)
# differences of log likelihoods are chi^2 distributed
# we use this to test it
anova(m.0, m.1, test = 'Chisq')
## Analysis of Deviance Table
##
## Model 1: type ~ 1
## Model 2: type ~ npreg
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 199 256.41
## 2 198 242.03 1 14.388 0.0001488 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# we get the odds ratios
round(exp(coef(m.1)),2)
## (Intercept) npreg
## 0.27 1.18
Für beide Modell wird nun unter Resid. Dev die -2LL (-2*LogLikelihood) für das Modell ausgegeben, allerdings wird kein Signifikanztest durchgeführt. Die Differenz von -2LL Werten ist Chi² verteilt. Wir müssen nun für einen Signifikanztest der beiden Modelle gegeneinander eine zusätzliche Angabe für den geforderten Chi²-Test machen.
Für unser neues Modell können wir uns wieder die Koeffizienten als Odds-Ratios ausgeben lassen:
Eine Visualisierung ohne ggplot()
# we find the data in library(MASS)
require(MASS)
df <- Pima.tr
# nice plot for the elation of x and y
par(mar = c(4,4,1,4))
# - cumulative density plot
cdplot(df$npreg, df$type)
Mehr als nur eine Vorhersagevariable:
# we find the data in library(MASS)
require(MASS)
df <- Pima.tr
m.0 = glm(type ~ 1, data = df, family = 'binomial')
m.1 <- glm(type ~ npreg, data = df, family = 'binomial')
# we add a second predictive variable "bmi"
#m.2 = update(mod1, .~. + bmi)
m.2 <- glm(type ~ npreg + bmi, data = df, family = 'binomial')
anova(m.0, m.1, m.2, test = 'Chisq')
## Analysis of Deviance Table
##
## Model 1: type ~ 1
## Model 2: type ~ npreg
## Model 3: type ~ npreg + bmi
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 199 256.41
## 2 198 242.03 1 14.388 0.0001488 ***
## 3 197 225.82 1 16.204 5.687e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m.2)
##
## Call:
## glm(formula = type ~ npreg + bmi, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7318 -0.8744 -0.5543 1.0888 2.1471
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.93785 1.01696 -4.856 1.2e-06 ***
## npreg 0.17391 0.04798 3.625 0.000289 ***
## bmi 0.10957 0.02870 3.818 0.000135 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 225.82 on 197 degrees of freedom
## AIC: 231.82
##
## Number of Fisher Scoring iterations: 4
round(exp(coef(m.2)),2)
## (Intercept) npreg bmi
## 0.01 1.19 1.12
# and a third predictive variable "age"
#mod3 = update(mod2, .~. + age)
m.3 <- glm(type ~ npreg + bmi + age, data = df, family = 'binomial')
anova(m.0, m.1, m.2, m.3, test = 'Chisq')
## Analysis of Deviance Table
##
## Model 1: type ~ 1
## Model 2: type ~ npreg
## Model 3: type ~ npreg + bmi
## Model 4: type ~ npreg + bmi + age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 199 256.41
## 2 198 242.03 1 14.388 0.0001488 ***
## 3 197 225.82 1 16.204 5.687e-05 ***
## 4 196 214.62 1 11.197 0.0008192 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(m.3)
##
## Call:
## glm(formula = type ~ npreg + bmi + age, family = "binomial",
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7413 -0.8235 -0.4918 0.9773 2.2382
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.44462 1.18360 -5.445 5.18e-08 ***
## npreg 0.06508 0.05720 1.138 0.255226
## bmi 0.10761 0.02986 3.604 0.000313 ***
## age 0.05937 0.01817 3.267 0.001086 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 256.41 on 199 degrees of freedom
## Residual deviance: 214.62 on 196 degrees of freedom
## AIC: 222.62
##
## Number of Fisher Scoring iterations: 4
confint(m.3)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -8.90582169 -4.24467204
## npreg -0.04696579 0.17892197
## bmi 0.05088762 0.16853743
## age 0.02448616 0.09638742
round(exp(confint(m.3)),2)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.00 0.01
## npreg 0.95 1.20
## bmi 1.05 1.18
## age 1.02 1.10
# Modellvorhersagen
# predict(m.3)
range(predict(m.3))
## [1] -3.174282 1.998056
# predict(m.3, type = 'response')
range(predict(m.3, type = 'response'))
## [1] 0.0401451 0.8805928
[res_end]