Rmd

Die Daten und das Beispiel entstammen dem Buch Everitt (2010), S. 84

Wieviel Zeit (Minuten/Woche) verbringen die Personen damit, sich um ihr Auto zu kümmern. Potenzielle erklärende Variablen: Das Geschlecht (0=f, 1=m) Alter (Jahre), ein Extroversionswert (Verfahren nicht angegeben)

Daten einlesen, deskriptive Daten, erster grafischer Eindruck

[solution:begin:cw1]

# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
# get data into namespace
attach(ddf)
# get some impression of the data
ddf[1:10,]

##    sex age extro time
## 1    1  55    40   46
## 2    1  43    45   79
## 3    0  57    52   33
## 4    1  26    62   63
## 5    0  22    31   20
## 6    0  32    28   18
## 7    0  26     2   11
## 8    1  29    83   97
## 9    1  40    55   63
## 10   0  30    32   46

# psych package gives a nice, short description of the data
# ::: operator selects command of a certain package
require("psych")

## Loading required package: psych

psych:::describe(ddf)

##       vars  n  mean    sd median trimmed   mad min max range skew kurtosis   se
## sex      1 40  0.50  0.51    0.5    0.50  0.74   0   1     1 0.00    -2.05 0.08
## age      2 40 35.95 11.39   34.0   35.34 14.83  21  58    37 0.32    -1.34 1.80
## extro    3 40 37.02 19.67   39.0   36.72 17.79   2  85    83 0.24    -0.10 3.11
## time     4 40 44.12 20.79   44.0   43.28 22.24   7  97    90 0.36    -0.47 3.29

# a correlation matrix
cor(ddf)

##               sex         age     extro      time
## sex    1.00000000 -0.05335758 0.4028614 0.6614013
## age   -0.05335758  1.00000000 0.3969553 0.2335658
## extro  0.40286143  0.39695528 1.0000000 0.6701736
## time   0.66140131  0.23356578 0.6701736 1.0000000

# a first graphical overview
# pairs generates multiple scatterplots
# panel defines content of windows
# text inserts the values in specified column (sex) as symbol at coordinates 
# 0 are values of women, 1 of men in this example
pairs(cbind(time,age,extro),panel=function(x,y) text(x,y,sex))

detach(ddf)

[soution:end:cw1]

Linear Model spezifizieren und anpassen

# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
# adapt model with two explanatory variables extro and sex
m.r <- lm(time ~ extro + sex, data=ddf)
# summary of model 
summary(m.r)

## 
## Call:
## lm(formula = time ~ extro + sex, data = ddf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.193  -9.474  -2.149  10.165  23.918 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15.6797     4.4365   3.534 0.001118 ** 
## extro         0.5093     0.1151   4.423 8.24e-05 ***
## sex          19.1801     4.4727   4.288 0.000124 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.95 on 37 degrees of freedom
## Multiple R-squared:  0.632,  Adjusted R-squared:  0.6121 
## F-statistic: 31.77 on 2 and 37 DF,  p-value: 9.284e-09

Standardisierte Koeffizienten (Gewichte)

# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.r  <- lm(time ~ extro + sex)
m.s <- lm(scale(time) ~ scale(extro) + scale(sex))
require(QuantPsyc)

## Loading required package: QuantPsyc

## Loading required package: boot

## 
## Attaching package: 'boot'

## The following object is masked from 'package:psych':
## 
##     logit

## Loading required package: MASS

## 
## Attaching package: 'QuantPsyc'

## The following object is masked from 'package:base':
## 
##     norm

# use this to get raw and standardized weights
coefficients(m.r)

## (Intercept)       extro         sex 
##   15.679695    0.509257   19.180128

coefficients(m.s)

##   (Intercept)  scale(extro)    scale(sex) 
## -8.124208e-17  4.819377e-01  4.672472e-01

lm.beta(m.r)

##     extro       sex 
## 0.4819377 0.4672472

detach(ddf)
detach("package:QuantPsyc", unload=TRUE)

Konfidenzintervalle für die Parameter

Zur Beurteilung der einzelnen Prädiktoren und Koeffizienten helfen die Konfidenzintervalle.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# generate model
m.e_s <- lm(time ~ extro + sex)
# get the coefficients
m.e_s$coefficients

## (Intercept)       extro         sex 
##   15.679695    0.509257   19.180128

# and their confidence intervals 95% range
confint(m.e_s, level=0.95)

##                  2.5 %     97.5 %
## (Intercept)  6.6905865 24.6688032
## extro        0.2759686  0.7425455
## sex         10.1175468 28.2427083

detach(ddf)

Grafische Darstellung

Der nicht-ggplot Weg für das Modell time ~ extro + sex.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
#Fig 4.2
# vergleichend die beiden Regressionsgeraden für Männer und Frauen
# Grafikumgebung zurücksetzen
par(mfrow=c(1,1))
# einen Vektor mit 'm' für Männer und 'f' für Frauen erstellen 
Sex<-rep("m",length(sex))
Sex[sex==0]<-"f"
# einen Vektor mit 50 Elementen erstellen die in regelmäßigen Abständen zwischen 0 und 90 liegen
x<-seq(0,90,length=50)
# die Geradengleichungen mit derselben Steigung und verschiedenem Y-Achsenabschnitt
ym<-15.68+19.18+0.51*x
yf<-15.68+0.51*x
# und nun der plot
plot(ddf$time ~ ddf$extro,      # plotte Grundmodell
 xlab="Extroversion score",                     # X-Achsen-Beschriftung
 ylab="Time spent looking after car (mins)",type="n") # Y-Achsen-Beschriftung
 text(ddf$extro,ddf$time,labels=Sex)    # an die Koordinaten den Text (Buchstaben) ausgeben
 lines(x,ym,lty=1)  # Regressionsgerade für Männer einzeichnen, Linienart festlegen
 lines(x,yf,lty=2)  # dito für Frauen
 legend("topleft",c("Male","Female"),   # eine Legende in die linke, obere Ecke
    lty=1:2)    # zugeordneter Linientyp in der Legende, der dem der Regressionsgeraden entspricht

detach(ddf)

ggplot und verschiedene Scatterplots.

Scatterplot mit unterschiedlichen Werten für Geschlecht und den beiden Regressionsgeraden für das jeweilige Geschlecht

scale_fill_manual(values=c_colors) +

To use for fills, add

scale_fill_manual(values=cbPalette)

To use for line and point colors, add

scale_colour_manual(values=cbPalette)

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# use ggplot2
require("ggplot2")

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

# raw scatterplot of extro and time
pplot <- ggplot(ddf, aes(x=extro, y=time))
pplot + 
  geom_point()

# gender specifity
# sex must have type factor to be used adequately
f.sex <- factor(sex, levels = c(0,1), labels = c("f", "m"))
# create base plot object 
pplot <- ggplot(ddf, aes(x=extro, y=time))
# add a geom_point to base plot object
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex))

# include global regression line
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

# include global regression line with confidence interval
pplot <- ggplot(ddf, aes(x=extro, y=time))
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

# include group specific regression lines without differences in slope (linear model without interaction)
# for this we need the coefficients of the model to plot
m.e_s <- lm(time ~ extro + sex)
# create base plot object 
pplot <- ggplot(ddf, aes(x=extro, y=time))
# color management
cbPalette = c('blue', 'red')
# add scatterplot and gender specific regression lines as estimated by model above 
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  scale_color_manual(values=cbPalette) +
  geom_abline(intercept = m.e_s$coefficients['(Intercept)'], slope =  m.e_s$coefficients['extro'], color=cbPalette[1]) + 
  geom_abline(intercept = m.e_s$coefficients['(Intercept)'] + m.e_s$coefficients['sex'], slope =  m.e_s$coefficients['extro'], color=cbPalette[2])

# include group specific regression lines - this is NOT the statistical model 'time~extro+sex' discussed here
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(aes(fill = f.sex), method = "lm")

## `geom_smooth()` using formula 'y ~ x'

# include both regression lines
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(method = "lm", se=FALSE) +
  stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

# include both regression lines, global confidence interval
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(method = "lm") +
  stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# include both regression lines and confidence intervals for gender
pplot + 
  geom_point(aes(color=f.sex, shape=f.sex)) +
  stat_smooth(method = "lm", se=FALSE) +
  stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm")

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

detach(ddf)

Residuen

Unstandardisierte und standardisierte Residuen. Mit rstandard() kann man die Residuen standardisieren.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# get residuals
resid(m.e_s)

##           1           2           3           4           5           6 
##  -9.2301034  21.2236114  -9.1610602  -3.4337580 -11.4666626 -11.9388916 
##           7           8           9          10          11          12 
##  -5.6982089  19.8718445   0.1310412  14.0240803 -18.6147751  13.2236114 
##          13          14          15          16          17          18 
##  -6.4152439   2.5198363  10.7555060   1.8533304   9.9685382 -20.7443734 
##          19          20          21          22          23          24 
## -15.1005051 -10.2074659  19.5518514   7.7648836  14.2050974  -5.8463213 
##          25          26          27          28          29          30 
##  12.1721927  -3.0129478  23.9180090  -6.7907792 -11.1560472   2.7143544 
##          31          32          33          34          35          36 
##  -3.1005051   5.9500241  -1.2856456  -4.0870040   8.7925341  18.5888795 
##          37          38          39          40 
## -14.7578745 -11.7393605   2.7513825 -26.1930753

# or via
m.e_s$residuals

##           1           2           3           4           5           6 
##  -9.2301034  21.2236114  -9.1610602  -3.4337580 -11.4666626 -11.9388916 
##           7           8           9          10          11          12 
##  -5.6982089  19.8718445   0.1310412  14.0240803 -18.6147751  13.2236114 
##          13          14          15          16          17          18 
##  -6.4152439   2.5198363  10.7555060   1.8533304   9.9685382 -20.7443734 
##          19          20          21          22          23          24 
## -15.1005051 -10.2074659  19.5518514   7.7648836  14.2050974  -5.8463213 
##          25          26          27          28          29          30 
##  12.1721927  -3.0129478  23.9180090  -6.7907792 -11.1560472   2.7143544 
##          31          32          33          34          35          36 
##  -3.1005051   5.9500241  -1.2856456  -4.0870040   8.7925341  18.5888795 
##          37          38          39          40 
## -14.7578745 -11.7393605   2.7513825 -26.1930753

# get standardized residuals
rstandard(m.e_s)

##           1           2           3           4           5           6 
## -0.73224152  1.68206102 -0.74229499 -0.27553450 -0.90890330 -0.94626353 
##           7           8           9          10          11          12 
## -0.46619465  1.68001111  0.01043039  1.11182964 -1.49515277  1.04802717 
##          13          14          15          16          17          18 
## -0.51336399  0.20871681  0.87046630  0.15786030  0.79260765 -1.71799770 
##          19          20          21          22          23          24 
## -1.21488477 -0.83314458  1.54956809  0.64124615  1.12602975 -0.46578420 
##          25          26          27          28          29          30 
##  0.97177203 -0.23924967  1.93087032 -0.54495136 -0.89030477  0.21513572 
##          31          32          33          34          35          36 
## -0.24944572  0.47387092 -0.10189837 -0.32690695  0.71765629  1.47432914 
##          37          38          39          40 
## -1.16978961 -0.93096781  0.21813228 -2.08271256

# check residuals grafically
# raw residuals
plot(m.e_s$fitted.values, m.e_s$residuals, 
     ylab="Raw Residuals", 
     xlab="Fitted Washing Time", 
     main="Washing Time \npredicted by Extraversion and Gender") 
abline(0, 0)

# standardized residuals - no much difference
plot(m.e_s$fitted.values, rstandard(m.e_s), 
     ylab="Standardized Residuals", 
     xlab="Fitted Washing Time", 
     main="Washing Time \npredicted by Extraversion and Gender") 
abline(0, 0)

# plot response variable values against raw residuals
plot(time, m.e_s$residuals, 
     ylab="Raw Residuals", 
     xlab="Washing Time", 
     main="Washing Time \npredicted by Extraversion and Gender") 
abline(0, 0)

# plot response variable values against standardized residuals
plot(time, rstandard(m.e_s), 
     ylab="Standardized Residuals", 
     xlab="Washing Time", 
     main="Washing Time \npredicted by Extraversion and Gender") 
abline(0, 0)

#{} Hinweise, was man Residualplots entnehmen kann 
detach(ddf)

Ein Plot der Residuen gegen die vorhergesagten Werte (vgl. Vorhersage) hilft bei der grafischen Beurteilung der Residuen.

Vorhersage

Vorhersage eines einzelnen Kriteriumswertes (Vp 3) mit Prädiktor sex und extero (zu Fuß)

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# store coefficients
coe <- m.e_s$coefficients
# calculate predicted (fitted) value of 3rd subject
e.time.3 <- coe['(Intercept)'] + coe['sex'] * sex[3] + coe['extro'] * extro[3]
e.time.3

## (Intercept) 
##    42.16106

# predicted values are called fitted.values
m.e_s$fitted.values

##        1        2        3        4        5        6        7        8 
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816 
##        9       10       11       12       13       14       15       16 
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667 
##       17       18       19       20       21       22       23       24 
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632 
##       25       26       27       28       29       30       31       32 
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998 
##       33       34       35       36       37       38       39       40 
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308

# müßte gleich sein dem 3. Wert in den Vorhersagewerten
# should be equal to 3 pos of fitted (predicted) values 
m.e_s$fitted.values[3]

##        3 
## 42.16106

# residual for this subject is difference between real value and predicted value
time[3] - e.time.3

## (Intercept) 
##    -9.16106

# compare with residuals vector pos 3
m.e_s$residuals

##           1           2           3           4           5           6 
##  -9.2301034  21.2236114  -9.1610602  -3.4337580 -11.4666626 -11.9388916 
##           7           8           9          10          11          12 
##  -5.6982089  19.8718445   0.1310412  14.0240803 -18.6147751  13.2236114 
##          13          14          15          16          17          18 
##  -6.4152439   2.5198363  10.7555060   1.8533304   9.9685382 -20.7443734 
##          19          20          21          22          23          24 
## -15.1005051 -10.2074659  19.5518514   7.7648836  14.2050974  -5.8463213 
##          25          26          27          28          29          30 
##  12.1721927  -3.0129478  23.9180090  -6.7907792 -11.1560472   2.7143544 
##          31          32          33          34          35          36 
##  -3.1005051   5.9500241  -1.2856456  -4.0870040   8.7925341  18.5888795 
##          37          38          39          40 
## -14.7578745 -11.7393605   2.7513825 -26.1930753

m.e_s$residuals[3]

##        3 
## -9.16106

detach(ddf)

Vorhergesagte Werte für alle Beobachtungen

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
# predicted or fitted values may be obtained by
m.e_s$fitted.values

##        1        2        3        4        5        6        7        8 
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816 
##        9       10       11       12       13       14       15       16 
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667 
##       17       18       19       20       21       22       23       24 
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632 
##       25       26       27       28       29       30       31       32 
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998 
##       33       34       35       36       37       38       39       40 
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308

# or alternatively by
predict(m.e_s)

##        1        2        3        4        5        6        7        8 
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816 
##        9       10       11       12       13       14       15       16 
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667 
##       17       18       19       20       21       22       23       24 
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632 
##       25       26       27       28       29       30       31       32 
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998 
##       33       34       35       36       37       38       39       40 
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308

detach(ddf)

Ausreißer und bedeutsame Fälle

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# generate obeservation number
nr <- 1:length(time)
# get the most extreme observations: standardized residuals > 2 or < -2 (two extremes)
extremes <- nr[rstandard(m.e_s) > 2 | rstandard(m.e_s) < -2]
print(paste('extreme standardized residuals in observation: ', extremes))

## [1] "extreme standardized residuals in observation:  40"

#extremes
ddf[extremes,]

##    sex age extro time
## 40   1  25    36   27

# find observations with cooks distance > 1
extremes <- nr[cooks.distance(m.e_s) > 0.5]
print(paste('extreme cooks distances in observation: ', extremes))

## [1] "extreme cooks distances in observation:  "

# extremes
ddf[extremes,]

## [1] sex   age   extro time 
## <0 rows> (or 0-length row.names)

{}

## NULL

# insert extreme values to show effect
c.d.mod <- ddf
to.mod <- sample(length(time), 2)
c.d.mod$extro[to.mod] <- c.d.mod$extro[to.mod] * 2
mm.e_s <- lm(time ~ extro + sex, data=c.d.mod)
cooks.distance(mm.e_s)[cooks.distance(mm.e_s) > 0.1]

##         5         8 
## 0.1202940 0.1940179

detach(ddf)

Link

Voraussetzung: Unabhängigkeit

Berechnung mit durbinWatsonTest() or dwt().

Daumenregel: Je näher an 2, desto besser. Werte unter 1 und über 3 sollten aufmerksam machen.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
require(car)

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:boot':
## 
##     logit

## The following object is masked from 'package:psych':
## 
##     logit

# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
dwt(m.e_s)

##  lag Autocorrelation D-W Statistic p-value
##    1       -0.149987      2.175587   0.552
##  Alternative hypothesis: rho != 0

# or
durbinWatsonTest(m.e_s)

##  lag Autocorrelation D-W Statistic p-value
##    1       -0.149987      2.175587   0.572
##  Alternative hypothesis: rho != 0

detach(ddf)

Multikollinearität

VIF (Variation Inflation Factor) oder tolerance. Sie sind ineinander überführbar: \(tolerance = \frac{1}{vif}\)

Empfehlungen bei VIF

Achtung wenn es VIF größer 10 gibt bzw. wenn es Toleranzwerte < 0.1 gibt
Achtung wenn der durchschnittliche VIF deutlich größer als 1 ist.
Toleranzwerte < 0.2 sollten Verdacht auf Multikollinearität wecken

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
# output vif
print(paste("vif: ", vif(m.e_s)))

## [1] "vif:  1.19374097731722" "vif:  1.19374097731722"

# output tolerance
print(paste("tolerance: ", 1 / vif(m.e_s)))

## [1] "tolerance:  0.83770266666004" "tolerance:  0.83770266666004"

# mean of vif
mean(vif(m.e_s))

## [1] 1.193741

detach(ddf)

Grafiken im Modell

Wenn man den plot() Befehl auf ein lm-Objekt ausführt, bekommt man eine ganze Kette von Grafiken.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex + age)
# get series of plots
plot(m.e_s)

detach(ddf)

Beispiel Autowasch-Daten

Rmd