Die Daten und das Beispiel entstammen dem Buch Everitt (2010), S. 84
Wieviel Zeit (Minuten/Woche) verbringen die Personen damit, sich um ihr Auto zu kümmern. Potenzielle erklärende Variablen: Das Geschlecht (0=f, 1=m) Alter (Jahre), ein Extroversionswert (Verfahren nicht angegeben)
Daten einlesen, deskriptive Daten, erster grafischer Eindruck
[solution:begin:cw1]
# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
# get data into namespace
attach(ddf)
# get some impression of the data
ddf[1:10,]
## sex age extro time
## 1 1 55 40 46
## 2 1 43 45 79
## 3 0 57 52 33
## 4 1 26 62 63
## 5 0 22 31 20
## 6 0 32 28 18
## 7 0 26 2 11
## 8 1 29 83 97
## 9 1 40 55 63
## 10 0 30 32 46
# psych package gives a nice, short description of the data
# ::: operator selects command of a certain package
require("psych")
## Loading required package: psych
psych:::describe(ddf)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## sex 1 40 0.50 0.51 0.5 0.50 0.74 0 1 1 0.00 -2.05 0.08
## age 2 40 35.95 11.39 34.0 35.34 14.83 21 58 37 0.32 -1.34 1.80
## extro 3 40 37.02 19.67 39.0 36.72 17.79 2 85 83 0.24 -0.10 3.11
## time 4 40 44.12 20.79 44.0 43.28 22.24 7 97 90 0.36 -0.47 3.29
# a correlation matrix
cor(ddf)
## sex age extro time
## sex 1.00000000 -0.05335758 0.4028614 0.6614013
## age -0.05335758 1.00000000 0.3969553 0.2335658
## extro 0.40286143 0.39695528 1.0000000 0.6701736
## time 0.66140131 0.23356578 0.6701736 1.0000000
# a first graphical overview
# pairs generates multiple scatterplots
# panel defines content of windows
# text inserts the values in specified column (sex) as symbol at coordinates
# 0 are values of women, 1 of men in this example
pairs(cbind(time,age,extro),panel=function(x,y) text(x,y,sex))
detach(ddf)
[soution:end:cw1]
# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
# adapt model with two explanatory variables extro and sex
m.r <- lm(time ~ extro + sex, data=ddf)
# summary of model
summary(m.r)
##
## Call:
## lm(formula = time ~ extro + sex, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.193 -9.474 -2.149 10.165 23.918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.6797 4.4365 3.534 0.001118 **
## extro 0.5093 0.1151 4.423 8.24e-05 ***
## sex 19.1801 4.4727 4.288 0.000124 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.95 on 37 degrees of freedom
## Multiple R-squared: 0.632, Adjusted R-squared: 0.6121
## F-statistic: 31.77 on 2 and 37 DF, p-value: 9.284e-09
# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.r <- lm(time ~ extro + sex)
m.s <- lm(scale(time) ~ scale(extro) + scale(sex))
require(QuantPsyc)
## Loading required package: QuantPsyc
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
##
## logit
## Loading required package: MASS
##
## Attaching package: 'QuantPsyc'
## The following object is masked from 'package:base':
##
## norm
# use this to get raw and standardized weights
coefficients(m.r)
## (Intercept) extro sex
## 15.679695 0.509257 19.180128
coefficients(m.s)
## (Intercept) scale(extro) scale(sex)
## -8.124208e-17 4.819377e-01 4.672472e-01
lm.beta(m.r)
## extro sex
## 0.4819377 0.4672472
detach(ddf)
detach("package:QuantPsyc", unload=TRUE)
Zur Beurteilung der einzelnen Prädiktoren und Koeffizienten helfen die Konfidenzintervalle.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# generate model
m.e_s <- lm(time ~ extro + sex)
# get the coefficients
m.e_s$coefficients
## (Intercept) extro sex
## 15.679695 0.509257 19.180128
# and their confidence intervals 95% range
confint(m.e_s, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 6.6905865 24.6688032
## extro 0.2759686 0.7425455
## sex 10.1175468 28.2427083
detach(ddf)
Der nicht-ggplot Weg für das Modell time ~ extro + sex
.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
#Fig 4.2
# vergleichend die beiden Regressionsgeraden für Männer und Frauen
# Grafikumgebung zurücksetzen
par(mfrow=c(1,1))
# einen Vektor mit 'm' für Männer und 'f' für Frauen erstellen
Sex<-rep("m",length(sex))
Sex[sex==0]<-"f"
# einen Vektor mit 50 Elementen erstellen die in regelmäßigen Abständen zwischen 0 und 90 liegen
x<-seq(0,90,length=50)
# die Geradengleichungen mit derselben Steigung und verschiedenem Y-Achsenabschnitt
ym<-15.68+19.18+0.51*x
yf<-15.68+0.51*x
# und nun der plot
plot(ddf$time ~ ddf$extro, # plotte Grundmodell
xlab="Extroversion score", # X-Achsen-Beschriftung
ylab="Time spent looking after car (mins)",type="n") # Y-Achsen-Beschriftung
text(ddf$extro,ddf$time,labels=Sex) # an die Koordinaten den Text (Buchstaben) ausgeben
lines(x,ym,lty=1) # Regressionsgerade für Männer einzeichnen, Linienart festlegen
lines(x,yf,lty=2) # dito für Frauen
legend("topleft",c("Male","Female"), # eine Legende in die linke, obere Ecke
lty=1:2) # zugeordneter Linientyp in der Legende, der dem der Regressionsgeraden entspricht
detach(ddf)
ggplot und verschiedene Scatterplots.
Scatterplot mit unterschiedlichen Werten für Geschlecht und den beiden Regressionsgeraden für das jeweilige Geschlecht
scale_fill_manual(values=c_colors) +
scale_fill_manual(values=cbPalette)
scale_colour_manual(values=cbPalette)
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# use ggplot2
require("ggplot2")
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
# raw scatterplot of extro and time
pplot <- ggplot(ddf, aes(x=extro, y=time))
pplot +
geom_point()
# gender specifity
# sex must have type factor to be used adequately
f.sex <- factor(sex, levels = c(0,1), labels = c("f", "m"))
# create base plot object
pplot <- ggplot(ddf, aes(x=extro, y=time))
# add a geom_point to base plot object
pplot +
geom_point(aes(color=f.sex, shape=f.sex))
# include global regression line
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
# include global regression line with confidence interval
pplot <- ggplot(ddf, aes(x=extro, y=time))
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# include group specific regression lines without differences in slope (linear model without interaction)
# for this we need the coefficients of the model to plot
m.e_s <- lm(time ~ extro + sex)
# create base plot object
pplot <- ggplot(ddf, aes(x=extro, y=time))
# color management
cbPalette = c('blue', 'red')
# add scatterplot and gender specific regression lines as estimated by model above
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
scale_color_manual(values=cbPalette) +
geom_abline(intercept = m.e_s$coefficients['(Intercept)'], slope = m.e_s$coefficients['extro'], color=cbPalette[1]) +
geom_abline(intercept = m.e_s$coefficients['(Intercept)'] + m.e_s$coefficients['sex'], slope = m.e_s$coefficients['extro'], color=cbPalette[2])
# include group specific regression lines - this is NOT the statistical model 'time~extro+sex' discussed here
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(aes(fill = f.sex), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(method = "lm", se=FALSE) +
stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines, global confidence interval
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(method = "lm") +
stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines and confidence intervals for gender
pplot +
geom_point(aes(color=f.sex, shape=f.sex)) +
stat_smooth(method = "lm", se=FALSE) +
stat_smooth(aes(fill = f.sex, color=f.sex), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
detach(ddf)
Unstandardisierte und standardisierte Residuen. Mit rstandard()
kann man die Residuen standardisieren.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# get residuals
resid(m.e_s)
## 1 2 3 4 5 6
## -9.2301034 21.2236114 -9.1610602 -3.4337580 -11.4666626 -11.9388916
## 7 8 9 10 11 12
## -5.6982089 19.8718445 0.1310412 14.0240803 -18.6147751 13.2236114
## 13 14 15 16 17 18
## -6.4152439 2.5198363 10.7555060 1.8533304 9.9685382 -20.7443734
## 19 20 21 22 23 24
## -15.1005051 -10.2074659 19.5518514 7.7648836 14.2050974 -5.8463213
## 25 26 27 28 29 30
## 12.1721927 -3.0129478 23.9180090 -6.7907792 -11.1560472 2.7143544
## 31 32 33 34 35 36
## -3.1005051 5.9500241 -1.2856456 -4.0870040 8.7925341 18.5888795
## 37 38 39 40
## -14.7578745 -11.7393605 2.7513825 -26.1930753
# or via
m.e_s$residuals
## 1 2 3 4 5 6
## -9.2301034 21.2236114 -9.1610602 -3.4337580 -11.4666626 -11.9388916
## 7 8 9 10 11 12
## -5.6982089 19.8718445 0.1310412 14.0240803 -18.6147751 13.2236114
## 13 14 15 16 17 18
## -6.4152439 2.5198363 10.7555060 1.8533304 9.9685382 -20.7443734
## 19 20 21 22 23 24
## -15.1005051 -10.2074659 19.5518514 7.7648836 14.2050974 -5.8463213
## 25 26 27 28 29 30
## 12.1721927 -3.0129478 23.9180090 -6.7907792 -11.1560472 2.7143544
## 31 32 33 34 35 36
## -3.1005051 5.9500241 -1.2856456 -4.0870040 8.7925341 18.5888795
## 37 38 39 40
## -14.7578745 -11.7393605 2.7513825 -26.1930753
# get standardized residuals
rstandard(m.e_s)
## 1 2 3 4 5 6
## -0.73224152 1.68206102 -0.74229499 -0.27553450 -0.90890330 -0.94626353
## 7 8 9 10 11 12
## -0.46619465 1.68001111 0.01043039 1.11182964 -1.49515277 1.04802717
## 13 14 15 16 17 18
## -0.51336399 0.20871681 0.87046630 0.15786030 0.79260765 -1.71799770
## 19 20 21 22 23 24
## -1.21488477 -0.83314458 1.54956809 0.64124615 1.12602975 -0.46578420
## 25 26 27 28 29 30
## 0.97177203 -0.23924967 1.93087032 -0.54495136 -0.89030477 0.21513572
## 31 32 33 34 35 36
## -0.24944572 0.47387092 -0.10189837 -0.32690695 0.71765629 1.47432914
## 37 38 39 40
## -1.16978961 -0.93096781 0.21813228 -2.08271256
# check residuals grafically
# raw residuals
plot(m.e_s$fitted.values, m.e_s$residuals,
ylab="Raw Residuals",
xlab="Fitted Washing Time",
main="Washing Time \npredicted by Extraversion and Gender")
abline(0, 0)
# standardized residuals - no much difference
plot(m.e_s$fitted.values, rstandard(m.e_s),
ylab="Standardized Residuals",
xlab="Fitted Washing Time",
main="Washing Time \npredicted by Extraversion and Gender")
abline(0, 0)
# plot response variable values against raw residuals
plot(time, m.e_s$residuals,
ylab="Raw Residuals",
xlab="Washing Time",
main="Washing Time \npredicted by Extraversion and Gender")
abline(0, 0)
# plot response variable values against standardized residuals
plot(time, rstandard(m.e_s),
ylab="Standardized Residuals",
xlab="Washing Time",
main="Washing Time \npredicted by Extraversion and Gender")
abline(0, 0)
#{} Hinweise, was man Residualplots entnehmen kann
detach(ddf)
Ein Plot der Residuen gegen die vorhergesagten Werte (vgl. Vorhersage) hilft bei der grafischen Beurteilung der Residuen.
Vorhersage eines einzelnen Kriteriumswertes (Vp 3) mit Prädiktor sex und extero (zu Fuß)
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# store coefficients
coe <- m.e_s$coefficients
# calculate predicted (fitted) value of 3rd subject
e.time.3 <- coe['(Intercept)'] + coe['sex'] * sex[3] + coe['extro'] * extro[3]
e.time.3
## (Intercept)
## 42.16106
# predicted values are called fitted.values
m.e_s$fitted.values
## 1 2 3 4 5 6 7 8
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816
## 9 10 11 12 13 14 15 16
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667
## 17 18 19 20 21 22 23 24
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632
## 25 26 27 28 29 30 31 32
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998
## 33 34 35 36 37 38 39 40
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308
# müßte gleich sein dem 3. Wert in den Vorhersagewerten
# should be equal to 3 pos of fitted (predicted) values
m.e_s$fitted.values[3]
## 3
## 42.16106
# residual for this subject is difference between real value and predicted value
time[3] - e.time.3
## (Intercept)
## -9.16106
# compare with residuals vector pos 3
m.e_s$residuals
## 1 2 3 4 5 6
## -9.2301034 21.2236114 -9.1610602 -3.4337580 -11.4666626 -11.9388916
## 7 8 9 10 11 12
## -5.6982089 19.8718445 0.1310412 14.0240803 -18.6147751 13.2236114
## 13 14 15 16 17 18
## -6.4152439 2.5198363 10.7555060 1.8533304 9.9685382 -20.7443734
## 19 20 21 22 23 24
## -15.1005051 -10.2074659 19.5518514 7.7648836 14.2050974 -5.8463213
## 25 26 27 28 29 30
## 12.1721927 -3.0129478 23.9180090 -6.7907792 -11.1560472 2.7143544
## 31 32 33 34 35 36
## -3.1005051 5.9500241 -1.2856456 -4.0870040 8.7925341 18.5888795
## 37 38 39 40
## -14.7578745 -11.7393605 2.7513825 -26.1930753
m.e_s$residuals[3]
## 3
## -9.16106
detach(ddf)
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
# predicted or fitted values may be obtained by
m.e_s$fitted.values
## 1 2 3 4 5 6 7 8
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816
## 9 10 11 12 13 14 15 16
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667
## 17 18 19 20 21 22 23 24
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632
## 25 26 27 28 29 30 31 32
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998
## 33 34 35 36 37 38 39 40
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308
# or alternatively by
predict(m.e_s)
## 1 2 3 4 5 6 7 8
## 55.23010 57.77639 42.16106 66.43376 31.46666 29.93889 16.69821 77.12816
## 9 10 11 12 13 14 15 16
## 62.86896 31.97592 39.61478 57.77639 65.41524 41.48016 19.24449 78.14667
## 17 18 19 20 21 22 23 24
## 35.03146 46.74437 48.10051 17.20747 30.44815 46.23512 58.79490 24.84632
## 25 26 27 28 29 30 31 32
## 23.82781 34.01295 47.08199 21.79078 51.15605 58.28565 48.10051 36.04998
## 33 34 35 36 37 38 39 40
## 58.28565 38.08700 17.20747 28.41112 56.75787 55.73936 56.24862 53.19308
detach(ddf)
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
# generate obeservation number
nr <- 1:length(time)
# get the most extreme observations: standardized residuals > 2 or < -2 (two extremes)
extremes <- nr[rstandard(m.e_s) > 2 | rstandard(m.e_s) < -2]
print(paste('extreme standardized residuals in observation: ', extremes))
## [1] "extreme standardized residuals in observation: 40"
#extremes
ddf[extremes,]
## sex age extro time
## 40 1 25 36 27
# find observations with cooks distance > 1
extremes <- nr[cooks.distance(m.e_s) > 0.5]
print(paste('extreme cooks distances in observation: ', extremes))
## [1] "extreme cooks distances in observation: "
# extremes
ddf[extremes,]
## [1] sex age extro time
## <0 rows> (or 0-length row.names)
{}
## NULL
# insert extreme values to show effect
c.d.mod <- ddf
to.mod <- sample(length(time), 2)
c.d.mod$extro[to.mod] <- c.d.mod$extro[to.mod] * 2
mm.e_s <- lm(time ~ extro + sex, data=c.d.mod)
cooks.distance(mm.e_s)[cooks.distance(mm.e_s) > 0.1]
## 5 8
## 0.1202940 0.1940179
detach(ddf)
Berechnung mit durbinWatsonTest()
or dwt()
.
Daumenregel: Je näher an 2, desto besser. Werte unter 1 und über 3 sollten aufmerksam machen.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
require(car)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:boot':
##
## logit
## The following object is masked from 'package:psych':
##
## logit
# create and store linear model object
m.e_s <- lm(time ~ extro + sex)
dwt(m.e_s)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.149987 2.175587 0.552
## Alternative hypothesis: rho != 0
# or
durbinWatsonTest(m.e_s)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.149987 2.175587 0.572
## Alternative hypothesis: rho != 0
detach(ddf)
VIF (Variation Inflation Factor) oder tolerance. Sie sind ineinander überführbar: \(tolerance = \frac{1}{vif}\)
Empfehlungen bei VIF
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex)
# output vif
print(paste("vif: ", vif(m.e_s)))
## [1] "vif: 1.19374097731722" "vif: 1.19374097731722"
# output tolerance
print(paste("tolerance: ", 1 / vif(m.e_s)))
## [1] "tolerance: 0.83770266666004" "tolerance: 0.83770266666004"
# mean of vif
mean(vif(m.e_s))
## [1] 1.193741
detach(ddf)
Wenn man den plot()
Befehl auf ein lm-Objekt ausführt, bekommt man eine ganze Kette von Grafiken.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
attach(ddf)
m.e_s <- lm(time ~ extro + sex + age)
# get series of plots
plot(m.e_s)
detach(ddf)