Einiges zum Hintergrund, Theorie etc. lm_mult_background
Moderationsanalyse bzw. Interaktionen lm_moderation
Sparsamkeit von Modellen: Modellvergleiche auf Basis von Parameterschätzungen nach der Least-Square-Methode
Zur inhaltlichen Konzeption von PACS vergleiche body_comparison
Der bereits hinreichend vorgestellte Datensatz zu PACS bzw. BMI. Hinzugekommene Variable body satisfaction bodysatisf
, die die Zufriedenheit mit dem eigenen Körper (Körperzufriedenheit) ausdrücken soll. Die Werte hier schwanken zwischen 0 und 100 und können als prozentuale Zufriedenheit interpretiert werden. (ähnlich BIQ, Body Image Questionnaire)
Die Idee: Vorhersage des bodysatisf durch die PACS-Einzelwerte oder: Wieviel Varianz am BMI kann durch die PACS-Fragen erklärt werden.
Zunächst ein Eindruck über relevante Einzelkorrelationen.
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(psych)
## Lade nötiges Paket: psych
# an overview of relevant scales
#pairs.panels(df[,c("bodysatisf", "c_phys_app", "c_good_way", "c_dress", "grade", "c_figure")])
# gender specific
#pairs(cbind(df$c_figure, df$grade),panel=function(x,y) text(x,y,df$gender))
cor(df[,c("bodysatisf", "grade", "bmi")], df[,c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")])
## c_phys_app c_good_way c_dress c_bad_way c_figure
## bodysatisf 0.7511423 0.83906136 0.6966399 -0.8319948 0.809023784
## grade 0.2284217 -0.01805602 0.3034052 -0.2128557 -0.002494173
## bmi 0.2290078 0.24151014 0.3582303 -0.2283778 0.085188936
cor(df[,c("bodysatisf", "grade", "bmi")])
## bodysatisf grade bmi
## bodysatisf 1.00000000 -0.2351783 -0.00534248
## grade -0.23517829 1.0000000 0.43248817
## bmi -0.00534248 0.4324882 1.00000000
cor(df[,c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")])
## c_phys_app c_good_way c_dress c_bad_way c_figure
## c_phys_app 1.0000000 0.7675161 0.8699426 -0.8450965 0.6739475
## c_good_way 0.7675161 1.0000000 0.7911833 -0.8468327 0.7777649
## c_dress 0.8699426 0.7911833 1.0000000 -0.8790725 0.6922335
## c_bad_way -0.8450965 -0.8468327 -0.8790725 1.0000000 -0.8324704
## c_figure 0.6739475 0.7777649 0.6922335 -0.8324704 1.0000000
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# zero model, we estimate constant only which is mean of reaction variable
m.0 <- lm(bodysatisf ~ 1, data = df)
# adapt model with one explanatory variables c_figure
m.1 <- lm(bodysatisf ~ c_figure, data = df)
# summary of model
summary(m.1)
##
## Call:
## lm(formula = bodysatisf ~ c_figure, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.859 -9.735 1.141 7.326 22.191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.612 5.683 2.395 0.0235 *
## c_figure 12.049 1.654 7.283 6.27e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 28 degrees of freedom
## Multiple R-squared: 0.6545, Adjusted R-squared: 0.6422
## F-statistic: 53.05 on 1 and 28 DF, p-value: 6.268e-08
anova(m.0, m.1)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 13191.0
## 2 28 4557.2 1 8633.7 53.047 6.268e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# grade as additional predictor
m.2 <- lm(bodysatisf ~ c_figure + grade, data = df)
summary(m.2)
##
## Call:
## lm(formula = bodysatisf ~ c_figure + grade, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.676 7.555 3.398 0.00212 **
## c_figure 12.041 1.547 7.786 2.26e-08 ***
## grade -7.501 3.340 -2.245 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
anova(m.0, m.2)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure + grade
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 13191.0
## 2 27 3840.1 2 9350.9 32.873 5.82e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.1, m.2)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ c_figure
## Model 2: bodysatisf ~ c_figure + grade
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 4557.2
## 2 27 3840.1 1 717.12 5.0421 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# we want to take a look at the design matrix of our model
head(model.matrix(m.2))
## (Intercept) c_figure grade
## 1 1 1 2.02
## 2 1 5 1.00
## 3 1 5 2.03
## 4 1 3 1.00
## 5 1 1 1.00
## 6 1 4 1.00
# bmi_dichotom as additional predictor
m.3 <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
summary(m.3)
##
## Call:
## lm(formula = bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.011 -6.549 -1.282 8.161 15.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.254 8.565 5.050 2.94e-05 ***
## c_figure 9.040 1.638 5.519 8.58e-06 ***
## grade -9.328 2.946 -3.166 0.00392 **
## bmi_dichotomnormal -15.734 4.955 -3.176 0.00383 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.32 on 26 degrees of freedom
## Multiple R-squared: 0.7902, Adjusted R-squared: 0.766
## F-statistic: 32.65 on 3 and 26 DF, p-value: 5.714e-09
anova(m.0, m.3)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 13191.0
## 2 26 2766.9 3 10424 32.651 5.714e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.1, m.3)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ c_figure
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 28 4557.2
## 2 26 2766.9 2 1790.3 8.4115 0.001524 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.2, m.3)
## Analysis of Variance Table
##
## Model 1: bodysatisf ~ c_figure + grade
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 27 3840.1
## 2 26 2766.9 1 1073.2 10.085 0.003827 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# we want to take a look at the design matrix of our model
head(model.matrix(m.3))
## (Intercept) c_figure grade bmi_dichotomnormal
## 1 1 1 2.02 1
## 2 1 5 1.00 0
## 3 1 5 2.03 0
## 4 1 3 1.00 0
## 5 1 1 1.00 0
## 6 1 4 1.00 0
Omnibus F-Test (F-statistic) prüft Hyp., dass alle Regressionskoeffizienten gleich 0 sind, bzw. ob durch die Vorhersage-Linearkombination ein signifikanter Anteil der Varianz am Kriterium erklärt wird.
Multiple R-squared: Anteil der Varianz des Kriteriums, der durch die Kombination der Prädiktoren gebunden/erklärt wird.
Adjusted R-squared: rechnet R-squared so um, dass die Anzahl der erklärenden Terme im Modell berücksichtigt wird. adjusted R-squared steigt im Gegensatz zu R-squared nur, wenn der neue Term das Modell um mehr als durch Zufall erwartet verbessert. adjusted R-squared kann negativ sein und ist immer <= R-squared.
Unter Coefficients finden sich die unstandardisierten Beta-Gewichte mit ihrem Standardfehler sowie ein T-Test zur Prüfung der Einzeleffekte (t-Werte sind Estimate / Std. Error)
Die unstandardisierten Beta-Gewichte dienen zur Berechnung der Vorhersagewerte in der Skalierung der Original-Variablen.
p Anzahl der Regressore im linearen Model (ohne constant), n ist sample-size.
Standardisierte Gewichte (\(\beta\)-Gewichte) sind untereinander vergleichbar.
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# standardized beta-weights are weights of standardized variables
# standardize variables - the hard way
s.bodysatisf <- (df$bodysatisf - mean(df$bodysatisf)) / sd(df$bodysatisf)
s.c_figure <- (df$c_figure - mean(df$c_figure)) / sd(df$c_figure)
s.grade <- (df$grade - mean(df$grade)) / sd(df$grade)
# compare summaries
# unstandardisiert
summary(lm(df$bodysatisf ~ df$c_figure + df$grade))
##
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.676 7.555 3.398 0.00212 **
## df$c_figure 12.041 1.547 7.786 2.26e-08 ***
## df$grade -7.501 3.340 -2.245 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
# standardized predictive variables
summary(lm(df$bodysatisf ~ s.c_figure + s.grade))
##
## Call:
## lm(formula = df$bodysatisf ~ s.c_figure + s.grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.367 2.177 23.591 < 2e-16 ***
## s.c_figure 17.242 2.215 7.786 2.26e-08 ***
## s.grade -4.973 2.215 -2.245 0.0331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
# standardized predicted and explanatory variables
summary(lm(s.bodysatisf ~ s.c_figure + s.grade))
##
## Call:
## lm(formula = s.bodysatisf ~ s.c_figure + s.grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21216 -0.37249 -0.03029 0.45277 0.90202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.607e-17 1.021e-01 0.000 1.0000
## s.c_figure 8.084e-01 1.038e-01 7.786 2.26e-08 ***
## s.grade -2.332e-01 1.038e-01 -2.245 0.0331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5592 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
Standardisieren bzw. z-transformieren geht auch einfacher mit scale()
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# show the equivalenz
s.bodysatisf <- (df$bodysatisf - mean(df$bodysatisf)) / sd(df$bodysatisf)
head(cbind(df$bodysatisf, scale(df$bodysatisf), s.bodysatisf))
## s.bodysatisf
## [1,] 16 -1.6582684 -1.6582684
## [2,] 76 1.1550050 1.1550050
## [3,] 68 0.7799019 0.7799019
## [4,] 70 0.8736777 0.8736777
## [5,] 45 -0.2985196 -0.2985196
## [6,] 58 0.3110230 0.3110230
require(psych)
describe(cbind(df$bodysatisf, scale(df$bodysatisf), s.bodysatisf))
## vars n mean sd median trimmed mad min max range skew kurtosis
## 1 1 30 51.37 21.33 51.50 51.92 25.95 14.00 91.00 77.00 -0.17 -0.98
## 2 2 30 0.00 1.00 0.01 0.03 1.22 -1.75 1.86 3.61 -0.17 -0.98
## 3 3 30 0.00 1.00 0.01 0.03 1.22 -1.75 1.86 3.61 -0.17 -0.98
## se
## 1 3.89
## 2 0.18
## 3 0.18
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
m.s.fb <- lm(scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
coefficients(m.fb)
## (Intercept) df$c_figure df$grade
## 25.675687 12.040666 -7.500898
coefficients(m.s.fb)
## (Intercept) scale(df$c_figure) scale(df$grade)
## -3.277394e-17 8.084422e-01 -2.331619e-01
summary(m.fb)
##
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.676 7.555 3.398 0.00212 **
## df$c_figure 12.041 1.547 7.786 2.26e-08 ***
## df$grade -7.501 3.340 -2.245 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
summary(m.s.fb)
##
## Call:
## lm(formula = scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21216 -0.37249 -0.03029 0.45277 0.90202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.277e-17 1.021e-01 0.000 1.0000
## scale(df$c_figure) 8.084e-01 1.038e-01 7.786 2.26e-08 ***
## scale(df$grade) -2.332e-01 1.038e-01 -2.245 0.0331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5592 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
oder mit QuantPsyc-Package lm.beta()
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(QuantPsyc)
## Lade nötiges Paket: QuantPsyc
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : es gibt kein Paket namens 'QuantPsyc'
require(lm.beta)
## Lade nötiges Paket: lm.beta
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
m.s.fb <- lm(scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
coefficients(m.fb)
## (Intercept) df$c_figure df$grade
## 25.675687 12.040666 -7.500898
lm.beta(m.fb)
##
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
##
## Standardized Coefficients::
## (Intercept) df$c_figure df$grade
## NA 0.8084422 -0.2331619
coefficients(m.s.fb)
## (Intercept) scale(df$c_figure) scale(df$grade)
## -3.277394e-17 8.084422e-01 -2.331619e-01
summary(m.fb)
##
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.676 7.555 3.398 0.00212 **
## df$c_figure 12.041 1.547 7.786 2.26e-08 ***
## df$grade -7.501 3.340 -2.245 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
Zur Beurteilung der einzelnen Prädiktoren und Koeffizienten helfen die Konfidenzintervalle. Individuelle Konfidenzintervalle finden sich bei Vorhersage
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(QuantPsyc)
## Lade nötiges Paket: QuantPsyc
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : es gibt kein Paket namens 'QuantPsyc'
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
# get the coefficients
coefficients(m.fb)
## (Intercept) df$c_figure df$grade
## 25.675687 12.040666 -7.500898
lm.beta(m.fb)
##
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
##
## Standardized Coefficients::
## (Intercept) df$c_figure df$grade
## NA 0.8084422 -0.2331619
# and their confidence intervals for 95% range. This are the confidence intervals of the means.
confint(m.fb, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 10.173147 41.1782270
## df$c_figure 8.867478 15.2138539
## df$grade -14.354990 -0.6468053
# to get individual confidence intervals see predict
# help
# help(predict.lm) # commented, so it can be rendered
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
#attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=df)
# store coefficients
coe <- m.2$coefficients
# calculate predicted (fitted) value of 3rd subject
e.bodysatisf.3 <- coe['(Intercept)'] + coe['c_figure'] * df$c_figure[3] + coe['grade'] * df$grade[3]
e.bodysatisf.3
## (Intercept)
## 70.65219
# predicted values are called fitted.values
# we get at them by using 'fitted.values'
m.2$fitted.values
## 1 2 3 4 5 6 7 8
## 22.56454 78.37812 70.65219 54.29679 30.21546 66.33745 43.27047 56.06122
## 9 10 11 12 13 14 15 16
## 56.88632 64.76226 40.23088 39.52002 71.85234 25.56490 42.25612 22.26450
## 17 18 19 20 21 22 23 24
## 54.29679 66.33745 54.29679 49.53544 66.33745 61.83691 25.71492 40.00585
## 25 26 27 28 29 30
## 37.60556 67.27679 54.29679 78.37812 78.37812 21.58942
# or by using 'predict()'
predict(m.2)
## 1 2 3 4 5 6 7 8
## 22.56454 78.37812 70.65219 54.29679 30.21546 66.33745 43.27047 56.06122
## 9 10 11 12 13 14 15 16
## 56.88632 64.76226 40.23088 39.52002 71.85234 25.56490 42.25612 22.26450
## 17 18 19 20 21 22 23 24
## 54.29679 66.33745 54.29679 49.53544 66.33745 61.83691 25.71492 40.00585
## 25 26 27 28 29 30
## 37.60556 67.27679 54.29679 78.37812 78.37812 21.58942
# should be equal to 3 pos of fitted (predicted) values
m.2$fitted.values[3]
## 3
## 70.65219
# we can use the design matrix to calculate the predicted values
head(cbind(model.matrix(m.2) %*% m.2$coefficients, predict(m.2)))
## [,1] [,2]
## 1 22.56454 22.56454
## 2 78.37812 78.37812
## 3 70.65219 70.65219
## 4 54.29679 54.29679
## 5 30.21546 30.21546
## 6 66.33745 66.33745
# residual for this subject is difference between real value and predicted value
df$bodysatisf[3] - e.bodysatisf.3
## (Intercept)
## -2.652195
# compare with residuals vector pos 3
m.2$residuals
## 1 2 3 4 5 6
## -6.5645397 -2.3781193 -2.6521947 15.7032127 14.7845447 -8.3374533
## 7 8 9 10 11 12
## -3.2704678 -1.0612236 -6.8863223 19.2377352 -0.2308789 13.4799810
## 13 14 15 16 17 18
## -25.8523384 -11.5648988 -20.2561213 10.7354962 -8.2967873 0.6625467
## 19 20 21 22 23 24
## 5.7032127 -4.5354426 6.6625467 14.1630853 -11.7149167 -19.0058520
## 25 26 27 28 29 30
## 7.3944352 6.7232092 2.7032127 -8.3781193 12.6218807 10.4105770
m.2$residuals[3]
## 3
## -2.652195
#detach(ddf)
# predict bodysatisf on base of new data, same predictor names are required. We don't know/need response variable.
data.new <- data.frame(
c_figure = c(1, 3, 5),
grade = c(1.3, 2.1, 3.2)
)
# Calculate the individual estimated response variable
predict(m.2, data.new)
## 1 2 3
## 27.96519 46.04580 61.87614
# we might need the standard error to build a confidence interval
predict(m.2, data.new, se.fit=T)
## $fit
## 1 2 3
## 27.96519 46.04580 61.87614
##
## $se.fit
## 1 2 3
## 4.083912 2.742181 6.446045
##
## $df
## [1] 27
##
## $residual.scale
## [1] 11.92586
rm(data.new, m.2, e.bodysatisf.3, coe)
Der ggplot Weg für das Modell bodysatisf ~ c_figure + grade
.
ggplot stellt nur zweidimensionale Graphen dar. Wir können aber die einzelnen Regressionsgeraden darstellen und die dritte Dimension über die Farbgebung in den Punkten visualisieren. Bei kategorialen Vorhersagevariablen kann auch eine Farbgebung in Abhängigkeit von der Gruppenzugehörigkeit helfen, oder aber multiple Plots.
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# adapt model with two explanatory variables c_dress and grade
m.r <- lm(bodysatisf ~ c_figure + grade, data = df)
require(ggplot2)
## Lade nötiges Paket: ggplot2
##
## Attache Paket: 'ggplot2'
## Die folgenden Objekte sind maskiert von 'package:psych':
##
## %+%, alpha
# only 2-dimensional regression line is possible
# we begin drawing regression of bodysatisf on c_figure
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=grade)) + geom_smooth(method=lm) + geom_point()
## `geom_smooth()` using formula 'y ~ x'
# ... and continue with drawing regression of bodysatisf on grade
ggplot(data=df, mapping=aes(grade, bodysatisf, color=c_figure)) + geom_smooth(method=lm) + geom_point()
## `geom_smooth()` using formula 'y ~ x'
# dichotomous grouping variable
m.r <- lm(bodysatisf ~ c_figure + bmi_dichotom, data = df)
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=bmi_dichotom)) + geom_smooth(method=lm) + geom_point()
## `geom_smooth()` using formula 'y ~ x'
# multiple plots
m.r <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=grade)) + geom_smooth(method=lm) + geom_point() + facet_wrap(~ df$bmi_dichotom)
## `geom_smooth()` using formula 'y ~ x'
#ggplot(data=df, mapping=aes(x=c_figure, y=bodysatisf)) + geom_point() + facet_wrap(~bmi_dichotom)
rm(m.r)
Mit der library(rockchalk)
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.a <- lm(bodysatisf ~ c_figure + grade, data=df)
require(rockchalk)
## Lade nötiges Paket: rockchalk
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade")
# various angles
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", theta = 0, phi = 10)
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T, theta=0, phi=10)
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T)
# for an interaction view see lm_moderation
Zweck ist die Suche nach Besonderheiten im Plot der Residuen
Besonderheiten deuten auf systematische Einflüsse hin, die bisher nicht durch das Modell erfasst werden. Es geht um Abweichungen der Residuen von der NV mit Mittelwert 0.
Zur Inspektion bietet sich ein Plot der vorhergesagten Werte gegen die (ev. standardisierten) Residuen an.
Grafiken, Residualplot und grafische Prüfung auf NV der Residuen mit ggplot()
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.2 <- lm(bodysatisf ~ c_figure + grade, data=df)
# we plot predicted/fitted values against standarized residuals
ggplot(data=df) + geom_point(aes(x=rstandard(m.2), y=m.2$fitted.values)) + geom_hline(yintercept = 0)
# we plot observed residuals against theoretically expected residuals if normally distributed
ggplot(df, aes(sample = m.2$residuals)) + stat_qq() + geom_abline(intercept=0, slope=1*sd(m.2$residuals))
# or scaled
ggplot(df, aes(sample = scale(m.2$residuals))) + stat_qq() + geom_abline(intercept=0, slope=1*sd(scale(m.2$residuals)))
Linearer Trend, visualisieren, Residuen visualisieren. Aus den Residuen ist alles, was der lineare Trend erklärt, herausgerechnet.
# some trend
t <- rnorm(100,0, 15)
y <- t + rnorm(100,0,5) * -1 + ((t + rnorm(100, 0, 5))^2) / 20
#x <- ((t + rnorm(100, 0, 5))^2)*10
x <- (t + rnorm(100,0,2)) / 500
#x <- t^2
cor(cbind(t, y, x))
## t y x
## t 1.0000000 0.6075833 0.9911027
## y 0.6075833 1.0000000 0.6030845
## x 0.9911027 0.6030845 1.0000000
m.xy <- lm(y~x)
# check residuals grafically
require("ggplot2")
require("gridExtra")
## Lade nötiges Paket: gridExtra
# create scatterplot
ddf.t <- data.frame(x=x, y=y, pred=m.xy$fitted.values, std.res=rstandard(m.xy))
p1 <- ggplot(ddf.t, aes(x=x, y=y)) + geom_point() + stat_smooth(method = "lm", se=FALSE)
p2 <- ggplot(ddf.t, aes(x=pred, y=std.res)) + geom_point()
#p1 <- ggplot(dfbmi, x=nation, y=bmi, aes(nation, bmi, fill = gender))
grid.arrange(p1,p2, nrow=2)
## `geom_smooth()` using formula 'y ~ x'
So könnte es aussehen, wenn die Ausprägung der Residuen steigt, je größer die vorhergesagten Werte sind.
# random data
x <- 1:100
y <- x / 10 * rnorm(100, 0, 2)
m.xy <- lm(y ~ x)
# check residuals grafically
#plot(x, y)
# standardized residuals
plot(m.xy$fitted.values, rstandard(m.xy),
ylab="Standardized Residuals",
xlab="Fitted y",
main="Beispiel 1 Besonderheiten im Residualplot")
abline(0, 0)
In diesem Beispiel legt der Residualplot nahe, dass eine bisher nicht in das Modell aufgenommene Vorhersagevariable systematischen Einfluss hat. Alternativ könnte es sein, dass eine bereits aufgenommene Vorhersagevariable vielleicht quadriert (zusätzlich/alternativ) in das Modell aufgenommen werden sollte.
# some quadratic trend
y <- rnorm(100, 100, 15)
x <- y ^ 2 + rnorm(100, 0, 100)
m.xy <- lm(y ~x)
# check residuals grafically
# standardized residuals
plot(m.xy$fitted.values, rstandard(m.xy),
ylab="Standardized Residuals",
xlab="Fitted y",
main="Beispiel 2 Besonderheiten im Residualplot")
abline(0, 0)
Das Problem der Bedeutsamkeit von Ausreißern wird sehr deutlich am sog. Anscombe’s Quartet. Mean of x in each case 9 (exact) Variance of x in each case 11 (exact) Mean of y in each case 7.50 (to 2 decimal places) Variance of y in each case 4.122 or 4.127 (to 3 decimal places) Correlation between x and y in each case 0.816 (to 3 decimal places) Linear regression line in each case y = 3.00 + 0.500x (to 2 and 3 decimal places, respectively)
Alle x*y Korrelationen sind gleich hoch. Die Daten sind völlig verschieden.
## vars n mean sd median trimmed mad min max range skew kurtosis se
## x1 1 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x2 2 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x3 3 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x4 4 11 9 3.32 8 8 0.00 8 19 11 2.47 4.52 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## y1 1 11 7.5 2.03 7.58 7.49 1.82 4.26 10.84 6.58 -0.05 -1.20 0.61
## y2 2 11 7.5 2.03 8.14 7.79 1.47 3.10 9.26 6.16 -0.98 -0.51 0.61
## y3 3 11 7.5 2.03 7.11 7.15 1.53 5.39 12.74 7.35 1.38 1.24 0.61
## y4 4 11 7.5 2.03 7.04 7.20 1.90 5.25 12.50 7.25 1.12 0.63 0.61
## x1*y1: 0.816
## x2*y2: 0.816
## x3*y3: 0.816
## x4*y4: 0.817
## (Intercept) y1
## -0.9975311 1.3328426
## (Intercept) y2
## -0.9948419 1.3324841
## (Intercept) y3
## -1.000315 1.333375
## (Intercept) y4
## -1.003640 1.333657
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Alle im Folgenden angegebenen Funktionen brauchen als Parameter lediglich ein Ergebnisobjekt eines Modells.
Outlier:
resid()
rstandard()
Standardisierte Differenz beobachtete Vorhersagewerte (response variable) zu den Vorhersagewerten (fitted values) vorgeschlagene Grenze 2rstudent()
Differenz zu den standardisierten angepassten Vorhersagewerten (Modell, bei dem die jeweilige Beobachtung ausgeschlossen ist) vorgeschlagene Grenze 2bedeutsame Fälle:
cooks distance cooks.distance()
Globales Einflussmaß. Einfluss einer einzelnen Beobachtung auf das Gesamtmodell. Vorgeschlagene Grenze 1
dfbeta dfbeta()
bildet Unterschiede in den Koeffizienten der Modelle ab, wenn eine Beobachtung mit eingeschlossen oder ausgeschlossen ist.
dffits dffits()
Differenz zwischen den Vorhersagewerten (fitted values) und den angepassten Vorhersagewerten (adjusted fitted values) Angepasste Vorhersagewerte sind die Vorhersagewerte einer Beobachtung, wenn das Modell ohne diese Beobachtung (Vp) angepasst wurde. Idee: Die Beobachtung ist dann ‘unwichtig’, wenn die Differenz klein ist, wenn sich also die Modellkoeffizienten nur wenig aufgrund der Beobachtung ändern.
leverage, Hebewert (hat values) hatvalues()
Das Erklärungsbeitrag einer Beobachtung zur Vorhersagevariable (Anteil der i-ten Beobachtung an der Gesamtquadratsumme der Vorhersagevariablen). Die Reaktionsvariable geht nicht ein, die Leverage ist also ein Maß, das etwas aussagt über den Beitrag einer Person zu den Vorhersagewerten. Schwankt zwischen 0 (kein Einfluss) und 1 (kompletter Einfluss) mittlerer Einfluss \(\frac{(p+1)}{n}\), p= Anzahl Prädiktoren, n = Anzahl Beobachtungen (subjects). Beobachtungen nahe am mittleren Einfluss sind ‘unwichtig’. Einlussreiche Beobachtungen haben leverage Wert über dem zwei bis dreifachen mittleren Einfluss. Leverage is the proportion of the total sum of squares of the explanatory variable contributed by the i^th case. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The greater an observation’s leverage, the more potential it has to be an influential observation. For example, an observation with a value equal to the mean on the predictor variable has no influence on the slope of the regression line regardless of its value on the criterion variable. On the other hand, an observation that is extreme on the predictor variable has the potential to affect the slope greatly. Calculation of Leverage (h) The first step is to standardize the predictor variable so that it has a mean of 0 and a standard deviation of 1. Then, the leverage (h) is computed by squaring the observation’s value on the standardized predictor variable, adding 1, and dividing by the number of observations. \(h_i > \frac{2p}{n}\) p=Anzahl der Vorhersagevariablen, n=Stichprobengröße, wird als große Leverage vorgeschlagen
covariance ratio covratio()
{}
## vars n mean sd median trimmed mad min max range skew kurtosis se
## x1 1 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x2 2 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x3 3 11 9 3.32 9 9 4.45 4 14 10 0.00 -1.53 1
## x4 4 11 9 3.32 8 8 0.00 8 19 11 2.47 4.52 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## y1 1 11 7.5 2.03 7.58 7.49 1.82 4.26 10.84 6.58 -0.05 -1.20 0.61
## y2 2 11 7.5 2.03 8.14 7.79 1.47 3.10 9.26 6.16 -0.98 -0.51 0.61
## y3 3 11 7.5 2.03 7.11 7.15 1.53 5.39 12.74 7.35 1.38 1.24 0.61
## y4 4 11 7.5 2.03 7.04 7.20 1.90 5.25 12.50 7.25 1.12 0.63 0.61
## correlation: 0.816
## `geom_smooth()` using formula 'y ~ x'
## named numeric(0)
## `geom_smooth()` using formula 'y ~ x'
## named numeric(0)
## `geom_smooth()` using formula 'y ~ x'
## 3
## 14.01311
## `geom_smooth()` using formula 'y ~ x'
## 8
## 10.35181
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)
# get the most extreme observations: standardized residuals > 2 or < -2 (two extremes)
extremes <- subj[rstandard(m.2) > 2 | rstandard(m.2) < -2]
print(paste('extreme standardized residuals in observation: ', extremes))
## [1] "extreme standardized residuals in observation: 13"
#extremes
ddf[extremes,]
## subj name gender height weight grade c_phys_app c_good_way c_dress
## 13 13 Berking f 187 91 1.87 3 4 3
## c_bad_way c_figure filling_time bmi bmi_class bmi_dichotom bmi_normal
## 13 2 5 26.16 26.02305 high extreme FALSE
## i_c_bad_way pacs_sum pacs bodysatisf
## 13 4 19 3.8 46
# find observations with cooks distance > 1 would be the recommendated limit
# find suspicious observations
extremes <- subj[cooks.distance(m.2) > 0.5]
print(paste('extreme cooks distances in observation: ', extremes))
## [1] "extreme cooks distances in observation: "
# extremes
ddf[extremes,]
## [1] subj name gender height weight
## [6] grade c_phys_app c_good_way c_dress c_bad_way
## [11] c_figure filling_time bmi bmi_class bmi_dichotom
## [16] bmi_normal i_c_bad_way pacs_sum pacs bodysatisf
## <0 Zeilen> (oder row.names mit Länge 0)
# plot it
plot(m.2$fitted.values, cooks.distance(m.2))
# insert extreme values to show effect
ddf.mod <- ddf
to.mod <- c(20, sample(length(subj), 2))
print(paste('to modify: ', to.mod))
## [1] "to modify: 20" "to modify: 7" "to modify: 2"
ddf.mod[to.mod,]
## subj name gender height weight grade c_phys_app c_good_way c_dress
## 20 20 Chorvátc f 177 80 3.24 4 3 5
## 7 7 Valentin f 165 71 2.47 4 3 4
## 2 2 Müller m 176 54 1.00 3 4 3
## c_bad_way c_figure filling_time bmi bmi_class bmi_dichotom bmi_normal
## 20 1 4 113.48 25.53545 high extreme FALSE
## 7 2 3 26.00 26.07897 high extreme FALSE
## 2 1 5 26.00 17.43285 low extreme FALSE
## i_c_bad_way pacs_sum pacs bodysatisf
## 20 5 21 4.2 45
## 7 4 18 3.6 40
## 2 5 20 4.0 76
ddf.mod$c_figure[to.mod] <- ddf.mod$c_figure[to.mod] * 2
ddf.mod[to.mod,]
## subj name gender height weight grade c_phys_app c_good_way c_dress
## 20 20 Chorvátc f 177 80 3.24 4 3 5
## 7 7 Valentin f 165 71 2.47 4 3 4
## 2 2 Müller m 176 54 1.00 3 4 3
## c_bad_way c_figure filling_time bmi bmi_class bmi_dichotom bmi_normal
## 20 1 8 113.48 25.53545 high extreme FALSE
## 7 2 6 26.00 26.07897 high extreme FALSE
## 2 1 10 26.00 17.43285 low extreme FALSE
## i_c_bad_way pacs_sum pacs bodysatisf
## 20 5 21 4.2 45
## 7 4 18 3.6 40
## 2 5 20 4.0 76
m.2.m <- lm(bodysatisf ~ c_figure + grade, data=ddf.mod)
cooks.distance(m.2.m)[cooks.distance(m.2.m) > 0.1]
## 2 12 20
## 0.9270435 0.1431751 0.3458125
plot(m.2.m$fitted.values, cooks.distance(m.2.m))
rm(ddf.mod, to.mod, m.2.m)
detach(ddf)
cooks distance plot mit ggplot()
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)
# again the model
#m.e_s <- lm(time ~ extro + sex, data=car.data)
#m.e_s <- lm(traitanx ~ bfagree + bfcon + bfext + bfneur + bfopen)
# we prepare adequate dataframe
ddf.plot = data.frame(subj = 1:length(subj), ddf[,c('bodysatisf', 'c_figure', 'grade')])
ddf.plot['fitted.values'] <- m.2$fitted.values
ddf.plot['studentized.residuals'] <- rstudent(m.2)
ddf.plot['cooks.distance'] <- cooks.distance(m.2)
ddf.plot['dfbeta'] <- dfbeta(m.2)
ddf.plot['dffits'] <- dffits(m.2)
ddf.plot['hatvalues'] <- hatvalues(m.2)
ddf.plot['covratio'] <- covratio(m.2)
# example: plot cooks distance against subject number - you have to identify the person
pplot <- ggplot(ddf.plot, aes(subj, cooks.distance, ymin=0, ymax=cooks.distance), title="Cook's Distance") +
geom_point() +
geom_linerange() +
scale_x_continuous("Observation Number") +
scale_y_continuous("Cook's distance")
#opts(title="Cook's Distance")
pplot
# get at persons above Cook's distance above 0.5
# find observations with cooks distance > 1
extremes <- ddf.plot$subj[cooks.distance(m.2) > 0.5]
if(length(extremes)) {
print(paste('extreme cooks distances in observation: ', extremes))
ddf.plot[extremes,]
}
# example: plot hat values i.e. leverage subject number
pplot <- ggplot(ddf.plot, aes(subj, cooks.distance, ymin=0, ymax=cooks.distance), title="Cook's Distance") +
geom_point() +
geom_linerange() +
scale_x_continuous("Observation Number") +
scale_y_continuous("Cook's distance")
#opts(title="Cook's Distance")
pplot
detach(ddf)
library(ggplot2)
library(ggplot2)
ermöglicht es, sehr aussagekräftige und zugleich schöne diagnostische Plots zu erstellen.
Diagnostic plots with library(ggplot2)
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)
require(ggplot2)
# plot fitted values against residuals, aggregate smooth and show standard error
p1<-ggplot(m.2, aes(.fitted, .resid)) +
geom_point() +
stat_smooth(method="loess") +
geom_hline(yintercept=0, col="red", linetype="dashed") +
xlab("Fitted values")+ylab("Residuals") +
ggtitle("Residual vs Fitted Plot") +
theme_bw()
p1
## `geom_smooth()` using formula 'y ~ x'
# QQ plot to check normal distribution of residuals
p2<-ggplot(m.2, aes(qqnorm(.stdresid)[[1]], .stdresid)) +
geom_point(na.rm = TRUE) +
geom_abline() +
xlab("Theoretical Quantiles") +
ylab("Standardized Residuals") +
ggtitle("Normal Q-Q") +
theme_bw()
# to see the plot, uncomment next line. For some reason it is currently not rendered correctly by Knit HTML
#p2
# plot fitted values against sqrt(stdandardized residuals), root amplifies small differences in comparison to bigger differences
p3<-ggplot(m.2, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab(expression(sqrt("|Standardized residuals|")))
p3<-p3+ggtitle("Scale-Location")+theme_bw()
p3
## `geom_smooth()` using formula 'y ~ x'
# plot coocs distance
p4<-ggplot(m.2, aes(seq_along(.cooksd), .cooksd))+geom_bar(stat="identity", position="identity")
p4<-p4+xlab("Obs. Number")+ylab("Cook's distance")
p4<-p4+ggtitle("Cook's distance")+theme_bw()
p4
# residuals vs leverage
p5<-ggplot(m.2, aes(.hat, .stdresid))+geom_point(aes(size=.cooksd), na.rm=TRUE)
p5<-p5+stat_smooth(method="loess", na.rm=TRUE)
p5<-p5+xlab("Leverage")+ylab("Standardized Residuals")
p5<-p5+ggtitle("Residual vs Leverage Plot")
p5<-p5+scale_size_continuous("Cook's Distance", range=c(1,5))
p5<-p5+theme_bw()+theme(legend.position="bottom")
p5
## `geom_smooth()` using formula 'y ~ x'
# plot leverage vs Cook's distance
p6<-ggplot(m.2, aes(.hat, .cooksd))+geom_point(na.rm=TRUE)+stat_smooth(method="loess", na.rm=TRUE)
p6<-p6+xlab("Leverage hii")+ylab("Cook's Distance")
p6<-p6+ggtitle("Cook's dist vs Leverage hii/(1-hii)")
p6<-p6+geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")
p6<-p6+theme_bw()
p6
## `geom_smooth()` using formula 'y ~ x'
Berechnung mit durbinWatsonTest()
or dwt()
.
Daumenregel: Je näher an 2, desto besser. Werte unter 1 und über 3 sollten aufmerksam machen.
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)
require(car)
## Lade nötiges Paket: car
## Lade nötiges Paket: carData
##
## Attache Paket: 'car'
## Das folgende Objekt ist maskiert 'package:psych':
##
## logit
dwt(m.2)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02889119 2.018337 0.934
## Alternative hypothesis: rho != 0
# or
durbinWatsonTest(m.2)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.02889119 2.018337 0.99
## Alternative hypothesis: rho != 0
VIF (Variation Inflation Factor) oder tolerance. Sie sind ineinander überführbar: \(tolerance = \frac{1}{vif}\)
Der Variance Inflation Factor (VIF) dient als Hilfsmittel um Multikollinearitäten zwischen den Vorhersagevariablen eines Modells zu entdecken. Die grundlegende Idee besteht darin, dass man versucht eine Vorhersagevariable durch die verbeibenden Vorhersagevariablen auszudrücken. Gelingt dies gut, ist also der VIF (das Bestimmtheitsmaß) hoch, so kann man annehmen, dass die geprüfte Variable zu den restlichen Vorhersagevariablen (multi)kollinear ist.
Empfehlungen bei VIF
Wir prüfen hier zunächst die VIF unseres Körpervergleichs-Beispiels.
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)
require(car)
# create and store linear model object
#m.e_s <- lm(time ~ extro + sex)
# output vif
print(paste("vif: ", vif(m.2)))
## [1] "vif: 1.00000622093632" "vif: 1.00000622093632"
# output tolerance
print(paste("tolerance: ", 1 / vif(m.2)))
## [1] "tolerance: 0.999993779102384" "tolerance: 0.999993779102384"
# mean of vif
mean(vif(m.2))
## [1] 1.000006
Beispielhaft überprüfen wir die Kollinearität eines Modells, bei dem die Körperzufriedenheit aus den einzelnen PACS-Items vorhergesagt werden soll.
# model with all explanatory variables pacs answers
m.1 <- lm(bodysatisf ~ c_phys_app + c_good_way + c_dress + c_bad_way + c_figure, data=ddf)
# output vif
print(paste("vif: ", vif(m.1)))
## [1] "vif: 4.72719520876748" "vif: 4.03488094015851" "vif: 6.06820748510201"
## [4] "vif: 8.80134563818313" "vif: 3.63561364134242"
# output tolerance
print(paste("tolerance: ", 1 / vif(m.1)))
## [1] "tolerance: 0.211541930433782" "tolerance: 0.247838787520881"
## [3] "tolerance: 0.164793310455367" "tolerance: 0.113618989766936"
## [5] "tolerance: 0.275056730073979"
# mean of vif
mean(vif(m.1))
## [1] 5.453449
Als Maßnahme bei zu hoher Kollinearität sollten wir die Zusammenstellung der dann (teilweise) redundanten Vorhersagevariablen überprüfen bzw. verändern. Eine andere Möglichkeit, die teilweise vorgeschlagen wird (Everitt), ist das Zusammenfassen von Vorhersagevariablen in Komponenten einer PCA, mit denen dann an Stelle der Vorhersagevariablen in der MR gerechnet wird.
Wenn man den plot()
Befehl auf ein lm-Objekt ausführt, bekommt man eine ganze Kette von Grafiken.
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.3 <- lm(bodysatisf ~ c_figure + grade + gender, data=df)
m.3 <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data=df)
# get series of plots
plot(m.3)
Modellvergleiche auf Basis von Parameterschätzungen nach der Least-Square-Methode
Alternative Ansätze zur Parameterschätzung und zu Modellvergleichen.
Je nach Daten und Fragestellung kann das Hinzufügen von Interaktionstermen Sinn machen. Diese Idee wird unter lm_moderation Rmd, html näher vorgestellt.
Hier nur die Plots, die ggplot()
für ein exploratives Vorgehen anbietet. Dies ist nicht mehr das aktuell berechnete (rein additive) Modell bodysatisf ~ c_figure + grade
.
Idee bzw. Fragen: - Ist die Vorhersage von bodysatisf bei Frauen und Männern verschieden? - Ist die Vorhersage vielleicht unterschiedlich genau? - spezifischer: Ist die Steigung der Regressionsgeraden unterschiedlich in Abhängigkeit vom Geschlecht?
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require("ggplot2")
# raw scatterplot
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
pplot + geom_point()
# graphical look at gender specifity. Gender must have type factor to be used adequately. This is given already
ddf$gender
## [1] "m" "m" "f" "f" "f" "f" "f" "f" "f" "m" "f" "m" "f" "m" "m" "f" "f" "m" "m"
## [20] "f" "m" "m" "m" "m" "f" "m" "m" "m" "m" "f"
# plot values in different color and shape
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
# add a geom_point to base plot object
pplot + geom_point(aes(color=ddf$gender, shape=ddf$gender))
## Warning: Use of `ddf$gender` is discouraged. Use `gender` instead.
## Use of `ddf$gender` is discouraged. Use `gender` instead.
# include global regression line
pplot +
geom_point(aes(color=ddf$gender, shape=ddf$gender)) +
stat_smooth(method = "lm", se=FALSE)
## Warning: Use of `ddf$gender` is discouraged. Use `gender` instead.
## Use of `ddf$gender` is discouraged. Use `gender` instead.
## `geom_smooth()` using formula 'y ~ x'
# include group specific regression lines without differences in slope (linear model without interaction)
# for this we need the coefficients of the model to plot
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + gender, data=ddf)
#m.e_s <- lm(time ~ extro + sex)
# create base plot object
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
# color management
cbPalette = c('blue', 'red')
# add scatterplot and gender specific regression lines as estimated by model above
pplot +
geom_point(aes(color=gender, shape=gender)) +
scale_color_manual(values=cbPalette) +
geom_abline(intercept = m.2$coefficients['(Intercept)'], slope = m.2$coefficients['c_figure'], color=cbPalette[1]) +
geom_abline(intercept = m.2$coefficients['(Intercept)'] + m.2$coefficients['genderm'], slope = m.2$coefficients['c_figure'], color=cbPalette[2])
# include group specific regression lines - this is NOT the statistical model 'time~extro+sex' discussed here
pplot +
geom_point(aes(color=gender, shape=gender)) +
stat_smooth(aes(fill = gender), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines
pplot +
geom_point(aes(color=gender, shape=gender)) +
stat_smooth(method = "lm", se=FALSE) +
stat_smooth(aes(fill = gender, color=gender), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines, global confidence interval
pplot +
geom_point(aes(color=gender, shape=gender)) +
stat_smooth(method = "lm") +
stat_smooth(aes(fill = gender, color=gender), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
# include both regression lines and confidence intervals for gender
pplot +
geom_point(aes(color=gender, shape=gender)) +
stat_smooth(method = "lm", se=FALSE) +
stat_smooth(aes(fill = gender, color=gender), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Der nicht-ggplot Weg für das Modell bodysatisf ~ c_figure, grade
.
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# adapt model with two explanatory variables c_dress and grade
m.r <- lm(bodysatisf ~ c_figure + grade, data = ddf)
summary(m.r)
##
## Call:
## lm(formula = bodysatisf ~ c_figure + grade, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.8523 -7.9442 -0.6461 9.6565 19.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.676 7.555 3.398 0.00212 **
## c_figure 12.041 1.547 7.786 2.26e-08 ***
## grade -7.501 3.340 -2.245 0.03313 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared: 0.7089, Adjusted R-squared: 0.6873
## F-statistic: 32.87 on 2 and 27 DF, p-value: 5.82e-08
coefficients(m.r)
## (Intercept) c_figure grade
## 25.675687 12.040666 -7.500898
# plot c_figure against bodysatisf
plot(ddf$bodysatisf ~ ddf$c_figure,
xlab="Intensity comparing figure",
ylab="subj. tendency to compare attractivity")
abline(coefficients(m.r)[1:2])
# 3-d scatterplot
# require(rgl)
# plot3d(ddf$c_figure, ddf$bodysatisf, ddf$grade) # commented out to render
require(rockchalk)
plotPlane(model = m.r, plotx1 = "c_figure", plotx2 = "grade")
plotPlane(model = m.r, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T)
… oder mit Grafiken aus dem base System
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
## Die folgenden Objekte sind maskiert von ddf (pos = 3):
##
## bmi, bmi_class, bmi_dichotom, bmi_normal, bodysatisf, c_bad_way,
## c_dress, c_figure, c_good_way, c_phys_app, filling_time, gender,
## grade, height, i_c_bad_way, name, pacs, pacs_sum, subj, weight
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)
# standardized residuals
plot(m.2$fitted.values, rstandard(m.2),
ylab="Standardized Residuals",
xlab="Fitted y",
main="Plot of Residuals against predicted Values")
abline(0, 0)
# nothing suspicious to see
# test of normal distribution of residuals
qqnorm(m.2$residuals)
# a line may be added
abline(0,sd(m.2$residuals))
# test of normal distribution of residuals, now with standardized values
qqnorm(scale(m.2$residuals))
# a line may be added
abline(0,1)
“Both types of coefficient are ways to standardize the effect of x_1 in the milieu of other predictors.”
An einem
Beispiel mit zwei erklärenden Variablen.
require(tibble)
## Lade nötiges Paket: tibble
tt <- tibble(
Y = c(14, 23, 30, 50, 39, 67), # memory span
X = c( 4, 4, 7, 7, 10, 10), # age
T = c( 1, 2, 2, 4, 3, 6) # speech rate
)
m <- lm(Y ~ X + T, data=tt)
summary(m)
##
## Call:
## lm(formula = Y ~ X + T, data = tt)
##
## Residuals:
## 1 2 3 4 5 6
## -1.167 -1.667 2.333 3.333 -1.167 -1.667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.667 3.598 0.463 0.67470
## X 1.000 0.725 1.379 0.26162
## T 9.500 1.087 8.736 0.00316 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.877 on 3 degrees of freedom
## Multiple R-squared: 0.9866, Adjusted R-squared: 0.9776
## F-statistic: 110.1 on 2 and 3 DF, p-value: 0.001559
cor(tt)
## Y X T
## Y 1.0000000 0.8027961 0.9889517
## X 0.8027961 1.0000000 0.7500000
## T 0.9889517 0.7500000 1.0000000
# Partial regression coefficient as increment in explained variance
# The squared partial regression coefficient between X and Y is computed as
s.m <- summary(m)
s.m$r.squared - cor(tt$Y, tt$T)^2
## [1] 0.008528111
# ... This indicates that when X is entered last in the regression equation, it increases the multiple coefficient of correlation by .0085
# squared partial regression coefficient between T and Y is
s.m$r.squared - cor(tt$Y, tt$X)^2
## [1] 0.342072
# ... Y enterd increases multiple R^2 by 0.342072
# another way to calculate the same explained variance
sum1 <- summary(lm(Y ~ X + T, data=tt))
sum2 <- summary(lm(Y ~ T, data=tt))
sum1$r.squared - sum2$r.squared
## [1] 0.008528111
# Partial regression coefficient as prediction from a residual
# partial regression coefficient for explanatory variable X
m.X <- lm(X ~ T, data=tt)
r.X <- residuals(m.X)
r.X <- residuals(lm(X ~ T, data=tt))
r.YT <- residuals(lm(Y ~ T, data=tt))
cor(tt$Y, r.X)^2
## [1] 0.008528111
# partial regression coefficient for explanatory variable Y
m.T <- lm(T ~ X, data=tt)
r.T <- residuals(m.T)
cor(tt$Y, r.T)^2
## [1] 0.342072
# multiple R^2 is the correlation of fitted values and reaction variable
cor(tt$Y, m$fitted.values)^2
## [1] 0.9865536
# part of explained variance in Y that is not specific to one of the explanatory variables
cor(tt$Y, m$fitted.values)^2 - cor(tt$Y, r.X)^2 - cor(tt$Y, r.T)^2
## [1] 0.6359534
# we get the same coefficient for explanatory variable and residuals of explanatory variable
coefficients(lm(Y ~ X + T, data=tt))
## (Intercept) X T
## 1.666667 1.000000 9.500000
coefficients(lm(Y ~ r.X + T, data=tt))
## (Intercept) r.X T
## 5.291667 1.000000 10.625000
coefficients(lm(Y ~ X + r.T, data=tt))
## (Intercept) X r.T
## -3.083333 5.750000 9.500000
# we can even transform coefficients and partial coefficients one in the other (take care: correlation of residuals, not semipartial coefficient)
m$coefficients[2] * sqrt(var(r.X)/var(r.YT))
## X
## 0.622969
cor(r.X, r.YT)
## [1] 0.622969
Vollziehen Sie die entsprechenden Kapitel in Field (2012) nach.
Vollziehen Sie die entsprechenden Kapitel in Everitt (2010) nach.
Die Daten und das Beispiel entstammen dem Buch [Everitt (2010), S. 84][everitt]
car_wash html
Die Daten und das Beispiel entstammen dem Buch [Everitt (2010), S. {}][everitt]
Dies ist das einführende Beispiel verknappt und ohne erläuternde, konzeptionelle Kommentare.
[lm_mult_example_course_data.html] - [lm_mult_example_course_data.Rmd]
require(psych)
#dd <- sat.act
dd <- bfi
#dd <- epi.bfi
dd[1:5,]
## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4
## 61618 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3
## 61620 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5
## 61621 4 4 6 5 5 4 4 3 5 5 5 3 4 4 4 2 5 2 4 1 3 3 4 3
## 61622 2 3 3 4 5 4 4 5 3 2 2 2 5 4 5 2 3 4 4 3 3 3 4 3
## O5 gender education age
## 61617 3 1 NA 16
## 61618 3 2 NA 18
## 61620 2 2 NA 17
## 61621 5 2 NA 17
## 61622 3 1 NA 17
describe(dd)
## vars n mean sd median trimmed mad min max range skew
## A1 1 2784 2.41 1.41 2 2.23 1.48 1 6 5 0.83
## A2 2 2773 4.80 1.17 5 4.98 1.48 1 6 5 -1.12
## A3 3 2774 4.60 1.30 5 4.79 1.48 1 6 5 -1.00
## A4 4 2781 4.70 1.48 5 4.93 1.48 1 6 5 -1.03
## A5 5 2784 4.56 1.26 5 4.71 1.48 1 6 5 -0.85
## C1 6 2779 4.50 1.24 5 4.64 1.48 1 6 5 -0.85
## C2 7 2776 4.37 1.32 5 4.50 1.48 1 6 5 -0.74
## C3 8 2780 4.30 1.29 5 4.42 1.48 1 6 5 -0.69
## C4 9 2774 2.55 1.38 2 2.41 1.48 1 6 5 0.60
## C5 10 2784 3.30 1.63 3 3.25 1.48 1 6 5 0.07
## E1 11 2777 2.97 1.63 3 2.86 1.48 1 6 5 0.37
## E2 12 2784 3.14 1.61 3 3.06 1.48 1 6 5 0.22
## E3 13 2775 4.00 1.35 4 4.07 1.48 1 6 5 -0.47
## E4 14 2791 4.42 1.46 5 4.59 1.48 1 6 5 -0.82
## E5 15 2779 4.42 1.33 5 4.56 1.48 1 6 5 -0.78
## N1 16 2778 2.93 1.57 3 2.82 1.48 1 6 5 0.37
## N2 17 2779 3.51 1.53 4 3.51 1.48 1 6 5 -0.08
## N3 18 2789 3.22 1.60 3 3.16 1.48 1 6 5 0.15
## N4 19 2764 3.19 1.57 3 3.12 1.48 1 6 5 0.20
## N5 20 2771 2.97 1.62 3 2.85 1.48 1 6 5 0.37
## O1 21 2778 4.82 1.13 5 4.96 1.48 1 6 5 -0.90
## O2 22 2800 2.71 1.57 2 2.56 1.48 1 6 5 0.59
## O3 23 2772 4.44 1.22 5 4.56 1.48 1 6 5 -0.77
## O4 24 2786 4.89 1.22 5 5.10 1.48 1 6 5 -1.22
## O5 25 2780 2.49 1.33 2 2.34 1.48 1 6 5 0.74
## gender 26 2800 1.67 0.47 2 1.71 0.00 1 2 1 -0.73
## education 27 2577 3.19 1.11 3 3.22 1.48 1 5 4 -0.05
## age 28 2800 28.78 11.13 26 27.43 10.38 3 86 83 1.02
## kurtosis se
## A1 -0.31 0.03
## A2 1.05 0.02
## A3 0.44 0.02
## A4 0.04 0.03
## A5 0.16 0.02
## C1 0.30 0.02
## C2 -0.14 0.03
## C3 -0.13 0.02
## C4 -0.62 0.03
## C5 -1.22 0.03
## E1 -1.09 0.03
## E2 -1.15 0.03
## E3 -0.47 0.03
## E4 -0.30 0.03
## E5 -0.09 0.03
## N1 -1.01 0.03
## N2 -1.05 0.03
## N3 -1.18 0.03
## N4 -1.09 0.03
## N5 -1.06 0.03
## O1 0.43 0.02
## O2 -0.81 0.03
## O3 0.30 0.02
## O4 1.08 0.02
## O5 -0.24 0.03
## gender -1.47 0.01
## education -0.32 0.02
## age 0.56 0.21
Beschreibung des Datensatzes Link
# read data
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/affect.txt", fileEncoding = "UTF-8")
require("psych")
describe(ddf)
## vars n mean sd median trimmed mad min max range skew
## nr 1 330 165.50 95.41 165.5 165.50 122.31 1 330 329 0.00
## Study* 2 330 1.48 0.50 1.0 1.48 0.00 1 2 1 0.06
## Film 3 330 2.52 1.13 3.0 2.52 1.48 1 4 3 -0.03
## ext 4 330 13.16 4.47 13.0 13.38 4.45 0 22 22 -0.41
## neur 5 330 10.22 5.04 10.0 10.11 4.45 0 23 23 0.19
## imp 6 330 4.29 1.97 4.0 4.30 1.48 0 9 9 -0.02
## soc 7 330 7.51 2.86 8.0 7.75 2.97 0 13 13 -0.64
## lie 8 330 2.30 1.48 2.0 2.20 1.48 0 7 7 0.68
## traitanx 9 330 39.52 9.77 38.0 38.88 8.90 22 71 49 0.63
## state1 10 330 40.83 10.76 40.0 40.24 10.38 20 79 59 0.57
## EA1 11 330 9.19 7.11 8.0 8.67 8.90 0 29 29 0.52
## TA1 12 330 12.91 4.42 13.0 12.76 4.45 0 26 26 0.35
## PA1 13 330 8.89 6.79 8.0 8.28 7.41 0 29 29 0.70
## NA1 14 330 3.69 4.31 2.0 2.93 2.97 0 28 28 1.97
## EA2 15 330 11.01 6.85 11.0 10.82 7.41 0 29 29 0.21
## TA2 16 330 15.65 4.88 15.0 15.54 4.45 3 31 28 0.25
## PA2 17 330 8.98 6.43 8.0 8.52 7.41 0 29 29 0.53
## NA2 18 330 4.65 5.19 3.0 3.79 4.45 0 30 30 1.38
## state2 19 170 42.45 10.76 43.0 42.35 10.38 20 71 51 0.10
## MEQ 20 170 39.36 10.28 40.0 39.43 11.12 15 65 50 -0.06
## BDI 21 160 10.68 7.25 11.0 10.38 8.90 0 29 29 0.26
## kurtosis se
## nr -1.21 5.25
## Study* -2.00 0.03
## Film -1.38 0.06
## ext -0.15 0.25
## neur -0.44 0.28
## imp -0.67 0.11
## soc -0.13 0.16
## lie 0.38 0.08
## traitanx 0.16 0.54
## state1 0.17 0.59
## EA1 -0.65 0.39
## TA1 0.39 0.24
## PA1 -0.13 0.37
## NA1 5.29 0.24
## EA2 -0.79 0.38
## TA2 -0.13 0.27
## PA2 -0.39 0.35
## NA2 1.92 0.29
## state2 -0.45 0.83
## MEQ -0.38 0.79
## BDI -0.91 0.57
# examples from the psych dataset
describeBy(ddf, ddf$Film)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew
## nr 1 83 158.58 94.72 143.0 157.63 106.75 6 326 320 0.10
## Study* 2 83 1.51 0.50 2.0 1.51 0.00 1 2 1 -0.02
## Film 3 83 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN
## ext 4 83 13.40 4.35 13.0 13.43 4.45 2 22 20 -0.06
## neur 5 83 10.21 4.90 11.0 10.17 5.93 1 22 21 0.06
## imp 6 83 4.34 1.97 4.0 4.30 1.48 0 9 9 0.10
## soc 7 83 7.64 2.77 8.0 7.78 2.97 0 13 13 -0.42
## lie 8 83 2.36 1.50 2.0 2.30 1.48 0 7 7 0.42
## traitanx 9 83 39.91 9.60 39.0 39.55 10.38 23 61 38 0.31
## state1 10 83 39.81 10.75 38.0 39.22 10.38 20 69 49 0.56
## EA1 11 83 9.27 7.21 8.0 8.70 7.41 0 29 29 0.59
## TA1 12 83 12.49 4.80 12.9 12.35 4.30 0 26 26 0.37
## PA1 13 83 9.46 7.28 7.5 8.77 5.19 0 29 29 0.80
## NA1 14 83 3.50 3.73 2.0 2.94 2.97 0 17 17 1.41
## EA2 15 83 10.02 6.09 9.0 9.72 7.41 0 23 23 0.30
## TA2 16 83 17.48 4.74 18.0 17.55 4.45 8 27 19 -0.20
## PA2 17 83 7.27 5.93 6.0 6.63 5.93 0 23 23 0.78
## NA2 18 83 8.67 5.35 8.0 8.21 5.93 0 22 22 0.60
## state2 19 41 48.66 11.00 49.0 49.21 10.38 20 71 51 -0.45
## MEQ 20 41 38.39 11.49 40.0 38.00 13.34 16 65 49 0.26
## BDI 21 42 10.69 6.73 11.0 10.59 7.41 0 23 23 0.05
## kurtosis se
## nr -1.14 10.40
## Study* -2.02 0.06
## Film NaN 0.00
## ext -0.58 0.48
## neur -0.76 0.54
## imp -0.54 0.22
## soc -0.53 0.30
## lie -0.14 0.16
## traitanx -0.81 1.05
## state1 -0.17 1.18
## EA1 -0.48 0.79
## TA1 0.74 0.53
## PA1 -0.14 0.80
## NA1 1.90 0.41
## EA2 -0.60 0.67
## TA2 -0.77 0.52
## PA2 -0.25 0.65
## NA2 -0.43 0.59
## state2 0.01 1.72
## MEQ -0.40 1.79
## BDI -1.25 1.04
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew
## nr 1 78 164.23 94.57 174.5 164.98 114.16 8 316 308 -0.11
## Study* 2 78 1.47 0.50 1.0 1.47 0.00 1 2 1 0.10
## Film 3 78 2.00 0.00 2.0 2.00 0.00 2 2 0 NaN
## ext 4 78 12.38 4.55 12.0 12.47 2.97 1 21 20 -0.21
## neur 5 78 9.88 4.61 10.0 9.80 4.45 0 21 21 0.17
## imp 6 78 3.88 2.05 4.0 3.86 1.48 0 8 8 0.15
## soc 7 78 7.15 2.71 7.5 7.31 2.22 1 12 11 -0.49
## lie 8 78 2.15 1.47 2.0 2.02 1.48 0 7 7 0.97
## traitanx 9 78 39.87 10.14 38.0 39.33 10.38 22 71 49 0.69
## state1 10 78 41.44 10.41 41.0 41.03 10.38 22 79 57 0.67
## EA1 11 78 8.40 6.54 7.5 7.97 8.15 0 26 26 0.45
## TA1 12 78 12.62 3.84 13.0 12.51 2.97 5 25 20 0.45
## PA1 13 78 7.48 5.89 7.0 6.95 5.93 0 22 22 0.68
## NA1 14 78 3.36 4.45 2.0 2.57 2.97 0 28 28 2.73
## EA2 15 78 11.65 6.66 11.0 11.51 5.93 0 28 28 0.17
## TA2 16 78 18.33 5.15 19.0 18.34 4.45 6 31 25 -0.01
## PA2 17 78 8.08 6.18 7.0 7.56 7.41 0 26 26 0.74
## NA2 18 78 5.37 5.18 4.0 4.66 4.45 0 30 30 1.80
## state2 19 41 46.55 9.63 46.0 46.63 8.90 25 67 42 -0.01
## MEQ 20 41 39.63 10.04 39.0 39.58 10.38 16 58 42 -0.01
## BDI 21 37 10.65 7.19 11.0 10.48 8.90 0 26 26 0.11
## kurtosis se
## nr -1.28 10.71
## Study* -2.02 0.06
## Film NaN 0.00
## ext -0.37 0.52
## neur -0.35 0.52
## imp -0.78 0.23
## soc -0.42 0.31
## lie 1.12 0.17
## traitanx 0.46 1.15
## state1 1.15 1.18
## EA1 -0.65 0.74
## TA1 0.65 0.44
## PA1 -0.31 0.67
## NA1 10.66 0.50
## EA2 -0.62 0.75
## TA2 -0.15 0.58
## PA2 0.15 0.70
## NA2 5.06 0.59
## state2 -0.21 1.50
## MEQ -0.76 1.57
## BDI -1.05 1.18
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max range skew
## nr 1 85 165.85 94.21 160.0 165.68 124.54 1 330.0 329.0 -0.01
## Study* 2 85 1.51 0.50 2.0 1.51 0.00 1 2.0 1.0 -0.02
## Film 3 85 3.00 0.00 3.0 3.00 0.00 3 3.0 0.0 NaN
## ext 4 85 12.78 4.60 13.0 13.12 4.45 1 20.0 19.0 -0.64
## neur 5 85 10.61 5.14 10.0 10.30 4.45 0 23.0 23.0 0.46
## imp 6 85 4.12 1.98 4.0 4.07 1.48 0 8.0 8.0 0.23
## soc 7 85 7.31 3.15 8.0 7.60 2.97 0 12.0 12.0 -0.75
## lie 8 85 2.34 1.41 2.0 2.28 1.48 0 6.0 6.0 0.42
## traitanx 9 85 39.42 9.62 39.0 38.70 7.41 23 71.0 48.0 0.74
## state1 10 85 40.72 11.25 40.0 39.89 10.38 22 76.0 54.0 0.67
## EA1 11 85 9.42 7.24 8.0 8.87 8.90 0 29.0 29.0 0.55
## TA1 12 85 13.16 4.60 13.0 13.00 4.45 3 24.5 21.5 0.25
## PA1 13 85 8.95 6.91 8.0 8.29 5.93 0 28.0 28.0 0.77
## NA1 14 85 3.92 4.66 2.0 3.12 2.97 0 23.0 23.0 1.93
## EA2 15 85 9.07 6.90 7.0 8.66 8.90 0 27.0 27.0 0.46
## TA2 16 85 13.30 4.26 14.0 13.20 4.45 3 25.0 22.0 0.20
## PA2 17 85 8.75 6.45 8.0 8.23 7.41 0 26.0 26.0 0.57
## NA2 18 85 2.89 4.13 1.0 2.01 1.48 0 18.0 18.0 1.77
## state2 19 42 39.58 9.24 40.0 39.89 9.64 20 57.0 37.0 -0.20
## MEQ 20 42 38.64 10.20 38.5 38.82 8.15 15 59.0 44.0 -0.20
## BDI 21 43 10.81 7.99 9.0 10.37 10.38 0 29.0 29.0 0.40
## kurtosis se
## nr -1.14 10.22
## Study* -2.02 0.05
## Film NaN 0.00
## ext -0.12 0.50
## neur -0.20 0.56
## imp -0.87 0.22
## soc -0.16 0.34
## lie -0.35 0.15
## traitanx 0.51 1.04
## state1 0.33 1.22
## EA1 -0.59 0.78
## TA1 -0.19 0.50
## PA1 0.00 0.75
## NA1 4.61 0.50
## EA2 -0.87 0.75
## TA2 -0.08 0.46
## PA2 -0.42 0.70
## NA2 2.44 0.45
## state2 -0.83 1.43
## MEQ -0.27 1.57
## BDI -0.93 1.22
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew
## nr 1 84 173.17 99.15 168.5 173.46 137.88 12 324 312 -0.01
## Study* 2 84 1.45 0.50 1.0 1.44 0.00 1 2 1 0.19
## Film 3 84 4.00 0.00 4.0 4.00 0.00 4 4 0 NaN
## ext 4 84 14.01 4.27 14.5 14.28 3.71 0 22 22 -0.64
## neur 5 84 10.13 5.51 10.0 10.11 5.93 0 23 23 0.04
## imp 6 84 4.81 1.81 5.0 4.93 1.48 0 8 8 -0.54
## soc 7 84 7.91 2.78 9.0 8.17 2.97 0 13 13 -0.77
## lie 8 84 2.31 1.57 2.0 2.19 1.48 0 7 7 0.86
## traitanx 9 84 38.89 9.87 37.0 38.07 10.38 23 68 45 0.74
## state1 10 84 41.39 10.71 40.0 40.99 11.12 21 66 45 0.38
## EA1 11 84 9.62 7.47 8.0 9.22 8.90 0 26 26 0.38
## TA1 12 84 13.35 4.37 13.0 13.21 4.45 4 24 20 0.38
## PA1 13 84 9.56 6.88 9.0 9.15 8.15 0 27 27 0.36
## NA1 14 84 3.96 4.40 2.0 3.18 2.97 0 18 18 1.45
## EA2 15 84 13.35 7.02 14.0 13.46 7.41 0 29 29 -0.13
## TA2 16 84 13.74 3.14 14.0 13.89 2.97 6 20 14 -0.45
## PA2 17 84 11.75 6.31 12.0 11.65 5.93 0 29 29 0.11
## NA2 18 84 1.79 2.88 0.0 1.16 0.00 0 12 12 1.81
## state2 19 46 35.89 8.05 36.0 35.71 7.41 20 54 34 0.22
## MEQ 20 46 40.63 9.59 41.5 40.97 9.64 20 59 39 -0.32
## BDI 21 38 10.55 7.25 10.0 10.28 8.90 0 26 26 0.33
## kurtosis se
## nr -1.41 10.82
## Study* -1.99 0.05
## Film NaN 0.00
## ext 0.28 0.47
## neur -0.81 0.60
## imp 0.01 0.20
## soc 0.11 0.30
## lie 0.67 0.17
## traitanx 0.20 1.08
## state1 -0.73 1.17
## EA1 -1.14 0.81
## TA1 -0.07 0.48
## PA1 -0.70 0.75
## NA1 1.60 0.48
## EA2 -0.81 0.77
## TA2 -0.28 0.34
## PA2 -0.38 0.69
## NA2 2.54 0.31
## state2 -0.67 1.19
## MEQ -0.53 1.41
## BDI -0.98 1.18
pairs.panels(ddf[14:17],bg=c("red","black","white","blue")[ddf$Film],pch=21, main="Affect varies by movies ")
errorCircles(13,14,ddf[-1],group=1,labels=c("Sad","Fear","Neutral","Humor"), main="Enegetic and Tense Arousal by Movie condition")
## Warning in qf(1 - alpha/2, df11n, df11d): NaNs wurden erzeugt
## Warning in qf(1 - alpha/2, df11d, df11n): NaNs wurden erzeugt
## Warning in cor(data, use = use, method = method): Standardabweichung ist Null
## Warning in cor(new.data[, 1:nvar], use = "pairwise", method = method):
## Standardabweichung ist Null
## Warning in sqrt(xvals$nw - 2): NaNs wurden erzeugt
## Warning in sqrt(n - 3): NaNs wurden erzeugt
## Warning in cor(new.data[, 1:(nvar)], new.data[, (nvar + 1):ncol(new.data)], :
## Standardabweichung ist Null
errorCircles(15,16,ddf[-1],group=1,labels=c("Sad","Fear","Neutral","Humor"),main="Positive and Negative Affect by Movie condition")
## Warning in qf(1 - alpha/2, df11n, df11d): NaNs wurden erzeugt
## Warning in qf(1 - alpha/2, df11d, df11n): NaNs wurden erzeugt
## Warning in cor(data, use = use, method = method): Standardabweichung ist Null
## Warning in cor(new.data[, 1:nvar], use = "pairwise", method = method):
## Standardabweichung ist Null
## Warning in sqrt(xvals$nw - 2): NaNs wurden erzeugt
## Warning in sqrt(n - 3): NaNs wurden erzeugt
## Warning in cor(new.data[, 1:(nvar)], new.data[, (nvar + 1):ncol(new.data)], :
## Standardabweichung ist Null
msq dataset aus library(psych)
description
[lm_simple_exercises.html] bzw. [lm_simple_exercises.Rmd]
Benutzen Sie den Datensatz zu den PACS-Daten Berechnen Sie Partial- und Semipartialkoeffizienten für ein Modell, in dem bodysatisfaction
vorhergesagt wird durch c_bad_way
, c_figure
und grade
.
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.full <- lm(bodysatisf ~ c_bad_way + c_figure + grade, data = df)
dd <- readr::read_csv("https://md.psych.bio.uni-goettingen.de/mv/data/div/dusch_data.csv")
## Rows: 200 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): temperatur, genuss
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
temp_model <- lm(genuss ~ temperatur, data=dd)
dd$nr <- 1:nrow(dd)
dd$res <- temp_model$residuals
dd$res.s <- scale(temp_model$residuals)
dd$fit <- temp_model$fitted.values
# diagnostic plot 1
dd.f <- dd[order(dd$fit),]
ggplot(dd.f, aes(x=fit, y=res)) + geom_point()
#ggplot(dd.f, aes(x=fit, y=res.s)) + geom_point()
# diagnostic plot 3
ggplot(dd.f, aes(x=fit, y=sqrt(abs(res.s)))) + geom_point()
dd.r <- dd[order(dd$res),]
dd.r$nr.r <- 1:nrow(dd)
head(dd.r)
## # A tibble: 6 × 7
## temperatur genuss nr res res.s[,1] fit nr.r
## <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
## 1 11.5 -82.7 104 -144. -5.00 61.1 1
## 2 -4.85 -129. 160 -96.5 -3.35 -32.6 2
## 3 -5.78 -130. 186 -92.3 -3.21 -37.9 3
## 4 9.65 -35.0 95 -85.7 -2.98 50.7 4
## 5 9.36 -36.4 59 -85.4 -2.97 49.1 5
## 6 9.66 -28.0 12 -78.8 -2.74 50.8 6
ggplot(dd.r, aes(x=nr.r, y=res)) + geom_point() + geom_point(aes(x=nr.r, y=sqrt(res)), color="red")
## Warning in sqrt(res): NaNs wurden erzeugt
## Warning: Removed 73 rows containing missing values (geom_point).
Field (2012, p. 245) Kapitel 7: Regression
stetige Reaktionsvariable, mehr als eine stetige Erklärvariable
Einzel-Koeffizienten der erklärenden Variablen und multiples R^2 und korrigiertes R^2 (Kreuzvalidierung)
Koeffizienten vergleichen (Prädiktor-Wichtigkeit, lm.beta()
)
multiples R und \(R^2\)
Formel und Erklärung Visualisierung
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/sat.txt")
m.mv <- lm(univ_GPA ~ math_SAT + verb_SAT, data=df)
df$pred <- predict(m.mv)
cor(df$univ_GPA, df$pred)
## [1] 0.6857272
ggplot(df, aes(x=pred, y=univ_GPA)) + geom_point() + stat_smooth(method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
Koeffizienten verstehen (für alle Beobachtungen gleich, Unterschied zu MLM)
Modellvergleiche (hierarchische Modelle)
Prädiktion, auch für neue Daten (Dataframes mit denselben Prädiktoren - ohne Reaktionsvariable, predict(lm-object, dataframe)
)
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/sat.txt")
m.mv <- lm(univ_GPA ~ math_SAT + verb_SAT, data=df)
# estimate reaction variable univ_GPA values of new observations
predict(m.mv, data.frame(math_SAT = c(580, 630, 680), verb_SAT = c(560, 590, 710)))
## 1 2 3
## 2.943406 3.176105 3.613269
Verhältnis der Erklärvariablen untereinander (Korrelationsmatrix, unkorreliert, Multikollinearität(VIF > 10? Tol < 0.1?) ) Covarianz der Erklärvariablen kommt als erklärender Effekt hinzu. vgl. shared variance
Shared variance: Residuen (auspartialisieren), mit Residuen weiterrechnen Beispiel unter shared variance
Interaktionsterme und Moderationsanalyse in eigener Unit lm_moderation