Rmd

Unterthemen, in eigenen Units behandelt

Einführendes Beispiel PACS

Zur inhaltlichen Konzeption von PACS vergleiche body_comparison

Der bereits hinreichend vorgestellte Datensatz zu PACS bzw. BMI. Hinzugekommene Variable body satisfaction bodysatisf, die die Zufriedenheit mit dem eigenen Körper (Körperzufriedenheit) ausdrücken soll. Die Werte hier schwanken zwischen 0 und 100 und können als prozentuale Zufriedenheit interpretiert werden. (ähnlich BIQ, Body Image Questionnaire)

Die Idee: Vorhersage des bodysatisf durch die PACS-Einzelwerte oder: Wieviel Varianz am BMI kann durch die PACS-Fragen erklärt werden.

Zunächst ein Eindruck über relevante Einzelkorrelationen.

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(psych)
## Lade nötiges Paket: psych
# an overview of relevant scales
#pairs.panels(df[,c("bodysatisf", "c_phys_app", "c_good_way", "c_dress", "grade", "c_figure")])
# gender specific
#pairs(cbind(df$c_figure, df$grade),panel=function(x,y) text(x,y,df$gender))

cor(df[,c("bodysatisf", "grade", "bmi")], df[,c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")])
##            c_phys_app  c_good_way   c_dress  c_bad_way     c_figure
## bodysatisf  0.7511423  0.83906136 0.6966399 -0.8319948  0.809023784
## grade       0.2284217 -0.01805602 0.3034052 -0.2128557 -0.002494173
## bmi         0.2290078  0.24151014 0.3582303 -0.2283778  0.085188936
cor(df[,c("bodysatisf", "grade", "bmi")])
##             bodysatisf      grade         bmi
## bodysatisf  1.00000000 -0.2351783 -0.00534248
## grade      -0.23517829  1.0000000  0.43248817
## bmi        -0.00534248  0.4324882  1.00000000
cor(df[,c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")])
##            c_phys_app c_good_way    c_dress  c_bad_way   c_figure
## c_phys_app  1.0000000  0.7675161  0.8699426 -0.8450965  0.6739475
## c_good_way  0.7675161  1.0000000  0.7911833 -0.8468327  0.7777649
## c_dress     0.8699426  0.7911833  1.0000000 -0.8790725  0.6922335
## c_bad_way  -0.8450965 -0.8468327 -0.8790725  1.0000000 -0.8324704
## c_figure    0.6739475  0.7777649  0.6922335 -0.8324704  1.0000000

Additives lineares Model spezifizieren und an Daten anpassen

  • Wie gut kann eine einzelne PACS-Frage die globale Selbsteinschätzung der Vergleichsintensität vorhersagen?
  • Wieviel kann diese Frage zusammen mit der Abschlussnote (Abitur) zu einer Verbesserung der Vorhersage beitragen?
  • Wieviel tr
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")

# zero model, we estimate constant only which is mean of reaction variable
m.0 <- lm(bodysatisf ~ 1, data = df)


# adapt model with one explanatory variables c_figure
m.1 <- lm(bodysatisf ~ c_figure, data = df)
# summary of model 
summary(m.1)
## 
## Call:
## lm(formula = bodysatisf ~ c_figure, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.859  -9.735   1.141   7.326  22.191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.612      5.683   2.395   0.0235 *  
## c_figure      12.049      1.654   7.283 6.27e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.76 on 28 degrees of freedom
## Multiple R-squared:  0.6545, Adjusted R-squared:  0.6422 
## F-statistic: 53.05 on 1 and 28 DF,  p-value: 6.268e-08
anova(m.0, m.1)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 13191.0                                  
## 2     28  4557.2  1    8633.7 53.047 6.268e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# grade as additional predictor
m.2 <- lm(bodysatisf ~ c_figure + grade, data = df)
summary(m.2)
## 
## Call:
## lm(formula = bodysatisf ~ c_figure + grade, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.676      7.555   3.398  0.00212 ** 
## c_figure      12.041      1.547   7.786 2.26e-08 ***
## grade         -7.501      3.340  -2.245  0.03313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08
anova(m.0, m.2)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure + grade
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)    
## 1     29 13191.0                                 
## 2     27  3840.1  2    9350.9 32.873 5.82e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.1, m.2)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ c_figure
## Model 2: bodysatisf ~ c_figure + grade
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     28 4557.2                              
## 2     27 3840.1  1    717.12 5.0421 0.03313 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# we want to take a look at the design matrix of our model
head(model.matrix(m.2))
##   (Intercept) c_figure grade
## 1           1        1  2.02
## 2           1        5  1.00
## 3           1        5  2.03
## 4           1        3  1.00
## 5           1        1  1.00
## 6           1        4  1.00
# bmi_dichotom as additional predictor
m.3 <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
summary(m.3)
## 
## Call:
## lm(formula = bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.011  -6.549  -1.282   8.161  15.873 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          43.254      8.565   5.050 2.94e-05 ***
## c_figure              9.040      1.638   5.519 8.58e-06 ***
## grade                -9.328      2.946  -3.166  0.00392 ** 
## bmi_dichotomnormal  -15.734      4.955  -3.176  0.00383 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.32 on 26 degrees of freedom
## Multiple R-squared:  0.7902, Adjusted R-squared:  0.766 
## F-statistic: 32.65 on 3 and 26 DF,  p-value: 5.714e-09
anova(m.0, m.3)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ 1
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 13191.0                                  
## 2     26  2766.9  3     10424 32.651 5.714e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.1, m.3)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ c_figure
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     28 4557.2                                
## 2     26 2766.9  2    1790.3 8.4115 0.001524 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m.2, m.3)
## Analysis of Variance Table
## 
## Model 1: bodysatisf ~ c_figure + grade
## Model 2: bodysatisf ~ c_figure + grade + bmi_dichotom
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     27 3840.1                                
## 2     26 2766.9  1    1073.2 10.085 0.003827 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# we want to take a look at the design matrix of our model
head(model.matrix(m.3))
##   (Intercept) c_figure grade bmi_dichotomnormal
## 1           1        1  2.02                  1
## 2           1        5  1.00                  0
## 3           1        5  2.03                  0
## 4           1        3  1.00                  0
## 5           1        1  1.00                  0
## 6           1        4  1.00                  0

Interpretation des Summary

Omnibus F-Test (F-statistic) prüft Hyp., dass alle Regressionskoeffizienten gleich 0 sind, bzw. ob durch die Vorhersage-Linearkombination ein signifikanter Anteil der Varianz am Kriterium erklärt wird.

Multiple R-squared: Anteil der Varianz des Kriteriums, der durch die Kombination der Prädiktoren gebunden/erklärt wird.

Adjusted R-squared: rechnet R-squared so um, dass die Anzahl der erklärenden Terme im Modell berücksichtigt wird. adjusted R-squared steigt im Gegensatz zu R-squared nur, wenn der neue Term das Modell um mehr als durch Zufall erwartet verbessert. adjusted R-squared kann negativ sein und ist immer <= R-squared.

Unter Coefficients finden sich die unstandardisierten Beta-Gewichte mit ihrem Standardfehler sowie ein T-Test zur Prüfung der Einzeleffekte (t-Werte sind Estimate / Std. Error)

Die unstandardisierten Beta-Gewichte dienen zur Berechnung der Vorhersagewerte in der Skalierung der Original-Variablen.

p Anzahl der Regressore im linearen Model (ohne constant), n ist sample-size.

Standardisierte Gewichte

Standardisierte Gewichte (\(\beta\)-Gewichte) sind untereinander vergleichbar.

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# standardized beta-weights are weights of standardized variables
# standardize variables - the hard way
s.bodysatisf   <- (df$bodysatisf - mean(df$bodysatisf)) / sd(df$bodysatisf)
s.c_figure <- (df$c_figure - mean(df$c_figure)) / sd(df$c_figure)
s.grade <- (df$grade - mean(df$grade)) / sd(df$grade)
# compare summaries
# unstandardisiert
summary(lm(df$bodysatisf ~ df$c_figure + df$grade))
## 
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.676      7.555   3.398  0.00212 ** 
## df$c_figure   12.041      1.547   7.786 2.26e-08 ***
## df$grade      -7.501      3.340  -2.245  0.03313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08
# standardized predictive variables
summary(lm(df$bodysatisf ~ s.c_figure + s.grade))
## 
## Call:
## lm(formula = df$bodysatisf ~ s.c_figure + s.grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   51.367      2.177  23.591  < 2e-16 ***
## s.c_figure    17.242      2.215   7.786 2.26e-08 ***
## s.grade       -4.973      2.215  -2.245   0.0331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08
# standardized predicted and explanatory variables
summary(lm(s.bodysatisf ~ s.c_figure + s.grade))
## 
## Call:
## lm(formula = s.bodysatisf ~ s.c_figure + s.grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21216 -0.37249 -0.03029  0.45277  0.90202 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.607e-17  1.021e-01   0.000   1.0000    
## s.c_figure   8.084e-01  1.038e-01   7.786 2.26e-08 ***
## s.grade     -2.332e-01  1.038e-01  -2.245   0.0331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5592 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08

Standardisieren bzw. z-transformieren geht auch einfacher mit scale()

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# show the equivalenz
s.bodysatisf   <- (df$bodysatisf - mean(df$bodysatisf)) / sd(df$bodysatisf)
head(cbind(df$bodysatisf, scale(df$bodysatisf), s.bodysatisf))
##                    s.bodysatisf
## [1,] 16 -1.6582684   -1.6582684
## [2,] 76  1.1550050    1.1550050
## [3,] 68  0.7799019    0.7799019
## [4,] 70  0.8736777    0.8736777
## [5,] 45 -0.2985196   -0.2985196
## [6,] 58  0.3110230    0.3110230
require(psych)
describe(cbind(df$bodysatisf, scale(df$bodysatisf), s.bodysatisf))
##   vars  n  mean    sd median trimmed   mad   min   max range  skew kurtosis
## 1    1 30 51.37 21.33  51.50   51.92 25.95 14.00 91.00 77.00 -0.17    -0.98
## 2    2 30  0.00  1.00   0.01    0.03  1.22 -1.75  1.86  3.61 -0.17    -0.98
## 3    3 30  0.00  1.00   0.01    0.03  1.22 -1.75  1.86  3.61 -0.17    -0.98
##     se
## 1 3.89
## 2 0.18
## 3 0.18
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
m.s.fb <- lm(scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
coefficients(m.fb)
## (Intercept) df$c_figure    df$grade 
##   25.675687   12.040666   -7.500898
coefficients(m.s.fb)
##        (Intercept) scale(df$c_figure)    scale(df$grade) 
##      -3.277394e-17       8.084422e-01      -2.331619e-01
summary(m.fb)
## 
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.676      7.555   3.398  0.00212 ** 
## df$c_figure   12.041      1.547   7.786 2.26e-08 ***
## df$grade      -7.501      3.340  -2.245  0.03313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08
summary(m.s.fb)
## 
## Call:
## lm(formula = scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21216 -0.37249 -0.03029  0.45277  0.90202 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -3.277e-17  1.021e-01   0.000   1.0000    
## scale(df$c_figure)  8.084e-01  1.038e-01   7.786 2.26e-08 ***
## scale(df$grade)    -2.332e-01  1.038e-01  -2.245   0.0331 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5592 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08

oder mit QuantPsyc-Package lm.beta()

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(QuantPsyc)
## Lade nötiges Paket: QuantPsyc
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : es gibt kein Paket namens 'QuantPsyc'
require(lm.beta)
## Lade nötiges Paket: lm.beta
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
m.s.fb <- lm(scale(df$bodysatisf) ~ scale(df$c_figure) + scale(df$grade))
coefficients(m.fb)
## (Intercept) df$c_figure    df$grade 
##   25.675687   12.040666   -7.500898
lm.beta(m.fb)
## 
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
## 
## Standardized Coefficients::
## (Intercept) df$c_figure    df$grade 
##          NA   0.8084422  -0.2331619
coefficients(m.s.fb)
##        (Intercept) scale(df$c_figure)    scale(df$grade) 
##      -3.277394e-17       8.084422e-01      -2.331619e-01
summary(m.fb)
## 
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.676      7.555   3.398  0.00212 ** 
## df$c_figure   12.041      1.547   7.786 2.26e-08 ***
## df$grade      -7.501      3.340  -2.245  0.03313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08

Konfidenzintervalle für die Parameter

Zur Beurteilung der einzelnen Prädiktoren und Koeffizienten helfen die Konfidenzintervalle. Individuelle Konfidenzintervalle finden sich bei Vorhersage

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(QuantPsyc)
## Lade nötiges Paket: QuantPsyc
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : es gibt kein Paket namens 'QuantPsyc'
# use this to get raw and standardized weights
m.fb <- lm(df$bodysatisf ~ df$c_figure + df$grade)
# get the coefficients
coefficients(m.fb)
## (Intercept) df$c_figure    df$grade 
##   25.675687   12.040666   -7.500898
lm.beta(m.fb)
## 
## Call:
## lm(formula = df$bodysatisf ~ df$c_figure + df$grade)
## 
## Standardized Coefficients::
## (Intercept) df$c_figure    df$grade 
##          NA   0.8084422  -0.2331619
# and their confidence intervals for 95% range. This are the confidence intervals of the means.
confint(m.fb, level=0.95)
##                  2.5 %     97.5 %
## (Intercept)  10.173147 41.1782270
## df$c_figure   8.867478 15.2138539
## df$grade    -14.354990 -0.6468053
# to get individual confidence intervals see predict
# help
# help(predict.lm) # commented, so it can be rendered 

Vorhersage

  • Vorhersage eines einzelnen Kriteriumswertes (Beobachtung 3) mit Prädiktor c_figure und grade (‘zu Fuß’).
  • Residuum für diese Beobachtung ‘zu Fuß’.
  • Vorhersagen für neue Daten machen, in denen die Reaktionsvariable nicht bekannt ist und geschätzt werden muss
df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
#attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=df)

# store coefficients
coe <- m.2$coefficients
# calculate predicted (fitted) value of 3rd subject
e.bodysatisf.3 <- coe['(Intercept)'] + coe['c_figure'] * df$c_figure[3] + coe['grade'] * df$grade[3]
e.bodysatisf.3
## (Intercept) 
##    70.65219
# predicted values are called fitted.values
# we get at them by using 'fitted.values'
m.2$fitted.values
##        1        2        3        4        5        6        7        8 
## 22.56454 78.37812 70.65219 54.29679 30.21546 66.33745 43.27047 56.06122 
##        9       10       11       12       13       14       15       16 
## 56.88632 64.76226 40.23088 39.52002 71.85234 25.56490 42.25612 22.26450 
##       17       18       19       20       21       22       23       24 
## 54.29679 66.33745 54.29679 49.53544 66.33745 61.83691 25.71492 40.00585 
##       25       26       27       28       29       30 
## 37.60556 67.27679 54.29679 78.37812 78.37812 21.58942
# or by using 'predict()'
predict(m.2)
##        1        2        3        4        5        6        7        8 
## 22.56454 78.37812 70.65219 54.29679 30.21546 66.33745 43.27047 56.06122 
##        9       10       11       12       13       14       15       16 
## 56.88632 64.76226 40.23088 39.52002 71.85234 25.56490 42.25612 22.26450 
##       17       18       19       20       21       22       23       24 
## 54.29679 66.33745 54.29679 49.53544 66.33745 61.83691 25.71492 40.00585 
##       25       26       27       28       29       30 
## 37.60556 67.27679 54.29679 78.37812 78.37812 21.58942
# should be equal to 3 pos of fitted (predicted) values 
m.2$fitted.values[3]
##        3 
## 70.65219
# we can use the design matrix to calculate the predicted values
head(cbind(model.matrix(m.2) %*% m.2$coefficients, predict(m.2)))
##       [,1]     [,2]
## 1 22.56454 22.56454
## 2 78.37812 78.37812
## 3 70.65219 70.65219
## 4 54.29679 54.29679
## 5 30.21546 30.21546
## 6 66.33745 66.33745
# residual for this subject is difference between real value and predicted value
df$bodysatisf[3] - e.bodysatisf.3
## (Intercept) 
##   -2.652195
# compare with residuals vector pos 3
m.2$residuals
##           1           2           3           4           5           6 
##  -6.5645397  -2.3781193  -2.6521947  15.7032127  14.7845447  -8.3374533 
##           7           8           9          10          11          12 
##  -3.2704678  -1.0612236  -6.8863223  19.2377352  -0.2308789  13.4799810 
##          13          14          15          16          17          18 
## -25.8523384 -11.5648988 -20.2561213  10.7354962  -8.2967873   0.6625467 
##          19          20          21          22          23          24 
##   5.7032127  -4.5354426   6.6625467  14.1630853 -11.7149167 -19.0058520 
##          25          26          27          28          29          30 
##   7.3944352   6.7232092   2.7032127  -8.3781193  12.6218807  10.4105770
m.2$residuals[3]
##         3 
## -2.652195
#detach(ddf)
# predict bodysatisf on base of new data, same predictor names are required. We don't know/need response variable.
data.new <- data.frame(
  c_figure = c(1, 3, 5),
  grade    = c(1.3, 2.1, 3.2)
  )
# Calculate the individual estimated response variable
predict(m.2, data.new)
##        1        2        3 
## 27.96519 46.04580 61.87614
# we might need the standard error to build a confidence interval
predict(m.2, data.new, se.fit=T)
## $fit
##        1        2        3 
## 27.96519 46.04580 61.87614 
## 
## $se.fit
##        1        2        3 
## 4.083912 2.742181 6.446045 
## 
## $df
## [1] 27
## 
## $residual.scale
## [1] 11.92586
rm(data.new, m.2, e.bodysatisf.3, coe)

Grafische Darstellung

Der ggplot Weg für das Modell bodysatisf ~ c_figure + grade.

ggplot stellt nur zweidimensionale Graphen dar. Wir können aber die einzelnen Regressionsgeraden darstellen und die dritte Dimension über die Farbgebung in den Punkten visualisieren. Bei kategorialen Vorhersagevariablen kann auch eine Farbgebung in Abhängigkeit von der Gruppenzugehörigkeit helfen, oder aber multiple Plots.

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# adapt model with two explanatory variables c_dress and grade
m.r <- lm(bodysatisf ~ c_figure + grade, data = df)
require(ggplot2)
## Lade nötiges Paket: ggplot2
## 
## Attache Paket: 'ggplot2'
## Die folgenden Objekte sind maskiert von 'package:psych':
## 
##     %+%, alpha
# only 2-dimensional regression line is possible
# we begin drawing regression of bodysatisf on c_figure
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=grade)) + geom_smooth(method=lm)  + geom_point()
## `geom_smooth()` using formula 'y ~ x'

# ... and continue with drawing regression of bodysatisf on grade
ggplot(data=df, mapping=aes(grade, bodysatisf, color=c_figure)) + geom_smooth(method=lm)  + geom_point()
## `geom_smooth()` using formula 'y ~ x'

# dichotomous grouping variable
m.r <- lm(bodysatisf ~ c_figure + bmi_dichotom, data = df)
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=bmi_dichotom)) + geom_smooth(method=lm)  + geom_point()
## `geom_smooth()` using formula 'y ~ x'

# multiple plots
m.r <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data = df)
ggplot(data=df, mapping=aes(c_figure, bodysatisf, color=grade)) + geom_smooth(method=lm)  + geom_point() + facet_wrap(~ df$bmi_dichotom)
## `geom_smooth()` using formula 'y ~ x'

#ggplot(data=df, mapping=aes(x=c_figure, y=bodysatisf)) + geom_point() + facet_wrap(~bmi_dichotom)


rm(m.r)

Grafische Darstellung als 3D-Plot

Mit der library(rockchalk)

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.a <- lm(bodysatisf ~ c_figure + grade, data=df)
require(rockchalk)
## Lade nötiges Paket: rockchalk
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade")

# various angles
plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", theta = 0, phi = 10)

plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T, theta=0, phi=10)

plotPlane(model = m.a, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T)

# for an interaction view see lm_moderation

Residualanalyse

Zweck ist die Suche nach Besonderheiten im Plot der Residuen

Besonderheiten deuten auf systematische Einflüsse hin, die bisher nicht durch das Modell erfasst werden. Es geht um Abweichungen der Residuen von der NV mit Mittelwert 0.

Zur Inspektion bietet sich ein Plot der vorhergesagten Werte gegen die (ev. standardisierten) Residuen an.

Grafiken, Residualplot und grafische Prüfung auf NV der Residuen mit ggplot()

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.2 <- lm(bodysatisf ~ c_figure + grade, data=df)
# we plot predicted/fitted values against standarized residuals
ggplot(data=df) + geom_point(aes(x=rstandard(m.2), y=m.2$fitted.values))  + geom_hline(yintercept = 0)

# we plot observed residuals against theoretically expected residuals if normally distributed
ggplot(df, aes(sample = m.2$residuals)) + stat_qq() + geom_abline(intercept=0, slope=1*sd(m.2$residuals))

# or scaled
ggplot(df, aes(sample = scale(m.2$residuals))) + stat_qq() + geom_abline(intercept=0, slope=1*sd(scale(m.2$residuals)))

Beispiele

Beispiel linearer Trend

Linearer Trend, visualisieren, Residuen visualisieren. Aus den Residuen ist alles, was der lineare Trend erklärt, herausgerechnet.

# some trend
t <- rnorm(100,0, 15)
y <- t + rnorm(100,0,5) * -1 + ((t + rnorm(100, 0, 5))^2) / 20
#x <-  ((t + rnorm(100, 0, 5))^2)*10
x <- (t + rnorm(100,0,2)) / 500
#x <- t^2
cor(cbind(t, y, x))
##           t         y         x
## t 1.0000000 0.6075833 0.9911027
## y 0.6075833 1.0000000 0.6030845
## x 0.9911027 0.6030845 1.0000000
m.xy <- lm(y~x)
# check residuals grafically
require("ggplot2")
require("gridExtra")
## Lade nötiges Paket: gridExtra
# create scatterplot
ddf.t <- data.frame(x=x, y=y, pred=m.xy$fitted.values, std.res=rstandard(m.xy))
p1 <- ggplot(ddf.t, aes(x=x, y=y)) + geom_point() + stat_smooth(method = "lm", se=FALSE)
p2 <- ggplot(ddf.t, aes(x=pred, y=std.res)) + geom_point()
#p1 <- ggplot(dfbmi, x=nation, y=bmi, aes(nation, bmi, fill = gender))
grid.arrange(p1,p2, nrow=2)
## `geom_smooth()` using formula 'y ~ x'

Residualanalyse: Beispiele für Besonderheiten

Die Residuen sind manchmal abhängig von der Höhe der vorhergesagten Werte

So könnte es aussehen, wenn die Ausprägung der Residuen steigt, je größer die vorhergesagten Werte sind.

# random data
x <- 1:100
y <- x / 10 * rnorm(100, 0, 2)
m.xy <- lm(y ~ x)
# check residuals grafically
#plot(x, y)
# standardized residuals
plot(m.xy$fitted.values, rstandard(m.xy), 
     ylab="Standardized Residuals", 
     xlab="Fitted y", 
     main="Beispiel 1 Besonderheiten im Residualplot") 
abline(0, 0)  

Beispiel quadratischer Trend

In diesem Beispiel legt der Residualplot nahe, dass eine bisher nicht in das Modell aufgenommene Vorhersagevariable systematischen Einfluss hat. Alternativ könnte es sein, dass eine bereits aufgenommene Vorhersagevariable vielleicht quadriert (zusätzlich/alternativ) in das Modell aufgenommen werden sollte.

# some quadratic trend
y <- rnorm(100, 100, 15)
x <- y ^ 2 + rnorm(100, 0, 100)
m.xy <- lm(y ~x)
# check residuals grafically

# standardized residuals
plot(m.xy$fitted.values, rstandard(m.xy), 
     ylab="Standardized Residuals", 
     xlab="Fitted y", 
     main="Beispiel 2 Besonderheiten im Residualplot") 
abline(0, 0)  

Ausreißer und bedeutsame Fälle

Ausreißer am Beispiel von Anscombe’s Quartet

Das Problem der Bedeutsamkeit von Ausreißern wird sehr deutlich am sog. Anscombe’s Quartet. Mean of x in each case 9 (exact) Variance of x in each case 11 (exact) Mean of y in each case 7.50 (to 2 decimal places) Variance of y in each case 4.122 or 4.127 (to 3 decimal places) Correlation between x and y in each case 0.816 (to 3 decimal places) Linear regression line in each case y = 3.00 + 0.500x (to 2 and 3 decimal places, respectively)

Alle x*y Korrelationen sind gleich hoch. Die Daten sind völlig verschieden.

##    vars  n mean   sd median trimmed  mad min max range skew kurtosis se
## x1    1 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x2    2 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x3    3 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x4    4 11    9 3.32      8       8 0.00   8  19    11 2.47     4.52  1
##    vars  n mean   sd median trimmed  mad  min   max range  skew kurtosis   se
## y1    1 11  7.5 2.03   7.58    7.49 1.82 4.26 10.84  6.58 -0.05    -1.20 0.61
## y2    2 11  7.5 2.03   8.14    7.79 1.47 3.10  9.26  6.16 -0.98    -0.51 0.61
## y3    3 11  7.5 2.03   7.11    7.15 1.53 5.39 12.74  7.35  1.38     1.24 0.61
## y4    4 11  7.5 2.03   7.04    7.20 1.90 5.25 12.50  7.25  1.12     0.63 0.61
## x1*y1:  0.816
## x2*y2:  0.816
## x3*y3:  0.816
## x4*y4:  0.817
## (Intercept)          y1 
##  -0.9975311   1.3328426
## (Intercept)          y2 
##  -0.9948419   1.3324841
## (Intercept)          y3 
##   -1.000315    1.333375
## (Intercept)          y4 
##   -1.003640    1.333657
## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

Ausreißer und Extremwerte entdecken und identifizieren

Alle im Folgenden angegebenen Funktionen brauchen als Parameter lediglich ein Ergebnisobjekt eines Modells.

Outlier:

  • Residuen resid()
  • standardisierte Residuen rstandard() Standardisierte Differenz beobachtete Vorhersagewerte (response variable) zu den Vorhersagewerten (fitted values) vorgeschlagene Grenze 2
  • studentisierte Residuen rstudent() Differenz zu den standardisierten angepassten Vorhersagewerten (Modell, bei dem die jeweilige Beobachtung ausgeschlossen ist) vorgeschlagene Grenze 2

bedeutsame Fälle:

  • cooks distance cooks.distance() Globales Einflussmaß. Einfluss einer einzelnen Beobachtung auf das Gesamtmodell. Vorgeschlagene Grenze 1

  • dfbeta dfbeta() bildet Unterschiede in den Koeffizienten der Modelle ab, wenn eine Beobachtung mit eingeschlossen oder ausgeschlossen ist.

  • dffits dffits() Differenz zwischen den Vorhersagewerten (fitted values) und den angepassten Vorhersagewerten (adjusted fitted values) Angepasste Vorhersagewerte sind die Vorhersagewerte einer Beobachtung, wenn das Modell ohne diese Beobachtung (Vp) angepasst wurde. Idee: Die Beobachtung ist dann ‘unwichtig’, wenn die Differenz klein ist, wenn sich also die Modellkoeffizienten nur wenig aufgrund der Beobachtung ändern.

  • leverage, Hebewert (hat values) hatvalues() Das Erklärungsbeitrag einer Beobachtung zur Vorhersagevariable (Anteil der i-ten Beobachtung an der Gesamtquadratsumme der Vorhersagevariablen). Die Reaktionsvariable geht nicht ein, die Leverage ist also ein Maß, das etwas aussagt über den Beitrag einer Person zu den Vorhersagewerten. Schwankt zwischen 0 (kein Einfluss) und 1 (kompletter Einfluss) mittlerer Einfluss \(\frac{(p+1)}{n}\), p= Anzahl Prädiktoren, n = Anzahl Beobachtungen (subjects). Beobachtungen nahe am mittleren Einfluss sind ‘unwichtig’. Einlussreiche Beobachtungen haben leverage Wert über dem zwei bis dreifachen mittleren Einfluss. Leverage is the proportion of the total sum of squares of the explanatory variable contributed by the i^th case. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The greater an observation’s leverage, the more potential it has to be an influential observation. For example, an observation with a value equal to the mean on the predictor variable has no influence on the slope of the regression line regardless of its value on the criterion variable. On the other hand, an observation that is extreme on the predictor variable has the potential to affect the slope greatly. Calculation of Leverage (h) The first step is to standardize the predictor variable so that it has a mean of 0 and a standard deviation of 1. Then, the leverage (h) is computed by squaring the observation’s value on the standardized predictor variable, adding 1, and dividing by the number of observations. \(h_i > \frac{2p}{n}\) p=Anzahl der Vorhersagevariablen, n=Stichprobengröße, wird als große Leverage vorgeschlagen

  • covariance ratio covratio() {}

Cook’s Distance an Daten aus Anscombe’s Quartet

##    vars  n mean   sd median trimmed  mad min max range skew kurtosis se
## x1    1 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x2    2 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x3    3 11    9 3.32      9       9 4.45   4  14    10 0.00    -1.53  1
## x4    4 11    9 3.32      8       8 0.00   8  19    11 2.47     4.52  1
##    vars  n mean   sd median trimmed  mad  min   max range  skew kurtosis   se
## y1    1 11  7.5 2.03   7.58    7.49 1.82 4.26 10.84  6.58 -0.05    -1.20 0.61
## y2    2 11  7.5 2.03   8.14    7.79 1.47 3.10  9.26  6.16 -0.98    -0.51 0.61
## y3    3 11  7.5 2.03   7.11    7.15 1.53 5.39 12.74  7.35  1.38     1.24 0.61
## y4    4 11  7.5 2.03   7.04    7.20 1.90 5.25 12.50  7.25  1.12     0.63 0.61
## correlation:  0.816
## `geom_smooth()` using formula 'y ~ x'

## named numeric(0)

## `geom_smooth()` using formula 'y ~ x'

## named numeric(0)

## `geom_smooth()` using formula 'y ~ x'

##        3 
## 14.01311

## `geom_smooth()` using formula 'y ~ x'

##        8 
## 10.35181

Ausreißer beim Körpervergleichs-Beispiel

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)
# get the most extreme observations: standardized residuals > 2 or < -2 (two extremes)
extremes <- subj[rstandard(m.2) > 2 | rstandard(m.2) < -2]
print(paste('extreme standardized residuals in observation: ', extremes))
## [1] "extreme standardized residuals in observation:  13"
#extremes
ddf[extremes,]
##    subj    name gender height weight grade c_phys_app c_good_way c_dress
## 13   13 Berking      f    187     91  1.87          3          4       3
##    c_bad_way c_figure filling_time      bmi bmi_class bmi_dichotom bmi_normal
## 13         2        5        26.16 26.02305      high      extreme      FALSE
##    i_c_bad_way pacs_sum pacs bodysatisf
## 13           4       19  3.8         46
# find observations with cooks distance > 1 would be the recommendated limit
# find suspicious observations
extremes <- subj[cooks.distance(m.2) > 0.5]
print(paste('extreme cooks distances in observation: ', extremes))
## [1] "extreme cooks distances in observation:  "
# extremes
ddf[extremes,]
##  [1] subj         name         gender       height       weight      
##  [6] grade        c_phys_app   c_good_way   c_dress      c_bad_way   
## [11] c_figure     filling_time bmi          bmi_class    bmi_dichotom
## [16] bmi_normal   i_c_bad_way  pacs_sum     pacs         bodysatisf  
## <0 Zeilen> (oder row.names mit Länge 0)
# plot it
plot(m.2$fitted.values, cooks.distance(m.2))

# insert extreme values to show effect
ddf.mod <- ddf
to.mod <- c(20, sample(length(subj), 2))
print(paste('to modify: ', to.mod))
## [1] "to modify:  20" "to modify:  7"  "to modify:  2"
ddf.mod[to.mod,]
##    subj     name gender height weight grade c_phys_app c_good_way c_dress
## 20   20 Chorvátc      f    177     80  3.24          4          3       5
## 7     7 Valentin      f    165     71  2.47          4          3       4
## 2     2   Müller      m    176     54  1.00          3          4       3
##    c_bad_way c_figure filling_time      bmi bmi_class bmi_dichotom bmi_normal
## 20         1        4       113.48 25.53545      high      extreme      FALSE
## 7          2        3        26.00 26.07897      high      extreme      FALSE
## 2          1        5        26.00 17.43285       low      extreme      FALSE
##    i_c_bad_way pacs_sum pacs bodysatisf
## 20           5       21  4.2         45
## 7            4       18  3.6         40
## 2            5       20  4.0         76
ddf.mod$c_figure[to.mod] <- ddf.mod$c_figure[to.mod] * 2
ddf.mod[to.mod,]
##    subj     name gender height weight grade c_phys_app c_good_way c_dress
## 20   20 Chorvátc      f    177     80  3.24          4          3       5
## 7     7 Valentin      f    165     71  2.47          4          3       4
## 2     2   Müller      m    176     54  1.00          3          4       3
##    c_bad_way c_figure filling_time      bmi bmi_class bmi_dichotom bmi_normal
## 20         1        8       113.48 25.53545      high      extreme      FALSE
## 7          2        6        26.00 26.07897      high      extreme      FALSE
## 2          1       10        26.00 17.43285       low      extreme      FALSE
##    i_c_bad_way pacs_sum pacs bodysatisf
## 20           5       21  4.2         45
## 7            4       18  3.6         40
## 2            5       20  4.0         76
m.2.m <- lm(bodysatisf ~ c_figure + grade, data=ddf.mod) 

cooks.distance(m.2.m)[cooks.distance(m.2.m) > 0.1]
##         2        12        20 
## 0.9270435 0.1431751 0.3458125
plot(m.2.m$fitted.values, cooks.distance(m.2.m))

rm(ddf.mod, to.mod, m.2.m)
detach(ddf)

cooks distance plot mit ggplot()

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)
# again the model
#m.e_s <- lm(time ~ extro + sex, data=car.data)
#m.e_s <- lm(traitanx ~ bfagree + bfcon + bfext + bfneur + bfopen)
# we prepare adequate dataframe
ddf.plot = data.frame(subj = 1:length(subj), ddf[,c('bodysatisf', 'c_figure', 'grade')])
ddf.plot['fitted.values']         <- m.2$fitted.values
ddf.plot['studentized.residuals'] <- rstudent(m.2)
ddf.plot['cooks.distance']        <- cooks.distance(m.2)
ddf.plot['dfbeta']                <- dfbeta(m.2)
ddf.plot['dffits']                <- dffits(m.2)
ddf.plot['hatvalues']             <- hatvalues(m.2)
ddf.plot['covratio']              <- covratio(m.2)
# example: plot cooks distance against subject number - you have to identify the person
pplot <- ggplot(ddf.plot, aes(subj, cooks.distance, ymin=0, ymax=cooks.distance), title="Cook's Distance") +
  geom_point() + 
  geom_linerange() +
  scale_x_continuous("Observation Number") +
  scale_y_continuous("Cook's distance") 
      #opts(title="Cook's Distance")
pplot

# get at persons above Cook's distance above 0.5
# find observations with cooks distance > 1
extremes <- ddf.plot$subj[cooks.distance(m.2) > 0.5]
if(length(extremes)) {
  print(paste('extreme cooks distances in observation: ', extremes))
  ddf.plot[extremes,]
  }
# example: plot hat values i.e. leverage subject number 
pplot <- ggplot(ddf.plot, aes(subj, cooks.distance, ymin=0, ymax=cooks.distance), title="Cook's Distance") +
  geom_point() + 
  geom_linerange() +
  scale_x_continuous("Observation Number") +
  scale_y_continuous("Cook's distance") 
      #opts(title="Cook's Distance")
pplot

detach(ddf)

Diagnostische Plots mit library(ggplot2)

library(ggplot2) ermöglicht es, sehr aussagekräftige und zugleich schöne diagnostische Plots zu erstellen.

Diagnostic plots with library(ggplot2)

Quelle

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)

require(ggplot2)
# plot fitted values against residuals, aggregate smooth and show standard error
p1<-ggplot(m.2, aes(.fitted, .resid)) +
    geom_point() +
    stat_smooth(method="loess") +
    geom_hline(yintercept=0, col="red", linetype="dashed") +
    xlab("Fitted values")+ylab("Residuals") +
    ggtitle("Residual vs Fitted Plot") +
    theme_bw()
p1
## `geom_smooth()` using formula 'y ~ x'

# QQ plot to check normal distribution of residuals
p2<-ggplot(m.2, aes(qqnorm(.stdresid)[[1]], .stdresid)) +
  geom_point(na.rm = TRUE) + 
  geom_abline() +
  xlab("Theoretical Quantiles") +
  ylab("Standardized Residuals") + 
  ggtitle("Normal Q-Q") +
  theme_bw()
# to see the plot, uncomment next line. For some reason it is currently not rendered correctly by Knit HTML
#p2

# plot fitted values against sqrt(stdandardized residuals), root amplifies small differences in comparison to bigger differences
p3<-ggplot(m.2, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab(expression(sqrt("|Standardized residuals|")))
p3<-p3+ggtitle("Scale-Location")+theme_bw()
p3
## `geom_smooth()` using formula 'y ~ x'

# plot coocs distance
p4<-ggplot(m.2, aes(seq_along(.cooksd), .cooksd))+geom_bar(stat="identity", position="identity")
p4<-p4+xlab("Obs. Number")+ylab("Cook's distance")
p4<-p4+ggtitle("Cook's distance")+theme_bw()
p4

# residuals vs leverage
p5<-ggplot(m.2, aes(.hat, .stdresid))+geom_point(aes(size=.cooksd), na.rm=TRUE)
p5<-p5+stat_smooth(method="loess", na.rm=TRUE)
p5<-p5+xlab("Leverage")+ylab("Standardized Residuals")
p5<-p5+ggtitle("Residual vs Leverage Plot")
p5<-p5+scale_size_continuous("Cook's Distance", range=c(1,5))
p5<-p5+theme_bw()+theme(legend.position="bottom")
p5
## `geom_smooth()` using formula 'y ~ x'

# plot leverage vs Cook's distance
p6<-ggplot(m.2, aes(.hat, .cooksd))+geom_point(na.rm=TRUE)+stat_smooth(method="loess", na.rm=TRUE)
p6<-p6+xlab("Leverage hii")+ylab("Cook's Distance")
p6<-p6+ggtitle("Cook's dist vs Leverage hii/(1-hii)")
p6<-p6+geom_abline(slope=seq(0,3,0.5), color="gray", linetype="dashed")
p6<-p6+theme_bw()
p6    
## `geom_smooth()` using formula 'y ~ x'

{}[https://personality-project.org/r/#linear]

Voraussetzung: Unabhängigkeit der Residuen von der Höhe der vorhergesagten Werte

Berechnung mit durbinWatsonTest() or dwt().

Daumenregel: Je näher an 2, desto besser. Werte unter 1 und über 3 sollten aufmerksam machen.

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)

require(car)
## Lade nötiges Paket: car
## Lade nötiges Paket: carData
## 
## Attache Paket: 'car'
## Das folgende Objekt ist maskiert 'package:psych':
## 
##     logit
dwt(m.2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02889119      2.018337   0.934
##  Alternative hypothesis: rho != 0
# or
durbinWatsonTest(m.2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.02889119      2.018337    0.99
##  Alternative hypothesis: rho != 0

Multikollinearität

VIF (Variation Inflation Factor) oder tolerance. Sie sind ineinander überführbar: \(tolerance = \frac{1}{vif}\)

Der Variance Inflation Factor (VIF) dient als Hilfsmittel um Multikollinearitäten zwischen den Vorhersagevariablen eines Modells zu entdecken. Die grundlegende Idee besteht darin, dass man versucht eine Vorhersagevariable durch die verbeibenden Vorhersagevariablen auszudrücken. Gelingt dies gut, ist also der VIF (das Bestimmtheitsmaß) hoch, so kann man annehmen, dass die geprüfte Variable zu den restlichen Vorhersagevariablen (multi)kollinear ist.

Empfehlungen bei VIF

  • Achtung wenn es VIF größer 10 gibt bzw. wenn es Toleranzwerte < 0.1 gibt
  • Achtung wenn der durchschnittliche VIF deutlich größer als 1 ist.
  • Toleranzwerte < 0.2 sollten Verdacht auf Multikollinearität wecken

Wir prüfen hier zunächst die VIF unseres Körpervergleichs-Beispiels.

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade, data=ddf)
require(car)
# create and store linear model object
#m.e_s <- lm(time ~ extro + sex)
# output vif
print(paste("vif: ", vif(m.2)))
## [1] "vif:  1.00000622093632" "vif:  1.00000622093632"
# output tolerance
print(paste("tolerance: ", 1 / vif(m.2)))
## [1] "tolerance:  0.999993779102384" "tolerance:  0.999993779102384"
# mean of vif
mean(vif(m.2))
## [1] 1.000006

Beispielhaft überprüfen wir die Kollinearität eines Modells, bei dem die Körperzufriedenheit aus den einzelnen PACS-Items vorhergesagt werden soll.

# model with all explanatory variables pacs answers
m.1 <- lm(bodysatisf ~ c_phys_app + c_good_way + c_dress + c_bad_way + c_figure, data=ddf)
# output vif
print(paste("vif: ", vif(m.1)))
## [1] "vif:  4.72719520876748" "vif:  4.03488094015851" "vif:  6.06820748510201"
## [4] "vif:  8.80134563818313" "vif:  3.63561364134242"
# output tolerance
print(paste("tolerance: ", 1 / vif(m.1)))
## [1] "tolerance:  0.211541930433782" "tolerance:  0.247838787520881"
## [3] "tolerance:  0.164793310455367" "tolerance:  0.113618989766936"
## [5] "tolerance:  0.275056730073979"
# mean of vif
mean(vif(m.1))
## [1] 5.453449

Als Maßnahme bei zu hoher Kollinearität sollten wir die Zusammenstellung der dann (teilweise) redundanten Vorhersagevariablen überprüfen bzw. verändern. Eine andere Möglichkeit, die teilweise vorgeschlagen wird (Everitt), ist das Zusammenfassen von Vorhersagevariablen in Komponenten einer PCA, mit denen dann an Stelle der Vorhersagevariablen in der MR gerechnet wird.

Grafiken über das Modell

Wenn man den plot() Befehl auf ein lm-Objekt ausführt, bekommt man eine ganze Kette von Grafiken.

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# the model
m.3 <- lm(bodysatisf ~ c_figure + grade + gender, data=df)
m.3 <- lm(bodysatisf ~ c_figure + grade + bmi_dichotom, data=df)

# get series of plots
plot(m.3)

Modellvergleiche

Modellvergleiche auf Basis von Parameterschätzungen nach der Least-Square-Methode

Alternative Ansätze zur Parameterschätzung und zu Modellvergleichen.

Modell verbessern durch Interaktionsterme

Je nach Daten und Fragestellung kann das Hinzufügen von Interaktionstermen Sinn machen. Diese Idee wird unter lm_moderation Rmd, html näher vorgestellt.

Hier nur die Plots, die ggplot() für ein exploratives Vorgehen anbietet. Dies ist nicht mehr das aktuell berechnete (rein additive) Modell bodysatisf ~ c_figure + grade.

Idee bzw. Fragen: - Ist die Vorhersage von bodysatisf bei Frauen und Männern verschieden? - Ist die Vorhersage vielleicht unterschiedlich genau? - spezifischer: Ist die Steigung der Regressionsgeraden unterschiedlich in Abhängigkeit vom Geschlecht?

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require("ggplot2")
# raw scatterplot 
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
pplot + geom_point() 

# graphical look at gender specifity. Gender must have type factor to be used adequately. This is given already
ddf$gender
##  [1] "m" "m" "f" "f" "f" "f" "f" "f" "f" "m" "f" "m" "f" "m" "m" "f" "f" "m" "m"
## [20] "f" "m" "m" "m" "m" "f" "m" "m" "m" "m" "f"
# plot values in different color and shape 
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
# add a geom_point to base plot object
pplot + geom_point(aes(color=ddf$gender, shape=ddf$gender))
## Warning: Use of `ddf$gender` is discouraged. Use `gender` instead.
## Use of `ddf$gender` is discouraged. Use `gender` instead.

# include global regression line
pplot + 
  geom_point(aes(color=ddf$gender, shape=ddf$gender)) +
  stat_smooth(method = "lm", se=FALSE)
## Warning: Use of `ddf$gender` is discouraged. Use `gender` instead.
## Use of `ddf$gender` is discouraged. Use `gender` instead.
## `geom_smooth()` using formula 'y ~ x'

# include group specific regression lines without differences in slope (linear model without interaction)
# for this we need the coefficients of the model to plot
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
# the model
m.2 <- lm(bodysatisf ~ c_figure + gender, data=ddf)
#m.e_s <- lm(time ~ extro + sex)
# create base plot object 
pplot <- ggplot(ddf, aes(x=c_figure, y=bodysatisf))
# color management
cbPalette = c('blue', 'red')

# add scatterplot and gender specific regression lines as estimated by model above 
pplot + 
  geom_point(aes(color=gender, shape=gender)) +
  scale_color_manual(values=cbPalette) +
  geom_abline(intercept = m.2$coefficients['(Intercept)'], slope =  m.2$coefficients['c_figure'], color=cbPalette[1]) + 
  geom_abline(intercept = m.2$coefficients['(Intercept)'] + m.2$coefficients['genderm'], slope =  m.2$coefficients['c_figure'], color=cbPalette[2]) 

# include group specific regression lines - this is NOT the statistical model 'time~extro+sex' discussed here
pplot + 
  geom_point(aes(color=gender, shape=gender)) +
  stat_smooth(aes(fill = gender), method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# include both regression lines
pplot + 
  geom_point(aes(color=gender, shape=gender)) +
  stat_smooth(method = "lm", se=FALSE) +
  stat_smooth(aes(fill = gender, color=gender), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# include both regression lines, global confidence interval
pplot + 
  geom_point(aes(color=gender, shape=gender)) +
  stat_smooth(method = "lm") +
  stat_smooth(aes(fill = gender, color=gender), method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

# include both regression lines and confidence intervals for gender
pplot + 
  geom_point(aes(color=gender, shape=gender)) +
  stat_smooth(method = "lm", se=FALSE) +
  stat_smooth(aes(fill = gender, color=gender), method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Graphische Darstellung mit Base System

Regressionsplot

Der nicht-ggplot Weg für das Modell bodysatisf ~ c_figure, grade.

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")

# adapt model with two explanatory variables c_dress and grade
m.r <- lm(bodysatisf ~ c_figure + grade, data = ddf)
summary(m.r)
## 
## Call:
## lm(formula = bodysatisf ~ c_figure + grade, data = ddf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.8523  -7.9442  -0.6461   9.6565  19.2377 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.676      7.555   3.398  0.00212 ** 
## c_figure      12.041      1.547   7.786 2.26e-08 ***
## grade         -7.501      3.340  -2.245  0.03313 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.93 on 27 degrees of freedom
## Multiple R-squared:  0.7089, Adjusted R-squared:  0.6873 
## F-statistic: 32.87 on 2 and 27 DF,  p-value: 5.82e-08
coefficients(m.r)
## (Intercept)    c_figure       grade 
##   25.675687   12.040666   -7.500898
# plot c_figure against bodysatisf
plot(ddf$bodysatisf ~ ddf$c_figure,
 xlab="Intensity comparing figure",
 ylab="subj. tendency to compare attractivity")
 abline(coefficients(m.r)[1:2])

# 3-d scatterplot
# require(rgl)
# plot3d(ddf$c_figure, ddf$bodysatisf, ddf$grade) # commented out to render

require(rockchalk)
plotPlane(model = m.r, plotx1 = "c_figure", plotx2 = "grade")

plotPlane(model = m.r, plotx1 = "c_figure", plotx2 = "grade", drawArrows = T)

Residualplot mit Base System

… oder mit Grafiken aus dem base System

ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
attach(ddf)
## Die folgenden Objekte sind maskiert von ddf (pos = 3):
## 
##     bmi, bmi_class, bmi_dichotom, bmi_normal, bodysatisf, c_bad_way,
##     c_dress, c_figure, c_good_way, c_phys_app, filling_time, gender,
##     grade, height, i_c_bad_way, name, pacs, pacs_sum, subj, weight
# the model
m.2 <- lm(bodysatisf ~ c_figure + grade)

# standardized residuals
plot(m.2$fitted.values, rstandard(m.2), 
     ylab="Standardized Residuals", 
     xlab="Fitted y", 
     main="Plot of Residuals against predicted Values") 
abline(0, 0)  

# nothing suspicious to see

# test of normal distribution of residuals
qqnorm(m.2$residuals)
# a line may be added
abline(0,sd(m.2$residuals))

# test of normal distribution of residuals, now with standardized values
qqnorm(scale(m.2$residuals))
# a line may be added
abline(0,1)

Ausflug: Semipartialkoeffizienten, erklärte Varianz der Reaktionsvariable und Koeffizienten (Gewichte)

Quelle

“Both types of coefficient are ways to standardize the effect of x_1 in the milieu of other predictors.”

  • Semipartial-Koeffizienten sind ein Zuwachs an erklärter Varianz in der Reaktionsvariablen durch Hinzunahme einer erklärenden Variablen
  • Semipartial-Koeffizient einer erklärenden Variable x_1 ist die Korrelation zwischen den Residuen von x_1 mit der Reaktionsvariablen, wenn alle anderen erklärenden Variablen x_i aus ihr auspartialisiert sind.
  • die Residuen von x_1 als Prädiktor erhält dasselbe Regressionsgewicht wie x_1 selbst
  • das multiple R^2 setzt sich zusammen aus Varianzanteilen:
    • der Summe der quadrierten Semipartial-Koeffizienten (spezifischer Anteil Varianz, den die jeweilige erklärende Variable an der Varianz der Reaktionsvariable bindet)
    • Restteil der erklärten Varianz der Reaktionsvariable, der durch Covarianzen der Vorhersagevariablen erklärt wird

An einem
Beispiel mit zwei erklärenden Variablen.

require(tibble)
## Lade nötiges Paket: tibble
tt <- tibble(
    Y  = c(14, 23, 30, 50, 39, 67),  # memory span
  X = c( 4, 4, 7, 7, 10, 10), # age
  T = c( 1, 2, 2, 4, 3, 6) # speech rate
)

m <- lm(Y ~ X + T, data=tt)
summary(m)
## 
## Call:
## lm(formula = Y ~ X + T, data = tt)
## 
## Residuals:
##      1      2      3      4      5      6 
## -1.167 -1.667  2.333  3.333 -1.167 -1.667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    1.667      3.598   0.463  0.67470   
## X              1.000      0.725   1.379  0.26162   
## T              9.500      1.087   8.736  0.00316 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.877 on 3 degrees of freedom
## Multiple R-squared:  0.9866, Adjusted R-squared:  0.9776 
## F-statistic: 110.1 on 2 and 3 DF,  p-value: 0.001559
cor(tt)
##           Y         X         T
## Y 1.0000000 0.8027961 0.9889517
## X 0.8027961 1.0000000 0.7500000
## T 0.9889517 0.7500000 1.0000000
# Partial regression coefficient as increment in explained variance
# The squared partial regression coefficient between X and Y is computed as
s.m <- summary(m)
s.m$r.squared - cor(tt$Y, tt$T)^2
## [1] 0.008528111
# ... This indicates that when X is entered last in the regression equation, it increases the multiple coefficient of correlation by .0085
# squared partial regression coefficient between T and Y is
s.m$r.squared - cor(tt$Y, tt$X)^2
## [1] 0.342072
# ... Y enterd increases multiple R^2 by 0.342072

# another way to calculate the same explained variance
sum1 <- summary(lm(Y ~ X + T, data=tt))
sum2 <- summary(lm(Y ~ T, data=tt))
sum1$r.squared - sum2$r.squared
## [1] 0.008528111
# Partial regression coefficient as prediction from a residual
# partial regression coefficient for explanatory variable X
m.X <- lm(X ~ T, data=tt)
r.X <- residuals(m.X)
r.X  <- residuals(lm(X ~ T, data=tt))
r.YT <- residuals(lm(Y ~ T, data=tt))
cor(tt$Y, r.X)^2
## [1] 0.008528111
# partial regression coefficient for explanatory variable Y
m.T <- lm(T ~ X, data=tt)
r.T <- residuals(m.T)
cor(tt$Y, r.T)^2
## [1] 0.342072
# multiple R^2 is the correlation of fitted values and reaction variable
cor(tt$Y, m$fitted.values)^2
## [1] 0.9865536
# part of explained variance in Y that is not specific to one of the explanatory variables
cor(tt$Y, m$fitted.values)^2 - cor(tt$Y, r.X)^2 - cor(tt$Y, r.T)^2
## [1] 0.6359534
# we get the same coefficient for explanatory variable and residuals of explanatory variable
coefficients(lm(Y ~   X + T, data=tt))
## (Intercept)           X           T 
##    1.666667    1.000000    9.500000
coefficients(lm(Y ~ r.X + T, data=tt))
## (Intercept)         r.X           T 
##    5.291667    1.000000   10.625000
coefficients(lm(Y ~ X + r.T, data=tt))
## (Intercept)           X         r.T 
##   -3.083333    5.750000    9.500000
# we can even transform coefficients and partial coefficients one in the other (take care: correlation of residuals, not semipartial coefficient)
m$coefficients[2] * sqrt(var(r.X)/var(r.YT))
##        X 
## 0.622969
cor(r.X, r.YT)
## [1] 0.622969

Beispiele und Aufgaben

Vollziehen Sie die entsprechenden Kapitel in Field (2012) nach.

Vollziehen Sie die entsprechenden Kapitel in Everitt (2010) nach.

Ascombe’s Quartet

html

Autowaschen und Persönlichkeit

Die Daten und das Beispiel entstammen dem Buch [Everitt (2010), S. 84][everitt]

car_wash html

Die Daten und das Beispiel entstammen dem Buch [Everitt (2010), S. {}][everitt]

Angst und Persönlichkeit

Dies ist das einführende Beispiel verknappt und ohne erläuternde, konzeptionelle Kommentare.

lm_mult_example_bdi_bfi.html

[lm_mult_example_course_data.html] - [lm_mult_example_course_data.Rmd]

{} psych package data examples

require(psych)
#dd <- sat.act
dd <- bfi
#dd <- epi.bfi
dd[1:5,]
##       A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
## 61617  2  4  3  4  4  2  3  3  4  4  3  3  3  4  4  3  4  2  2  3  3  6  3  4
## 61618  2  4  5  2  5  5  4  4  3  4  1  1  6  4  3  3  3  3  5  5  4  2  4  3
## 61620  5  4  5  4  4  4  5  4  2  5  2  4  4  4  5  4  5  4  2  3  4  2  5  5
## 61621  4  4  6  5  5  4  4  3  5  5  5  3  4  4  4  2  5  2  4  1  3  3  4  3
## 61622  2  3  3  4  5  4  4  5  3  2  2  2  5  4  5  2  3  4  4  3  3  3  4  3
##       O5 gender education age
## 61617  3      1        NA  16
## 61618  3      2        NA  18
## 61620  2      2        NA  17
## 61621  5      2        NA  17
## 61622  3      1        NA  17
describe(dd)
##           vars    n  mean    sd median trimmed   mad min max range  skew
## A1           1 2784  2.41  1.41      2    2.23  1.48   1   6     5  0.83
## A2           2 2773  4.80  1.17      5    4.98  1.48   1   6     5 -1.12
## A3           3 2774  4.60  1.30      5    4.79  1.48   1   6     5 -1.00
## A4           4 2781  4.70  1.48      5    4.93  1.48   1   6     5 -1.03
## A5           5 2784  4.56  1.26      5    4.71  1.48   1   6     5 -0.85
## C1           6 2779  4.50  1.24      5    4.64  1.48   1   6     5 -0.85
## C2           7 2776  4.37  1.32      5    4.50  1.48   1   6     5 -0.74
## C3           8 2780  4.30  1.29      5    4.42  1.48   1   6     5 -0.69
## C4           9 2774  2.55  1.38      2    2.41  1.48   1   6     5  0.60
## C5          10 2784  3.30  1.63      3    3.25  1.48   1   6     5  0.07
## E1          11 2777  2.97  1.63      3    2.86  1.48   1   6     5  0.37
## E2          12 2784  3.14  1.61      3    3.06  1.48   1   6     5  0.22
## E3          13 2775  4.00  1.35      4    4.07  1.48   1   6     5 -0.47
## E4          14 2791  4.42  1.46      5    4.59  1.48   1   6     5 -0.82
## E5          15 2779  4.42  1.33      5    4.56  1.48   1   6     5 -0.78
## N1          16 2778  2.93  1.57      3    2.82  1.48   1   6     5  0.37
## N2          17 2779  3.51  1.53      4    3.51  1.48   1   6     5 -0.08
## N3          18 2789  3.22  1.60      3    3.16  1.48   1   6     5  0.15
## N4          19 2764  3.19  1.57      3    3.12  1.48   1   6     5  0.20
## N5          20 2771  2.97  1.62      3    2.85  1.48   1   6     5  0.37
## O1          21 2778  4.82  1.13      5    4.96  1.48   1   6     5 -0.90
## O2          22 2800  2.71  1.57      2    2.56  1.48   1   6     5  0.59
## O3          23 2772  4.44  1.22      5    4.56  1.48   1   6     5 -0.77
## O4          24 2786  4.89  1.22      5    5.10  1.48   1   6     5 -1.22
## O5          25 2780  2.49  1.33      2    2.34  1.48   1   6     5  0.74
## gender      26 2800  1.67  0.47      2    1.71  0.00   1   2     1 -0.73
## education   27 2577  3.19  1.11      3    3.22  1.48   1   5     4 -0.05
## age         28 2800 28.78 11.13     26   27.43 10.38   3  86    83  1.02
##           kurtosis   se
## A1           -0.31 0.03
## A2            1.05 0.02
## A3            0.44 0.02
## A4            0.04 0.03
## A5            0.16 0.02
## C1            0.30 0.02
## C2           -0.14 0.03
## C3           -0.13 0.02
## C4           -0.62 0.03
## C5           -1.22 0.03
## E1           -1.09 0.03
## E2           -1.15 0.03
## E3           -0.47 0.03
## E4           -0.30 0.03
## E5           -0.09 0.03
## N1           -1.01 0.03
## N2           -1.05 0.03
## N3           -1.18 0.03
## N4           -1.09 0.03
## N5           -1.06 0.03
## O1            0.43 0.02
## O2           -0.81 0.03
## O3            0.30 0.02
## O4            1.08 0.02
## O5           -0.24 0.03
## gender       -1.47 0.01
## education    -0.32 0.02
## age           0.56 0.21

Beispiel Affect

Beschreibung des Datensatzes Link

# read data
ddf <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/affect.txt", fileEncoding = "UTF-8")
require("psych")
describe(ddf) 
##          vars   n   mean    sd median trimmed    mad min max range  skew
## nr          1 330 165.50 95.41  165.5  165.50 122.31   1 330   329  0.00
## Study*      2 330   1.48  0.50    1.0    1.48   0.00   1   2     1  0.06
## Film        3 330   2.52  1.13    3.0    2.52   1.48   1   4     3 -0.03
## ext         4 330  13.16  4.47   13.0   13.38   4.45   0  22    22 -0.41
## neur        5 330  10.22  5.04   10.0   10.11   4.45   0  23    23  0.19
## imp         6 330   4.29  1.97    4.0    4.30   1.48   0   9     9 -0.02
## soc         7 330   7.51  2.86    8.0    7.75   2.97   0  13    13 -0.64
## lie         8 330   2.30  1.48    2.0    2.20   1.48   0   7     7  0.68
## traitanx    9 330  39.52  9.77   38.0   38.88   8.90  22  71    49  0.63
## state1     10 330  40.83 10.76   40.0   40.24  10.38  20  79    59  0.57
## EA1        11 330   9.19  7.11    8.0    8.67   8.90   0  29    29  0.52
## TA1        12 330  12.91  4.42   13.0   12.76   4.45   0  26    26  0.35
## PA1        13 330   8.89  6.79    8.0    8.28   7.41   0  29    29  0.70
## NA1        14 330   3.69  4.31    2.0    2.93   2.97   0  28    28  1.97
## EA2        15 330  11.01  6.85   11.0   10.82   7.41   0  29    29  0.21
## TA2        16 330  15.65  4.88   15.0   15.54   4.45   3  31    28  0.25
## PA2        17 330   8.98  6.43    8.0    8.52   7.41   0  29    29  0.53
## NA2        18 330   4.65  5.19    3.0    3.79   4.45   0  30    30  1.38
## state2     19 170  42.45 10.76   43.0   42.35  10.38  20  71    51  0.10
## MEQ        20 170  39.36 10.28   40.0   39.43  11.12  15  65    50 -0.06
## BDI        21 160  10.68  7.25   11.0   10.38   8.90   0  29    29  0.26
##          kurtosis   se
## nr          -1.21 5.25
## Study*      -2.00 0.03
## Film        -1.38 0.06
## ext         -0.15 0.25
## neur        -0.44 0.28
## imp         -0.67 0.11
## soc         -0.13 0.16
## lie          0.38 0.08
## traitanx     0.16 0.54
## state1       0.17 0.59
## EA1         -0.65 0.39
## TA1          0.39 0.24
## PA1         -0.13 0.37
## NA1          5.29 0.24
## EA2         -0.79 0.38
## TA2         -0.13 0.27
## PA2         -0.39 0.35
## NA2          1.92 0.29
## state2      -0.45 0.83
## MEQ         -0.38 0.79
## BDI         -0.91 0.57
# examples from the psych dataset
describeBy(ddf, ddf$Film)
## 
##  Descriptive statistics by group 
## group: 1
##          vars  n   mean    sd median trimmed    mad min max range  skew
## nr          1 83 158.58 94.72  143.0  157.63 106.75   6 326   320  0.10
## Study*      2 83   1.51  0.50    2.0    1.51   0.00   1   2     1 -0.02
## Film        3 83   1.00  0.00    1.0    1.00   0.00   1   1     0   NaN
## ext         4 83  13.40  4.35   13.0   13.43   4.45   2  22    20 -0.06
## neur        5 83  10.21  4.90   11.0   10.17   5.93   1  22    21  0.06
## imp         6 83   4.34  1.97    4.0    4.30   1.48   0   9     9  0.10
## soc         7 83   7.64  2.77    8.0    7.78   2.97   0  13    13 -0.42
## lie         8 83   2.36  1.50    2.0    2.30   1.48   0   7     7  0.42
## traitanx    9 83  39.91  9.60   39.0   39.55  10.38  23  61    38  0.31
## state1     10 83  39.81 10.75   38.0   39.22  10.38  20  69    49  0.56
## EA1        11 83   9.27  7.21    8.0    8.70   7.41   0  29    29  0.59
## TA1        12 83  12.49  4.80   12.9   12.35   4.30   0  26    26  0.37
## PA1        13 83   9.46  7.28    7.5    8.77   5.19   0  29    29  0.80
## NA1        14 83   3.50  3.73    2.0    2.94   2.97   0  17    17  1.41
## EA2        15 83  10.02  6.09    9.0    9.72   7.41   0  23    23  0.30
## TA2        16 83  17.48  4.74   18.0   17.55   4.45   8  27    19 -0.20
## PA2        17 83   7.27  5.93    6.0    6.63   5.93   0  23    23  0.78
## NA2        18 83   8.67  5.35    8.0    8.21   5.93   0  22    22  0.60
## state2     19 41  48.66 11.00   49.0   49.21  10.38  20  71    51 -0.45
## MEQ        20 41  38.39 11.49   40.0   38.00  13.34  16  65    49  0.26
## BDI        21 42  10.69  6.73   11.0   10.59   7.41   0  23    23  0.05
##          kurtosis    se
## nr          -1.14 10.40
## Study*      -2.02  0.06
## Film          NaN  0.00
## ext         -0.58  0.48
## neur        -0.76  0.54
## imp         -0.54  0.22
## soc         -0.53  0.30
## lie         -0.14  0.16
## traitanx    -0.81  1.05
## state1      -0.17  1.18
## EA1         -0.48  0.79
## TA1          0.74  0.53
## PA1         -0.14  0.80
## NA1          1.90  0.41
## EA2         -0.60  0.67
## TA2         -0.77  0.52
## PA2         -0.25  0.65
## NA2         -0.43  0.59
## state2       0.01  1.72
## MEQ         -0.40  1.79
## BDI         -1.25  1.04
## ------------------------------------------------------------ 
## group: 2
##          vars  n   mean    sd median trimmed    mad min max range  skew
## nr          1 78 164.23 94.57  174.5  164.98 114.16   8 316   308 -0.11
## Study*      2 78   1.47  0.50    1.0    1.47   0.00   1   2     1  0.10
## Film        3 78   2.00  0.00    2.0    2.00   0.00   2   2     0   NaN
## ext         4 78  12.38  4.55   12.0   12.47   2.97   1  21    20 -0.21
## neur        5 78   9.88  4.61   10.0    9.80   4.45   0  21    21  0.17
## imp         6 78   3.88  2.05    4.0    3.86   1.48   0   8     8  0.15
## soc         7 78   7.15  2.71    7.5    7.31   2.22   1  12    11 -0.49
## lie         8 78   2.15  1.47    2.0    2.02   1.48   0   7     7  0.97
## traitanx    9 78  39.87 10.14   38.0   39.33  10.38  22  71    49  0.69
## state1     10 78  41.44 10.41   41.0   41.03  10.38  22  79    57  0.67
## EA1        11 78   8.40  6.54    7.5    7.97   8.15   0  26    26  0.45
## TA1        12 78  12.62  3.84   13.0   12.51   2.97   5  25    20  0.45
## PA1        13 78   7.48  5.89    7.0    6.95   5.93   0  22    22  0.68
## NA1        14 78   3.36  4.45    2.0    2.57   2.97   0  28    28  2.73
## EA2        15 78  11.65  6.66   11.0   11.51   5.93   0  28    28  0.17
## TA2        16 78  18.33  5.15   19.0   18.34   4.45   6  31    25 -0.01
## PA2        17 78   8.08  6.18    7.0    7.56   7.41   0  26    26  0.74
## NA2        18 78   5.37  5.18    4.0    4.66   4.45   0  30    30  1.80
## state2     19 41  46.55  9.63   46.0   46.63   8.90  25  67    42 -0.01
## MEQ        20 41  39.63 10.04   39.0   39.58  10.38  16  58    42 -0.01
## BDI        21 37  10.65  7.19   11.0   10.48   8.90   0  26    26  0.11
##          kurtosis    se
## nr          -1.28 10.71
## Study*      -2.02  0.06
## Film          NaN  0.00
## ext         -0.37  0.52
## neur        -0.35  0.52
## imp         -0.78  0.23
## soc         -0.42  0.31
## lie          1.12  0.17
## traitanx     0.46  1.15
## state1       1.15  1.18
## EA1         -0.65  0.74
## TA1          0.65  0.44
## PA1         -0.31  0.67
## NA1         10.66  0.50
## EA2         -0.62  0.75
## TA2         -0.15  0.58
## PA2          0.15  0.70
## NA2          5.06  0.59
## state2      -0.21  1.50
## MEQ         -0.76  1.57
## BDI         -1.05  1.18
## ------------------------------------------------------------ 
## group: 3
##          vars  n   mean    sd median trimmed    mad min   max range  skew
## nr          1 85 165.85 94.21  160.0  165.68 124.54   1 330.0 329.0 -0.01
## Study*      2 85   1.51  0.50    2.0    1.51   0.00   1   2.0   1.0 -0.02
## Film        3 85   3.00  0.00    3.0    3.00   0.00   3   3.0   0.0   NaN
## ext         4 85  12.78  4.60   13.0   13.12   4.45   1  20.0  19.0 -0.64
## neur        5 85  10.61  5.14   10.0   10.30   4.45   0  23.0  23.0  0.46
## imp         6 85   4.12  1.98    4.0    4.07   1.48   0   8.0   8.0  0.23
## soc         7 85   7.31  3.15    8.0    7.60   2.97   0  12.0  12.0 -0.75
## lie         8 85   2.34  1.41    2.0    2.28   1.48   0   6.0   6.0  0.42
## traitanx    9 85  39.42  9.62   39.0   38.70   7.41  23  71.0  48.0  0.74
## state1     10 85  40.72 11.25   40.0   39.89  10.38  22  76.0  54.0  0.67
## EA1        11 85   9.42  7.24    8.0    8.87   8.90   0  29.0  29.0  0.55
## TA1        12 85  13.16  4.60   13.0   13.00   4.45   3  24.5  21.5  0.25
## PA1        13 85   8.95  6.91    8.0    8.29   5.93   0  28.0  28.0  0.77
## NA1        14 85   3.92  4.66    2.0    3.12   2.97   0  23.0  23.0  1.93
## EA2        15 85   9.07  6.90    7.0    8.66   8.90   0  27.0  27.0  0.46
## TA2        16 85  13.30  4.26   14.0   13.20   4.45   3  25.0  22.0  0.20
## PA2        17 85   8.75  6.45    8.0    8.23   7.41   0  26.0  26.0  0.57
## NA2        18 85   2.89  4.13    1.0    2.01   1.48   0  18.0  18.0  1.77
## state2     19 42  39.58  9.24   40.0   39.89   9.64  20  57.0  37.0 -0.20
## MEQ        20 42  38.64 10.20   38.5   38.82   8.15  15  59.0  44.0 -0.20
## BDI        21 43  10.81  7.99    9.0   10.37  10.38   0  29.0  29.0  0.40
##          kurtosis    se
## nr          -1.14 10.22
## Study*      -2.02  0.05
## Film          NaN  0.00
## ext         -0.12  0.50
## neur        -0.20  0.56
## imp         -0.87  0.22
## soc         -0.16  0.34
## lie         -0.35  0.15
## traitanx     0.51  1.04
## state1       0.33  1.22
## EA1         -0.59  0.78
## TA1         -0.19  0.50
## PA1          0.00  0.75
## NA1          4.61  0.50
## EA2         -0.87  0.75
## TA2         -0.08  0.46
## PA2         -0.42  0.70
## NA2          2.44  0.45
## state2      -0.83  1.43
## MEQ         -0.27  1.57
## BDI         -0.93  1.22
## ------------------------------------------------------------ 
## group: 4
##          vars  n   mean    sd median trimmed    mad min max range  skew
## nr          1 84 173.17 99.15  168.5  173.46 137.88  12 324   312 -0.01
## Study*      2 84   1.45  0.50    1.0    1.44   0.00   1   2     1  0.19
## Film        3 84   4.00  0.00    4.0    4.00   0.00   4   4     0   NaN
## ext         4 84  14.01  4.27   14.5   14.28   3.71   0  22    22 -0.64
## neur        5 84  10.13  5.51   10.0   10.11   5.93   0  23    23  0.04
## imp         6 84   4.81  1.81    5.0    4.93   1.48   0   8     8 -0.54
## soc         7 84   7.91  2.78    9.0    8.17   2.97   0  13    13 -0.77
## lie         8 84   2.31  1.57    2.0    2.19   1.48   0   7     7  0.86
## traitanx    9 84  38.89  9.87   37.0   38.07  10.38  23  68    45  0.74
## state1     10 84  41.39 10.71   40.0   40.99  11.12  21  66    45  0.38
## EA1        11 84   9.62  7.47    8.0    9.22   8.90   0  26    26  0.38
## TA1        12 84  13.35  4.37   13.0   13.21   4.45   4  24    20  0.38
## PA1        13 84   9.56  6.88    9.0    9.15   8.15   0  27    27  0.36
## NA1        14 84   3.96  4.40    2.0    3.18   2.97   0  18    18  1.45
## EA2        15 84  13.35  7.02   14.0   13.46   7.41   0  29    29 -0.13
## TA2        16 84  13.74  3.14   14.0   13.89   2.97   6  20    14 -0.45
## PA2        17 84  11.75  6.31   12.0   11.65   5.93   0  29    29  0.11
## NA2        18 84   1.79  2.88    0.0    1.16   0.00   0  12    12  1.81
## state2     19 46  35.89  8.05   36.0   35.71   7.41  20  54    34  0.22
## MEQ        20 46  40.63  9.59   41.5   40.97   9.64  20  59    39 -0.32
## BDI        21 38  10.55  7.25   10.0   10.28   8.90   0  26    26  0.33
##          kurtosis    se
## nr          -1.41 10.82
## Study*      -1.99  0.05
## Film          NaN  0.00
## ext          0.28  0.47
## neur        -0.81  0.60
## imp          0.01  0.20
## soc          0.11  0.30
## lie          0.67  0.17
## traitanx     0.20  1.08
## state1      -0.73  1.17
## EA1         -1.14  0.81
## TA1         -0.07  0.48
## PA1         -0.70  0.75
## NA1          1.60  0.48
## EA2         -0.81  0.77
## TA2         -0.28  0.34
## PA2         -0.38  0.69
## NA2          2.54  0.31
## state2      -0.67  1.19
## MEQ         -0.53  1.41
## BDI         -0.98  1.18
pairs.panels(ddf[14:17],bg=c("red","black","white","blue")[ddf$Film],pch=21, main="Affect varies by movies ")

errorCircles(13,14,ddf[-1],group=1,labels=c("Sad","Fear","Neutral","Humor"), main="Enegetic and Tense Arousal by Movie condition")
## Warning in qf(1 - alpha/2, df11n, df11d): NaNs wurden erzeugt
## Warning in qf(1 - alpha/2, df11d, df11n): NaNs wurden erzeugt
## Warning in cor(data, use = use, method = method): Standardabweichung ist Null
## Warning in cor(new.data[, 1:nvar], use = "pairwise", method = method):
## Standardabweichung ist Null
## Warning in sqrt(xvals$nw - 2): NaNs wurden erzeugt
## Warning in sqrt(n - 3): NaNs wurden erzeugt
## Warning in cor(new.data[, 1:(nvar)], new.data[, (nvar + 1):ncol(new.data)], :
## Standardabweichung ist Null

errorCircles(15,16,ddf[-1],group=1,labels=c("Sad","Fear","Neutral","Humor"),main="Positive and Negative Affect by Movie condition")
## Warning in qf(1 - alpha/2, df11n, df11d): NaNs wurden erzeugt
## Warning in qf(1 - alpha/2, df11d, df11n): NaNs wurden erzeugt
## Warning in cor(data, use = use, method = method): Standardabweichung ist Null
## Warning in cor(new.data[, 1:nvar], use = "pairwise", method = method):
## Standardabweichung ist Null
## Warning in sqrt(xvals$nw - 2): NaNs wurden erzeugt
## Warning in sqrt(n - 3): NaNs wurden erzeugt
## Warning in cor(new.data[, 1:(nvar)], new.data[, (nvar + 1):ncol(new.data)], :
## Standardabweichung ist Null

Beispiel Stimmung

msq dataset aus library(psych) description

Beispiel

[lm_simple_exercises.html] bzw. [lm_simple_exercises.Rmd]

Aufgabe Partial- und Semipartialkoeffizienten

Benutzen Sie den Datensatz zu den PACS-Daten Berechnen Sie Partial- und Semipartialkoeffizienten für ein Modell, in dem bodysatisfaction vorhergesagt wird durch c_bad_way, c_figure und grade.

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
m.full         <- lm(bodysatisf ~ c_bad_way + c_figure + grade, data = df)

diagnostic plots

dd <- readr::read_csv("https://md.psych.bio.uni-goettingen.de/mv/data/div/dusch_data.csv")
## Rows: 200 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): temperatur, genuss
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
temp_model <- lm(genuss ~ temperatur, data=dd)
dd$nr <- 1:nrow(dd)
dd$res <- temp_model$residuals
dd$res.s <- scale(temp_model$residuals)
dd$fit <- temp_model$fitted.values

# diagnostic plot 1
dd.f <- dd[order(dd$fit),]
ggplot(dd.f, aes(x=fit, y=res)) + geom_point()

#ggplot(dd.f, aes(x=fit, y=res.s)) + geom_point()

# diagnostic plot 3
ggplot(dd.f, aes(x=fit, y=sqrt(abs(res.s)))) + geom_point()

dd.r <- dd[order(dd$res),]
dd.r$nr.r <- 1:nrow(dd)
head(dd.r)
## # A tibble: 6 × 7
##   temperatur genuss    nr    res res.s[,1]   fit  nr.r
##        <dbl>  <dbl> <int>  <dbl>     <dbl> <dbl> <int>
## 1      11.5   -82.7   104 -144.      -5.00  61.1     1
## 2      -4.85 -129.    160  -96.5     -3.35 -32.6     2
## 3      -5.78 -130.    186  -92.3     -3.21 -37.9     3
## 4       9.65  -35.0    95  -85.7     -2.98  50.7     4
## 5       9.36  -36.4    59  -85.4     -2.97  49.1     5
## 6       9.66  -28.0    12  -78.8     -2.74  50.8     6
ggplot(dd.r, aes(x=nr.r, y=res)) + geom_point() + geom_point(aes(x=nr.r, y=sqrt(res)), color="red")
## Warning in sqrt(res): NaNs wurden erzeugt
## Warning: Removed 73 rows containing missing values (geom_point).

check

  • stetige Reaktionsvariable, mehr als eine stetige Erklärvariable

  • Einzel-Koeffizienten der erklärenden Variablen und multiples R^2 und korrigiertes R^2 (Kreuzvalidierung)

  • Koeffizienten vergleichen (Prädiktor-Wichtigkeit, lm.beta())

  • multiples R und \(R^2\)
    Formel und Erklärung Visualisierung

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/sat.txt")
m.mv <- lm(univ_GPA ~ math_SAT + verb_SAT, data=df)
df$pred <- predict(m.mv)
cor(df$univ_GPA, df$pred)
## [1] 0.6857272
ggplot(df, aes(x=pred, y=univ_GPA)) + geom_point() + stat_smooth(method = "lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'

  • Koeffizienten verstehen (für alle Beobachtungen gleich, Unterschied zu MLM)

  • Modellvergleiche (hierarchische Modelle)

  • Prädiktion, auch für neue Daten (Dataframes mit denselben Prädiktoren - ohne Reaktionsvariable, predict(lm-object, dataframe))

df <- read.delim("https://md.psych.bio.uni-goettingen.de/mv/data/div/sat.txt")
m.mv <- lm(univ_GPA ~ math_SAT + verb_SAT, data=df)
# estimate reaction variable univ_GPA values of new observations
predict(m.mv, data.frame(math_SAT = c(580, 630, 680), verb_SAT = c(560, 590, 710)))
##        1        2        3 
## 2.943406 3.176105 3.613269
  • Verhältnis der Erklärvariablen untereinander (Korrelationsmatrix, unkorreliert, Multikollinearität(VIF > 10? Tol < 0.1?) ) Covarianz der Erklärvariablen kommt als erklärender Effekt hinzu. vgl. shared variance

  • Shared variance: Residuen (auspartialisieren), mit Residuen weiterrechnen Beispiel unter shared variance

  • Interaktionsterme und Moderationsanalyse in eigener Unit lm_moderation

Screencast

Version: 17 Mai, 2022 16:59