Korrelation, Partialkorrelation

Field (2012, p. 206)

Beispiel-Datensatz, der für alle Beispiele dieser Seite benutzt wird.

# read data
dfbmi <- read.delim("http://r.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi.txt", fileEncoding = "UTF-8")
# gender is a factor
dfbmi$gender <- factor(dfbmi$gender, levels = c(1,2), labels = c("f", "m"))
# invert item comp_bad_way and add it to dataframe
dfbmi["i_c_bad_way"] <- 6 - dfbmi$c_bad_way
# define comp items
c_item_names <- c("c_phys_app", "c_good_way", "c_dress", "i_c_bad_way", "c_figure")
# aggregate comp items to comp scale using apply()
dfbmi["comp"] <- apply(dfbmi[,c_item_names], 1, mean)
# create new variable bmi and add it to the dataframe
dfbmi["bmi"] <- dfbmi$weight / (dfbmi$height / 100) ** 2
dfbmi[1:5,]

##   nr    name gender height weight grade c_phys_app c_good_way c_dress
## 1  1   Meier      m    180     60  2.02          2          1       2
## 2  2  Müller      m    176     54  1.00          3          4       3
## 3  3  Strauß      f    180    110  2.03          5          5       5
## 4  4 Übeling      f    182     52  1.00          4          5       4
## 5  5   Ärger      f    160     62  1.00          3          2       2
##   c_bad_way c_figure filling_time i_c_bad_way comp   bmi
## 1         4        1        39.63           2  1.6 18.52
## 2         1        5        26.00           5  4.0 17.43
## 3         1        5        26.00           5  5.0 33.95
## 4         2        3        26.00           4  4.0 15.70
## 5         4        1        26.00           2  2.0 24.22

# create three groups, imagine as countries
dfbmi$nation <- factor(c(rep(1, 10), rep(2,10), rep(3,10)), levels = c(1,2,3), labels = c("Norway", "Switzerland", "Portugal"))

Korrelationen im R-Commander

Möglichkeiten Produkt-Moment Korrelation (Pearson) aber auch Rangkorrelatin (Spearman) unter

Statistik | Deskriptive Statistik | Korrelationsmatrix
Statistik | Deskriptive Statistik | Test auf Signifikanz der Korrelation

Beispiel: Größe / Gewicht / comp

Beispiel: c_phys_app / comp

Kontingenz-Koeffizient (Pearson's Chi-squared) für nominalskalierte Variablen

Statistik | Kontingenztabellen | Kreuztabelle

Beispiel: gender / nation

Partialkorrelationen unter

Statistik | Deskriptive Statistik | Korrelationsmatrix
im Folgedialog: Typ der Korrelation: Partialkorrelation

Beispiel: Größe / Gewicht / comp (körperl. Vergleichsintensität)

Covarianz

Aufsummierte Abweichungsquadrate, geteilt durch die Anzahl der Beobachtungen (-1).

\[ cov(x, y) = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{N - 1} \]

# classical covariance of height and weight
cov(dfbmi$height, dfbmi$weight)

## [1] 128

In seiner Höhe stark abhängig von Skalierung der Variablen.

Den größten Einfluss haben Beobachtungen, die in beiden Variablen zugleich weit weg vom jeweiligen Mittelwert sind.

Produkt-Moment-Korrelation als standardisierte Covarianz

Covarianz standardisieren durch Teilen durch Produkt der Standardabweichungen.

Korrelationskoeffizient \( -1 <= r <= +1 \)

\[ r(x, y) = \frac{cov(x, y)}{s_x s_y} = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{(N - 1) s_x s_y} \]

# classical covariance of height and weight
# one single correlation
cor(dfbmi$height, dfbmi$weight)

## [1] 0.5651

# also correlation matrices
cor(dfbmi[c_item_names])

##             c_phys_app c_good_way c_dress i_c_bad_way c_figure
## c_phys_app      1.0000     0.7675  0.8699      0.8451   0.6739
## c_good_way      0.7675     1.0000  0.7912      0.8468   0.7778
## c_dress         0.8699     0.7912  1.0000      0.8791   0.6922
## i_c_bad_way     0.8451     0.8468  0.8791      1.0000   0.8325
## c_figure        0.6739     0.7778  0.6922      0.8325   1.0000

Auch standardisiert haben die Beobachtungen den größten Einfluss, die in beiden Variablen zugleich weit weg vom jeweiligen Mittelwert sind.

\( R^2 \) als erklärter Varianzanteil (Determinationskoeffizient)

Erklärter Varianzanteil. Gebundene Varianz. Anteil der erklärten Varianz an Gesamtvarianz. Varianzenverhältnis.

Korrelationsmatrix

# base package cor() supports correlation matrices
cor(dfbmi[c_item_names])

##             c_phys_app c_good_way c_dress i_c_bad_way c_figure
## c_phys_app      1.0000     0.7675  0.8699      0.8451   0.6739
## c_good_way      0.7675     1.0000  0.7912      0.8468   0.7778
## c_dress         0.8699     0.7912  1.0000      0.8791   0.6922
## i_c_bad_way     0.8451     0.8468  0.8791      1.0000   0.8325
## c_figure        0.6739     0.7778  0.6922      0.8325   1.0000

# Hmisc packages offers correlation matrices with automatic test against 0-correlation
require("Hmisc")

## Loading required package: Hmisc
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## Die folgenden Objekte sind maskiert from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

cc <- rcorr.adjust(dfbmi[c_item_names], type="pearson", use="complete")

## Error: konnte Funktion "rcorr.adjust" nicht finden

Signifikanztest gegen Nullkorrelation

Sig. hängt nur von n und der Höhe des Koeffizienten ab

# one single correlation
cor(dfbmi$height, dfbmi$weight)

## [1] 0.5651

# cor.test() does not work with correlation matrices
cor.test(dfbmi$height, dfbmi$weight)

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$height and dfbmi$weight
## t = 3.625, df = 28, p-value = 0.001138
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2572 0.7689
## sample estimates:
##    cor 
## 0.5651

#
require(ggm)

## Loading required package: ggm
## Loading required package: igraph
## 
## Attaching package: 'ggm'
## 
## Das folgende Objekt ist maskiert from 'package:Hmisc':
## 
##     rcorr

# to get p-values
rcorr(dfbmi$height, dfbmi$weight)

## Error: unbenutztes Argument (dfbmi$weight)

# use rcorr() for several variables simuntaneously
# as.matrix() converts a dataframe to matrix format
rcorr(as.matrix(dfbmi[c_item_names]))

##         [,1]    [,2]
## [1,]  1.0000 -0.2359
## [2,] -0.2359  1.0000

cor() listet lediglich die Korrelationen auf, aber auch in Matrizenform. rcor() gibt für Matrizen sowohl Korrelationen, als auch Signifikanztests aus. rcorr() liefert Details für Korrelationen wie z. B. Konfidenzintervalle. Leider nur für eine Korrelation und nicht für Korrelationsmatrizen.

Signifikanztest zweier Korrelationen gegeneinander

field S. 238

über Z-Transformation der Korrelationskoeffizienten.

Keine spezielle R-Funktion.

Online-Programm hierzu (und zu vielem mehr):

quelle

COMPARING TWO CORRELATION COEFFICIENTS

(computing significance of the difference)

A function com.cor(x,nx,y,ny) returns significance level for the difference between two correlation coefficients (x and y are the correlation coefficients, nx and ny are the respective sample sizes).

# com.cor(x, nx, y, ny):
com.cor=function(x, nx, y, ny){
z1=0.5*log((1+x)/(1-x))
z2=0.5*log((1+y)/(1-y))
w=sqrt(1/(nx-1)+2/(nx-1)^2+1/(ny-1)+2/(ny-1)^2)
2*pnorm(-abs((z1-z2)/w),0,1)
} 
com.cor(cor(dfbmi$c_dress, dfbmi$i_c_bad_way), 
        length(dfbmi$c_dress),
        cor(dfbmi$c_dress, dfbmi$c_figure), 
        length(dfbmi$c_dress)
        )

## [1] 0.05574

Vergleich zweier Korrelationskoeffizienten auf signifikanten Unterschied online

Rangkorrelation, Nonparametrische Korrelation, Spearman Korrelation, Kendall's tau

Der Befehl cor() kennt den method Parameter. Aus der Hilfe: method = c("pearson", "kendall", "spearman")

# one spearman correlation
cor(dfbmi$c_phys_app, dfbmi$c_good_way, method = "spearman")

## [1] 0.7635

# details for this rank correlation
cor.test(dfbmi$c_phys_app, dfbmi$c_good_way, method = "spearman")

## Warning: Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  dfbmi$c_phys_app and dfbmi$c_good_way
## S = 1063, p-value = 9.244e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##    rho 
## 0.7635

Kendal Tau b (Rangkorrelation, zu bevorzugen bei kleinen Stichproben und vielen verbundenen Rängen) Spearman (Rangkorrelation)

Korrelation bei kategorialen Variablen

Vierfelder-Tafel

Kontingenz-Koeffizienten

.Table <- xtabs(~gender+nation, data=dfbmi)
.Table

##       nation
## gender Norway Switzerland Portugal
##      f      7           5        2
##      m      3           5        8

.Test <- chisq.test(.Table, correct=FALSE)

## Warning: Chi-squared approximation may be incorrect

.Test

## 
##  Pearson's Chi-squared test
## 
## data:  .Table
## X-squared = 5.089, df = 2, p-value = 0.0785

Punktbiseriale Korrelation

Echte dichotome Variable z. B. schwanger.

Berechnung durch Pearson Corr mit einer dichotomen Variable (0, 1). Die dichotome Varible muss vom Typ numerisch sein. ggf. Umwandlung.

# pearson corr point-biseral, weight * gender
# convert gender to a numeric vector
# create vector of missings with length of weight/gender
dfbmi$n_gender <- rep(NA, length(dfbmi$gender))
dfbmi$n_gender[dfbmi$gender == 'f'] <- 1
dfbmi$n_gender[dfbmi$gender == 'm'] <- 2
# now calculate the correlation
cor(dfbmi$weight, dfbmi$n_gender)

## [1] 0.1613

# details for this correlation
cor.test(dfbmi$weight, dfbmi$n_gender)

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$weight and dfbmi$n_gender
## t = 0.8647, df = 28, p-value = 0.3946
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2113  0.4929
## sample estimates:
##    cor 
## 0.1613

Biseriale Korrelation

dichotomisierte kontinuierliche Variable (künstliche Dichotomisierung) z. B. bestehen oder nicht bestehen einer Prüfung aufgrund einer Prüfungsleistung

Berechnung der Punktbiserialen Korrelation, dann Korrekturformel (biserialel Korrelation ist immer etwas höher als die punktbiseriale Korrelation).

Korrekturformel field S. 230:

\[ r_b = \frac{r_{pb}sqrt{pq}}{y} \]

\( r_b \) : biserialer Korrelationskoeffizient p: Menge der Beobachtungen in größerer Kategorie q: Menge der Beobachtungen in kleinerer Kategorie

# pearson corr biseral, weight * grade
# median split of grade, make it numeric
dfbmi$n_grade_class <- rep(NA, length(dfbmi$grade))
dfbmi$n_grade_class[dfbmi$grade <= median(dfbmi$grade)] <- 0
dfbmi$n_grade_class[dfbmi$grade >  median(dfbmi$grade)] <- 1
# now calculate the correlation
cor(dfbmi$weight, dfbmi$n_grade_class)

## [1] 0.3795

# details for this correlation
cor.test(dfbmi$weight, dfbmi$n_grade_class)

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$weight and dfbmi$n_grade_class
## t = 2.171, df = 28, p-value = 0.0386
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02228 0.65080
## sample estimates:
##    cor 
## 0.3795

# ... missing parts ...

Alternativberechnung mit polyserial() aus dem Paket polycor.

# Library polycor is required
require(polycor)

## Loading required package: polycor
## Loading required package: mvtnorm
## Loading required package: sfsmisc
## 
## Attaching package: 'sfsmisc'
## 
## Das folgende Objekt ist maskiert from 'package:Hmisc':
## 
##     errbar

# pearson corr biseral, weight * grade
# median split of grade, make it numeric
dfbmi$n_grade_class <- rep(NA, length(dfbmi$grade))
dfbmi$n_grade_class[dfbmi$grade <= median(dfbmi$grade)] <- 0
dfbmi$n_grade_class[dfbmi$grade >  median(dfbmi$grade)] <- 1
# now calculate the correlation
polyserial(dfbmi$comp, dfbmi$n_grade_class)

## [1] -0.0333

Partial und Semipartial-Korrelation

Einfluss einer dritten Variable auf die Enge des Zusammenhangs von zwei Variablen.

Das Package ggm stellt pcor() bzw. pcor.test() für die Berechnung von Partialkorrelationen zur Verfügung.

gemeinsame Varianz

# Library ggm is required
require(ggm)
# pearson correlation weight / height / comp
cor(c(dfbmi$weight, dfbmi$height, dfbmi$comp))

## Error: supply both 'x' and 'y' or a matrix-like 'x'

cor.test(dfbmi$weight, dfbmi$height            )

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$weight and dfbmi$height
## t = 3.625, df = 28, p-value = 0.001138
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2572 0.7689
## sample estimates:
##    cor 
## 0.5651

cor.test(dfbmi$weight,               dfbmi$comp)

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$weight and dfbmi$comp
## t = 1.591, df = 28, p-value = 0.1229
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08073  0.58727
## sample estimates:
##    cor 
## 0.2879

cor.test(              dfbmi$height, dfbmi$comp)

## 
##  Pearson's product-moment correlation
## 
## data:  dfbmi$height and dfbmi$comp
## t = 1.31, df = 28, p-value = 0.2007
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1313  0.5528
## sample estimates:
##    cor 
## 0.2404

# now calculate the partial correlation
pc <- pcor(c("weight", "height", "comp"), var(dfbmi))

## Warning: NAs durch Umwandlung erzeugt

pc

## [1] 0.5335

pc^2

## [1] 0.2846

# general form: pcor.test(pcor-object, number-of-control-variables, sample-size)
pcor.test(pc, 1, length(dfbmi$height))

## $tval
## [1] 3.278
## 
## $df
## [1] 27
## 
## $pvalue
## [1] 0.00288

Auspartialisieren entspricht MR und Weiterrechnen mit den Residuen. Dabei sind die beiden interessierenden Variablen Kriterium (output var) und die auszupartialisierende Var Prädiktor. vgl. [../lm_simple] vgl. [../lm_multiple]

Bei Semipartial-Koeffizienten (part correlation) wird der Einfluss der auszupartialisierenden Var auf nur eine der beiden interessierenden (korrelierten) Variablen herausgerechnet. Praktisch besprochen in MR [../lm_multiple]

zero-order Var ist Produkt-Moment Korrelation. first-order partial correlation: der Einfluss einer Variable wurde auspartialisiert. second-order partial correlation: der Einfluss von zwei Variablen wird auspartialisiert usw.

Visualisierungen

Scatterplot, Multiple Scatterplots,

# simple plot of scattergram of to variables
plot(dfbmi$c_phys_app, dfbmi$c_good_way)

plot of chunk unnamed-chunk-13

# scatterplot with ggplot 2
require(ggplot2)

## Loading required package: ggplot2

pplot <- ggplot(dfbmi, aes(x=height, y=weight))
# add a geom_point to base plot object
pplot + 
  geom_point()