example_head

Versteh-Beispiel: Head Data (Everitt, 2010)

Minimal-Versteh-Beispiel head data [Everitt (2010, p. 196)][everitt]

Kopfgröße (head length in mm) der ersten beiden Söhne von 25 Familien. Zwei Variablen. Wie sehen die Komponenten aus? Was kann aus ihnen gelesen werden?

Bei zwei Variablen erklären zwei Komponenten die gesamte, erklärbare Varianz (Kovarianz).

Hauptkomponentenanalyse kann auf Basis der Rohwertmatrix oder auf Basis der Varianz-Kovarianz-Matrix berechnet werden.

leichte Abweichungen bei der Standardabweichung im Vergleich zur Berechnung mit Var-Covar-Matr
individuelle Vorhersage Familie 3 als Beispiel, wie man aus den factor-loadings die vorhergesagten Werte (Faktorwerte) berechnen könnte

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/be/head.txt")
head(ddf)

##   fam hlfs hlss
## 1   1  191  179
## 2   2  195  201
## 3   3  181  185
## 4   4  183  188
## 5   5  176  171
## 6   6  208  192

# we are only interested in first and second son's head size
sizes <- c('hlfs', 'hlss')
# get means of size of head of first and second son
apply(ddf[sizes], 2, mean)

##   hlfs   hlss 
## 185.72 183.84

# covariance matrix
var(ddf[sizes])

##          hlfs      hlss
## hlfs 95.29333  69.66167
## hlss 69.66167 100.80667

# pca based on raw data
# you may notice small differences in sd compared to pca based on covariance matrix
pca.h <- princomp(ddf[sizes])
print(summary(pca.h),digits=3,loadings=TRUE)

## Importance of components:
##                            Comp.1    Comp.2
## Standard deviation     12.6907660 5.2154059
## Proportion of Variance  0.8555135 0.1444865
## Cumulative Proportion   0.8555135 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2
## hlfs  0.693  0.721
## hlss  0.721 -0.693

# pca based on covariance matrix
pca.cov <- princomp(covmat=var(ddf[sizes]))
print(summary(pca.cov),digits=3,loadings=TRUE)

## Importance of components:
##                            Comp.1    Comp.2
## Standard deviation     12.9524588 5.3229513
## Proportion of Variance  0.8555135 0.1444865
## Cumulative Proportion   0.8555135 1.0000000
## 
## Loadings:
##      Comp.1 Comp.2
## hlfs  0.693  0.721
## hlss  0.721 -0.693

# values of family 3
ddf[3,]

##   fam hlfs hlss
## 3   3  181  185

# factor loadings for each component can be found in resultobject$loadings
g1 <- pca.h$loadings[,1]
g2 <- pca.h$loadings[,2]
# get factor scores (individual value of the two components) for family 3
sum(ddf[3,2:3] * g1)

## [1] 258.8064

sum(ddf[3,2:3] * g2)

## [1] 2.289788

# predict family 3 component 1 manually
# multiply difference from mean by factor loading
sum((ddf[3,sizes] - apply(ddf[sizes], 2, mean)) * g1)

## [1] -2.43459

# check against scores provided by princomp
princomp(ddf[sizes], scores=T)$scores[3,]

##    Comp.1    Comp.2 
## -2.434590 -4.206753

# add factor scores to data.frame
ddf <- cbind(ddf, princomp(ddf[sizes], scores=T)$scores)

# a visualization via included plot()
plot(pca.h)

# you may apply what you have found in pca to new data
ddf.new <- data.frame(
    hlfs = c(166, 168, 185),
    hlss = c(157, 170, 190)
    )
predict(pca.h, ddf.new)

##         Comp.1    Comp.2
## [1,] -33.01601  4.382582
## [2,] -22.25767 -3.184331
## [3,]   3.94211 -4.787878

rm(ddf.new)

# what is different using prcomp()

# using principal() of library(psych)
require(psych)

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

pca.h2 <- principal(ddf[sizes], nfactors=2, rotate="none")
pca.h2

## Principal Components Analysis
## Call: principal(r = ddf[sizes], nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       PC1   PC2 h2 u2 com
## hlfs 0.92 -0.38  1  0 1.3
## hlss 0.92  0.38  1  0 1.3
## 
##                        PC1  PC2
## SS loadings           1.71 0.29
## Proportion Var        0.86 0.14
## Cumulative Var        0.86 1.00
## Proportion Explained  0.86 0.14
## Cumulative Proportion 0.86 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1

# component scores are available
head(pca.h2$scores)

##              PC1        PC2
## [1,]  0.03180061 -1.3449328
## [2,]  1.43791839  0.9972252
## [3,] -0.19893757  0.7876139
## [4,]  0.07335984  0.9110939
## [5,] -1.22967480 -0.3722605
## [6,]  1.67326381 -1.9322253

# we plot the components weight in the scatterplot of the two head sizes
ddf %>% ggplot(aes(x=hlfs, y=hlss, color=pca.h$scores[,1])) + geom_point()

# ddf %>% ggplot(aes(x=hlfs, y=hlss, color=pca.h$scores[,2])) + geom_point()


ddf %>% ggplot(aes(x=pca.h$scores[,1], y=pca.h$scores[,2], color=hlss)) + geom_point()

# ddf %>% ggplot(aes(x=pca.h$scores[,1], y=pca.h$scores[,2], color=hlfs)) + geom_point()

‘Loadings’ zeigt die Gewichte, mit denen die Komponenten-Werte gebildet werden können. ‘Proportion Var’ ist der Anteil an erklärter Varianz, den die jeweilige Komponente erklärt. ‘Cumulative Var’ ist der kumulierte Anteil an der Varianz, der durch alle höheren Komponenten inkl. der aktuellen erkärt wird.

Ein Komponenten-Wert ist eine Linearkombination aus den Rohwerten der Variablen und den Gewichten (loadings) für eine Komponente. Konkret wird oben die Abweichungen einer Beobachtung (Vp) vom Gruppenmittelwert multipliziert mit dem zugehörigen loading. Dies passiert für alle Variablen in der Komponente und wird aufaddiert.

Die Komponenten-Werte können im Quell-Data-Frame mit abgespeichert werden. Dies ist u. U. sinnvoll, wenn man die Anzahl der Variablen verringern möchte und mit den Komponenten weiterrechnen will.

Komponenten-Gewichte (Koeffizienten) können auch auf neue Daten angewendet werden. Dazu müssen die Gewichte von einem Data-Frame generiert worden sein und die Variablen (Columns) müssen gleich benannt sein.

example_head

rmd

Versteh-Beispiel: Head Data (Everitt, 2010)