rmd

Einführendes Beispiel Body-Data

Erklärungen und Hinweise in den Beispielen

Idee hier: Faktorisierung der PACS-Items und ein paar körperbezogene Items (Größe, Gewicht)

Pre-Tests:

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#ddf.f <- dplyr::select(ddf, c(c_phys_app:c_dress, i_c_bad_way, c_figure))
ddf.f <- dplyr::select(ddf, c(height, weight, c_phys_app:c_dress, i_c_bad_way, c_figure))
require(psych)
## Loading required package: psych
## Warning: package 'psych' was built under R version 3.2.5
# Ask Bartlett test whether correlation matrix is idendity matrix (all elements are 0), should be significant
# run Bartlett test on dataframe
cortest.bartlett(ddf.f)
## R was not square, finding R from data
## $chisq
## [1] 164.7114
## 
## $p.value
## [1] 2.695317e-24
## 
## $df
## [1] 21
# alternativly on correlation matrix (add n), here calculated on the fly using cor()
cortest.bartlett(cor(ddf.f), n=length(ddf.f[,1]))
## $chisq
## [1] 164.7114
## 
## $p.value
## [1] 2.695317e-24
## 
## $df
## [1] 21
# get the KMO coefficients
kmo <- KMO(ddf.f)
# show it
kmo
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = ddf.f)
## Overall MSA =  0.84
## MSA for each item = 
##      height      weight  c_phys_app  c_good_way     c_dress i_c_bad_way 
##        0.64        0.65        0.86        0.93        0.85        0.84 
##    c_figure 
##        0.85
# get ordered MSAi
kmo$MSAi[order(kmo$MSAi)]
##      height      weight i_c_bad_way    c_figure     c_dress  c_phys_app 
##   0.6410269   0.6513657   0.8431849   0.8459168   0.8487658   0.8618381 
##  c_good_way 
##   0.9303134

Anzahl der Komponenten

Via Parallelanalyse

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(dplyr)
#ddf.f <- dplyr::select(ddf, c(c_phys_app:c_dress, i_c_bad_way, c_figure))
ddf.f <- dplyr::select(ddf, c(height, weight, c_phys_app:c_dress, i_c_bad_way, c_figure))
require(psych)
# get parallel analysis for principal component analysis
# plot is generated automagically
res.p <- fa.parallel(ddf.f, fa="pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2
# parallel analysis suggests a one component solution

Nach der Parallelanalyse sicher eine, ev. zwei Komponenten.

mögliche R Befehle

  • princomp() im base package
  • prcomp() im base package
  • principal() aus library(psych)

Ansatz mit princom()

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(dplyr)
#ddf.f <- dplyr::select(ddf, c(c_phys_app:c_dress, i_c_bad_way, c_figure))
ddf.f <- dplyr::select(ddf, c(height, weight, c_phys_app:c_dress, i_c_bad_way, c_figure))

# fit model, run PCA
# use base package princomp()
fit <- princomp(ddf.f, cor=FALSE)

# simple scree plot
plot(fit,type="lines")

# print variance accounted for
summary(fit)
## Importance of components:
##                            Comp.1    Comp.2     Comp.3      Comp.4
## Standard deviation     21.9654645 8.2311689 2.68215274 0.810451473
## Proportion of Variance  0.8634056 0.1212431 0.01287363 0.001175407
## Cumulative Proportion   0.8634056 0.9846488 0.99752241 0.998697815
##                              Comp.5       Comp.6       Comp.7
## Standard deviation     0.6076003082 0.4308661782 0.4157559709
## Proportion of Variance 0.0006606479 0.0003322148 0.0003093222
## Cumulative Proportion  0.9993584630 0.9996906778 1.0000000000
# component scores (principal components) are available and may be used for further computations
head(round(fit$scores, 2))
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## [1,]   6.17   6.58  -3.46  -0.94  -0.53  -0.27  -0.19
## [2,]  12.92   4.85   2.35   1.38  -0.23  -0.45  -0.57
## [3,] -41.54  -9.01   2.64   0.33   0.01   0.65   0.11
## [4,]  12.91  11.14   2.03  -0.94   1.32  -0.07   0.12
## [5,]  10.53 -13.01  -2.00  -0.95   0.44   0.24  -0.42
## [6,]  31.59 -10.06   1.95   0.81   0.77  -0.03   0.52
# biplot(fit)

Gut zu sehen ist (Cumulative Var), dass dieselbe Anzahl an Komponenten wie Variablen die gesamte Varianz erklärt, die Original Korrelationsmatrix also repliziert.

Das Problem der subjektiven Festlegung der Anzahl der Komponenten bleibt, obwohl der Screeplot hilft.

princomp() erlaubt keine Anpassung auf eine bestimmte Anzahl von Komponenten und keine Rotation.

psych::principal() und noch mehr psych::fa() sind flexibler und damit häufig auch anschaulicher.

Ansatz mit principal() bzw. fa() aus der library(psych)

Hier mit einigen Erklärungen und Erläuterungen.

ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(dplyr)
#ddf.f <- dplyr::select(ddf, c(c_phys_app:c_dress, i_c_bad_way, c_figure))
ddf.f <- dplyr::select(ddf, c(height, weight, c_phys_app:c_dress, i_c_bad_way, c_figure))
require(psych)
# equivalence to base package's princomp()
# full model: as much components as items, all variance explained by components.
# Cumulative Proportion adds up to 1, communalities h2 are set to 1
psych::principal(ddf.f, rotate='none', nfactors=ncol(ddf.f))
## Principal Components Analysis
## Call: psych::principal(r = ddf.f, nfactors = ncol(ddf.f), rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              PC1   PC2   PC3   PC4   PC5   PC6   PC7 h2      u2 com
## height      0.36  0.81  0.41 -0.22  0.03  0.02  0.02  1 1.1e-16 2.1
## weight      0.41  0.78 -0.38  0.26  0.02  0.04 -0.02  1 1.1e-15 2.3
## c_phys_app  0.89 -0.22 -0.21 -0.25  0.02  0.23  0.01  1 1.0e-15 1.6
## c_good_way  0.91 -0.07  0.10  0.09 -0.38 -0.02  0.01  1 1.2e-15 1.4
## c_dress     0.93 -0.03 -0.20 -0.15  0.09 -0.20  0.14  1 1.8e-15 1.3
## i_c_bad_way 0.95 -0.12  0.05 -0.01  0.09 -0.09 -0.23  1 1.6e-15 1.2
## c_figure    0.84 -0.25  0.30  0.32  0.17  0.07  0.09  1 7.8e-16 1.9
## 
##                        PC1  PC2  PC3  PC4  PC5  PC6  PC7
## SS loadings           4.40 1.39 0.51 0.31 0.19 0.11 0.08
## Proportion Var        0.63 0.20 0.07 0.04 0.03 0.02 0.01
## Cumulative Var        0.63 0.83 0.90 0.95 0.97 0.99 1.00
## Proportion Explained  0.63 0.20 0.07 0.04 0.03 0.02 0.01
## Cumulative Proportion 0.63 0.83 0.90 0.95 0.97 0.99 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 7 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0 
##  with the empirical chi square  0  with prob <  NA 
## 
## Fit based upon off diagonal values = 1
# what suggests fa.parallel() of library(psych)
# already done above
# items.parallel <- fa.parallel(ddf, fa="pc")

# we go on with 2 components
pc.2 <- psych::principal(ddf.f, rotate='none', nfactors=2)
# default output
# h2 are communalities
# u2 are residuals
# h2 and u2 sum up to 1
pc.2
## Principal Components Analysis
## Call: psych::principal(r = ddf.f, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              PC1   PC2   h2    u2 com
## height      0.36  0.81 0.78 0.222 1.4
## weight      0.41  0.78 0.78 0.216 1.5
## c_phys_app  0.89 -0.22 0.83 0.166 1.1
## c_good_way  0.91 -0.07 0.84 0.162 1.0
## c_dress     0.93 -0.03 0.87 0.133 1.0
## i_c_bad_way 0.95 -0.12 0.93 0.073 1.0
## c_figure    0.84 -0.25 0.77 0.234 1.2
## 
##                        PC1  PC2
## SS loadings           4.40 1.39
## Proportion Var        0.63 0.20
## Cumulative Var        0.63 0.83
## Proportion Explained  0.76 0.24
## Cumulative Proportion 0.76 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  5.52  with prob <  0.7 
## 
## Fit based upon off diagonal values = 0.99
# graph it
plot(pc.2)

# get loadings
# alternatively available via fit$loadings
loadings(pc.2)
## 
## Loadings:
##             PC1    PC2   
## height       0.360  0.806
## weight       0.413  0.783
## c_phys_app   0.887 -0.218
## c_good_way   0.913       
## c_dress      0.931       
## i_c_bad_way  0.955 -0.124
## c_figure     0.840 -0.248
## 
##                  PC1   PC2
## SS loadings    4.402 1.393
## Proportion Var 0.629 0.199
## Cumulative Var 0.629 0.828
# loadings understands parameter cutoff. only loadings above cutoff will be printed. default is .1
# a higher cutoff might result in a better understanding of the structure
print(loadings(pc.2), cutoff= .45)
## 
## Loadings:
##             PC1    PC2   
## height              0.806
## weight              0.783
## c_phys_app   0.887       
## c_good_way   0.913       
## c_dress      0.931       
## i_c_bad_way  0.955       
## c_figure     0.840       
## 
##                  PC1   PC2
## SS loadings    4.402 1.393
## Proportion Var 0.629 0.199
## Cumulative Var 0.629 0.828
# does varimax rotation help?
pc.2.r <- principal(ddf.f, nfactors=2, rotate='varimax')
print(loadings(pc.2.r), cutoff= .3)
## 
## Loadings:
##             RC1   RC2  
## height            0.877
## weight            0.873
## c_phys_app  0.911      
## c_good_way  0.889      
## c_dress     0.896      
## i_c_bad_way 0.947      
## c_figure    0.875      
## 
##                  RC1   RC2
## SS loadings    4.115 1.680
## Proportion Var 0.588 0.240
## Cumulative Var 0.588 0.828
plot(pc.2.r)

# even better
print.psych(pc.2.r, cut=0.3, sort=T)
## Principal Components Analysis
## Call: principal(r = ddf.f, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##             item  RC1  RC2   h2    u2 com
## i_c_bad_way    6 0.95      0.93 0.073 1.1
## c_phys_app     3 0.91      0.83 0.166 1.0
## c_dress        5 0.90      0.87 0.133 1.2
## c_good_way     4 0.89      0.84 0.162 1.1
## c_figure       7 0.88      0.77 0.234 1.0
## height         1      0.88 0.78 0.222 1.0
## weight         2      0.87 0.78 0.216 1.1
## 
##                        RC1  RC2
## SS loadings           4.12 1.68
## Proportion Var        0.59 0.24
## Cumulative Var        0.59 0.83
## Proportion Explained  0.71 0.29
## Cumulative Proportion 0.71 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  5.52  with prob <  0.7 
## 
## Fit based upon off diagonal values = 0.99
# a quick look at some factor scores
# unrotated
head(round(pc.2$scores, 2))
##        PC1   PC2
## [1,] -1.15  0.62
## [2,]  0.51 -0.68
## [3,]  1.57  0.68
## [4,]  0.60 -0.28
## [5,] -0.98 -0.58
## [6,] -0.12 -1.87
# component scores are different for rotated components
head(round(pc.2.r$scores, 2))
##        RC1   RC2
## [1,] -1.29  0.24
## [2,]  0.69 -0.49
## [3,]  1.28  1.14
## [4,]  0.66 -0.08
## [5,] -0.76 -0.86
## [6,]  0.46 -1.82
# component scores may be saved to dataframe for further computations
ddf <- cbind(ddf.f, pc.2.r$scores)
head(ddf)
##   height weight c_phys_app c_good_way c_dress i_c_bad_way c_figure
## 1    180     60          2          1       2           2        1
## 2    176     54          3          4       3           5        5
## 3    180    110          5          5       5           5        5
## 4    182     52          4          5       4           4        3
## 5    160     62          3          2       2           2        1
## 6    156     41          3          4       3           3        4
##          RC1         RC2
## 1 -1.2905641  0.23508492
## 2  0.6941589 -0.48814640
## 3  1.2830310  1.13533746
## 4  0.6603628 -0.07911112
## 5 -0.7566471 -0.85680501
## 6  0.4631791 -1.81583204

in Output unter Standardized loadings based ...

h2 sind die Kommunalitäten (Anteil an Varianz aller Variablen, der durch die beiden Komponenten erklärt/repliziert wird).

u2 sind die uniqueness-Werte (siehe efa).

com Complexity index, wieviele Faktoren brauchen wir, um eine Variable zu erklären (siehe efa)

Eine Erhöhung des cutoffs macht die Tabelle der Factor-Loadings klarer:

Die Variablen sind gut bereits unrotiert einer der beiden Komponenten zuordenbar. Die Körperitems und die PACS-Items laden klar in je eine der beiden Komponenten. Zwei Komponenten erklären 83% der Varianz. Andersherum gesehen gehen 17% der Information verloren, wenn man mit 2 Komponenten anstelle von 12 Items in einem multivariaten Verfahren weiterrechnen würde.

Nach Varimax-Rotation findet sich dieselbe Struktur.

Mit den gespeicherten Komponenten-Werten kann man dann weiterrechnen.

Alles hier passiert zunächst ohne inhaltlichen Bezug. Hat man aber mit Komponenten gerechnete Ergebnisse, ist u. U. fraglich, was sie eigentlich aussagen. In diesem Beispiel lassen sich die beiden inhaltlichen Itemgruppen klar in den beiden Komponenten identifizieren.