Beschreiben (reproduzieren) der Kovarianz einer Menge korrelierter Variablen durch wenige unkorrelierte Variablen (Komponenten). Komponenten geordnet nach ‘Wichtigkeit’ (Anteil an erklärter Varianz).
(Forschungs-)Fragen
Koeffizienten/Parameter für jede Variable, die ausdrücken, welchen Einfluss die Variable auf die Bildung des Faktors hat. In PCA wenn von der Korrelationsmatrix aus gerechnet wurde: Korrelation der Variablen mit den Faktorwerten (Werte der Personen in diesem Faktor). Auch berechenbar über die Multiplikation der Loadings (Eigenvektor) mit der Standardabweichung des jeweiligen Faktors.
Befehle
princomp()
im base packageprcomp()
im base packageprincipal()
aus library(psych)
Bei PCA zur Variablenreduktion ist eine Rotation nicht sinnvoll. Dies kann durch ein Flag gesetzt werden principal(..., rotate="none")
Das Beispiel lehnt sich an eine Studie von Tanja Lischetzke et al (2001, Diagnostica Vol 47, No. 4, S. 167 - 177):
Aus dem Abstract: “Das Erkennen der eigenen Gefühle und der Gefühle anderer Menschen ist eine wichtige Kompetenz im Umgang mit Emotionen und Stimmungen. Es werden die bisher vor allem im englischen Sprachraum untersuchten Konstrukte der emotionalen Selbstaufmerksamkeit und der Klarheit über eigene Gefühle vorgestellt und die konzeptuelle Trennung der Konstrukte erstmals auf die Wahrnehmung fremder Gefühle übertragen.”
Die Ausführungen hier beziehen sich allerdings nur auf den Teil der Wahrnehmung eigener Gefühle.
Ein (simulierter) Datensatz findet sich im üblichen tab-delimited Textformat unter: fa-ei.txt
Die Fragen/Items sind:
Die Items sind skaliert von 1 bis 4 (fast nie/manchmal/oft/fast immer)
Hier soll die Frage geklärt werden, ob sich die obigen Items auf ein paar Komponenten reduzieren lassen, wenn ja, wie viele sind sinnvoll.
# get data
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_ei.txt")
require(psych)
## Loading required package: psych
# Ask Bartlett test whether correlation matrix is idendity matrix (all elements are 0), should be significant
# run Bartlett test on dataframe
cortest.bartlett(dd)
## R was not square, finding R from data
## $chisq
## [1] 374.6439
##
## $p.value
## [1] 1.072894e-44
##
## $df
## [1] 66
# alternativly on correlation matrix (add n), here calculated on the fly using cor()
cortest.bartlett(cor(dd), n=length(dd[,1]))
## $chisq
## [1] 374.6439
##
## $p.value
## [1] 1.072894e-44
##
## $df
## [1] 66
# get the KMO coefficients
kmo <- KMO(dd)
# show it
kmo
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = dd)
## Overall MSA = 0.81
## MSA for each item =
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12
## 0.64 0.86 0.77 0.82 0.74 0.78 0.67 0.88 0.83 0.84 0.74 0.82
# get ordered MSAi
kmo$MSAi[order(kmo$MSAi)]
## i1 i7 i11 i5 i3 i6 i4 i12
## 0.6395808 0.6744224 0.7404746 0.7435706 0.7730711 0.7834420 0.8164525 0.8177575
## i9 i10 i2 i8
## 0.8256334 0.8434389 0.8556920 0.8799089
Via Parallelanalyse psych::fa.parallel()
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_ei.txt")
require(psych)
# get parallel analysis for principal component analysis
# plot is generated automagically
psych::fa.parallel(dd, fa="pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
# parallel analysis suggests a one component solution
Die Parallelanalyse empfiehlt zwei Komponenten.
princom()
princomp()
ist bereits im base System von R enthalten.
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_ei.txt")
# fit model, run PCA
# use base package princomp()
fit <- princomp(dd, cor=FALSE)
# simple scree plot would be
# plot(fit,type="lines")
# but we already did fa.parallel
# print variance accounted for
summary(fit)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9385555 1.2467494 0.92256025 0.87555710 0.76591709
## Proportion of Variance 0.3644755 0.1507545 0.08254701 0.07434997 0.05689517
## Cumulative Proportion 0.3644755 0.5152299 0.59777695 0.67212692 0.72902208
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.7540646 0.71094546 0.68050758 0.6381646 0.59495713
## Proportion of Variance 0.0551479 0.04902125 0.04491359 0.0394982 0.03433074
## Cumulative Proportion 0.7841700 0.83319124 0.87810483 0.9176030 0.95193377
## Comp.11 Comp.12
## Standard deviation 0.50532069 0.49015043
## Proportion of Variance 0.02476544 0.02330079
## Cumulative Proportion 0.97669921 1.00000000
# component scores (principal components) are available and may be used for further computations
head(round(fit$scores, 2))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## [1,] 0.67 1.07 0.45 1.03 -1.36 -0.75 0.15 0.77 -0.24 0.35
## [2,] 1.97 1.03 -0.05 0.63 0.15 -0.83 -0.84 0.36 0.26 0.53
## [3,] 0.22 2.09 0.16 0.76 -0.81 -0.93 -0.27 -0.05 -1.15 -0.06
## [4,] -1.34 -1.75 0.00 2.14 -0.78 0.14 0.65 1.04 0.50 0.52
## [5,] 1.38 1.96 -0.25 0.37 -0.09 1.66 -0.37 0.79 -0.33 -0.18
## [6,] 0.08 -0.32 -0.84 -1.15 0.09 0.06 -0.35 -0.04 -1.02 -0.26
## Comp.11 Comp.12
## [1,] 0.11 -0.14
## [2,] 0.46 0.20
## [3,] 0.55 0.30
## [4,] -0.21 0.54
## [5,] 0.09 -1.80
## [6,] -0.69 -0.05
# biplot(fit)
Gut zu sehen ist (Cumulative Var), dass dieselbe Anzahl an Komponenten wie Variablen die gesamte Varianz erklärt, die Original Korrelationsmatrix also repliziert.
Das Problem der subjektiven Festlegung der Anzahl der Komponenten bleibt, obwohl der Screeplot hilft.
princomp()
erlaubt keine Anpassung auf eine bestimmte Anzahl von Komponenten und keine Rotation.
psych::principal()
und noch mehr psych::fa()
sind flexibler und damit häufig auch anschaulicher.
principal()
bzw. fa()
aus der library(psych)
Hier mit einigen Erklärungen und Erläuterungen.
# get data
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_ei.txt")
# take a look at the data
head(dd)
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12
## 1 4 3 4 2 3 1 4 3 3 1 3 3
## 2 4 3 3 1 4 1 4 4 4 1 3 2
## 3 4 3 4 2 4 2 4 2 4 1 3 3
## 4 3 3 2 2 1 2 4 3 1 2 2 3
## 5 4 4 4 4 4 1 4 4 4 2 3 1
## 6 2 4 3 2 4 2 2 3 3 2 3 2
# we need library(psych)
require(psych)
# some descriptives
psych::describe(dd)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## i1 1 100 3.06 0.93 3 3.15 1.48 1 4 3 -0.56 -0.77 0.09
## i2 2 100 3.08 0.97 3 3.20 1.48 1 4 3 -0.68 -0.68 0.10
## i3 3 100 3.02 0.86 3 3.09 1.48 1 4 3 -0.50 -0.56 0.09
## i4 4 100 1.87 0.96 2 1.71 1.48 1 4 3 0.94 -0.10 0.10
## i5 5 100 3.00 0.93 3 3.09 1.48 1 4 3 -0.52 -0.75 0.09
## i6 6 100 1.91 0.96 2 1.77 1.48 1 4 3 0.78 -0.44 0.10
## i7 7 100 2.91 0.92 3 2.98 1.48 1 4 3 -0.36 -0.86 0.09
## i8 8 100 3.05 1.04 3 3.19 1.48 1 4 3 -0.74 -0.71 0.10
## i9 9 100 3.01 0.90 3 3.08 1.48 1 4 3 -0.42 -0.88 0.09
## i10 10 100 1.94 0.99 2 1.81 1.48 1 4 3 0.67 -0.74 0.10
## i11 11 100 2.90 0.75 3 2.91 0.00 1 4 3 -0.28 -0.27 0.07
## i12 12 100 1.86 0.92 2 1.76 1.48 1 4 3 0.66 -0.71 0.09
# Ask Bartlett test whether correlation matrix is idendity matrix (all elements are 0), should be significant
cortest.bartlett(dd)
## R was not square, finding R from data
## $chisq
## [1] 374.6439
##
## $p.value
## [1] 1.072894e-44
##
## $df
## [1] 66
# alternativly on correlation matrix (add n)
cortest.bartlett(dd, n=nrow(dd))
## R was not square, finding R from data
## $chisq
## [1] 374.6439
##
## $p.value
## [1] 1.072894e-44
##
## $df
## [1] 66
# get kmo: are data good for factorization
psych::KMO(dd)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = dd)
## Overall MSA = 0.81
## MSA for each item =
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12
## 0.64 0.86 0.77 0.82 0.74 0.78 0.67 0.88 0.83 0.84 0.74 0.82
# determinant ('area' of data) should be higher than 0.00001, singularity might wait ...
det(cor(dd))
## [1] 0.01871332
# number of factors suggested by fa.parallel() is 2
items.parallel <- fa.parallel(dd, fa="pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
# original correlation table, that will be reproduced more or less by PCA
round(cor(dd), 2)
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12
## i1 1.00 0.18 0.35 -0.06 0.17 -0.23 0.18 0.14 0.43 -0.08 0.30 -0.11
## i2 0.18 1.00 0.14 -0.48 0.18 -0.52 0.21 0.57 0.18 -0.61 0.12 -0.60
## i3 0.35 0.14 1.00 -0.07 0.30 -0.11 0.13 0.25 0.34 -0.25 0.33 -0.05
## i4 -0.06 -0.48 -0.07 1.00 -0.12 0.63 -0.12 -0.45 -0.11 0.53 -0.15 0.54
## i5 0.17 0.18 0.30 -0.12 1.00 -0.04 0.18 0.31 0.26 -0.13 0.33 -0.11
## i6 -0.23 -0.52 -0.11 0.63 -0.04 1.00 -0.04 -0.47 -0.18 0.58 -0.04 0.45
## i7 0.18 0.21 0.13 -0.12 0.18 -0.04 1.00 0.18 0.17 -0.15 0.13 -0.03
## i8 0.14 0.57 0.25 -0.45 0.31 -0.47 0.18 1.00 0.23 -0.57 0.18 -0.53
## i9 0.43 0.18 0.34 -0.11 0.26 -0.18 0.17 0.23 1.00 -0.19 0.26 -0.13
## i10 -0.08 -0.61 -0.25 0.53 -0.13 0.58 -0.15 -0.57 -0.19 1.00 -0.17 0.48
## i11 0.30 0.12 0.33 -0.15 0.33 -0.04 0.13 0.18 0.26 -0.17 1.00 -0.17
## i12 -0.11 -0.60 -0.05 0.54 -0.11 0.45 -0.03 -0.53 -0.13 0.48 -0.17 1.00
require(psych)
# equivalence to base package's princomp()
# full model: as much components as items, all variance explained by components.
# Cumulative Proportion adds up to 1, communalities h2 are set to 1
psych::principal(dd, rotate='none', nfactors=ncol(dd))
## Principal Components Analysis
## Call: psych::principal(r = dd, nfactors = ncol(dd), rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 h2
## i1 0.37 0.56 -0.51 0.18 -0.20 0.06 0.10 0.41 0.05 0.06 -0.14 0.13 1
## i2 0.78 -0.22 0.06 0.12 0.01 0.07 0.31 0.07 0.16 -0.34 0.27 0.11 1
## i3 0.38 0.58 -0.08 -0.18 0.45 -0.42 0.07 0.05 -0.29 -0.09 0.06 -0.04 1
## i4 -0.71 0.34 0.01 0.04 0.18 0.08 0.43 0.02 0.28 0.03 -0.02 -0.27 1
## i5 0.37 0.48 0.51 -0.24 0.11 0.39 -0.25 0.24 0.05 -0.14 -0.08 -0.05 1
## i6 -0.72 0.31 0.32 -0.05 -0.04 0.05 0.33 -0.15 -0.13 -0.05 -0.13 0.32 1
## i7 0.28 0.28 0.40 0.79 -0.12 -0.15 -0.04 -0.04 -0.09 0.02 -0.02 -0.09 1
## i8 0.77 -0.09 0.21 -0.05 0.21 0.12 0.18 0.03 0.03 0.50 0.10 0.09 1
## i9 0.42 0.53 -0.34 0.11 0.09 0.38 -0.08 -0.50 -0.02 -0.03 0.04 0.00 1
## i10 -0.78 0.22 -0.03 0.01 -0.21 0.24 0.00 0.17 -0.31 0.07 0.34 -0.05 1
## i11 0.36 0.53 0.16 -0.35 -0.53 -0.30 0.01 -0.16 0.17 0.06 0.10 -0.04 1
## i12 -0.70 0.32 -0.01 0.16 0.26 -0.16 -0.33 0.03 0.33 0.05 0.17 0.19 1
## u2 com
## i1 -2.2e-16 4.7
## i2 3.3e-16 2.5
## i3 1.0e-15 4.7
## i4 1.6e-15 3.2
## i5 4.4e-16 5.7
## i6 -1.6e-15 3.2
## i7 2.6e-15 2.3
## i8 1.4e-15 2.4
## i9 8.9e-16 4.8
## i10 1.3e-15 2.5
## i11 1.1e-16 5.0
## i12 4.4e-16 3.5
##
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## SS loadings 4.10 1.95 0.98 0.94 0.74 0.70 0.62 0.57 0.45 0.40 0.29
## Proportion Var 0.34 0.16 0.08 0.08 0.06 0.06 0.05 0.05 0.04 0.03 0.02
## Cumulative Var 0.34 0.50 0.59 0.66 0.73 0.78 0.84 0.88 0.92 0.95 0.98
## Proportion Explained 0.34 0.16 0.08 0.08 0.06 0.06 0.05 0.05 0.04 0.03 0.02
## Cumulative Proportion 0.34 0.50 0.59 0.66 0.73 0.78 0.84 0.88 0.92 0.95 0.98
## PC12
## SS loadings 0.27
## Proportion Var 0.02
## Cumulative Var 1.00
## Proportion Explained 0.02
## Cumulative Proportion 1.00
##
## Mean item complexity = 3.7
## Test of the hypothesis that 12 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0
## with the empirical chi square 0 with prob < NA
##
## Fit based upon off diagonal values = 1
# what suggests fa.parallel() of library(psych)
# already done above
# items.parallel <- fa.parallel(dd, fa="pc")
# we go on with 2 components
pc.2 <- psych::principal(dd, rotate='none', nfactors=2)
# default output
# h2 are communalities
# u2 are residuals
# h2 and u2 sum up to 1
pc.2
## Principal Components Analysis
## Call: psych::principal(r = dd, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## i1 0.37 0.56 0.45 0.55 1.7
## i2 0.78 -0.22 0.65 0.35 1.2
## i3 0.38 0.58 0.49 0.51 1.7
## i4 -0.71 0.34 0.61 0.39 1.4
## i5 0.37 0.48 0.36 0.64 1.9
## i6 -0.72 0.31 0.61 0.39 1.4
## i7 0.28 0.28 0.16 0.84 2.0
## i8 0.77 -0.09 0.60 0.40 1.0
## i9 0.42 0.53 0.46 0.54 1.9
## i10 -0.78 0.22 0.65 0.35 1.2
## i11 0.36 0.53 0.41 0.59 1.8
## i12 -0.70 0.32 0.59 0.41 1.4
##
## PC1 PC2
## SS loadings 4.10 1.95
## Proportion Var 0.34 0.16
## Cumulative Var 0.34 0.50
## Proportion Explained 0.68 0.32
## Cumulative Proportion 0.68 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.08
## with the empirical chi square 90.97 with prob < 2.7e-05
##
## Fit based upon off diagonal values = 0.93
# graph it
plot(pc.2)
# get loadings
# alternatively available via fit$loadings
loadings(pc.2)
##
## Loadings:
## PC1 PC2
## i1 0.368 0.560
## i2 0.777 -0.217
## i3 0.383 0.583
## i4 -0.705 0.342
## i5 0.366 0.480
## i6 -0.718 0.314
## i7 0.276 0.282
## i8 0.769
## i9 0.417 0.535
## i10 -0.776 0.223
## i11 0.361 0.528
## i12 -0.699 0.318
##
## PC1 PC2
## SS loadings 4.096 1.949
## Proportion Var 0.341 0.162
## Cumulative Var 0.341 0.504
# loadings understands parameter cutoff. only loadings above cutoff will be printed. default is .1
# a higher cutoff might result in a better understanding of the structure
print(loadings(pc.2), cutoff= .45)
##
## Loadings:
## PC1 PC2
## i1 0.560
## i2 0.777
## i3 0.583
## i4 -0.705
## i5 0.480
## i6 -0.718
## i7
## i8 0.769
## i9 0.535
## i10 -0.776
## i11 0.528
## i12 -0.699
##
## PC1 PC2
## SS loadings 4.096 1.949
## Proportion Var 0.341 0.162
## Cumulative Var 0.341 0.504
# does varimax rotation help?
pc.2.r <- principal(dd, nfactors=2, rotate='varimax')
print(loadings(pc.2.r), cutoff= .3)
##
## Loadings:
## RC1 RC2
## i1 0.667
## i2 -0.789
## i3 0.694
## i4 0.784
## i5 0.596
## i6 0.782
## i7 0.378
## i8 -0.721
## i9 0.667
## i10 0.790
## i11 0.636
## i12 0.767
##
## RC1 RC2
## SS loadings 3.631 2.414
## Proportion Var 0.303 0.201
## Cumulative Var 0.303 0.504
# even better
print.psych(pc.2.r, cut=0.3, sort=T)
## Principal Components Analysis
## Call: principal(r = dd, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item RC1 RC2 h2 u2 com
## i10 10 0.79 0.65 0.35 1.1
## i2 2 -0.79 0.65 0.35 1.1
## i4 4 0.78 0.61 0.39 1.0
## i6 6 0.78 0.61 0.39 1.0
## i12 12 0.77 0.59 0.41 1.0
## i8 8 -0.72 0.60 0.40 1.3
## i3 3 0.69 0.49 0.51 1.0
## i9 9 0.67 0.46 0.54 1.1
## i1 1 0.67 0.45 0.55 1.0
## i11 11 0.64 0.41 0.59 1.0
## i5 5 0.60 0.36 0.64 1.1
## i7 7 0.38 0.16 0.84 1.2
##
## RC1 RC2
## SS loadings 3.63 2.41
## Proportion Var 0.30 0.20
## Cumulative Var 0.30 0.50
## Proportion Explained 0.60 0.40
## Cumulative Proportion 0.60 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.08
## with the empirical chi square 90.97 with prob < 2.7e-05
##
## Fit based upon off diagonal values = 0.93
# a quick look at some factor scores
# unrotated
head(round(pc.2$scores, 2))
## PC1 PC2
## [1,] 0.37 0.81
## [2,] 1.02 0.63
## [3,] 0.22 1.59
## [4,] -0.85 -1.41
## [5,] 0.78 1.35
## [6,] 0.03 -0.21
# component scores are different for rotated components
head(round(pc.2.r$scores, 2))
## RC1 RC2
## [1,] 0.05 0.89
## [2,] -0.61 1.03
## [3,] 0.55 1.51
## [4,] 0.10 -1.64
## [5,] -0.06 1.56
## [6,] -0.12 -0.17
# component scores may be saved to dataframe for further computations
dd <- cbind(dd, pc.2.r$scores)
head(dd)
## i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 RC1 RC2
## 1 4 3 4 2 3 1 4 3 3 1 3 3 0.04794665 0.8910171
## 2 4 3 3 1 4 1 4 4 4 1 3 2 -0.60773589 1.0319947
## 3 4 3 4 2 4 2 4 2 4 1 3 3 0.54578505 1.5066073
## 4 3 3 2 2 1 2 4 3 1 2 2 3 0.09548343 -1.6403599
## 5 4 4 4 4 4 1 4 4 4 2 3 1 -0.06131421 1.5556293
## 6 2 4 3 2 4 2 2 3 3 2 3 2 -0.12486039 -0.1743935
# finally we might want to visualize the models
# we can use the usual plot() method
plot(pc.2)
plot(pc.2.r)
# or the fa.diagram() of library(psych), parameter cut is equivalent to looking at scores, parameter simple suppresses inclusion of cross-loadings
fa.diagram(pc.2, simple=TRUE, cut=.2, digits=2)
fa.diagram(pc.2.r, simple=TRUE, cut=.2, digits=2)
in Output unter Standardized loadings based ...
h2
sind die Kommunalitäten (Anteil an Varianz aller Variablen, der durch die beiden Komponenten erklärt/repliziert wird).
u2
sind die uniqueness-Werte (siehe efa).
com
Complexity index, wieviele Faktoren brauchen wir, um eine Variable zu erklären (siehe efa) The complexity index is a positive number indicating on the average how many factors are used to explain each variable in a factor solution.
Eine Erhöhung des cutoffs auf .45
macht die Tabelle der Factor-Loadings klarer:
Die Variablen sind gut bereits unrotiert gut einer der beiden Komponenten zuordenbar. Wir sehen sehr schön die Struktur, dass die gerade bzw. die ungeraden Items auf je einer Komponente laden. Das Item i7 kann nicht eindeutig einer der beiden Komponenten zugeordnet werden. Cross-loadings, also “Falschzuordnungen” sind nicht zu beobachten. Zwei Komponenten erklären 50% der Varianz. Andersherum gesehen gehen 50% der Information verloren, wenn wir mit 2 Komponenten anstelle von 12 Items in einem multivariaten Verfahren weiterrechnen würde.
Nach Varimax-Rotation findet sich dieselbe Struktur. Hier wird das kritische Item i7 zumindest knapp (mit niedrigem Ladungsgewicht) der zweiten Komponente zugeordnet, wie alle ungeraden Items.
Mit den gespeicherten Komponenten-Werten könnten wir dann weiterrechnen.
Alles hier passiert zunächst ohne inhaltlichen Bezug. Hat man aber mit Komponenten gerechnete Ergebnisse, ist u. U. fraglich, was sie eigentlich aussagen. In diesem Beispiel lassen sich die beiden inhaltlichen Itemgruppen aber klar in den beiden Komponenten identifizieren.
todo: adapt to EI data
dd.raw <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#dd.f <- dplyr::select(dd, c(c_phys_app:c_dress, i_c_bad_way, c_figure))
dd <- dplyr::select(dd.raw, c(height, weight, c_phys_app:c_dress, i_c_bad_way, c_figure))
require(psych)
pc.2 <- principal(dd, rotate='none', nfactors=2)
# model test whether model with 2 components is sig. different from model with all components
pc.2
## Principal Components Analysis
## Call: principal(r = dd, nfactors = 2, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## height 0.36 0.81 0.78 0.222 1.4
## weight 0.41 0.78 0.78 0.216 1.5
## c_phys_app 0.89 -0.22 0.83 0.166 1.1
## c_good_way 0.91 -0.07 0.84 0.162 1.0
## c_dress 0.93 -0.03 0.87 0.133 1.0
## i_c_bad_way 0.95 -0.12 0.93 0.073 1.0
## c_figure 0.84 -0.25 0.77 0.234 1.2
##
## PC1 PC2
## SS loadings 4.40 1.39
## Proportion Var 0.63 0.20
## Cumulative Var 0.63 0.83
## Proportion Explained 0.76 0.24
## Cumulative Proportion 0.76 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 5.52 with prob < 0.7
##
## Fit based upon off diagonal values = 0.99
# original correlation table, that will be reproduced more or less by PCA
round(cor(dd), 2)
## height weight c_phys_app c_good_way c_dress i_c_bad_way c_figure
## height 1.00 0.57 0.12 0.29 0.26 0.26 0.16
## weight 0.57 1.00 0.22 0.30 0.39 0.28 0.12
## c_phys_app 0.12 0.22 1.00 0.77 0.87 0.85 0.67
## c_good_way 0.29 0.30 0.77 1.00 0.79 0.85 0.78
## c_dress 0.26 0.39 0.87 0.79 1.00 0.88 0.69
## i_c_bad_way 0.26 0.28 0.85 0.85 0.88 1.00 0.83
## c_figure 0.16 0.12 0.67 0.78 0.69 0.83 1.00
# get the reproduced correlation matrix, which is calculated on base of the loadings
# diagonal are the communalities
round(factor.model(pc.2$loadings), 2)
## height weight c_phys_app c_good_way c_dress i_c_bad_way c_figure
## height 0.78 0.78 0.14 0.27 0.31 0.24 0.10
## weight 0.78 0.78 0.20 0.32 0.36 0.30 0.15
## c_phys_app 0.14 0.20 0.83 0.82 0.83 0.87 0.80
## c_good_way 0.27 0.32 0.82 0.84 0.85 0.88 0.78
## c_dress 0.31 0.36 0.83 0.85 0.87 0.89 0.79
## i_c_bad_way 0.24 0.30 0.87 0.88 0.89 0.93 0.83
## c_figure 0.10 0.15 0.80 0.78 0.79 0.83 0.77
# get the residuals. These are the differences between original and reproduced correlation matrix
pc.diff <- cor(dd) - factor.model(pc.2$loadings)
round(pc.diff, 2)
## height weight c_phys_app c_good_way c_dress i_c_bad_way c_figure
## height 0.22 -0.21 -0.03 0.01 -0.05 0.02 0.06
## weight -0.21 0.22 0.03 -0.02 0.03 -0.02 -0.03
## c_phys_app -0.03 0.03 0.17 -0.06 0.04 -0.03 -0.12
## c_good_way 0.01 -0.02 -0.06 0.16 -0.06 -0.03 -0.01
## c_dress -0.05 0.03 0.04 -0.06 0.13 -0.01 -0.10
## i_c_bad_way 0.02 -0.02 -0.03 -0.03 -0.01 0.07 0.00
## c_figure 0.06 -0.03 -0.12 -0.01 -0.10 0.00 0.23
rm(pc.diff)
# get the same via call to factor.residuals()
# diagonal are the uniquenesses.
pc.diff.2 <- factor.residuals(cor(dd), pc.2$loadings)
round(pc.diff.2, 2)
## height weight c_phys_app c_good_way c_dress i_c_bad_way c_figure
## height 0.22 -0.21 -0.03 0.01 -0.05 0.02 0.06
## weight -0.21 0.22 0.03 -0.02 0.03 -0.02 -0.03
## c_phys_app -0.03 0.03 0.17 -0.06 0.04 -0.03 -0.12
## c_good_way 0.01 -0.02 -0.06 0.16 -0.06 -0.03 -0.01
## c_dress -0.05 0.03 0.04 -0.06 0.13 -0.01 -0.10
## i_c_bad_way 0.02 -0.02 -0.03 -0.03 -0.01 0.07 0.00
## c_figure 0.06 -0.03 -0.12 -0.01 -0.10 0.00 0.23
rm(pc.diff.2)
In der Diagonalen der reproduzierten Korrelationstabelle stehen die Kommunalitäten. In der Diagonalen der Differenzen (Residuen) stehen die spezifischen Anteile (Uniquenesses).
Ein Maß für die Qualität des Fits kann auf diesen Residuen aufbauen. Die Residuen im Verhältnis zur Höhe der Korrelation sind eine Grundidee für einen Fit-Index. Die Summe der quadrierten Residuen im Verhältnis zur Summe der quadrierten Korrelationen, von 1 abgezogen, ist der Wert hinter ‘Fit based upon off diagonal values = …’ Werte über .95 sprechen für einen guten Fit.
Der Likelihood-Quotienten-Test prüft auf Signifikanz ob die Anzahl der Faktoren ausreicht, um die Daten genüugend zu erklären. Er setzt Normalverteilung voraus. Geprüft wird die 0-Hypothese, dass die Daten und das Modell mit k Faktoren statistisch gleich sind. Dabei hat die Aufnahme eines Faktors ihren ‘Preis’. Die Anzahl der Elementgleichungen geht in die Modellprüfung mit ein.
https://aaronschlegel.me/principal-component-analysis-r-example.html
# data, 3 vars, 10 subjects
XX <- matrix(c(7,4,6,8,8,7,5,9,7,8,4,1,3,6,5,2,3,5,4,2,3,8,5,1,7,9,3,8,5,2), ncol=3)
colnames(XX) <- c("v1", "v2", "v3")
# center
cXX <- scale(XX, scale=FALSE)
# z transform
sXX <- scale(XX)
# correlation matrix
CC <- cor(XX)
VC <- cov(XX)
#solve(CC)
iCC <- solve(CC)
# Solve for the roots of R
# i.e. get the eigenvalues
eigen(CC)
## eigen() decomposition
## $values
## [1] 1.7687741 0.9270759 0.3041499
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.6420046 0.38467229 0.6632174
## [2,] -0.6863616 0.09713033 -0.7207450
## [3,] 0.3416692 0.91792861 -0.2016662
# Notice that:
# - Each eigenvalue satisfies |R−λI|=0. # ??
# diag(3) # identity matrix for three
GG <- CC - eigen(CC)$values[1] * diag(3) # should become 0
rowSums(GG)
## v1 v2 v3
## -0.2014306 -0.3880445 -1.1580156
colSums(GG)
## v1 v2 v3
## -0.2014306 -0.3880445 -1.1580156
# - The sum of the eigenvalues =3=p, which is equal to the trace of R (i.e., the sum of the main diagonal elements).
# - The determinant of R is the product of the eigenvalues.
# - The product is λ1×λ2×λ3=0.499.
det(CC)
## [1] 0.4987414
# the V matrix
VV <- eigen(CC)$vectors
# Notice that if you multiply V by its transpose, the result is an identity matrix, V′V=I.
VV %*% t(VV)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 -1.110223e-16 5.551115e-17
## [2,] -1.110223e-16 1.000000e+00 2.775558e-17
## [3,] 5.551115e-17 2.775558e-17 1.000000e+00
# check with various PCA possibilities
r.psych.nr <- psych::principal(XX, rotate='none', nfactors=ncol(XX))
r.psych <- psych::principal(XX, nfactors=ncol(XX))
# prcomp() and princomp()
r.prcomp <- prcomp(XX, scale. = TRUE)
r.prcomp.nr <- prcomp(XX)
r.princomp <- princomp(XX, cor = TRUE)
r.psych <- psych::principal(XX, rotate='none', nfactors=ncol(XX))
head(XX)
## v1 v2 v3
## [1,] 7 4 3
## [2,] 4 1 8
## [3,] 6 3 5
## [4,] 8 6 1
## [5,] 8 5 7
## [6,] 7 2 9
head(psych::predict.psych(r.psych, XX))
## PC1 PC2 PC3
## [1,] 0.3870911 -0.65517681 0.06076455
## [2,] -2.0000790 0.06523276 0.59998998
## [3,] -0.4391426 -0.30181899 0.28393399
## [4,] 1.5397222 -0.94473367 0.66414254
## [5,] 0.6641393 1.02944877 0.61929241
## [6,] -0.8148893 1.25520467 -0.81063301
head(r.psych$scores)
## PC1 PC2 PC3
## [1,] 0.3870911 -0.65517681 0.06076455
## [2,] -2.0000790 0.06523276 0.59998998
## [3,] -0.4391426 -0.30181899 0.28393399
## [4,] 1.5397222 -0.94473367 0.66414254
## [5,] 0.6641393 1.02944877 0.61929241
## [6,] -0.8148893 1.25520467 -0.81063301
XX[1,1] * eigen(CC)$vectors[1,1] + XX[1,2] * eigen(CC)$vectors[2,1] + XX[1,3] * eigen(CC)$vectors[3,1]
## v1
## -6.214471
XX[1,1] * eigen(VC)$vectors[1,1] + XX[1,2] * eigen(VC)$vectors[2,1] + XX[1,3] * eigen(VC)$vectors[3,1]
## v1
## 0.910074
sXX[1,1] * eigen(CC)$vectors[1,1] + sXX[1,2] * eigen(CC)$vectors[2,1] + sXX[1,3] * eigen(CC)$vectors[3,1]
## v1
## -0.5148128
XX[1,1] * r.psych$loadings[1,1] + XX[1,2] * r.psych$loadings[2,1] + XX[1,3] * r.psych$loadings[3,1]
## v1
## 8.264952
# XX %*% eigen(CC)$values
Beispiel Body-Data
Beispiel head data aus dem Everitt-Buch.
Beispiel crime data aus dem Everitt-Buch.
In den Beispielen finden sich zu PCA:
EI PCA
MSQ
R-Bloggers Tutorial John Quick [http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/]
Tutorial Wollschläger: [http://www.uni-kiel.de/psychologie/rexrepos/posts/multFA.html]
Field (2012, p. 112) Kapitel 3: R Environment
Version: 14 Juni, 2021 16:53