Chapter 7 Zeilenweise Transformationen

7.1 Base R

Base R hat eine ganze Familie von Funktionen für die wiederholte Anwendung derselben Operation: https://statologie.de/apply-lapply-sapply-und-tapply-r/

Hier dargestellt am Beispiel zeilenweiser oder spaltenweiser Operationen mit apply().

7.2 Vektorisierung

Vectorizing bezeichnet eine Eigenschaft von R, eine Operation auf alle Elemente einer Struktur, z. B. eines Vektors auszuführen.

Datentransformationen nutzen die Eigenschaft von R, automatisch eine Operation auf alle bzw. ausgewählte Elemente eines Objekts anzuwenden. Dieses Verhalten wird “Vectorisieren” genannt.

Bei zweidimensionalen Datenobjekten (Tibbles, DataFrames, Matrizen) muss für Transformationen der jeweils relevante Teil ausgewählt werden. Sind DataFrames in der üblichen Art aufgebaut (Zeilen sind Beobachtungen, Spalten sind Variablen), dann sind Transformationen üblicherweise Operationen, die innerhalb einer Beobachtung gemacht werden, also zeilenweise (waagerecht) passieren. R kann grundsätzlich in beliebiger Richtung transformieren.

# simple calculator operations
1 + 1

## [1] 2

2 * 5

## [1] 10

9 / 3

## [1] 3

2 ^ 3

## [1] 8

# ... work also on vector objects
# operation is repeated for each element
vv <- c(0,1,2,3,4,5,9,10)
vv + 1

## [1]  1  2  3  4  5  6 10 11

vv * 5

## [1]  0  5 10 15 20 25 45 50

# square a vector
vv^2

## [1]   0   1   4   9  16  25  81 100

# root extraction
sqrt(vv)

## [1] 0.000000 1.000000 1.414214 1.732051 2.000000 2.236068 3.000000 3.162278

# logarithmized vector
log(vv)

## [1]      -Inf 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 2.1972246 2.3025851

# if vv would be a vector of ratings 0 ... 10
# we could invert it using
10 - vv

## [1] 10  9  8  7  6  5  1  0

# be aware of effects of using sequences
vv * c(2,4)

## [1]  0  4  4 12  8 20 18 40

# ... that must be multiples
# this throws an error, therefore commented out
# vv * c(1,2,3)

# we can get basic descriptive values on the fly
min(vv)

## [1] 0

max(vv)

## [1] 10

mean(vv)

## [1] 4.25

sd(vv)

## [1] 3.615443

var(vv)

## [1] 13.07143

# get n
length(vv)

## [1] 8

# when working with two dimensional data objects (matrices, data frames, tibbles) we have to select the parts we want to work on
 
# create sample dataframe
dd <-data.frame(
  subj   = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  grade1 = c(1.0, 2.7, 3.7, 1.3, 1.7, 1.0, 3.3, 4.0, 3.7),
  grade2 = c(4.0, 3.0, 1.3, 1.3, 1.0, 1.3, 2.7, 4.0, 3.3),
  grade3 = c(1.3, 2.3, 2.7, 1.0, 1.3, 1.3, 2.3, 3.7, 3.0)
)
write.table(dd, 'v_grades_mini.txt', row.names=F, quote=F, sep='\t')

dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_grades_mini.txt")

dd

##   subj grade1 grade2 grade3
## 1    1    1.0    4.0    1.3
## 2    2    2.7    3.0    2.3
## 3    3    3.7    1.3    2.7
## 4    4    1.3    1.3    1.0
## 5    5    1.7    1.0    1.3
## 6    6    1.0    1.3    1.3
## 7    7    3.3    2.7    2.3
## 8    8    4.0    4.0    3.7
## 9    9    3.7    3.3    3.0

# calculate mean grade
# accessing a column returns a vector
mean(dd$grade1)

## [1] 2.488889

mean(dd[,2])

## [1] 2.488889

# when we access a line as a whole, we compute nonsense
rowMeans(dd[2,])

##   2 
## 2.5

mean(as.numeric(dd[2,]))

## [1] 2.5

# we want to get mean grade of subj 2
mean(as.numeric(dd[2,2:4]))

## [1] 2.666667

mean(as.numeric(dd[2,c('grade1', 'grade2', 'grade3')]))

## [1] 2.666667

# accessing a part of the dataframe via slicing and applying a function to it calculates overall results like grand mean
# rowwise calculations f. e. to get calculations per observation will be difficult doing it this way

# to get mean grade for all subjects we need to apply the function mean in an appropriate way - see below for examples for apply()

Mit der Kombination aus Slicing und apply() können wir gezielte Datenmodifikationen machen.

apply() hat normalerweise 3 Parameter - den Teil-DataFrame, auf den es angewendet werden soll - eine Richtung (1 ist zeilenweise, 2 ist spaltenweise) - eine Funktion, die zur Anwendung kommen soll - ggf. können weitere Argumente für die Funktion angehängt werden

require(tidyverse)
dd <- read_delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_grades_mini.txt", delim='\t')

## Rows: 9 Columns: 4
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (4): subj, grade1, grade2, grade3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# we would like to know the mean of the grades for every observation
apply(dd[,2:4], 1, mean)

## [1] 2.100000 2.666667 2.566667 1.200000 1.333333 1.200000 2.766667 3.900000 3.333333

# we get column means
apply(dd[,2:4], 2, mean)

##   grade1   grade2   grade3 
## 2.488889 2.433333 2.100000

# or robust estimators of column means 
# getting rid of the 20% most extreme values and calculating the mean from the rest of the data
apply(dd[,2:4], 2, mean, trim=0.2)

##   grade1   grade2   grade3 
## 2.485714 2.414286 2.028571

7.3 Tidyverse: `dplyr::rowwise()`

7.3.1 Aktionen innerhalb einer Beobachtung - zeilenweise Datenmodifikationen und Transformationen - `rowwise()`, die Probleme damit und die Ansätze zu Lösungen

Häufig ist es notwendig, bei jeder einzelnen Beobachtung aus einer oder auch mehreren Variablen eine neue Variable zu berechnen. Typische Beispiele hierfür sind: Aus Einzelitems eines Fragebogens Summen- oder Mittelwerte zu berechnen, die dann als Ausprägung auf einer Eigenschaftsdimension interpretiert werden. Ein inhaltliches Beispiel wäre im NEO-FFI das Mitteln über die 12 Neurotizismus-Items als Neurotizismus-Skala.

Zentral für zeilenweise Modifikationen ist hierbei der Befehl rowwise().

Mit mutate() können wir beliebige, auch selbst definierte Funktionen zeilenweise auf die Daten anwenden.

require(tidyverse)
dd <- readr::read_tsv("https://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt")

## Rows: 12 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gender
## dbl (6): subj, age, grp, v1, v2, v3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# require(dplyr)


# verb: rowwise()
# we might want to have the individual means of the three variables v1 to v3
dplyr::rowwise(dd) %>% dplyr::mutate(v_mean = mean(c(v1, v2, v3)))

## # A tibble: 12 × 8
## # Rowwise: 
##     subj gender   age   grp    v1    v2    v3 v_mean
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     1 f         17     1     9    16     8   11  
##  2     2 f         33     2    67    73    66   68.7
##  3     3 f         47     3    86    87    91   88  
##  4     4 f         10     1    64    68    67   66.3
##  5     5 f         21     2    40    46    44   43.3
##  6     6 f         30     3    26    34    28   29.3
##  7     7 m         51     1    64    66    64   64.7
##  8     8 m         13     2    61    66    64   63.7
##  9     9 m         17     3    67    67    67   67  
## 10    10 m         25     1    38    36    35   36.3
## 11    11 m         33     2    22    25    21   22.7
## 12    12 m         27     3    81    86    81   82.7

# just to note: without rowwise() this would not work correctly
dplyr::mutate(dd, v_mean = mean(c(v1, v2, v3)))  # caution: THIS IS WRONG

## # A tibble: 12 × 8
##     subj gender   age   grp    v1    v2    v3 v_mean
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     1 f         17     1     9    16     8   53.6
##  2     2 f         33     2    67    73    66   53.6
##  3     3 f         47     3    86    87    91   53.6
##  4     4 f         10     1    64    68    67   53.6
##  5     5 f         21     2    40    46    44   53.6
##  6     6 f         30     3    26    34    28   53.6
##  7     7 m         51     1    64    66    64   53.6
##  8     8 m         13     2    61    66    64   53.6
##  9     9 m         17     3    67    67    67   53.6
## 10    10 m         25     1    38    36    35   53.6
## 11    11 m         33     2    22    25    21   53.6
## 12    12 m         27     3    81    86    81   53.6

# building questionnaire scales often requires to invert items before aggregating them
dplyr::rowwise(dd) %>% dplyr::mutate(v2_i = 100 - v2) %>% dplyr::mutate(v_mean = mean(c(v1, v2_i, v3)))

## # A tibble: 12 × 9
## # Rowwise: 
##     subj gender   age   grp    v1    v2    v3  v2_i v_mean
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     1 f         17     1     9    16     8    84   33.7
##  2     2 f         33     2    67    73    66    27   53.3
##  3     3 f         47     3    86    87    91    13   63.3
##  4     4 f         10     1    64    68    67    32   54.3
##  5     5 f         21     2    40    46    44    54   46  
##  6     6 f         30     3    26    34    28    66   40  
##  7     7 m         51     1    64    66    64    34   54  
##  8     8 m         13     2    61    66    64    34   53  
##  9     9 m         17     3    67    67    67    33   55.7
## 10    10 m         25     1    38    36    35    64   45.7
## 11    11 m         33     2    22    25    21    75   39.3
## 12    12 m         27     3    81    86    81    14   58.7

  # rowwise means seems to work only using unquoted column names
  # unquoted column names need to be combined c()
# without inversion we could even use the : operator when variables are one aside the other
dplyr::rowwise(dd) %>% dplyr::mutate(v_mean = mean(c(v1:v3)))

## # A tibble: 12 × 8
## # Rowwise: 
##     subj gender   age   grp    v1    v2    v3 v_mean
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     1 f         17     1     9    16     8    8.5
##  2     2 f         33     2    67    73    66   66.5
##  3     3 f         47     3    86    87    91   88.5
##  4     4 f         10     1    64    68    67   65.5
##  5     5 f         21     2    40    46    44   42  
##  6     6 f         30     3    26    34    28   27  
##  7     7 m         51     1    64    66    64   64  
##  8     8 m         13     2    61    66    64   62.5
##  9     9 m         17     3    67    67    67   67  
## 10    10 m         25     1    38    36    35   36.5
## 11    11 m         33     2    22    25    21   21.5
## 12    12 m         27     3    81    86    81   81

# we might even use some function outside of dplyr commands with piping: here rowMeans
dd %>% dplyr::mutate(v2_i = 100 - v2) %>% dplyr::select(v1, v2_i, v3) %>% dplyr::rowwise() %>% rowMeans()

##  [1] 33.66667 53.33333 63.33333 54.33333 46.00000 40.00000 54.00000 53.00000 55.66667 45.66667 39.33333 58.66667

# as rowMeans already works rowwise, we don't need rowwise when using with rowMeans()
# so we get the same doing
dd %>% dplyr::mutate(v2_i = 100 - v2) %>% dplyr::select(v1, v2_i, v3) %>% rowMeans()

##  [1] 33.66667 53.33333 63.33333 54.33333 46.00000 40.00000 54.00000 53.00000 55.66667 45.66667 39.33333 58.66667

# we might even define our own function and apply it to each row
# we define a function that returns the mean of a vector of values excluding the lowest one
mean_c <- function(x){return (mean(x[order(x)[2:length(x)]]))}
# just to see what mean_c does: this should return 6, the mean of 8 and 4, 2 is excluded
mean_c(c(8,4,2))

## [1] 6

# we want to have the mean of v1, v2, and v3, excluding the minimum value of the three. this should be done for each observation.
dplyr::rowwise(dd) %>%  dplyr::mutate(v_mean_c = mean_c(c(v1, v2, v3)))

## # A tibble: 12 × 8
## # Rowwise: 
##     subj gender   age   grp    v1    v2    v3 v_mean_c
##    <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1     1 f         17     1     9    16     8     12.5
##  2     2 f         33     2    67    73    66     70  
##  3     3 f         47     3    86    87    91     89  
##  4     4 f         10     1    64    68    67     67.5
##  5     5 f         21     2    40    46    44     45  
##  6     6 f         30     3    26    34    28     31  
##  7     7 m         51     1    64    66    64     65  
##  8     8 m         13     2    61    66    64     65  
##  9     9 m         17     3    67    67    67     67  
## 10    10 m         25     1    38    36    35     37  
## 11    11 m         33     2    22    25    21     23.5
## 12    12 m         27     3    81    86    81     83.5

7.4 Referenzen

Beispiele und Erklärungen: Unit transformation base
Beispiele und Erklärungen: Unit transformation dplyr

7.5 Screencast(s)

… mit Vorstellen einiger Möglichkeiten