Chapter 5 Datenzugriff

5.1 Zwei Philosophien zum Datenzugriff

In Base-R bietet das Slicing eine sehr mächtige Möglichkeit, differenziert auf Teildaten zuzugreifen.

Tidyverse verfolgt eher die Strategie, viele modulare Zugriffsmöglichkeiten über Pipes zu verknüpfen.

Beide Welten lassen sich auch kombinieren.

5.2 Base R: Slicing

Spalten lassen sich über den ‘$’ Operator ansprechen. Dabei gilt: Datenobjektname$Spaltenname

Teilzugriffe geschehen über Slicing-Klammern (eckige Klammern), in denen die Zugriffsbereiche angegeben werden. Bei eindimensionalen Datenobjekten gibt es eine Angabe, bei zweidimensionalen zwei.

Elemente in zweidimensionalen Data-Frames (auch Tibbles) lassen sich über Angaben zu Zeilen und Spalten ansprechen. Hier werden zwei Angaben in den Slicing-Klammern gemacht, die die Zeilen und die Spalten durch ein Komma getrennt definieren. Bei durch Komma getrennte zwei Angaben bezieht sich die erste auf Zeilen, die zweite auf Spalten. Nur eine Angabe bezieht sich auf Spalten. Wird keine Angabe gemacht, egal ob bei Zeilen oder Spalten, ist alles ausgewählt. Die Angaben können sehr unterschiedlicher Natur sein. Es kann sich auf (Zeilen-, Spalten-)Nummern handeln, um Zeilen- oder Spaltennamen, um Bereiche und auch um logische Ausdrücke. Häufig werden Sequenzen benötigt, die wir über den “:” Operator bekommen. 2:5 ergibt die Reihe 2, 3, 4, 5 bzw. ist äquivalent zu c(2,3,4,5). Auch arbiträre Reihenfolgen sind möglich, wie c(4, 2, 9).

# we read some data
dd <- readRDS(gzcon(url("http://md.psych.bio.uni-goettingen.de/mv/data/div/stud.rds")))

# one dimension
# we can get the second column of dd named height by
head(dd$height)  # head() limits to the first 6 elements

## [1] 177 190 162 179 178 186

head(dd[2])

##   height
## 1    177
## 2    190
## 3    162
## 4    179
## 5    178
## 6    186

# even sliced
dd$height[2:5]

## [1] 190 162 179 178

# two dimensions
# we access birth_year (col. 7) of subject no 8 in 8th line which is 86
dd[8,7]

## [1] 86

# the first 5 subjects and their columns 2 to 7
dd[1:5, 2:7]

##   height shoe_size weight gender birth_month birth_year
## 1    177      43.0     75      1           3         88
## 2    190      45.0     87      2           6         83
## 3    162      37.0     49      1           1         89
## 4    179      42.5     80      2           6         85
## 5    178      39.0     52      1           7         85

# combinations are possible, we can combine them using c()
dd[c(1:5, 10:15), c(3, 2, 7:9)]

##    shoe_size height birth_year statistics_grade abitur
## 1       43.0    177         88              3.0    1.6
## 2       45.0    190         83              4.0    3.7
## 3       37.0    162         89              3.2    2.1
## 4       42.5    179         85              3.0    2.0
## 5       39.0    178         85              2.7    2.2
## 10      38.0    172         86              1.7    2.1
## 11      39.0    177         87              1.3    2.0
## 12      44.0    180         87              1.7    2.0
## 13      41.0    173         88              2.7    1.2
## 14      43.0    181         90              2.7    2.8
## 15      39.0    167         89              2.3    1.9

# we can work with column names also - column names have to be given as strings
dd[1:10, c("no", "statistics_grade", "abitur")]

##    no statistics_grade abitur
## 1   1              3.0    1.6
## 2   2              4.0    3.7
## 3   3              3.2    2.1
## 4   4              3.0    2.0
## 5   5              2.7    2.2
## 6   6              2.7    1.7
## 7   7              2.0    2.6
## 8   8              4.0    2.3
## 9   9              1.7    1.6
## 10 10              1.7    2.1

5.2.1 Bedingter Zugriff

Slicing funktioniert auch mit logischen Bedingungen. Bedingungen ergeben einen true-false-Vektor.

dd <- readRDS(gzcon(url("http://md.psych.bio.uni-goettingen.de/mv/data/div/stud.rds")))

# all women, they have a 1 in column gender
head(dd$gender == 1)  # a true-false-vector

## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE

# only rows with TRUE are selected/returned, they meet the condition
head(dd[dd$gender == 1, c("no", "gender", "abitur")])

##    no gender abitur
## 1   1      1    1.6
## 3   3      1    2.1
## 5   5      1    2.2
## 8   8      1    2.3
## 9   9      1    1.6
## 10 10      1    2.1

# < and > work also
head(dd[dd$abitur < 2, c("no", "gender", "abitur")])

##    no gender abitur
## 1   1      1    1.6
## 6   6      2    1.7
## 9   9      1    1.6
## 13 13      1    1.2
## 15 15      1    1.9
## 18 18      1    1.9

# we can use logical AND and OR (& and |)
head(dd[dd$abitur < 2 & dd$gender == 1, c("no", "gender", "abitur")])

##    no gender abitur
## 1   1      1    1.6
## 9   9      1    1.6
## 13 13      1    1.2
## 15 15      1    1.9
## 18 18      1    1.9
## 27 27      1    1.0

head(dd[dd$abitur < 2 | dd$gender == 1, c("no", "gender", "abitur")])

##   no gender abitur
## 1  1      1    1.6
## 3  3      1    2.1
## 5  5      1    2.2
## 6  6      2    1.7
## 8  8      1    2.3
## 9  9      1    1.6

# we could even use an arbitrary true-false-vector
vv <- c(TRUE, TRUE, FALSE, FALSE)
head(dd[vv, c("no", "gender", "abitur")] )  # no 3 and 4 are excluded

##    no gender abitur
## 1   1      1    1.6
## 2   2      2    3.7
## 5   5      1    2.2
## 6   6      2    1.7
## 9   9      1    1.6
## 10 10      1    2.1

5.3 Tidyverse Datenzugriff

column names don’t have to be strings (quoted).
dplyr::filter() selects rows (observations).
dplyr::select() selects columns (variables).
%>% pipes the result of what is left of it as first argument to what is right of it
%in% tests whether the left side is part of the right side
: can refer to a sequence of variables

dd <- readRDS(gzcon(url("http://md.psych.bio.uni-goettingen.de/mv/data/div/stud.rds")))
library(tidyverse)
# one dimension
# we can get the second column of dd named height by
head(dd %>% dplyr::select(height))

##   height
## 1    177
## 2    190
## 3    162
## 4    179
## 5    178
## 6    186

dd %>% dplyr::select(height) %>% head()

##   height
## 1    177
## 2    190
## 3    162
## 4    179
## 5    178
## 6    186

# two dimensions
# we filter the rows to get at subject 8 and access column statistics_grade, which is 86
dd %>% dplyr::filter(no == 8) %>% dplyr::select(birth_year)

##   birth_year
## 1         86

# the first 5 subjects and their columns height to birth_year
dd %>% dplyr::filter(no %in% 1:5) %>% dplyr::select(height:birth_year)

##   height shoe_size weight gender birth_month birth_year
## 1    177      43.0     75      1           3         88
## 2    190      45.0     87      2           6         83
## 3    162      37.0     49      1           1         89
## 4    179      42.5     80      2           6         85
## 5    178      39.0     52      1           7         85

# combinations are possible
dd %>% dplyr::filter(gender == 1 & no %in% 10:15) %>% dplyr::select(height, gender, birth_year:math_intense)

##   height gender birth_year statistics_grade abitur math_intense
## 1    172      1         86              1.7    2.1            0
## 2    177      1         87              1.3    2.0            1
## 3    173      1         88              2.7    1.2            0
## 4    167      1         89              2.3    1.9            0

# we can work with column names also
dd %>% dplyr::filter(no %in% 1:10) %>% dplyr::select(height, shoe_size, birth_year:academic_background)

##    height shoe_size birth_year statistics_grade abitur math_intense academic_background
## 1     177      43.0         88              3.0    1.6            0                   1
## 2     190      45.0         83              4.0    3.7            0                   1
## 3     162      37.0         89              3.2    2.1            0                   0
## 4     179      42.5         85              3.0    2.0            1                   1
## 5     178      39.0         85              2.7    2.2            0                   1
## 6     186      44.0         89              2.7    1.7            1                   0
## 7     187      46.0         88              2.0    2.6            0                   0
## 8     177      40.0         86              4.0    2.3            0                   1
## 9     176      39.0         88              1.7    1.6            0                   0
## 10    172      38.0         86              1.7    2.1            0                   1

5.4 Referenzen

tidyverse
Beispiele und Erklärungen: Unit transformation_base
Beispiele und Erklärungen: Unit dplyr

5.5 Screencast

… mit Vorstellen einiger Möglichkeiten