Chapter 12 Datenteile zusammenfügen
Teile zu einer Datentabelle zusammenfügen bzw. Teile zu bestehender Tabelle hinzufügen.
12.1 Base R: Merging
12.1.1 Ausweiten eines Dataframe: Hinzufügen von Beobachtungen, hinzufügen von Spalten(Variablen), Kombination aus beiden
Zwei Datenframes mit derselben Struktur (dieselben Spaltennamen) können mit dem Befehl rbind()
aneinandergehängt werden.
Dies entspricht in etwa dem merge
Befehl in anderen Paketen.
rbind()
bezieht sich auf Zeilen, wie sich cbind()
auf Spalten bezieht.
Mit rbind()
werden Zeilen angehängt. Dies entspricht dem Hinzufügen von Beobachtungen (Personen).
12.1.1.1 verschiedene bzw. fehlende Spalten/Variablen
rbind verknüpft nur Dataframes mit den gleichen Spalten (gleiche Spaltennamen).
Entweder, die Variable(n), die es in einem Teil-Dataframe nicht gibt, werden dort gelöscht, wo sie existieren,
oder die fehlenden Variablen werden dort, wo sie fehlen, generiert und sinnvollerweise mit NA vorbesetzt.
Dies muss vor dem rbind()
Befehl passieren.
12.1.1.2 Mini-Beispiel hinzufügen von Beobachtungen (Zeilen)
# data of s1 and s2 that have the same variables, i.e. add more observations with the same variables
<- data.frame(
data.s1 id = c( 1, 2, 3),
gender = c('w', 'm', 'w'),
age = c( 20, 22, 27)
)# variable sequence in dataframe doesn't matter
<- data.frame(
data.s2 gender = c('w', 'm'),
id = c( 4, 5),
age = c( 19, 23)
)<- rbind(data.s1, data.s2)
data.all data.all
## id gender age
## 1 1 w 20
## 2 2 m 22
## 3 3 w 27
## 4 4 w 19
## 5 5 m 23
# if a variable is missing in a dataframe, create it and fill it with NA (missings)
# in data.s3 age is missing
<- data.frame(
data.s3 gender = c('m', 'w'),
id = c( 6, 7)
)# rbinding it directly would cause an error
# rbind(data.all, data.t3)
# Fehler in rbind(deparse.level, ...) :
# Anzahl der Spalten der Argumente unterschiedlich
# so we aggregate the missing column
$age = NA
data.s3rbind(data.all, data.s3)
## id gender age
## 1 1 w 20
## 2 2 m 22
## 3 3 w 27
## 4 4 w 19
## 5 5 m 23
## 6 6 m NA
## 7 7 w NA
# or
rbind(data.s1, data.s2, data.s3)
## id gender age
## 1 1 w 20
## 2 2 m 22
## 3 3 w 27
## 4 4 w 19
## 5 5 m 23
## 6 6 m NA
## 7 7 w NA
# the same can be done using merge, usually used to add columns/variables
# but take care to specify all column names in the 'by=c(...)' part and add the all=T flag
merge(data.s1, data.s2, by=c('id','gender', 'age'), all=T)
## id gender age
## 1 1 w 20
## 2 2 m 22
## 3 3 w 27
## 4 4 w 19
## 5 5 m 23
# also with merge(), both dataframes must have all variables/columns, sequence doesn't matter
12.1.1.3 Mini Beispiel: Hinzufügen von Variablen
Bei regelmäßigen und vollständigen Teil-Datenframes gibt es keine Probleme
# data of v1: three subjects, 3 vars
<- data.frame(
ddf.v1 subj = c( 1, 2, 3),
gender = c('w', 'm', 'w'),
age = c( 20, 22, 27)
)# data of v2, same subjects, two more vars
<- data.frame(
ddf.v2 subj = c( 1, 2, 3),
weight = c( 67, 85, 78),
height = c(172, 185, 180)
)# if there are no missings and no aditional subjects it's easy and can be done using cbind()
cbind(ddf.v1, ddf.v2) # this duplicates subj column, what doesnt make sense
## subj gender age subj weight height
## 1 1 w 20 1 67 172
## 2 2 m 22 2 85 185
## 3 3 w 27 3 78 180
# without redundancy
cbind(ddf.v1, ddf.v2[c('weight', 'height')])
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
# the same using merge
# argument `by` specifies comon columns in the two dataframes
merge(ddf.v1, ddf.v2, by="subj")
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
Ein Dataframe enthält mehr Beobachtungen als der andere, keine Überschneidung bei den Variablen
# data of v1: three subjects, 3 vars
<- data.frame(
ddf.v1 subj = c( 1, 2, 3),
gender = c('w', 'm', 'w'),
age = c( 20, 22, 27)
)# data of v2, more subjects, two more vars
<- data.frame(
ddf.v2 subj = c( 1, 2, 3, 4, 5),
weight = c( 67, 85, 78, 66, 72),
height = c(172, 185, 180, 165, 177)
)# problem: what to do with subjects, that only exist in one dataframe
# solution 1: get only complete subjects after merge
merge(ddf.v1, ddf.v2, by="subj")
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
# solution 2: impute NA where data are missing by using flag `all`
merge(ddf.v1, ddf.v2, by="subj", all=T)
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
## 4 4 <NA> NA 66 165
## 5 5 <NA> NA 72 177
Ein Dataframe enthält mehr Beobachtungen als der andere, Überschneidung bei den Variablen, Inkonsistenzen bei zusammengehörigen Variablen.
# data of v1: three subjects, 3 vars
<- data.frame(
ddf.v1 subj = c( 1, 2, 3),
gender = c('w', 'm', 'w'),
age = c( 20, 22, 27)
)# data of v2, more subjects, two more vars
<- data.frame(
ddf.v2 subj = c( 1, 2, 3, 4, 5),
weight = c( 67, 85, 78, 66, 72),
gender = c('w', 'm', 'w', 'w', 'm'),
height = c(172, 185, 180, 165, 177)
)# take care: gender of ddf.v2 replaces gender of ddf.v1
# variables, the dataframes have in common, have to appear in parameter `by`
# there still are missing data, so the problem of what to do with incomplete data persists
# solution 1: get only complete subjects after merge
merge(ddf.v1, ddf.v2, by=c("subj", "gender"))
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
# solution 2: impute NA where data are missing by using flag `all`
merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 85 185
## 3 3 w 27 78 180
## 4 4 w NA 66 165
## 5 5 m NA 72 177
# take care: variables in common have to be equal in both dataframes for all subjects
# example: gender of subj 2 is different in the two dataframes
# data of v1: three subjects, 3 vars
<- data.frame(
ddf.v1 subj = c( 1, 2, 3),
gender = c('w', 'm', 'w'),
age = c( 20, 22, 27)
)# data of v2, more subjects, two more vars, subj 2 has a typo in variable gender
<- data.frame(
ddf.v2 subj = c( 1, 2, 3, 4, 5),
weight = c( 67, 85, 78, 66, 72),
gender = c('w', 'x', 'w', 'w', 'm'),
height = c(172, 185, 180, 165, 177)
)# take care: gender of ddf.v2 replaces gender of ddf.v1
# variables, the dataframes have in common, have to appear in parameter `by`
# there still are missing data, so the problem of what to do with incomplete data persists
# solution 1: get only complete subjects after merge
merge(ddf.v1, ddf.v2, by=c("subj", "gender"))
## subj gender age weight height
## 1 1 w 20 67 172
## 2 3 w 27 78 180
# solution 2: impute NA where data are missing by using flag `all`
merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)
## subj gender age weight height
## 1 1 w 20 67 172
## 2 2 m 22 NA NA
## 3 2 x NA 85 185
## 4 3 w 27 78 180
## 5 4 w NA 66 165
## 6 5 m NA 72 177
require(tidyverse)
<- readr::read_tsv("https://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt") dd
## Rows: 12 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gender
## dbl (6): subj, age, grp, v1, v2, v3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# we add some older subjects with partly incomplete data
.1 <- data_frame(subj = c(20, 21, 22, 23),
ddgender = c("f", "f", "m", "m"),
age = c(50, 54, 53, 59),
grp = c(1,2,4,3),
v1 = c(89, 88, 67, 79)
)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
::full_join(dd, dd.1, by=c("subj", "gender", "age", "grp", "v1")) dplyr
## # A tibble: 16 × 7
## subj gender age grp v1 v2 v3
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 f 17 1 9 16 8
## 2 2 f 33 2 67 73 66
## 3 3 f 47 3 86 87 91
## 4 4 f 10 1 64 68 67
## 5 5 f 21 2 40 46 44
## 6 6 f 30 3 26 34 28
## 7 7 m 51 1 64 66 64
## 8 8 m 13 2 61 66 64
## 9 9 m 17 3 67 67 67
## 10 10 m 25 1 38 36 35
## 11 11 m 33 2 22 25 21
## 12 12 m 27 3 81 86 81
## 13 20 f 50 1 89 NA NA
## 14 21 f 54 2 88 NA NA
## 15 22 m 53 4 67 NA NA
## 16 23 m 59 3 79 NA NA
12.2 Tidyverse: Merging
Im Tidyverse wird für das Zusammenfügen von Daten auf Prinzipien aus der Datenbank-Technik zurückgegriffen (SQL).
In den Joins (left_join()
, full_join()
kann der Parameter by
auch ein Array von Spaltennamen sein, die in beiden Datentabellen vorkommen).
require(tidyverse)
<- readr::read_tsv("https://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt") dd
## Rows: 12 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): gender
## dbl (6): subj, age, grp, v1, v2, v3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# we have to add some older subjects with partly incomplete data but also with additional columns
.1 <- data_frame(subj = c(20, 21, 22, 23),
ddgender = c("f", "f", "m", "m"),
age = c(50, 54, 53, 59),
grp = c(1,2,4,3),
v1 = c(89, 88, 67, 79),
v4 = c(2, 7, 9, 8),
v5 = c(98, 94, 91, 97)
)
::full_join(dd, dd.1, by=c("subj", "gender", "age", "grp", "v1")) dplyr
## # A tibble: 16 × 9
## subj gender age grp v1 v2 v3 v4 v5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 f 17 1 9 16 8 NA NA
## 2 2 f 33 2 67 73 66 NA NA
## 3 3 f 47 3 86 87 91 NA NA
## 4 4 f 10 1 64 68 67 NA NA
## 5 5 f 21 2 40 46 44 NA NA
## 6 6 f 30 3 26 34 28 NA NA
## 7 7 m 51 1 64 66 64 NA NA
## 8 8 m 13 2 61 66 64 NA NA
## 9 9 m 17 3 67 67 67 NA NA
## 10 10 m 25 1 38 36 35 NA NA
## 11 11 m 33 2 22 25 21 NA NA
## 12 12 m 27 3 81 86 81 NA NA
## 13 20 f 50 1 89 NA NA 2 98
## 14 21 f 54 2 88 NA NA 7 94
## 15 22 m 53 4 67 NA NA 9 91
## 16 23 m 59 3 79 NA NA 8 97
12.3 Referenzen
- data wrangling cheat sheet
- two table verbs
- Entsprechende Seite in Quick-R
12.3.1 Units
- Beispiele und Erklärungen: Unit transformation base
- Beispiele und Erklärungen: Unit transformation dplyr
- Beispiele und Erklärungen: Unit dataframe