Rmd

Oft liegen Datensätze in verschiedenen Teildatensätzen vor. Für Analysen sollen sie in einem Data-Frame zusammengeführt werden. Essenziell ist eine Schlüsselvariable (Spalte), die in beiden Teildatensätzen vorkommt. Das Vorgehen ist analog zu Datenbank-Abfragen, daher auch die vergleichbaren Kommandos.

Versteh-Beispiel

Beispiel: Trennung von vertraulichen Daten und Experimentaldaten, die in verschiedenen Dataframes vorliegen und aus verschiedenen Quellen stammen könnten.

# confidental data
df.c <- data.frame(
  subj = c('a',   'b',    'c',     'd'),
  name = c('Anna','Berta','Carlos','Dario'),
  year = c( 1995,   1998,   2000,    1956)
)
write.table(df.c, 'join_conf.txt', row.names=F, quote=F, sep='\t')
# experimental data
df.e <- data.frame(
  subj = c('a','a','b','b','b','d','d','e','e'),
  time = c(  1,  3,  1,  2,  3,  2,  3,  1,  2),
  res  = c( 12, 17,  9, 12, 13, 15, 19, 10, 11)
)
write.table(df.e, 'join_exp.txt', row.names=F, quote=F, sep='\t')

# note: column 'subj' exists in both dataframes
# note: subj 'c' does not exist in df.c but in df.e
# note: subj 'd' does not exist in df.e but in df.c

Beide Data-Frames haben die gleich benannte Spalte ‘subj’. df.c enthält Ausprägungen in Spalte ‘subj’, die in df.e nicht vorkommen, die ‘c’. df.e enthält Ausprägungen in Spalte ‘subj’, die in df.c nicht vorkommen, die ‘e’.

inner join

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df <- inner_join(df.e, df.c)

## Joining, by = "subj"

## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

df

##   subj time res  name year
## 1    a    1  12  Anna 1995
## 2    a    3  17  Anna 1995
## 3    b    1   9 Berta 1998
## 4    b    2  12 Berta 1998
## 5    b    3  13 Berta 1998
## 6    d    2  15 Dario 1956
## 7    d    3  19 Dario 1956

# we loose experimental data of 'e' because it doesn't exist in df.c
# we loose confidental data of 'c' because it doesn't exist in df.e

Dies erzeugt vollständige Datensätze, also Datensätze ohne Missings aufgrund fehlender Koinzidenz. Beachte: Lokale Missings in den einzelnen Data-Frames “überleben”, da sie nichts mit dem Zusammenführen der Daten zu tun haben.

left_join

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

require(dplyr)
df <- left_join(df.e, df.c)

## Joining, by = "subj"

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

df

##   subj time res  name year
## 1    a    1  12  Anna 1995
## 2    a    3  17  Anna 1995
## 3    b    1   9 Berta 1998
## 4    b    2  12 Berta 1998
## 5    b    3  13 Berta 1998
## 6    d    2  15 Dario 1956
## 7    d    3  19 Dario 1956
## 8    e    1  10  <NA>   NA
## 9    e    2  11  <NA>   NA

# we keep all experimental data but we don't have confidential data for 'e', because it doesn't exist in df.c

Die Reihenfolge der Tabellen, die zusammengefügt werden, entscheidet darüber, was verloren geht.

require(dplyr)
# now, df.c is first parameter of inner_join()
df <- left_join(df.c, df.e)

## Joining, by = "subj"

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

df

##   subj   name year time res
## 1    a   Anna 1995    1  12
## 2    a   Anna 1995    3  17
## 3    b  Berta 1998    1   9
## 4    b  Berta 1998    2  12
## 5    b  Berta 1998    3  13
## 6    c Carlos 2000   NA  NA
## 7    d  Dario 1956    2  15
## 8    d  Dario 1956    3  19

# we loose experimental data of 'e' because it doesn't exist in df.c

Daten des “führenden” (ersten) Dataframe bleiben erhalten.

full_join

full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.

require(dplyr)
df <- full_join(df.e, df.c)

## Joining, by = "subj"

## Warning in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector

df

##    subj time res   name year
## 1     a    1  12   Anna 1995
## 2     a    3  17   Anna 1995
## 3     b    1   9  Berta 1998
## 4     b    2  12  Berta 1998
## 5     b    3  13  Berta 1998
## 6     d    2  15  Dario 1956
## 7     d    3  19  Dario 1956
## 8     e    1  10   <NA>   NA
## 9     e    2  11   <NA>   NA
## 10    c   NA  NA Carlos 2000

# we keep all experimental data but we don't have confidential data for 'e', because it doesn't exist in df.c
# we keep all confidential data although we dont have experimental data for 'c'

Alle Informationen beider Data-Frames bleiben erhalten und werden ggf. mit NA aufgefüllt, wo die korrespondierende Information fehlt.

Beispiele und Aufgaben

erweitern Sie die obigen Beispiel-Datensätze um erfundene Daten und beobachten Sie die Effekte, die je nach Verknüpfungsart auftreten
experimentieren Sie auch mit den Verknüpfungsarten anti_join() und semi_join()

Links

tutorial

two table verbs

todo: ev.alle join-Arten integrieren