Rmd

Missing Data werden in R durch NA (not available) repräsentiert und haben eine eigene Funktionalität. Häufig wird das Ergebnis einer Operation, in der NA vorkommen ebenfalls auf NA gesetzt. Viele Funktionen und Verfahren haben ein Flag (Aufrufparamter) für den Umgang mit NA (na.rm), das häufig als Voreinstellung auf TRUE gesetzt ist, nicht auf FALSE. Dabei werden die Beobachtungen mit NA aus der jeweiligen Berechnung ausgeschlossen. Dies entspricht dem Standard-Vorgehen vieler Statistikprogramme, führt aber u. U. zu unterschiedlichen Stichproben bei den Berechnungen.

Für logische Operationen gibt es den Befehl is.na bzw. !is.na (not is NA, logische Prüfung auf nicht-fehlend)

# get a working base idea
ddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
is.na(ddf)
##          x     y
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE  TRUE
!is.na(ddf)
##         x     y
## [1,] TRUE  TRUE
## [2,] TRUE  TRUE
## [3,] TRUE FALSE

Häufig ist die Anzahl von fehlenden Werten relevant. Hier kann das Auszählen von TRUE-Werten beim Prüfen auf fehlende Werte helfen. Alternativ können wir which() einsetzen

# create sample dataframe
ddf <-data.frame(
  subj   = c(  1,   2,   3,   4,   5,   6,   7,   8,   9),
  uni    = c(  1,   1,   1,   2,   2,   2,   3,   3,   3),
  grade1 = c(1.0,  NA, 3.7, 1.3,  NA, 1.0, 3.3, 4.0,  NA),
  grade2 = c(4.0, 3.0, 1.3, 1.3, 1.0, 1.3, 2.7, 4.0, 3.3),
  grade3 = c(1.3,  NA, 2.7, 1.0, 1.3, 1.3, 2.3, 3.7, 3.0)
)
# we get a T/F vector
is.na(ddf)
##        subj   uni grade1 grade2 grade3
##  [1,] FALSE FALSE  FALSE  FALSE  FALSE
##  [2,] FALSE FALSE   TRUE  FALSE   TRUE
##  [3,] FALSE FALSE  FALSE  FALSE  FALSE
##  [4,] FALSE FALSE  FALSE  FALSE  FALSE
##  [5,] FALSE FALSE   TRUE  FALSE  FALSE
##  [6,] FALSE FALSE  FALSE  FALSE  FALSE
##  [7,] FALSE FALSE  FALSE  FALSE  FALSE
##  [8,] FALSE FALSE  FALSE  FALSE  FALSE
##  [9,] FALSE FALSE   TRUE  FALSE  FALSE
length(is.na(ddf))
## [1] 45
length(is.na(ddf)[is.na(ddf) == T])
## [1] 4
# or using which()
which(is.na(ddf))
## [1] 20 23 27 38
# and count it
length(which(is.na(ddf)))
## [1] 4
# how about missings in a column
length(which(is.na(ddf$grade1)))
## [1] 3
# or in a single row
length(which(is.na(ddf[2,])))
## [1] 2

Welche Beobachtungen haben missing values in einem Data-Frame?

# take care: variables in common have to be equal in both dataframes for all subjects
# example: gender of subj 2 is different in the two dataframes
# data of v1: three subjects, 3 vars
ddf.v1 <- data.frame(
  subj   = c(  1,   2,   3),
  gender = c('w', 'm', 'w'),
  age    = c( 20,  22,  27)
  )
# data of v2, more subjects, two more vars, subj 2 has a typo in variable gender
ddf.v2 <- data.frame(
  subj   = c(  1,   2,   3,   4,   5),
  weight = c( 67,  85,  78,  66,  72),
  gender = c('w', 'x', 'w', 'w', 'm'),
  height = c(172, 185, 180, 165, 177)
  )
  # take care: gender of ddf.v2 replaces gender of ddf.v1
# variables, the dataframes have in common, have to appear in parameter `by`
# there still are missing data, so the problem of what to do with incomplete data persists
# solution 1: get only complete subjects after merge
merge(ddf.v1, ddf.v2, by=c("subj", "gender"))
##   subj gender age weight height
## 1    1      w  20     67    172
## 2    3      w  27     78    180
# solution 2: impute NA where data are missing by using flag `all`
merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)
##   subj gender age weight height
## 1    1      w  20     67    172
## 2    2      m  22     NA     NA
## 3    2      x  NA     85    185
## 4    3      w  27     78    180
## 5    4      w  NA     66    165
## 6    5      m  NA     72    177
ddf <- merge(ddf.v1, ddf.v2, by=c("subj", "gender"), all=T)

# where are missings in ddf? how many?
apply(ddf, 1, function(x) length(which(is.na(x))))
## [1] 0 2 1 0 1 1

Alle Beobachtungen mit missing values aus Data-Frame ausschließen mit cleaned <- na.omit(Data-Frame).

# get a working base idea
ddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
ddf
##   x  y
## 1 1  0
## 2 2 10
## 3 3 NA
# na.omit() deletes!
na.omit(ddf)
##   x  y
## 1 1  0
## 2 2 10
# search for the first missing value
ddf = read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")
# first occurance of a missing in dataframe
which(is.na(ddf))
## [1] 375
# number issued comes from iterating over subjects column after column
which(is.na(ddf)) %% length(ddf[,1])
## [1] 39
# exclude all observations with missing data
ddf.cleaned <- na.omit(ddf)     #remove the cases with missing values
# n of excluded:
cat(nrow(ddf) - nrow(ddf.cleaned), ' observations excluded')
## 1  observations excluded

Alle Werte eines Dataframes, die negativ sind, auf NA setzen.

ddf <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))

ddf <- data.frame(lapply(ddf, function(x) { x[x < 0] <- NA; x }))

Welche Beobachtungen haben missing values in einem Data-Frame?

ddf = read.csv("http://www.bodowinter.com/tutorial/politeness_data.csv")
## todo

Ein paar Beispiele zur Verwendung von Hmisc::impute()

library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
DF <- data.frame(age = c(10, 20, NA, 40), sex = c('male','female'))

# impute with mean value

DF$imputed_age <- with(DF, impute(age, mean))

# impute with random value
DF$imputed_age2 <- with(DF, impute(age, 'random'))

# impute with the media
with(DF, impute(age, median))
##   1   2   3   4 
##  10  20 20*  40
# impute with the minimum
with(DF, impute(age, min))
##   1   2   3   4 
##  10  20 10*  40
# impute with the maximum
with(DF, impute(age, max))
##   1   2   3   4 
##  10  20 40*  40

aregImpute()

require(Hmisc)
# read data
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi.txt")
# attach ddf for more comfort
attach(ddf)
# copy column c_phys_app
n_c_phys_app <- c_phys_app
# choose values to set to NA
to_erase <- sample(length(c_phys_app), 5)
# show observations to erase
to_erase
## [1] 28  2 12 26 18
# set selected values to missing
n_c_phys_app[to_erase] <- NA
# show new var with erased data
n_c_phys_app
##  [1]  2 NA  5  4  3  3  4  4  4  5  2 NA  3  2  1  3  2 NA  3  4  3  5  1
## [24]  2  1 NA  3 NA  5  2
# compare with original
cbind(c_phys_app, n_c_phys_app)
##       c_phys_app n_c_phys_app
##  [1,]          2            2
##  [2,]          3           NA
##  [3,]          5            5
##  [4,]          4            4
##  [5,]          3            3
##  [6,]          3            3
##  [7,]          4            4
##  [8,]          4            4
##  [9,]          4            4
## [10,]          5            5
## [11,]          2            2
## [12,]          4           NA
## [13,]          3            3
## [14,]          2            2
## [15,]          1            1
## [16,]          3            3
## [17,]          2            2
## [18,]          4           NA
## [19,]          3            3
## [20,]          4            4
## [21,]          3            3
## [22,]          5            5
## [23,]          1            1
## [24,]          2            2
## [25,]          1            1
## [26,]          4           NA
## [27,]          3            3
## [28,]          3           NA
## [29,]          5            5
## [30,]          2            2
ddf$n_c_phys_app
## NULL
imputed_n_c_phys_app <- aregImpute(~ gender + height + weight + grade + c_phys_app + c_good_way + c_dress + c_bad_way + c_figure + filling_time, data=ddf, n.impute=10)
## Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
imputed_n_c_phys_app
## 
## Multiple Imputation using Bootstrap and PMM
## 
## aregImpute(formula = ~gender + height + weight + grade + c_phys_app + 
##     c_good_way + c_dress + c_bad_way + c_figure + filling_time, 
##     data = ddf, n.impute = 10)
## 
## n: 30    p: 10   Imputations: 10     nk: 3 
## 
## Number of NAs:
##       gender       height       weight        grade   c_phys_app 
##            0            0            0            0            0 
##   c_good_way      c_dress    c_bad_way     c_figure filling_time 
##            0            0            0            0            0 
## 
##              type d.f.
## gender          l   NA
## height          s   NA
## weight          s   NA
## grade           s   NA
## c_phys_app      s   NA
## c_good_way      s   NA
## c_dress         s   NA
## c_bad_way       s   NA
## c_figure        s   NA
## filling_time    s   NA
## 
## Transformation of Target Variables Forced to be Linear
## 
## R-squares for Predicting Non-Missing Values for Each Variable
## Using Last Imputations of Predictors
## named numeric(0)
ddf$n_c_phys_app
## NULL

Aus Baron & Lee [http://www.psych.upenn.edu/~baron/rpsych/rpsych.html]

7.15 Imputation of missing data

Schafer and Graham (2002) provide a good review of methods for dealing with missing data. R provides many of the methods that they discuss. One method is multiple imputation, which is found in the Hmisc package. Each missing datum is inferred repeated from different samples of other variables, and the repeated inferences are used to determine the error. It turns out that this method works best with the ols() function from the Design package rather than with (the otherwise equivalent) lm() function. Here is an example, using the data set t1.

# todo:
#library(Hmisc)
#f <- aregImpute(~v1+v2+v3+v4, n.impute=20,
#     fweighted=.2, tlinear=T, data=t1)
#library(Design)
#fmp <- fit.mult.impute(v1~v2+v3, ols, f, data=t1)
#summary(fmp)

The first command (f) imputes missing values in all four variables, using the other three for each variable. The second command (fmp) estimates a regression model in which v1 is predicted from two of the remaining variables. A variable can be used (and should be used, if it is useful) for imputation even when it is not used in the model.

Beispiele Übungen

Übungs-Datensatz

#ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/data/virt/v_bmi_miss.txt")