Rmd

Die Ausführungen orientieren sich an Wickham (2014)

Data structures

d Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Vector, (atomic) vector

Alle Elemente eines atomic vector haben denselben Datentyp

dbl_var <- c(1, 2.5, 4.5)
# With the L suffix, you get an integer rather than a double 
int_var <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors 
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")

logical vectors

# Use TRUE and FALSE (or T and F) to create logical vectors 
log_var <- c(TRUE, FALSE, T, F)
# FALSE is 0, TRUE is 0 when coerced to an integer
as.numeric(log_var)
## [1] 1 0 1 0
# so useful calculations become possible
# number of TRUEs
sum(log_var)
## [1] 2
# proportion that are TRUE
mean(log_var)
## [1] 0.5

Liste list

Listen sind Vektoren, dessen Elemente nicht den gleichen Datentyp haben müssen. Die Elemente einer List müssen nicht vom gleichen Datentyp sein. Elemente einer Liste können wieder Vektoren oder auch Listen sein. Die Elemente einer Liste müssen nicht die gleiche Länge haben, wie bei Dataframes.

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9)) 
str(x)
## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

list of vectors, list of lists

Skalar

Skalare existieren nicht als eigener Datentyp. Sie sind Vektoren der Länge 1, also Vektoren, die nur ein Element enthalten.

scalar is vector of length 1

x <- 1
str(x)
##  num 1
typeof(x)
## [1] "double"
is.atomic(x)
## [1] TRUE

Befehle im Zusammenhang von Datentypen

Überprüfen gegebener Datenobjekte

str()
typeof()
class()
attributes()

is.atomic()
is.list()

is.character()
is.double()
is.integer()
is.logical() 
is.numeric() true for integer and double

Coercion: logical < integer < double < character # eine String-Repräsentation ist immer möglich

explizites Umwandeln (type-casting)

as.character() 
as.double()
as.integer()
as.logical()
# ...

Attribute

Attribute sind Meta-Daten von Objekten. Jedes Objekt kann Attribute enthalten. Jedem Objekt können neue Attribute zugewiesen werden.

y <- 1:10
y
##  [1]  1  2  3  4  5  6  7  8  9 10
str(y)
##  int [1:10] 1 2 3 4 5 6 7 8 9 10
# we assign attributes to objects using attr()
attr(y, "my_attribute") <- "This is a vector" 
# we also can access attributes using attr()
attr(y, "my_attribute")
## [1] "This is a vector"
attr(y, "my_explanation") <- "With this sample vector we explain attributes"
# we access a list of all attributes of an object using attributes()
attributes(y)
## $my_attribute
## [1] "This is a vector"
## 
## $my_explanation
## [1] "With this sample vector we explain attributes"
str(attributes(y))
## List of 2
##  $ my_attribute  : chr "This is a vector"
##  $ my_explanation: chr "With this sample vector we explain attributes"

Names

Elemente von Vektoren können Namen haben, über die wir sie ansprechen können.

x <- 1:3 
names(x) <- c("a", "b", "c")
# we access a single element of x using its position
x[2]
## b 
## 2
# ... or its name
x["b"]
## b 
## 2
attributes(x)
## $names
## [1] "a" "b" "c"

factor

Faktoren sind Datentypen, bei denen nur bestimmte Ausprägungen vorkommen (können, dürfen). Sie werden oft benutzt, um Zugehörigkeit zu Gruppen zu kodieren.

Wickham(2015, p21): One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

x <- factor(c("a", "b", "b", "a")) 
x
## [1] a b b a
## Levels: a b
class(x)
## [1] "factor"
levels(x)
## [1] "a" "b"
# character vectors and factors are not the same
y <- c("a", "b", "b", "a")
typeof(y)
## [1] "character"
as.factor(y)
## [1] a b b a
## Levels: a b
# not all factor levels have to be present
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_char)
## sex_char
## m 
## 3
table(sex_factor)
## sex_factor
## m f 
## 3 0
# sequence in levels defines cohercion to integers
as.integer(sex_factor)
## [1] 1 1 1
sex_factor <- factor(sex_char, levels = c("f", "m"))
as.integer(sex_factor)
## [1] 2 2 2

Arrays und Matrizen

Rs Arrays sind Vectoren (atomic), denen mit dem Befehl dim() Dimensionen zugewiesen werden. Zweidimensionale Arrays sind Matrizen, ein Spezialfall eines Arrays.

# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# length() generalises to nrow() and ncol() for matrices, and dim() for arrays.
length(a)
## [1] 6
nrow(a)
## [1] 2
ncol(a)
## [1] 3
dim(a)
## [1] 2 3
# names() generalises to rownames() and colnames() for matrices, and dimnames() for arrays
rownames(a) <- c("A", "B") 
colnames(a) <- c("a", "b", "c")
a
##   a b c
## A 1 3 5
## B 2 4 6
is.matrix(a)
## [1] TRUE
is.array(a)
## [1] TRUE
# One vector argument to describe all dimensions 
b <- array(1:12, c(2, 3, 2))
b
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
typeof(b)
## [1] "integer"
dim(b)
## [1] 2 3 2
# check: due to 3 dimensions, b is no matrix, but an array
is.matrix(b)
## [1] FALSE
is.array(b)
## [1] TRUE
# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2) 
c
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
dim(c) <- c(2, 3)
c
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Data frames

Wegen ihrer Bedeutung werden Dataframes in einer eigenen Unit behandelt: DataFrames

A data frame is the most common way of storing data in R, and if used systematically (http://vita.had.co.nz/papers/tidy-data.pdf) makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

# You create a data frame using data.frame(), which takes named vectors as input:
df <- data.frame(x = 1:3, y = c("a", "b", "c")) 
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
df <- data.frame( x=1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)

# class() or test explicitly with is.data.frame()
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"

You can combine data frames using cbind() and rbind():

When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine data frames that don’t have the same columns.

{} dplyr equivalent

It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a ma- trix unless one of the arguments is already a data frame. Instead use data.frame() directly:

simplifying vs preserving simplyfy: returns the simplest possible data structure that can represent the output preserving: result has always the same data type as the input

Subsetting and assignment All subsetting operators can be combined with assignment to modify selected values of the input vector.

Vocabulary (p 75 in Hadley Wickham Advanced R)

Löschen

rm()

a <- matrix(1:6, ncol = 3, nrow = 2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
# overwriting is destructive
a <- "a"
a
## [1] "a"
# delete explicitly
rm(a)
# a # accessing a throws an error

Aufgaben