Lösungen vorhanden

Rmd

Diese Unit ist vorläufig und wird bis Dienstag, 20.4.2021 weiter umgebaut.

`qplot` and `ggplot`

qplot und ggplot sind die beiden Arten bzw. Befehle, mit library(ggplot2) Grafiken zu erzeugen.

ggplot doc

`qplot()` einfache Grafiken

Ein potenzieller Ersatz für plot() aus dem package(ggplot2) mit sehr viel mehr Optionen, das sich ähnlich verhält, wie plot() aus R.

qplot() siehe Tutorial von Christian Treffenstädt.

`ggplot()` Konzeptuelles

Jede ggplot2-Grafik hat drei Kernkomponenten:

Daten (Dataframe)
Ein Set von Verknüpfungen zwischen Variablen in den Daten und visuellen Eigenschaften (aesthetics)
geom, ein geometrisches Layer-Objekt das beschreibt, wie jeder Datenpunkt gerendert (aufbereitet, wiedergegeben) wird.

Zwei Herangehensweisen:

Erstellen von Grafiken aus Rohdaten. Die notwendigen Parameter, beispielsweise Mittelwerte oder Streuungen, werden direkt in ggplot() ermittelt. Durch Anwenden einer geom_... bzw. einer stat_... Operation werden die Rohdaten in die darzustellenden Daten umgewandelt (aggregiert).
Erstellen von Grafiken aus bereits anderweitig ermittelten bzw. bereits bekannten Parametern, die zum Erstellen der Grafiken benutzt werden sollen. Dies geschieht durch stat="identity", also dem Anwenden keiner Statistik, um die Rohdaten aus dem angegebenen Dataframe zu aggregieren. Die vorermittelten Daten müssen in einem Datenframe hinterlegt sein.

ggplot ist nicht geeignet für 3d-Grafiken.

ggplot-Grafiken sind modular aufgebaut, sie implementieren das Konzept von Layers (Schichten, Folien)

Mit dem Befehl ggplot() wird ein Grafikobjekt erzeugt. Grundeigenschaften dieses Grafikobjekts legen wir sinnvollerweise hier an. Meist ist es sinnvoll, hier beispielsweise die (Haupt-) Datenquelle festzulegen und welche Variablen die Koordinaten bilden. Schließlich kann man hier auch bereits Eigenschaften (aesthetics) der Grafik festlegen, die für alle Unterbefehle (Layers) gelten.

Nicht spezifisch für ggplot(), aber in diesem Kontext besonders wichtig: Wenn ein Rmd gerendert (knited) wird und das Ausgabeformat HTML ist, wird ein einziges HTML-File erzeugt, das auch die Grafiken enthält (base64 conversion: <img src="data:image/png;base64,iVBOR...).

Layers

Konzept von Hadley Wickham

Normalerweise wird ein Layer nicht explizit aufgerufen, sondern wird über geom...() implizit erzeugt. Fünf Komponenten eines Layer: data, the aesthetic mappings, the geom, stat, and position adjustments. Jedes Layer-Objekt kann auf eigene Daten (Dataframe) zurückgreifen, kann aber auch die Daten des ggplot-Objekts erben.

Gometric objects `geom_...()`

Definieren, welche Art von geometrischem Objekt auf den Layer ausgegeben wird. Layer wird implizit erzeugt.

Entweder über den Befehl layer() oder über Spezialbefehle, die einen bestimmten Typ von Layer implizieren. layer(geom="point") ist äquivalent zu geom_point() Ein paar geom...()

  geom_bar
  geom_point
  geom_line
  geom_smooth
  geom_histogram
  geom_boxpolot
  geom_text
  geom_density
  geom_errorbar
  geom_hline, geom_vline

Welche geoms gibt es?

require("ggplot2")

## Loading required package: ggplot2

# get a list of all currently available geoms
geoms <- help.search("geom_", package="ggplot2")
# show name and title of the first 5 geoms
geoms$matches[1:5, 1:2]

##         Topic                                               Title
## 1 geom_abline Reference lines: horizontal, vertical, and diagonal
## 2 geom_abline Reference lines: horizontal, vertical, and diagonal
## 3 geom_abline Reference lines: horizontal, vertical, and diagonal
## 4    geom_bar                                          Bar charts
## 5    geom_bar                                          Bar charts

# which arguments can be used in a specific geom, here for geom_boxplot
args(geom_boxplot)

## function (mapping = NULL, data = NULL, stat = "boxplot", position = "dodge2", 
##     ..., outlier.colour = NULL, outlier.color = NULL, outlier.fill = NULL, 
##     outlier.shape = 19, outlier.size = 1.5, outlier.stroke = 0.5, 
##     outlier.alpha = NULL, notch = FALSE, notchwidth = 0.5, varwidth = FALSE, 
##     na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE) 
## NULL

Aesthetics `aes()`

Ein Set von Verknüpfungen zwischen Variablen in den Daten und visuellen Eigenschaften (aesthetics) Gestaltungsmerkmale, die global gesetzt und ggf lokal überschrieben werden können, oder die nur lokal definiert werden. geom definiert, welche aestetics (subset) erlaubt sind. z. B.

postition linetype size shape colour alpha

aes können global oder lokal gelten

stats

stats können von geoms benutzt werden oder generieren auch selbstständig grafische Elemente. Sie dienen dazu, die für die Darstellung notwendigen Parameter aus (Roh-)Daten zu erstellen.

Viele geom()s müssen die Rohdaten transformieren. Dazu dient stat().

# find all versions of stats
stats <- help.search("stat_", package="ggplot2")
# get the first 5, their name and title
stats$matches[1:5, 1:2]

##          Topic                                           Title
## 1     geom_bar                                      Bar charts
## 2   geom_bin2d                        Heatmap of 2d bin counts
## 3   geom_bin2d                        Heatmap of 2d bin counts
## 4 geom_boxplot A box and whiskers plot (in the style of Tukey)
## 5 geom_contour                     2D contours of a 3D surface

Um zu verstehen, was stats tun, kann man sie außerhalb des ggplot-Kontexts laufen lassen und sich die Ergebnisse ansehen

# some values in a vector
vv <- c(100, 111, 112, 104,  98,  87,  95,  90, 
         90,  97, 102,  95,  88,  79,  82,  83)
# a factor-like vector
gg <- c(  1,   1,   1,   1,   1,   1,   1,   1,
          2,   2,   2,   2,   2,   2,   2,   2)
# tapply(data, group, function)
# tapply applies a function to eac subgroup defined by group to the data of that group

require(Hmisc)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

# mean_cl_normal is a stat used by errorbar
tapply(vv, gg, mean_cl_normal)

## $`1`
##        y   ymin    ymax
## 1 99.625 92.029 107.221
## 
## $`2`
##      y     ymin     ymax
## 1 89.5 82.76719 96.23281

# boxplot is a stat used by boxplot
tapply(vv, gg, boxplot)

## $`1`
## $`1`$stats
##       [,1]
## [1,]  87.0
## [2,]  92.5
## [3,]  99.0
## [4,] 107.5
## [5,] 112.0
## 
## $`1`$n
## [1] 8
## 
## $`1`$conf
##           [,1]
## [1,]  90.62078
## [2,] 107.37922
## 
## $`1`$out
## numeric(0)
## 
## $`1`$group
## numeric(0)
## 
## $`1`$names
## [1] "1"
## 
## 
## $`2`
## $`2`$stats
##       [,1]
## [1,]  79.0
## [2,]  82.5
## [3,]  89.0
## [4,]  96.0
## [5,] 102.0
## 
## $`2`$n
## [1] 8
## 
## $`2`$conf
##          [,1]
## [1,] 81.45871
## [2,] 96.54129
## 
## $`2`$out
## numeric(0)
## 
## $`2`$group
## numeric(0)
## 
## $`2`$names
## [1] "1"

Anatomie

Grafikobjekt erzeugen. Layers hinzufügen. Ausgeben auf Bildschirm bzw. in Datei speichern. Layer werden mit dem ‘+’ Operator an das Grafikobjekt angehängt. Das ‘+’ zum Hinzfügen von Layers kann bei mehrzeiligen Grafikbefehlen am Ende einer Zeile stehen, nicht jedoch am Anfang.

geom() und aes()

Eine knappe Einführung/ quick reference in die Struktur (anatomy) von ggplot2 Grafiken:

Eine Slideshow zu ggplot2 von Istvan Zahn Slideshow

Ein paar Verständnisbeispiele

Viele Wege zur selben Grafik

Am Beispiel von Scatterplots.

Da die Grafiken ‘programmiert’ werden, gibt es bei einer so offenen und modularen Struktur wie bei ggplot oft mehrere Wege, dasselbe zu erreichen. Hier wird das am Beispiel der Verwendung und Übergabe notwendiger Informationen (Dataframe, Zuordnung von Spalten zu Achsen) demonstriert.

# mini dataframe
dd <- data.frame(
    x = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y = c( 3, 2, 4, 4, 6, 8, 9, 8),
    g = c(rep('m', 4), rep('f', 4))
    )
# package ggplot2 has to be loaded
require("ggplot2")
# create a plot object without any defaults or preset parameters/values/geoms/aes ...
pplot <- ggplot()
# add a geom of type scatterplot (geom_point) including data 
pplot + 
  geom_point(data=dd, aes(x=x, y=y))

# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x, y=y))
# we can also use unnamed parameters for aes x and y
pplot <- ggplot(data=dd, aes(x, y))

# add scatterplot using `geom_point()` that herits its parameters from the mother ggplot object 
pplot + 
  geom_point()

# alternative syntax for the same plot using explicit layer()
# the above used geom_point() implicitly generates a layer() what is done here explicitly
# again the base data may be provided by the layer or by the base graphics  object
pplot <- ggplot()
pplot + layer(
  data=dd, 
  mapping=aes(x=x, y=y),
  position = "identity",
  stat="identity",
  geom = "point"
)

# alternatively via herited data of base graphic object pplot, layer() refers to that and only sets the specific parameters stat and geom
pplot <- ggplot(data=dd, aes(x=x, y=y))
pplot + layer(
    mapping = NULL,
  position = "identity",
  stat="identity",
  geom = "point"
)

# finally the same graph using qplot()
qplot(x, y, data=dd)

Wir können ein “leeres” Grafikobjekt erzeugen. Es sind hierbei keinerlei weitere Grunddaten festgelegt, weder ein Dataframe, noch Zuordnungen von Variablen zu Achsen o. ä. Alle Infos werden über Parameter des geom geom_point() geliefert.

Alternativ können wir Grafikobjekt z. B. namens “pplot” erzeugen, das bereits die Angaben zu Dataframe und zur Zuordnung von Variablen zu Achsen enthält. Hierauf greifen alle dem Grafikobjet “pplot” per “+” hinzugefügten Layers automatisch zurück. Unser geom_point() kommt somit ohne jegliche Paramter aus.

Während geom_point() implizit ein layer() erzeugt, können wir den Befehl layer() auch explizit benutzen. Auch hierbei funktioniert die o. a. Vererbung der Parameter aus dem Grundobjekt.

qplot() macht alles implizit, hat aber nicht so viele Einflussmöglichkeiten.

Mehrere `geom...()`, Wirkungsweise von `stats`, Zusammenspiel mit `geom...()`

Wenn für ein geom...() über die Einzeldaten aggregiert werden muss, kommt stat() zum Einsatz

# mini dataframe
dd <- data.frame(
    x = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y = c( 3, 2, 4, 4, 6, 8, 9, 8),
    g = c(rep('m', 4), rep('f', 4))
    )
require("ggplot2")
# create a scatterplot object 
pplot <- ggplot(data=dd, aes(x=x, y=y)) + geom_point()
# add a geom smooth, representing the regression line, we exclude the default 95% confidence interval
pplot + 
  geom_smooth(method = lm, se=F)

## `geom_smooth()` using formula 'y ~ x'

# add a geom smooth, representing the regression line, 95% confidence interval included by default
pplot + 
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

# we can add informative descriptive lines mean and sd
ggplot(data=dd, aes(x=x, y=y)) + 
    geom_hline(yintercept=mean(dd$y), color="red") +
    geom_hline(yintercept=mean(dd$y) + sd(dd$y), color="red", alpha=0.2) + 
    geom_hline(yintercept=mean(dd$y) - sd(dd$y), color="red", alpha=0.2) + 
    geom_point(data=dd, mapping=aes(x=x, y=y))

# we can combine geom_boxplot and geom_point and geom_hline
ggplot(data=dd, aes(x=x, y=y)) + 
    geom_boxplot(data=dd, mapping=aes(x=y)) + # boxplot uses median
    geom_point() + 
    geom_hline(yintercept=mean(dd$y), color="red") # we add mean of y

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Liniengrafiken

p <- ggplot(mry, aes(x=year, y=number, group=rating)) fill=rating? p + geom_line()

Ein Linienzug, auch unterbrochen

# mini dataframe
dd <- data.frame(
    x = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y = c( 3, 2, 4, 4, 6, 8, 9, 8),
    g = c(rep('m', 4), rep('f', 4))
    )
# package ggplot2 has to be loaded
require("ggplot2")

# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x, y=y))
# add a line layer
pplot + geom_line()

# line interrupted
pplot <- ggplot(data=dd, aes(x=x, y=y, group=g))
pplot + geom_line()

mehrere Linien

Die darzustellenden Linien stehen in separaten Spalten im Dataframe und können verschiedene Farben haben.

# mini dataframe
dd <- data.frame(
    x  = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y1 = c( 3, 2, 4, 4, 6, 8, 9, 8),
    y2 = c( 4, 5, 6, 5, 8, 9,11,14),
    g = c(rep('m', 4), rep('f', 4))
    )
# package ggplot2 has to be loaded
require("ggplot2")
# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x))
# add two line layers
pplot + geom_line(aes(y=y1)) + geom_line(aes(y=y2))

pplot + geom_line(aes(y=y1), color='red') + geom_line(aes(y=y2), color='blue')

# two lines interrupted
pplot <- ggplot(data=dd, aes(x=x, y=y, group=g))
pplot + geom_line(aes(y=y1), color='red') + geom_line(aes(y=y2), color='blue')

Mehrere Linien, Daten im Long-Format

# mini dataframe
dd <- data.frame(
    g  = factor(c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)),
    # x  = rep(c(1, 2, 3, 4, 5), 3),
    x  = c( 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
    y  = c( 3, 2, 3, 4, 6, 2, 2, 4, 3, 7, 1, 3, 5, 8, 9)
    )
# package ggplot2 has to be loaded
require("ggplot2")
# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x, y=y, group=g))
pplot + geom_line()

# each line with separate color
pplot + geom_line(aes(color=g))

Linien mit Punkten

# mini dataframe
dd <- data.frame(
    g  = factor(c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)),
    # x  = rep(c(1, 2, 3, 4, 5), 3),
    x  = c( 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
    y  = c( 3, 2, 3, 4, 6, 2, 2, 4, 3, 7, 1, 3, 5, 8, 9)
    )
# package ggplot2 has to be loaded
require("ggplot2")
# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x, y=y, group=g))
# each line with separate color
pplot + geom_line(aes(color=g)) + geom_point(aes(color=g, shape = g))

# each line with different color and shape
pplot + geom_line(aes(color=g)) + geom_point(aes(color=g, shape = g))

# each line with different color and shape and a bigger size of point
pplot + geom_line(aes(color=g)) + geom_point(aes(color=g, shape = g), size=7)

Linien mit Streuungsinformation

# mini dataframe
dd <- data.frame(
    g  = factor(c( 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)),
    # x  = factor(rep(c(1, 2, 3, 4, 5), 3)),
    x  =        c( 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
    y  =        c( 3, 2, 3, 4, 6, 2, 2, 4, 3, 7, 1, 3, 5, 8, 9),
    se =        c(.2,.3,.1,.2,.1,.3,.4,.3,.3,.4,.1,.2,.1,.1,.2)
    )
# package ggplot2 has to be loaded
require("ggplot2")
# create a plot object with some basic information
pplot <- ggplot(data=dd, aes(x=x, y=y, group=g))

# just to show that errorbars are plotted on own layer
pplot + geom_errorbar(aes(ymax = y + se, ymin=y - se), width=0.1)

# combine it with lines
pplot + geom_line(aes(color=g)) + geom_errorbar(aes(ymax = y + se, ymin=y - se), width=0.1)

# each line with different color and shape and a bigger size of point and errorbar
pplot + 
    geom_line(aes(color=g)) + 
    geom_point(aes(color=g, shape = g), size=4) + 
    geom_errorbar(aes(ymax = y + se, ymin=y - se, color=g), width=0.1)

# todo: points must not overlap

Gruppierte Daten

Am Beispiel von Scatterplots. Wir können Gruppierungsvariablen sehr intuitiv nutzen, indem wir sie dem aes() hinzufügen.

# mini dataframe with tow grouping vars
dd <- data.frame(
    x = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y = c( 3, 2, 4, 4, 6, 8, 9, 8),
    g = c(rep('m', 4), rep('f', 4)),
    h = c(rep(c('A', 'B'), 4))
    )
require("ggplot2")
# we create a plot object and associate it with data
pplot <- ggplot(data=dd)
# we add aes with x and y and group g, colored differently
pplot + geom_point(aes(x, y, group=g, color=g))

# we add aes with x and y and group h, colored differently
pplot + geom_point(aes(x, y, group=h, color=h))

# we can visualize two groups combining color and shape
pplot + geom_point(aes(x, y, group=g, color=g, shape=h))

# we can mix points and lines
pplot + 
    geom_point(aes(x, y, group=g, color=g, shape=h)) +
    geom_line(aes(x, y, group=g, color=g, shape=h))

## Warning: Ignoring unknown aesthetics: shape

# or
pplot + 
    geom_point(aes(x, y, group=g, color=g, shape=h)) +
    geom_line(aes(x, y, group=h, color=h))

# use both grouping vars combined by applying interaction and line
pplot + (aes(x, y, colour=h, shape = g,
  group=interaction(g, h))) + 
  geom_point() + geom_line()

Balkendiagramme (bar plots)

# mini dataframe
dd <- data.frame(
    time  = c("t1", "t2", "t1", "t2"),
    group = c("f", "f", "m", "m"),
    mean  = c(1.6, 1.8, 3.5, 4.1),
    se    = c(0.1, 0.3, 0.3, 0.2)
    )
# package ggplot2 has to be loaded
require("ggplot2")

# create a plot object, define dataframe to use, add gender to differ in colour already in base plot
pplot <- ggplot(dd, x=group, y=mean, aes(group, mean, fill = time))

# create a bar plot
pplot +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_errorbar(aes(ymax = mean + se, ymin= mean - se), position = position_dodge(width=0.7), width=0.2) +
  #geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7), width=0.2)
  geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7))

# we might prefer other colors
pplot +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_errorbar(aes(ymax = mean + se, ymin= mean - se), position = position_dodge(width=0.7), width=0.2) +
  # geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7), width=0.2) +
  geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7)) +
      scale_fill_manual(values=c("#CC6666", "#9999CC", "#66CC99"))

Grafiken auf Basis von mehreren Datenquellen

ggplot Grafiken können in einer Grafik mehrere Datenquellen vom Typ Tibble/Dataframe verwenden. Wir können geom...() einen Parameter data= hinzufügen, der benutzt werden soll. Hierdurch können Rohwerte und aus anderen Berechnungen stammende Ergebnisse in einer Grafik gemeinsam verwendet werden.

require("ggplot2")
# define dataframe 3 groups, gender and two vars (weight and height)
df1 <- data.frame(
  m.weight = c( 50,  55,  75,  77,  90,  98), 
  m.height = c(165, 184, 170, 179, 167, 182), 
  group = factor(c(rep(1,2),rep(2,2),rep(3,2))), 
  gender = factor(c(rep(c(1,2),3)))
  )
# we might have new weights after some treatment
df2 <- data.frame(
  m.weight=c( 55,  57,  74,  75,  78,  88), 
  m.height=c(165, 184, 170, 179, 167, 182), 
  group=factor(c(rep(1,2),rep(2,2),rep(3,2))), 
  gender=factor(c(rep(c(1,2),3)))
  )
df3 <- data.frame(
    m1  = mean(df1$m.weight),
    se1 =   sd(df1$m.weight) / sqrt(length(df1$m.weight)),
    m2  = mean(df2$m.weight),
    se2 =   sd(df2$m.weight) / sqrt(length(df2$m.weight))
)
# we do a plot of t1
pplot <- ggplot() +  
  geom_point(data = df1, aes(x=m.height, y=m.weight, shape=group))
pplot

# we add t2
pplot2 <- pplot + 
  geom_point(data=df2, aes(m.height, m.weight, shape=group, color='red'), show.legend=F) 
pplot2

# we add something of df3
pplot3 <- pplot2 + 
  geom_point(data=df3, stat='identity', aes(160, m1, size=10), show.legend=F) +
    geom_errorbar(data=df3, stat='identity', aes(x=160, ymax = m1 + se1, ymin=m1 - se1), show.legend=F) + 
  geom_point(data=df3, stat='identity', aes(161, m2, size=10, color='red'), show.legend=F) +
    geom_errorbar(data=df3, stat='identity', aes(x=161, ymax = m2 + se2, ymin=m2 - se2, color='red'), show.legend=F)
pplot3

# erase data to keep environment clean
rm(df1)
rm(df2)

Extras zu Grafiken hinzfügen, Achsen modifizieren

Auch Beschriftung, Titel, Achsenmodifikationen, Extras etc. sind Layer, die hinzugefügt werden.

# mini dataframe
dd <- data.frame(
    x = c( 1, 2, 3, 4, 5, 6, 7, 8),
    y = c( 3, 2, 4, 4, 6, 8, 9, 8)
    )
# package ggplot2 has to be loaded
require("ggplot2")

# create a plot object with some basic information and a point layer
pplot <- ggplot(data=dd, aes(x=x, y=y)) + geom_point()

# we might want to add a title
pplot + 
  ggtitle("Streudiagramm von einigen x und y")

# title can have several lines
pplot + 
  ggtitle("Streudiagramm von einigen x und y \n wenig vertrauenserweckende Werte!")

# maybe some other axis names aditionally to a title?
pplot + 
  labs(title="Streudiagramm von einigen x und y", x = "x regelmäßig steigend", y = "eng, aber nicht vollständig gebunden")

# lets make y a scale between 1 and 10
pplot + 
  labs(title="Streudiagramm von einigen x und y", x = "x regelmäßig steigend", y = "1 < y < 10") +
  ylim(1, 10)

pplot + 
  labs(title="Streudiagramm von einigen x und y", x = "x regelmäßig steigend", y = "1 < y < 10") +
  #ylim(1, 10) +
  scale_y_continuous(breaks=seq(1, 10, 1))  # # Ticks from 1-10, every 1.0

Farben

Farben in ggplot.

Grafiken speichern

Beim Speichern von Grafiken via Skript sind die Möglichkeiten abhängig von der Plattform, auf der wir uns bewegen.

# mini dataframe
dd <- data.frame(
  time  = c("t1", "t2", "t1", "t2"),
    group = c("f", "f", "m", "m"),
    mean  = c(1.6, 1.8, 3.5, 4.1),
    se    = c(0.1, 0.3, 0.3, 0.2)
    )
# package ggplot2 has to be loaded
require("ggplot2")

# create a plot object, define dataframe to use, add gender to differ in colour already in base plot
pplot <- ggplot(dd, x=group, y=mean, aes(group, mean, fill = time))

# create a bar plot
pplot +
  geom_bar(stat="identity", position=position_dodge()) +
  geom_errorbar(aes(ymax = mean + se, ymin= mean - se), position = position_dodge(width=0.7), width=0.2) +
  #geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7), width=0.2)
  geom_point(stat="identity", shape=21, size=5, position=position_dodge(width=0.7))

# now save it to a file
ggsave(file="test.pdf")

## Saving 7 x 5 in image

# ggsave(file="test.jpeg",dpi=72)
# ggsave(file="test.svg",plot=pplot,width=10,height=5)

Multiple Grafiken - mehrere Grafiken auf einer Seite

Ansatz mit `facet_grid()` für multiple Grafiken bzw. Subgrafiken

# scatterplot of all the pacs items with the scale using faceting
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# or, as UTF-8 is default in tidyverse readr
require(tidyverse)

## Loading required package: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ tibble  3.1.0     ✓ dplyr   1.0.5
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x dplyr::src()       masks Hmisc::src()
## x dplyr::summarize() masks Hmisc::summarize()

dd <- read_tsv("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   name = col_character(),
##   gender = col_character(),
##   bmi_class = col_character(),
##   bmi_dichotom = col_character(),
##   bmi_normal = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.

p <- ggplot(dd, aes(bmi, pacs)) + geom_point()
# With one variable
p + facet_grid(. ~ gender)

p + facet_grid(bmi_dichotom ~ gender)

Alternative Ansätze für multiple Grafiken

In ggplot2 funktioniert das klassische Aufteilen der Ausgabe mit par() leider nicht. Wir können alternativ grid.arrange() aus der library(gridExtra) verwenden.

cf: [http://stackoverflow.com/questions/1249548/side-by-side-plots-with-ggplot2-in-r]

or cookbook receipe multiport

cf. [http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_%28ggplot2%29/]

alternativ: cowplot für differneziertere Grids

require(ggplot2)
require(grid)

## Loading required package: grid

require(gridExtra)

## Loading required package: gridExtra

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

#require("gridExtra")
#require(grid)

# create four plots and show them in grid
x <- rnorm(100, 100, 15)
y <- x + 5 + rnorm(100, 0, 5)
dd <- data.frame(x, y)
p1 <- ggplot(dd, aes(x=x, y=y)) + geom_histogram(aes(y= ..density..))
p2 <- ggplot(dd, aes(x=x, y=y)) + geom_histogram(aes(x=y, y=..density..))
p3 <- ggplot(dd, aes(x=x, y=y)) + geom_point()
p4 <- p3 + geom_hline(yintercept=mean(y))
#p1 <- ggplot(dfbmi, x=nation, y=bmi, aes(nation, bmi, fill = gender))
grid.arrange(p1,p2,p3,p4, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Grafiken mit vorab berechneten Werten, die für die Darstellung benutzt werden sollen

Das Prinzip wird sehr schön erklärt auf Christian Treffenstädts rspace unter grafiken

Auch bei r-cookbook werden die Minimalangaben in einem kleinen Dataframe gespeichert und benutzt bar graphs

Wichtig: Alle Werte, die in einem geom...() verwendet werden sollen, müssen in einem Dataframe gespeichert sein. Vgl. aber auch die Ausführungen zu “Grafiken auf Basis von mehreren Datenquellen”

Der entscheidende Parameter ist stat="identity"), was bedeutet, dass als Ergebnis der Berechnung (stat) der Wert selbst zurückgegeben wird, mit anderen Worten also nichts berechnet wird bzw. das Ergebnis gleich dem Wert ist, der übergeben wurde.

require("ggplot2")
# define dataframe 3 groups, gender and two vars (weight and height)
dd <- data.frame(
  m.weight=c( 50,  55,  75,  77,  90,  98), 
  m.height=c(165, 184, 170, 179, 167, 182), 
  group=factor(c(rep(1,2),rep(2,2),rep(3,2))), 
  gender=factor(c(rep(c(1,2),3)))
  )
ggplot(data=dd, 
  aes(x=group, y=m.weight, fill=gender)) + 
  geom_bar(stat="identity", position=position_dodge(), colour="black")

Aufbereitung für `ggplot()` mit `Rmisc::SummarySE()`

Die Funktion summarySE() erstellt aus einem Dataframe einen über Gruppenvariablen aggregierten Dataframe mit descriptiven Statistiken. Wir können einen solchen Ergebnis-Dataframe nutzen, um ggplot-Grafiken zu erstellen.

# define dataframe 3 groups, gender and two vars (weight and height)
dd <- data.frame(
  m.weight=c( 50,  55,  75,  77,  90,  98,     53,  57,  76,  75,  91,  99), 
  m.height=c(165, 184, 170, 179, 167, 182,    166, 183, 171, 179, 174, 183), 
  group=factor(c(rep('a',2),rep('b',2),rep('c',2),rep('a',2),rep('b',2),rep('c',2))), 
  gender=factor(c(rep(c('f', 'm'),6)))
  )
# we use summarySE to calculate group descriptives
require(Rmisc)

## Loading required package: Rmisc

## Loading required package: plyr

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

ggframe <- Rmisc::summarySE(data = dd, measurevar = 1, groupvars = c(3))

# we get a data frame with all required parameters for the plot
ggframe

##   group N     1        sd        se       ci
## 1     a 4 53.75 2.9860788 1.4930394 4.751518
## 2     b 4 75.75 0.9574271 0.4787136 1.523480
## 3     c 4 94.50 4.6547467 2.3273733 7.406741

# we use that for ggplot
ggplot(ggframe, aes(x = factor(group), y = ggframe[,3])) + 
geom_bar(stat = "identity", width = 0.5, aes(fill = factor(group))) + 
labs(title = "Gewichtsgruppe", x = "Gruppenzugehörigkeit", y = "Gewicht (kg)") + 
geom_errorbar(aes(ymin = ggframe[, 3] - ci, ymax = ggframe[, 3] + ci), width = 0.1, size = 0.5)

Text und Pfeile hinzufügen

require("ggplot2")
dd <- data.frame(
  m.weight=c( 50,  55,  75,  77,  90,  98,     53,  57,  76,  75,  91,  99), 
  m.height=c(165, 184, 170, 179, 167, 182,    166, 183, 171, 179, 174, 183), 
  group=factor(c(rep('a',2),rep('b',2),rep('c',2),rep('a',2),rep('b',2),rep('c',2))), 
  gender=factor(c(rep(c('f', 'm'),6)))
  )

# we add an annotation
ggplot(data=dd, aes(x=m.height, y=m.weight, color=gender)) + 
  geom_point() + 
    geom_text(x=180, y=72, label="important", color="red")

# we add an arrow with label
require(grid)  # needed for arrow
ggplot(data=dd, aes(x=m.height, y=m.weight)) + 
  geom_point() + 
    geom_text(x=174, y=80, label="critical", color="red") +
    geom_segment(aes(x = 174, y = 85, xend = 174, yend = 90), arrow = arrow(length = unit(0.5, "cm")))

# we add an arrow with label
require(grid)  # needed for arrow
ggplot(data=dd, aes(x=m.height, y=m.weight)) + 
  geom_point() + 
    geom_text(x=174, y=83, label="critical", color="red") +
    geom_segment(aes(x = 174, y = 85, xend = 174, yend = 90), arrow = arrow(length = unit(0.5, "cm")), color="red") +
  geom_point(aes(x = 167, y = 90), shape=1, size=10, color="green")

Density Plots in `ggplot2()`

require(ggplot2)
#Sample data
dat <- data.frame(dens = c(rnorm(100), rnorm(100, 10, 5))
                   , lines = rep(c("a", "b"), each = 100))
#Plot.
ggplot(dat, aes(x = dens, fill = lines)) + geom_density(alpha = 0.5)

# a normal distribution
df <- data.frame(x=1:100, y=rnorm(100,0,1))
ggplot(df, aes(y)) + stat_function(fun=dnorm, args=list(mean=mean(df$y), sd=sd(df$y)))

Zeitliche Verläufe: Long wide and ggplot

Oft erleichtert es graphische Darstellungen mit ggplot() sehr, wenn die Daten “tidy” sind. Ggf. ist eine Wandlung wide -> long sinnvoll zur leichteren Visualisierung der Daten. Es gibt keinen einfachen Weg, um in ggplot() Messwiederholungsdaten im wide Format auf der X-Achse aufzutragen.

Verstehbeispiel 2*3 reines Messwiederholungsdesign

AV Zahl reproduzierter Silben subj: Versuchsperson gender: ‘f’, ‘m’ mnemotechnik: ‘m’, ‘c’ t1: AV Zeitpunkt 1 t2: AV Zeitpunkt 2 t3: AV Zeitpunkt 3

# we generate data frame in wide format
df.w <- data.frame(
  subj   = c(1,2,3,4),
  gender = c('f', 'f', 'm', 'm'),
  t1c     = c( 4,  5,  7,  6),
  t2c     = c( 6,  5,  9,  8),
  t3c     = c( 7,  6, 12, 10),
  t1m     = c( 8,  6, 15, 10),
  t2m     = c(12,  7, 19, 13),
  t3m     = c(13,  7, 22, 14)
  )
# take a look at it
df.w

##   subj gender t1c t2c t3c t1m t2m t3m
## 1    1      f   4   6   7   8  12  13
## 2    2      f   5   5   6   6   7   7
## 3    3      m   7   9  12  15  19  22
## 4    4      m   6   8  10  10  13  14

df.w$subj <- factor(df.w$subj)

# we could try to visualize using multiple aes in geoms and manual adaptation of x
require(ggplot2)
plw <-  ggplot() +
        geom_point(data=df.w, aes(x=1, y=t1c, group=subj, color=subj, shape=subj)) +
        geom_point(data=df.w, aes(x=2, y=t2c, group=subj, color=subj, shape=subj)) +
        geom_point(data=df.w, aes(x=3, y=t3c, group=subj, color=subj, shape=subj))
plw

# but this is not an elegant way to do it, not at all


# we better go the tidy way ...

# we gather all repeated measure variables at once
require(tidyr)
df.l <- df.w %>% tidyr::gather(key=m_t, value=dv, t1c:t3m)
# we make new columns to specify the repeated measure factors using the position in source column name and preserving source column to control what's happening
df.l %>% tidyr::separate(m_t, c("time", "cond"), 2, 1, remove=F) -> df.l
# we want to have all repeated measures for each person in consecutive lines
df.l %>% dplyr::arrange(subj, cond, time) -> df.l

df.l$subj <- as.factor(df.l$subj)
df.l$cond <- as.factor(df.l$cond)

# with tidy data it tracks down to this ..
require(ggplot2)
pl <-  ggplot(data=df.l, aes(x=time, y=dv, group=subj, color=subj, shape=subj)) + 
    aes(stat='identity') + 
    geom_line(aes(group=subj)) +
    geom_point() +
    facet_wrap(~cond)
    
pl

Legende

legend

Farben - Colors

If we set the color aestetic for a variable of type factor, the colors are set due to the number of levels of the factor and in the sequence of the factor levels (default alphabetically).

Stackoverflow: background

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

hue_pal()(5)

## [1] "#F8766D" "#A3A500" "#00BF7D" "#00B0F6" "#E76BF3"

show_col(hue_pal()(5))

# background from the stackoverflow post above
# It is just equally spaced hues around the color wheel, starting from 15:

gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
n = 4
cols = gg_color_hue(n)
cols

## [1] "#F8766D" "#7CAE00" "#00BFC4" "#C77CFF"

# dev.new(width = 4, height = 4)
plot(1:n, pch = 16, cex = 2, col = cols)

dd <- tibble(
  nr = c(1,2,3,4,5),
  yy = c(2, 4, 5, 3, 5.5),
  ff = factor(c('a', 'b', 'b', 'a', 'b'))
)
# dd$ff has two levels
dd$ff

## [1] a b b a b
## Levels: a b

# we get these colors for dd$ff
hue_pal()(2)

## [1] "#F8766D" "#00BFC4"

dd %>% ggplot(aes(x = nr, y = yy, color = ff)) + 
       geom_point() + 
       geom_hline(yintercept = 2.5,   color = hue_pal()(2)[1]) +
       geom_hline(yintercept = 4.833, color = hue_pal()(2)[2])

Beispiele und Aufgaben

Ein Beispiel mit virtuellen Daten zu Reaktionszeiten und P300 (EEG) html

Ein paar Experimente und Beispielcode: bmi_graphics html

Beispielgrafiken: body_comarison_graphics html

Aufgaben variieren

“spielen” Sie in den obigen Versteh-Beispielen, indem Sie Daten verändern und den Effekt auf die erzeugten Grafiken beobachten
“spielen” Sie in den obigen Versteh-Beispielen mit den Parametern von ggplot
gestalten Sie die Beispielgrafiken mit zusätzlichen Hervorhebungen, Erklärungen, Markierungen etc. aus

Aufgabe Datensatz “df_dplyr.txt”

Erstellen Sie für den Beispielatensatz df_dplyr.txt - ein paar Scatterplots (v1, v2, v3) - v1 … v3 als Verläufe in x-Richtung - Boxplot für Geschlecht, Gruppe - Bargraph für Geschlecht, Gruppe - … mit Errorbars - … mit Vorberechnung der aggregierten Werte z. B. unter Anwendung von summarySE() - fassen Sie die Variablen v1 … v3 als Messwiederholungen auf und erstellen Sie eine Grafik, in der der Verlauf gruppenspezifisch ausgegeben wird

require(tidyverse)
dd <- read_delim("http://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt", delim="\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   subj = col_double(),
##   gender = col_character(),
##   age = col_double(),
##   grp = col_double(),
##   v1 = col_double(),
##   v2 = col_double(),
##   v3 = col_double()
## )

# or
dd <- read_tsv("http://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   subj = col_double(),
##   gender = col_character(),
##   age = col_double(),
##   grp = col_double(),
##   v1 = col_double(),
##   v2 = col_double(),
##   v3 = col_double()
## )

Aufgabe economics

In R ist ein Datensatz über ökonomische Verlaufsdaten der USA hinterlegt.

Achtung:

Der Zahlenname Billion steht im deutschen Sprachgebrauch für die Zahl 1000 Milliarden oder 1.000.000.000.000 = 1012, im Dezimalsystem also für eine Eins mit 12 Nullen. 1000 Billionen ergeben eine Billiarde. Der Vorsatz für Maßeinheiten für den Faktor eine Billion ist Tera mit dem Zeichen T. Abgekürzt wird sie mit Bio. oder Bill., wobei Letzteres mit Billiarde verwechselt werden kann.

Das US-amerikanische billion hingegen entspricht der deutschen Milliarde.

zeichnen Sie den Verlauf der Arbeitslosenzahlen über die Jahre
zeichnen Sie zusätzlich einen Smoother mit ein, um den Langzeit-Trend zu visualisieren

help(economics)
# we make economics accessible by a shorter name, df
df <- economics

pp <- ggplot(data = economics, aes(x = date, y = unemploy))
pp <- pp + geom_line()
pp <- pp + geom_smooth()

pp

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Beispiel l/100km und MpG

1 L/100km entspricht \(100 * 3.79/1.61 = 235.4 mpg\).

1 Gallone = 3.79 Liter (L)
1 Meile = 1.61 Kilometer (km)

mpg -> l/100

\(l/100km = 100 / (mpg * 1.6) * 3.7\)

require(ggplot2)
?mpg
dd <- mpg

require(psych)

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:scales':
## 
##     alpha, rescale

## The following object is masked from 'package:Hmisc':
## 
##     describe

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

psych::describe(dd)

##               vars   n    mean    sd median trimmed   mad    min  max range
## manufacturer*    1 234    7.76  5.13    6.0    7.68  5.93    1.0   15  14.0
## model*           2 234   19.09 11.15   18.5   18.98 14.08    1.0   38  37.0
## displ            3 234    3.47  1.29    3.3    3.39  1.33    1.6    7   5.4
## year             4 234 2003.50  4.51 2003.5 2003.50  6.67 1999.0 2008   9.0
## cyl              5 234    5.89  1.61    6.0    5.86  2.97    4.0    8   4.0
## trans*           6 234    5.65  2.88    4.0    5.53  1.48    1.0   10   9.0
## drv*             7 234    1.67  0.66    2.0    1.59  1.48    1.0    3   2.0
## cty              8 234   16.86  4.26   17.0   16.61  4.45    9.0   35  26.0
## hwy              9 234   23.44  5.95   24.0   23.23  7.41   12.0   44  32.0
## fl*             10 234    4.63  0.70    5.0    4.77  0.00    1.0    5   4.0
## class*          11 234    4.59  1.99    5.0    4.64  2.97    1.0    7   6.0
##                skew kurtosis   se
## manufacturer*  0.21    -1.63 0.34
## model*         0.11    -1.23 0.73
## displ          0.44    -0.91 0.08
## year           0.00    -2.01 0.29
## cyl            0.11    -1.46 0.11
## trans*         0.29    -1.65 0.19
## drv*           0.48    -0.76 0.04
## cty            0.79     1.43 0.28
## hwy            0.36     0.14 0.39
## fl*           -2.25     5.76 0.05
## class*        -0.14    -1.52 0.13

require(tidyverse)
# we recalculate miles per gallon into liters per 100 km and save it in an additional column
dd$ctl100km <- 100 / (dd$cty * 1.6) * 3.7
dd$hwl100km <- 100 / (dd$hwy * 1.6) * 3.7
# we generate column eco which represents the order from most economic to least economic consumption
dd <- dd %>% arrange(ctl100km)
dd$eco <- 1:nrow(dd)

# we plot city consumption and hiway consumption for the manufacturers
dd %>% ggplot(aes(x=manufacturer, y=ctl100km, color="red")) + geom_point() + geom_point(aes(y=hwl100km, color="blue"))

# we plot average city consumption for the manufacturers
dd %>% ggplot(aes(x=manufacturer, y=ctl100km, color=manufacturer)) + stat_summary(fun.y = mean, geom="point") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width=0.2, size=1)

## Warning: `fun.y` is deprecated. Use `fun` instead.

# we aggregate data to get a tibble with average consumption per manufacturer (row)
dd %>% dplyr::group_by(manufacturer) %>% dplyr::summarise(ct_m=mean(ctl100km), ct_se=sd(ctl100km)/sqrt(n())) -> dd.m
dd.m <- dd.m %>% dplyr::mutate(upper = ct_m + ct_se, lower = ct_m - ct_se) %>% dplyr::arrange(ct_m)
# we generate column eco which represents the order from most economic to least economic manufacturer in consumption
dd.m$eco <- 1:nrow(dd.m)

dd.m$manuf <- factor(dd.m$manufacturer)

# consumption ordered by manufacturer
dd.m %>% ggplot(aes(x=manufacturer[15:1], y=ct_m, color=manufacturer)) + geom_point(size=3) + 
  geom_errorbar(mapping=aes(x=manufacturer[15:1], ymin=upper, ymax=lower), width=0.2, size=1)

# consumption ordered by consumption
dd.m %>% ggplot(aes(x=reorder(manuf, eco), y=ct_m, color=reorder(manuf, eco))) + 
  geom_point(size=3) + 
  geom_errorbar(mapping=aes(x=reorder(manuf, eco), ymin=upper, ymax=lower), width=0.2, size=1)

Interpolieren

… bei missing values in Messwiederholungsverläufen,

Eine Erhebung über 7 Jahre hat aus politischen Gründen in den Jahren 4 und 5 nicht stattgefunden. Die beiden fehlenden Jahre sollen durch geeignete Werte auf individueller Basis geschätzt werden.

require(tidyverse)
dd <-tibble(
  subj = as.factor(c(1,2,3,4,5)),
  y1   = c( 2,  0,  5,  4,  7),
  y2   = c( 6,  2,  7, 12,  9),
  y3   = c( 7,  9,  9, 15, 15),
  y4   = c(NA, NA, NA, NA, NA),
  y5   = c(NA, NA, NA, NA, NA),
  y6   = c(14, 18, 15, 22, 25),
  y7   = c(18, 22, 19, 29, 23)
)
# we convert dd to long format
dd.l <- dd %>% tidyr::gather(year, score, y1:y7) %>% dplyr::arrange(subj)
# and show a graph with the gap between the missing years
dd.l %>% ggplot2::ggplot(aes(x=year, y=score, group=subj, color=subj)) + ggplot2::geom_line() + ggplot2::geom_point(size=5)

## Warning: Removed 10 rows containing missing values (geom_point).

# we interpolate the missings
dd %>% dplyr::mutate(diff_3_6 = y6 - y3, y4 = y3 + diff_3_6/3, y5 = y6 - diff_3_6 / 3) -> dd
# we convert dd to long format
dd.l <- dd %>% tidyr::gather(year, score, y1:y7) %>% dplyr::arrange(subj)
# and show a graph with the gap between the missing years
dd.l %>% ggplot2::ggplot(aes(x=year, y=score, group=subj, color=subj)) + ggplot2::geom_line() + ggplot2::geom_point(size=5)

# btw: we could also think of an individual regression line

Versuche stat_summary

require(ggplot2)
year_plot <- ggplot(mpg, aes(x = class, y = cty))
year_plot + stat_summary(fun.y = mean, geom = "point") + 
  stat_summary(fun.y = mean, geom = "line", aes(group = 1)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width = 0.2) +
  labs(x = "Art des Autos",
       y = "Miles per Gallon (Stadt)",
       title = "Miles per Gallon in der Stadt nach Art des Autos")

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

mpg_data <- mpg
year_plot <- ggplot(mpg, aes(x = class, y = cty))
year_plot + stat_summary(fun.y = mean, geom = "point") + 
  stat_summary_bin(fun.y = mean, geom = "line")

## Warning: `fun.y` is deprecated. Use `fun` instead.

## Warning: `fun.y` is deprecated. Use `fun` instead.

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Teilaufgabe sheet_plots

Auf Wunsch der Studierenden ein Abschnitt, wie ihn Johannes Brachem vorgestellt hat am 7.5.2020

library(tidyverse)
# tm <- read_delim("text_messages.dat", delim = "\t")
tm <- read_delim("https://owncloud.gwdg.de/index.php/s/d1QBJpKEpqhPIa9/download", delim = "\t")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Group = col_character(),
##   Baseline = col_double(),
##   Six_months = col_double()
## )

tm$Group %>% unique()

## [1] "Text Messagers" "Controls"

tm_long <- tm %>% gather(Baseline, Six_months, key = "time", value = "score")
p1 <- ggplot(tm_long, aes(x = time, y = score, color = Group))
p1 + 
  stat_summary(fun = "mean",      # Mittelwerte berechnen
               geom = "point",    # Als Punkte anzeigen lassen
               size = 2.5,        # Größe verändern
               aes(shape = Group)
  ) +
  stat_summary(fun.data = mean_cl_boot,
               geom = "errorbar",
               width = 0.2
               ) +
  stat_summary(fun = "mean",
               geom = "line",
               aes(group = Group, linetype = Group)
               ) +
  scale_linetype_manual(values = c(5, 4)) +
  
  labs(x = "Zeit", 
       y = "Grammatik-Score", 
       title = "Plot-Titel",
       color = "Gruppe",
       shape = "Gruppe",
       linetype = "Gruppe"
       ) +
  NULL

Custom legend

library(tidyverse)
ds <- read_tsv("https://formr.org/assets/tmp/admin/_43eBqnR8S-ekngSj_Ya9r_MZnbwhk2nD4p2nGHYIrkD.txt?v1590078725")

## Warning: Missing column names filled in: 'X1' [1]

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   day = col_double(),
##   trueValue = col_double(),
##   error = col_double(),
##   obsValue = col_double()
## )

# we add vars, only to get the legend we want
ds$gar <- factor(rep(1:3, 10))
ds$Legende <- factor(c(rep("Punkt: Messwert", 10), rep("blau: Trendlinie", 10), rep("grau: Konfidenzintervall", 10)))
fitlmT      <- lm(obsValue ~ day, data = ds)
ggplot(ds, aes(day,obsValue)) +
  geom_point() + 
  geom_smooth(method="lm") + 
  labs(title ="Verlauf über die Tage", x = "Tag der Beobachtung", y = "tägliche Anzahl") +
  geom_point(aes(color=Legende), alpha=0)

## `geom_smooth()` using formula 'y ~ x'

ggplot(ds, aes(day,obsValue)) +
  geom_point() + 
  geom_smooth(method="lm") + 
  labs(title ="Verlauf über die Tage", x = "Tag der Beobachtung", y = "tägliche Anzahl") +
  geom_point(aes(color=gar), alpha=0) + 
  scale_colour_identity("Legende", guide = "legend", aesthetics = "colour", breaks=c(1,2,3), labels=c("Wert", "Trend", "KI"))

## `geom_smooth()` using formula 'y ~ x'

to get a legend, we have to use an aes() with color, shape, fill, … that on creates a scale
we can adapt, what is output in the legend in several ways [https://www.r-graph-gallery.com/239-custom-layout-legend-ggplot2.html]
the trick in the first example is to draw a layer with alpha=0 that actually hides the points to be plotted but generates the legend
the second example uses scale_color_identity() in combination with the newly generated 3 level factorized variable gar
both solutions are sort of a trick because legend generation has to be triggered …

References/Links

ggplot cheat sheet

Hadley Wickhams ggplot2 book

Homepage of ggplot2 bzw. ggplot2

Dokumentation documentation “official documentation”

Übersicht über Farben in ggplot2: [http://www.cookbook-r.com/Graphs/Colors_%28ggplot2%29/]

[http://sape.inf.usi.ch/quick-reference/ggplot2]

color specification

Farb Namensliste

course day 1 course day 2

example

Hadley Wickham über ggplot2 link

TEDs of Hans Rosling

Maps

Landkarten - hier nur ein Mini Beispiel.

library(maps)

## 
## Attaching package: 'maps'

## The following object is masked from 'package:plyr':
## 
##     ozone

## The following object is masked from 'package:purrr':
## 
##     map

outlines <- map("world",xlim=-c(113.8, 56.2),ylim=c(-21.2, 36.2))

ToDo

todo: how to show and store a ggplot object in one command: use () around the ggplot and store command: ( my_graph <- ggplot(...) + ....)

todo: include, cool multi whisker plots for grouping variables with single observations integrated http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/79-plot-meansmedians-and-error-bars/

Check

Plots are cool and help us to understand statistics.

Some Points to understand

Graphs are programmed in R, there is no interactive graphical direct manipulative modification
Plots are R-objects with attributes (components or layers), a container that keeps all information to generate a plot
connect raw data - aggregating functions - plot-types - scales etc - subsamples
Layers are responsible for creating the objects that we perceive on the plot. A layer is composed of five parts:
- Data
- Aesthetic mappings.
- A statistical transformation (stat). Generates a layer.
- A geometric object (geom). Generates a layer.
- A position adjustment.
the complete process behind:

mastery-schema.png

source: ggplot2 book
the cheat sheet ggplot, a great point to learn about the possibilities

Some hints

+ has to be at the end of a line and not at the beginning

require(tidyverse)
dd <- read_tsv("http://md.psych.bio.uni-goettingen.de/data/div/df_dplyr.txt")
# this works
pp <- ggplot(dd, aes(x = age, y = v1)) +
   geom_point()
pp
# this does not work
pp <- ggplot(dd, aes(x = age, y = v1))
   + geom_point()

understand inheritance

require(tidyverse)
dd <- read_tsv("http://md.psych.bio.uni-goettingen.de/mv/data/div/df_dplyr.txt")
# geom_point knows which data to show, because it inherits dd and aes from it's mother object
pp <- ggplot(dd, aes(x = age, y = v1)) +
   geom_point()
# but everything can be overwritten ...
pp <- ggplot(dd, aes(x = age, y = v1)) +
   geom_point() +
   geom_point(aes(x=v2), color = "blue")
   # the second geom_point() inherits dd and aes(y = v1), aes(x) is overwritten by v2
pp

factors on axes: the sequence matters

library(tidyverse)
dd <- readr::read_tsv("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   name = col_character(),
##   gender = col_character(),
##   bmi_class = col_character(),
##   bmi_dichotom = col_character(),
##   bmi_normal = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.

# default factor ordering by alphabet
dd$f_bmi_class <- factor(dd$bmi_class)
head(dd$f_bmi_class)

## [1] normal low    obese  low    high   low   
## Levels: high low normal obese

# sequence from low to high
dd$of_bmi_class <- factor(dd$bmi_class, levels = c("low", "normal", "high", "obese"))
head(dd$of_bmi_class)

## [1] normal low    obese  low    high   low   
## Levels: low normal high obese

# factor order matters in ggplot
ggplot(dd, aes(x = f_bmi_class, y = bmi)) +
   stat_summary(fun = mean, color ="red") +
   geom_point()

## Warning: Removed 4 rows containing missing values (geom_segment).

ggplot(dd, aes(x = of_bmi_class, y = bmi)) +
   stat_summary(fun = mean, color ="red") +
   geom_point()

## Warning: Removed 4 rows containing missing values (geom_segment).

if possible, long format is often better (tidy data)

See long wide and ggplot

Screencast

ggplot() - basic ideas and some hints StudIP - ownCloud

\end{comment}

Version: 17 April, 2021 07:49

Grafiken mit ggplot aus dem package library(ggplot2)

Peter Zezula (pzezula@uni-goettingen.de)

Rmd