Das Modell der einfachen linearen Regression
\[ Y = \beta_0 + \beta_1X + E \]
Zufallsvariable \(Y\), Kriterium, abhängige Variable (Everitt: response variable, outcome variable) Zufallsvariable \(X\), Prädiktor, erklärende Variable, unabhängige Variable, Regressor (Everitt: explanatory variable) \(beta_0\), Konstante, Y-Achsenabschnitt \(beta_1\), Steigung
In der Stichprobe
\[ y_i = (b_0 + b_1x_i) + \epsilon_i \]
\(y_i\) sind die Werte der Zufallsvariablen \(Y_i\) \(x_i\) sind die Ausprägungen der Zufallsvariablen \(X_i\) \(b_0\) constant, Konstante, Y-Achsenabschnitt, \(b_1\) Steigung, Gewicht, Einheiten in y-Richtung pro 1 Einheit in x-Richtung
Beispiel-Datensatz, der für alle Beispiele dieser Seite benutzt wird.
# read data
#ddf <- read.delim("http://r.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi.txt", fileEncoding = "UTF-8")
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# create three groups, imagine as countries
ddf$nation <- factor(c(rep(1, 10), rep(2,10), rep(3,10)), levels = c(1,2,3), labels = c("Norway", "Switzerland", "Portugal"))
# take a look at ddf
names(ddf)
## [1] "subj" "name" "gender" "height" "weight"
## [6] "grade" "c_phys_app" "c_good_way" "c_dress" "c_bad_way"
## [11] "c_figure" "filling_time" "bmi" "bmi_class" "bmi_dichotom"
## [16] "bmi_normal" "i_c_bad_way" "pacs_sum" "pacs" "bodysatisf"
## [21] "nation"
head(ddf)
## subj name gender height weight grade c_phys_app c_good_way c_dress
## 1 1 Meier m 180 60 2.02 2 1 2
## 2 2 Müller m 176 54 1.00 3 4 3
## 3 3 Strauß f 180 110 2.03 5 5 5
## 4 4 Übeling f 182 52 1.00 4 5 4
## 5 5 Ärger f 160 62 1.00 3 2 2
## 6 6 Östro f 156 41 1.00 3 4 3
## c_bad_way c_figure filling_time bmi bmi_class bmi_dichotom bmi_normal
## 1 4 1 39.63 18.51852 normal normal TRUE
## 2 1 5 26.00 17.43285 low extreme FALSE
## 3 1 5 26.00 33.95062 obese extreme FALSE
## 4 2 3 26.00 15.69859 low extreme FALSE
## 5 4 1 26.00 24.21875 high extreme FALSE
## 6 3 4 26.00 16.84747 low extreme FALSE
## i_c_bad_way pacs_sum pacs bodysatisf nation
## 1 2 8 1.6 16 Norway
## 2 5 20 4.0 76 Norway
## 3 5 25 5.0 68 Norway
## 4 4 20 4.0 70 Norway
## 5 2 10 2.0 45 Norway
## 6 3 17 3.4 58 Norway
# take ddf in namespace to use column names only
#attach(ddf)
Least Square visualisiert
Selbst spielen mit einer Shiny-App von Andreas Cordes
Einfache Regression - zunächst rein grafisch.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# default scatterplot
plot(ddf$height, ddf$weight)
# x=0 and y=0 included
plot(ddf$height, ddf$weight, ylim=c(-5,130), xlim=c(-5,200))
abline(h=0)
abline(v=0)
Regression hat immer eine Schließrichtung. In R wird eine Regression als Modell angegeben. Erklärte oder vorhergesagte Variable (Kriterium, AV, response variable) wird vorhergesagt durch Prädiktor (Vorhersagevariable, UV, explanatory variable) Ein lineares Modell kann man als Objekt speichern. Es ‘zeigt sich’ in R durch die Ausgabe seiner Parameter. Auf diese und noch mehr kann man zugreifen und sie weiter verwenden.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# default scatterplot
plot(ddf$height, ddf$weight)
# response variable y is predicted by explanatory variable x
# model <- lm(y ~ x)
model <- lm(weight ~ height, data=ddf)
# here weight predicted by height
# the coefficients are shown by default
model
##
## Call:
## lm(formula = weight ~ height, data = ddf)
##
## Coefficients:
## (Intercept) height
## -131.929 1.138
# put regression line to graph
abline(model)
# proof, that it's the constant on y
plot(ddf$height, ddf$weight, ylim=c(-140,130), xlim=c(-5,200))
abline(h=0)
abline(v=0)
abline(model)
Welche Informationen bekomme ich bei Anpassen des Modells einer einfachen Regression? Das linear model Objekt enthält/ermöglicht alles Relevante.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# calculate model
model <- lm(weight ~ height, data=ddf)
# model has attributes to offer
attributes(model)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
# list of names
names(model)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
# access attributes in model via object['attribut-name']
model['fitted.values'] # predicted values
## $fitted.values
## 1 2 3 4 5 6 7 8
## 72.87765 68.32639 72.87765 75.15328 50.12134 45.57008 55.81042 50.12134
## 9 10 11 12 13 14 15 16
## 48.98352 83.11799 47.84571 86.53143 80.84236 87.66925 64.91294 55.81042
## 17 18 19 20 21 22 23 24
## 58.08605 67.18857 72.87765 69.46420 71.73983 75.15328 70.60202 74.01546
## 25 26 27 28 29 30
## 78.56673 72.87765 77.42891 71.73983 76.29109 52.39697
model$fitted.values # alternativ access
## 1 2 3 4 5 6 7 8
## 72.87765 68.32639 72.87765 75.15328 50.12134 45.57008 55.81042 50.12134
## 9 10 11 12 13 14 15 16
## 48.98352 83.11799 47.84571 86.53143 80.84236 87.66925 64.91294 55.81042
## 17 18 19 20 21 22 23 24
## 58.08605 67.18857 72.87765 69.46420 71.73983 75.15328 70.60202 74.01546
## 25 26 27 28 29 30
## 78.56673 72.87765 77.42891 71.73983 76.29109 52.39697
# residuals
residuals(model)
## 1 2 3 4 5 6
## -12.8776484 -14.3263867 37.1223516 -23.1532793 11.8786603 -4.5700780
## 7 8 9 10 11 12
## 15.1895831 19.8786603 -4.9835243 -23.1179873 -2.8457089 43.4685663
## 13 14 15 16 17 18
## 10.1576435 -1.6692491 -8.9129404 -8.8104169 -6.0860478 -14.1885713
## 19 20 21 22 23 24
## 17.1223516 10.5357979 18.2601670 24.8467207 -10.6020176 0.9845361
## 25 26 27 28 29 30
## 1.4332744 -18.8776484 -19.4289102 -14.7398330 -23.2910947 1.6030294
# or via
model$residuals
## 1 2 3 4 5 6
## -12.8776484 -14.3263867 37.1223516 -23.1532793 11.8786603 -4.5700780
## 7 8 9 10 11 12
## 15.1895831 19.8786603 -4.9835243 -23.1179873 -2.8457089 43.4685663
## 13 14 15 16 17 18
## 10.1576435 -1.6692491 -8.9129404 -8.8104169 -6.0860478 -14.1885713
## 19 20 21 22 23 24
## 17.1223516 10.5357979 18.2601670 24.8467207 -10.6020176 0.9845361
## 25 26 27 28 29 30
## 1.4332744 -18.8776484 -19.4289102 -14.7398330 -23.2910947 1.6030294
# coefficients necessary to set up regression line
coefficients(model) # constant (par 1) slope (par 2)
## (Intercept) height
## -131.929130 1.137815
# str() gives a complete overview of the object attributes
str(model)
## List of 12
## $ coefficients : Named num [1:2] -131.93 1.14
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
## $ residuals : Named num [1:30] -12.9 -14.3 37.1 -23.2 11.9 ...
## ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
## $ effects : Named num [1:30] -371.5 65 38.3 -22.4 17.4 ...
## ..- attr(*, "names")= chr [1:30] "(Intercept)" "height" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:30] 72.9 68.3 72.9 75.2 50.1 ...
## ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:30, 1:2] -5.477 0.183 0.183 0.183 0.183 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:30] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "height"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.18 1
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 28
## $ xlevels : Named list()
## $ call : language lm(formula = weight ~ height, data = ddf)
## $ terms :Classes 'terms', 'formula' language weight ~ height
## .. ..- attr(*, "variables")= language list(weight, height)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. ..$ : chr "height"
## .. ..- attr(*, "term.labels")= chr "height"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(weight, height)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## $ model :'data.frame': 30 obs. of 2 variables:
## ..$ weight: int [1:30] 60 54 110 52 62 41 71 70 44 60 ...
## ..$ height: int [1:30] 180 176 180 182 160 156 165 160 159 189 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language weight ~ height
## .. .. ..- attr(*, "variables")= language list(weight, height)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "weight" "height"
## .. .. .. .. ..$ : chr "height"
## .. .. ..- attr(*, "term.labels")= chr "height"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(weight, height)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
## - attr(*, "class")= chr "lm"
# and a plot - first default
plot(ddf$height, ddf$weight)
# second adapted to visualize the parameters
plot(ddf$height, ddf$weight, ylim=c(-140,130), xlim=c(-5,200))
abline(h=0)
abline(v=0)
abline(model)
Neue Parameter. Die Steigung ist jetzt in x-Einheiten pro einer y-Einheit ausgedrückt. \(? cm/1 kg\) Zweite Regressionslinie.
Die Koeffizienten wären nur direkt einsetzbar, wenn man die x- und y-Achsen austauschen würde. Durch Umrechnen kann man sie in den bestehenden Plot einfügen. Constant (y-Achsenabschnitt):
Bezogen auf die bestehende Grafik gibt y.c den x-Wert an, bei dem y = 0
Ansatz: y = y.s * x + y.c
für y = 0
0 = y.s * x + y.c
auflösen nach x
0 - y.c = y.s * x
(0 - y.c)/ y.s = x
da in aktueller Grafik y und x vertauscht sind, ist dies der y-Achsenabschnitt
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# model of height on base of weight - the other way round
ymodel <- lm(height ~ weight, data=ddf)
# coefficients
ymodel
##
## Call:
## lm(formula = height ~ weight, data = ddf)
##
## Coefficients:
## (Intercept) weight
## 156.5262 0.2807
# plot it
plot(ddf$height, ddf$weight)
abline(h=0)
abline(v=0)
abline(model)
# constant in units of weight
y.c <- ymodel$coefficients[1]
# slope: 1 / y.s # y.s x-units per 1 y
y.s <- ymodel$coefficients[2]
print(c(y.c, y.s))
## (Intercept) weight
## 156.5261961 0.2806949
# to use
print(c((0 - y.c)/y.s, 1/y.s))
## (Intercept) weight
## -557.638194 3.562587
abline((0 - y.c) / y.s, 1/y.s, lty=4)
Um die relevanten Y-Achsen-Abschnitte auch zu sehen ein modifizierter Plot.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
model <- lm(weight ~ height, data=ddf)
ymodel <- lm(height ~ weight, data=ddf)
# default plot
plot(ddf$height, ddf$weight)
abline(h=0)
abline(v=0)
abline(model)
abline((0 - y.c) / y.s, 1/y.s, lty=4)
# proof, that it's the constant on y
plot(ddf$height, ddf$weight, ylim=c(-560,130), xlim=c(-5,200))
abline(h=0)
abline(v=0)
abline(model)
abline((0 - y.c) / y.s, 1/y.s, lty=4)
Korrelation als standardisiertes Maß des Zusammenhangs, unabhängig von Skalierung der relevanten Variablen.
Korrelationskoeffizient $ -1 <= r <= +1 $
\[ r(x, y) = \frac{cov(x, y)}{s_x s_y} = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{(N - 1) s_x s_y} \]
vgl. Korrelation
Korrelation zwischen Vorhersage- und Reaktionsvariable ist der Cosinus des Winkels zwischen den beiden Regressionsgeraden.
dd <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
model <- lm(weight ~ height, data=dd)
# default plot
# correlation reaction variable and predicted values
cor(dd$weight, model$fitted.values)
## [1] 0.5651363
# determination coefficient or percentage of variance of reaction variable that predictive variable can explain
cor(dd$weight, model$fitted.values)^2
## [1] 0.319379
# as we can see, linear transformation does not influence the correlation
cor(dd$weight, dd$height)
## [1] 0.5651363
Ein grafischer Überblick ist als erster Schritt immer eine gute Sache.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# define items
#c_item_names <- c("c_phys_app", "c_good_way", "c_dress", "i_c_bad_way", "c_figure")
c_item_names <- c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")
require(ggplot2)
## Lade nötiges Paket: ggplot2
# single plot
ggplot(ddf, aes(x=c_phys_app, y=pacs)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) # Add linear regression line
## `geom_smooth()` using formula 'y ~ x'
# (by default includes 95% confidence region)
# package GGally, recommended by hadley, provides ggpairs to visualize multiple scatterplots
require(GGally)
## Lade nötiges Paket: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# show all pacs-variables in a scatterplot matrix
ggpairs(ddf[c_item_names], colour=ddf$gender, alpha=0.4)
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'colour', 'alpha'
## are being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
Signifikanztest
Modellvergleich mit Null-Modell
Zwei Modelle werden verglichen. Ein Modell muss Teilmodell des anderen sein. Zu schätzende Parameter nehmen zu (nested models).
Null-Modell: Konstante Vorhersage durch Parameter constant (Mittelwert)
Modellanpassung funktioniert hier über die Methode der kleinsten Quadrate (Quadrierte Abweichungen der Daten von der Vorhersagegeraden werden minimiert).
Alternative für die Modellanpassung: Maximum Likelihood-Methode. Außerdem existieren viele graphische Anpassungsmethoden wie Splines etc.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
model <- lm(weight ~ height, data=ddf)
# to be more precise
model <- lm(weight ~ 1 + height, data=ddf)
# the usual model test
summary(model)
##
## Call:
## lm(formula = weight ~ 1 + height, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.291 -13.861 -3.708 11.543 43.469
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -131.9291 55.2076 -2.390 0.02384 *
## height 1.1378 0.3139 3.625 0.00114 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.93 on 28 degrees of freedom
## Multiple R-squared: 0.3194, Adjusted R-squared: 0.2951
## F-statistic: 13.14 on 1 and 28 DF, p-value: 0.001138
# idea of model comparison
zero.model <- lm(weight ~ 1, data=ddf)
summary(zero.model)
##
## Call:
## lm(formula = weight ~ 1, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.833 -14.583 -7.833 12.167 62.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.833 3.898 17.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.35 on 29 degrees of freedom
# zero model parameter 'constant' is mean of response variable
mean(ddf$weight)
## [1] 67.83333
# model comparison using anova()
# we compare a simpler model with a complexer one. Complexer models have more parameters but include the parameters of the simpler model.
# The simple model is a part model of the complexer one.
anova(zero.model, model)
## Analysis of Variance Table
##
## Model 1: weight ~ 1
## Model 2: weight ~ 1 + height
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 13220.2
## 2 28 8997.9 1 4222.2 13.139 0.001138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot zero.model
plot(ddf$height, ddf$weight)
abline(zero.model)
# add model to plot of zero.model
plot(ddf$height, ddf$weight)
abline(zero.model)
abline(model)
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
ddf <- ddf[order(ddf$weight),]
model <- lm(weight ~ height, data=ddf)
# plot model
plot(ddf$height, ddf$weight)
abline(model)
# plot(y ~ x, main='einfacher Scatterplot\nmit beiden Regressionsgeraden')
# abline(reg)
# ylm <- lm(x ~ y)
# yc <- coefficients(ylm)
# abline((0 - yc[1])/yc[2], 2 - yc[2])
#
# # Schätzung für ein x = 16.5
# plot(y ~ x, main='Schätzung eines unbekannten y\naufgrund eines x')
# abline(reg)
# abline(v=16.5, lty=2)
#
# # den Schätzwert von y einzeichnen
# abline(h=reg$coefficients[1] + reg$coefficients[2] * 16.5, lty=2)
# normal distribution of residuals
qqnorm(model$residuals)
# a line may be added
abline(0,1)
# alternative
# abline(c(0,0), c(1,1))
# plot residuals against predictors
plot(ddf$weight, model$residuals, main='Residuen über den Wertebereich x')
abline(h=0)
# Plot der Residuen gegen die vorhergesagten Werte (predicted values)
#plot(reg$fitted.values, reg$residuals, main='Residuen über predicted values x')
… und mit ggplot
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
ddf <- ddf[order(ddf$weight),]
model <- lm(weight ~ height, data=ddf)
require(ggplot2)
# raw data
ggplot(data=ddf, aes(x=height, y=weight)) + geom_point()
# residual plot
ggplot(data=ddf, aes(x=model$fitted.values, y=model$residuals)) + geom_point() + geom_hline(yintercept = 0)
Plot machen, Regressionsgerade fitten, einzeichnen, 95% Konfidenzintervall anzeigen, Modellparameter ansehen.
#ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term.txt", fileEncoding = "UTF-8")
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
head(ddf)
## subj name gender height weight grade c_phys_app c_good_way c_dress
## 1 1 Meier m 180 60 2.02 2 1 2
## 2 2 Müller m 176 54 1.00 3 4 3
## 3 3 Strauß f 180 110 2.03 5 5 5
## 4 4 Übeling f 182 52 1.00 4 5 4
## 5 5 Ärger f 160 62 1.00 3 2 2
## 6 6 Östro f 156 41 1.00 3 4 3
## c_bad_way c_figure filling_time bmi bmi_class bmi_dichotom bmi_normal
## 1 4 1 39.63 18.51852 normal normal TRUE
## 2 1 5 26.00 17.43285 low extreme FALSE
## 3 1 5 26.00 33.95062 obese extreme FALSE
## 4 2 3 26.00 15.69859 low extreme FALSE
## 5 4 1 26.00 24.21875 high extreme FALSE
## 6 3 4 26.00 16.84747 low extreme FALSE
## i_c_bad_way pacs_sum pacs bodysatisf
## 1 2 8 1.6 16
## 2 5 20 4.0 76
## 3 5 25 5.0 68
## 4 4 20 4.0 70
## 5 2 10 2.0 45
## 6 3 17 3.4 58
# correlation matrix
#cor(ddf[2:length(ddf)])
require(GGally)
#pacs.v <- c("c_phys_app", "c_good_way", "c_dress", "c_bad_way", "c_figure")
pacs.v <- c("pacs", "c_dress")
# show all pacs-variables in a scatterplot matrix
#ggpairs(ddf, pacs.v, alpha=0.4)
# reset of graphics environment
par(mfrow=c(1,1))
# predict outcome var pacs by explanatory variable c_dress
# fit model and store result object
#reg <- lm(ddf$d2 ~ ddf$d1)
# predict weight from height
model <- lm(pacs ~ c_dress, data=ddf)
# default plot method, associated with command lm(). plot() uses lm formula.
plot(ddf$pacs ~ ddf$c_dress)
# regression line can be drawn using result object
abline(model)
# check residuals normal distribution graphically
qqnorm(model$residuals, main="NV der Residuen?")
# check residuals
plot(ddf$c_dress, model$residuals, main="Residuen gegen Prädiktor")
# Residuen gegen vorhergesagte Y-Werte
#plot(reg$fitted.values, reg$residuals, main="Residuen gegen Prädiktor")
# oder äquivalent mit Model-Schreibweise
#plot(reg$residuals ~ reg$fitted.values, main="Residuen gegen Prädiktor2")
# Grundplot
plot(ddf$pacs ~ ddf$c_dress)
abline(lm(ddf$pacs ~ ddf$c_dress))
# predict() computes predicted values
# se.fit is a flag to compute se of predition error
# all is stored in result object 'pred'
pred<-predict(model, se.fit=TRUE)
fitval <- pred$fit
se <- pred$se.fit
# prepare se values for visualization
# store predicted values in an ordered vector 'index'
index <- order(ddf$c_dress)
# values have to be ordered so that line doesn't 'jump'
y <- fitval[index] # calculate predicted values
stde <- se[index] # calculate se
yu<-y+1.96*stde # upper and
yl<-y-1.96*stde # lower limits of confidence interval
# plot limits
lines(ddf$c_dress[index],yu,lty=2)
lines(ddf$c_dress[index],yl,lty=2)
# reset of graphics environment
par(mfrow=c(1,1))
# todo: maybe better using two continuous vars
In ein Modell kann zu der linearen Komponente auch eine quadratische oder höhere Komponente ein- und desselben Prädiktors hinzugefügt werden.
Hinweise zu bisher nicht in der Reaktionsvariable identifizierten Komponenten erkennen wir in den Residuen oder auch bereits im Plot der Rohwerte.
Der Beispiel-Datensatz: [generation data_generation.Rmd]
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term.txt", fileEncoding = "UTF-8")
# we sort ddf for better grafical displays of effects
ddf <- ddf[order(ddf$x),]
head(ddf)
## subj x y.c y.cl y.s1 y.s2 y.c.1 y.cs y.csl
## 1 15 71.06 69.50 98.62 154.47 152.75 -169.11 -90.45 -2.86
## 2 81 78.03 89.91 117.07 123.31 130.13 -36.11 30.26 125.09
## 3 11 79.32 78.85 109.29 123.95 113.53 -15.49 32.77 128.07
## 4 68 81.41 84.45 120.10 114.47 110.48 10.06 50.63 172.41
## 5 6 82.31 91.15 118.92 113.55 114.82 29.62 52.74 172.53
## 6 42 84.00 90.96 119.61 109.45 105.89 39.85 67.79 157.30
# ein erster grafischer Überblick über die Verhältnisse
pairs(ddf[c('x','y.s1', 'y.cs')])
# prediction of y.2s by a simple, linear model
r.l <- lm(y.s1 ~ x, data = ddf)
summary(r.l)
##
## Call:
## lm(formula = y.s1 ~ x, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.143 -7.522 -4.224 3.227 64.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.8230 13.8040 3.247 0.0016 **
## x 0.6308 0.1387 4.549 1.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.66 on 98 degrees of freedom
## Multiple R-squared: 0.1744, Adjusted R-squared: 0.1659
## F-statistic: 20.7 on 1 and 98 DF, p-value: 1.543e-05
require(ggplot2)
ggplot(data=ddf, aes(x=x, y=y.s1)) + geom_point() + geom_smooth(method="lm" )
## `geom_smooth()` using formula 'y ~ x'
#plot(ddf$x, ddf$y.s1, ylim=c(50,200), xlim=c(50,150), xlab="x", ylab="y.s1", main="full range")
# include the regression line
#abline(lm(r.l))
# we see an quadratic trend in the raw plot already
# take a look at the residuals. We plot fitted values against residuals
#plot(ddf$x, r.l$residuals, main='Residuen über den Wertebereich x')
ggplot(data=ddf, aes(x=r.l$fitted.values, y=r.l$residuals)) + geom_point() + geom_hline(yintercept = 0)
Hierbei kann Varianzeinschränkung täuschen. Ein rein lineares Modell scheint gut angepasst zu sein.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term.txt", fileEncoding = "UTF-8")
# we sort ddf for better grafical displays of effects
ddf <- ddf[order(ddf$x),]
# and we don't have subjects with x <= 95
ddf <- ddf[ddf$x > 95,]
require(psych)
## Lade nötiges Paket: psych
##
## Attache Paket: 'psych'
## Die folgenden Objekte sind maskiert von 'package:ggplot2':
##
## %+%, alpha
psych::describe(ddf)
## vars n mean sd median trimmed mad min max range skew
## subj 1 69 51.33 28.80 51.00 51.46 38.55 1.00 100.00 99.00 -0.02
## x 2 69 103.80 6.11 102.02 103.34 5.75 95.02 122.52 27.50 0.80
## y.c 3 69 108.93 8.26 108.66 108.73 9.25 83.88 127.31 43.43 0.05
## y.cl 4 69 141.43 9.16 140.23 140.88 10.72 123.95 164.15 40.20 0.46
## y.s1 5 69 108.72 14.52 102.57 106.45 6.98 92.72 172.00 79.28 2.08
## y.s2 6 69 108.70 15.08 104.95 106.45 9.33 89.76 170.28 80.52 1.90
## y.c.1 7 69 110.97 23.05 105.16 107.22 10.14 81.44 239.64 158.20 3.21
## y.cs 8 69 116.86 33.29 105.80 111.21 17.41 84.35 283.80 199.45 2.80
## y.csl 9 69 249.33 40.13 241.43 243.23 20.45 188.31 422.19 233.88 2.16
## kurtosis se
## subj -1.26 3.47
## x 0.12 0.74
## y.c -0.10 0.99
## y.cl -0.48 1.10
## y.s1 5.23 1.75
## y.s2 4.41 1.82
## y.c.1 13.44 2.78
## y.cs 9.67 4.01
## y.csl 5.82 4.83
# prediction of y.s1 by a simple, linear model
r.l <- lm(y.s1 ~ x, data = ddf)
summary(r.l)
##
## Call:
## lm(formula = y.s1 ~ x, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.745 -2.990 -1.555 1.726 21.465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -123.16033 10.24694 -12.02 <2e-16 ***
## x 2.23389 0.09855 22.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.968 on 67 degrees of freedom
## Multiple R-squared: 0.8847, Adjusted R-squared: 0.8829
## F-statistic: 513.8 on 1 and 67 DF, p-value: < 2.2e-16
#plot(ddf$x, ddf$y.s1, ylim=c(50,200), xlim=c(50,150), xlab="x", ylab="y.s1", main="limited range of x")
# include the regression line
#abline(lm(r.l))
# we see an quadratic trend in the raw plot already
require(ggplot2)
ggplot(data=ddf, aes(x=x, y=y.s1)) + geom_point() + geom_smooth(method="lm" )
## `geom_smooth()` using formula 'y ~ x'
# take a look at the residuals
ggplot(data=ddf, aes(x=r.l$fitted.values, y=r.l$residuals)) + geom_point() + geom_hline(yintercept = 0)
Die graphische Residualanalyse legt die Aufnahme eines quadratischen Terms als Prädiktor nahe.
Dies geschieht, indem wir x quadriert ( \(x^{2}\) ) als zusätzlichen Prädiktor ins Modell aufnehmen
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term.txt", fileEncoding = "UTF-8")
# we sort ddf for better grafical displays of effects
ddf <- ddf[order(ddf$x),]
# we create a new variable that is x squared
ddf['xs'] <- ddf$x ^ 2
# add both, linear and quadratic term to the model
r.ls <- lm(y.s1 ~ x + xs, data=ddf)
# model parameters
summary(r.ls)
##
## Call:
## lm(formula = y.s1 ~ x + xs, data = ddf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8239 -1.3803 -0.0551 1.3930 4.5626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 988.288192 14.551697 67.92 <2e-16 ***
## x -18.759804 0.296778 -63.21 <2e-16 ***
## xs 0.098760 0.001508 65.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.892 on 97 degrees of freedom
## Multiple R-squared: 0.9817, Adjusted R-squared: 0.9814
## F-statistic: 2608 on 2 and 97 DF, p-value: < 2.2e-16
# compare simple linear model with model that includes x squared
r.l <- lm(y.s1 ~ x, data=ddf)
# compare the models statistically
anova(r.l, r.ls)
## Analysis of Variance Table
##
## Model 1: y.s1 ~ x
## Model 2: y.s1 ~ x + xs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 15702.7
## 2 97 347.2 1 15356 4289.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# show a graph of the base model
plot(ddf$x, ddf$y.s1)
# add regression line
abline(lm(r.l))
# add regression curve that combines the two components
lines(ddf$x, r.ls$fit)
# a look at the residuals
plot(ddf$x, r.ls$residuals)
abline(h=0)
# plot raw data
require(ggplot2)
#ggplot(data=ddf, aes(x=x, y=y.s1)) + geom_point() + geom_point(aes(x=x, y=r.ls$fitted.values, color="red"))
ggplot(data=ddf, aes(x=x, y=y.s1)) + geom_point() + geom_line(aes(x=x, y=r.ls$fitted.values, color="red"))
#todo ggplot(data=ddf, aes(x=sqrt(x), y=sqrt(y.s1))) + geom_point() + geom_line(aes(x=sqrt(x), y=sqrt(r.ls$fitted.values), color="red"))
# take a look at the residuals
ggplot(data=ddf, aes(x=r.ls$fitted.values, y=r.ls$residuals)) + geom_point() + geom_hline(yintercept = 0)
Aber das ist eigentlich schon eine multiple Regression, um die es in lm_mult Rmd bzw. html geht.
Hier handelt es sich nicht um ein mathematisches Modell des Zusammenhanges von Prädiktor und Kriterium, sondern um eine grafische Technik (Glättung). Hier kurz angerissen sind “lowess-fit” und “spline”.
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm.txt")
# für eine vernünftige Darstellung muss die Tabelle nach d1 sortiert werden
ddf.sorted <- ddf[order(ddf$d1),]
# Spline anpassen
# Grafikumgebung reset
par(mfrow=c(1,1))
#
plot(ddf.sorted$q21 ~ ddf.sorted$d1)
abline(lm(ddf.sorted$q21 ~ ddf.sorted$d1))
lines(lowess(ddf.sorted$q21 ~ ddf.sorted$d1),lty=2)
lines(smooth.spline(ddf.sorted$d1, ddf.sorted$q21),lty=3)
legend("topleft",c("Linear Fit","Lowess fit","Spline fit"),lty=1:3)
# using ggplot, method loess is default
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%() masks ggplot2::%+%()
## x psych::alpha() masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ddf.sorted %>% ggplot(aes(x=d1, y=q21)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Der Plot der Residuen gegen die geschätzten Werte ist geeigneter, um Hinweise auf Besonderheiten in den Residuen zu bekommen, die z. B. ein anderes Modell nahelegen könnten. Man kann im Plot unten schön den Effekt der Regression zur Mitte sehen, der bei den geschätzten Werten nicht vorhanden ist. Bei Abweichungen von der Schätzgeraden (= Residuen) ‘wandern’ die geschätzten Werte in Richtung Schätzgerade.
# reset sub graph system
par(mfrow=c(1,1))
ddf <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
#data <- read.delim("http://r.psych.bio.uni-goettingen.de/mv/data/be/car-data.txt")
lll <- lm(weight ~ height, data=ddf)
plot(lll$residuals, ddf$weight,
main = paste("Gewicht ~ Größe", round(cor(ddf$weight, ddf$height), digits=2), sep=" ", collapse = NULL),
ylab='schwarz: y, rot: predicted')
points(cbind(lll$residuals, lll$fitted.values), type='p', pch=21, col="red")
ll.res <- lm(lll$fitted.values ~ lll$residuals)
abline(ll.res, col="red")
ll.res <- lm(ddf$weight ~ lll$residuals)
abline(ll.res)
# better residuals in y direction
plot(ddf$weight, lll$residuals,
main = paste("Gewicht ~ Größe", round(cor(ddf$weight, ddf$height), digits=2), sep=" ", collapse = NULL),
ylab='schwarz: y, rot: predicted')
points(cbind(lll$fitted.values, lll$residuals), type='p', pch=21, col="red")
ll.res <- lm(lll$residuals ~ lll$fitted.values)
abline(ll.res, col="red")
ll.res <- lm(lll$residuals ~ ddf$weight)
abline(ll.res)
Ein einmal angepasstes Modell kann zur Vorhersage von Reaktionswerten aufgrund beliebiger Vorhersagewerte genutzt werden.
# In R, you can generally fit a model doing something like this:
# mymodel <- method(y~x, data=mydata)
# ...and then predict the outcome for new data using the generic predict function:
# predvals <- predict(mymodel, newdataframe)
df <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt", fileEncoding = "UTF-8")
# calculate model
model <- lm(weight ~ height, data=df)
# calculate a prediction for 163 using coefficients (model parameters)
# model$coefficients[1] + model$coefficients[2] * 163
# to get rid of the irritating label
as.double(model$coefficients[1] + model$coefficients[2] * 163)
## [1] 53.53479
require(dplyr)
df.new <- dplyr::data_frame(height = c(163, 156, 177, 199, 167))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
predict(model, df.new)
## 1 2 3 4 5
## 53.53479 45.57008 69.46420 94.49614 58.08605
df.new$estimated_weight <- predict(model, df.new)
# look at it
df.new
## # A tibble: 5 × 2
## height estimated_weight
## <dbl> <dbl>
## 1 163 53.5
## 2 156 45.6
## 3 177 69.5
## 4 199 94.5
## 5 167 58.1
# keep environment neat
rm(df.new)
rm(model)
Hier werden verschiedene Ausreißer produziert und ihre Auswirkung auf die diagnostischen Plots gezeigt.
R-Funktionen für Ausreißerbetrachtung
• rstandard() fur die intern studentisierten Residuen, ¨ • rstudent() fur die extern studentisierten Residuen, ¨ • cooks.distance() fur die Cook-Abstände, • dffits() fur die DFFITS, • hatvalues() fur die Hebelwerte, ¨ • dfbetas() fur die DFBETAS und ¨ • covratio() fur die skalierten Varianz- bzw. Konfindenzvolumenverhältnisse.
library(tidyverse)
# we generate some data
set.seed(43)
y.t <- rnorm(100)
dd <- dplyr::tibble(
y = y.t,
x = y.t + rnorm(100, 0.3)
)
cor(dd)
## y x
## y 1.0000000 0.7781306
## x 0.7781306 1.0000000
plot(dd)
m.1 <- lm(y ~ x, data=dd)
# a typical outlier
dd.t <- dd %>% add_row(x = 16, y = 8)
m.2 <- lm(y ~ x, data=dd.t)
# an untypical outlier
dd.t <- dd %>% add_row(x = 12, y = 1)
m.3 <- lm(y ~ x, data=dd.t)
# base sample
plot(m.1)
# ... with a typical outlier
plot(m.2)
# ... with a non typical outlier
plot(m.3)
# we get cooks distance by calling it from the model
head(cooks.distance(m.1))
## 1 2 3 4 5 6
## 0.0037980279 0.0181441040 0.0727529049 0.0008097697 0.0177657095 0.0044440711
tail(cooks.distance(m.1))
## 95 96 97 98 99 100
## 2.500816e-02 9.659158e-04 1.060513e-03 3.972685e-02 6.048756e-05 8.771396e-02
# we look at the maxima of cooks distance
max(cooks.distance(m.1))
## [1] 0.08771396
max(cooks.distance(m.2))
## [1] 0.04985645
max(cooks.distance(m.3))
## [1] 8.749748
max(hatvalues(m.1))
## [1] 0.06937655
max(hatvalues(m.2))
## [1] 0.5241967
max(hatvalues(m.3))
## [1] 0.3797013
max(dffits(m.1))
## [1] 0.4269897
max(dffits(m.2))
## [1] 0.3218417
max(dffits(m.3))
## [1] 0.3922781
max(dfbetas(m.1))
## [1] 0.3633103
max(dfbetas(m.2))
## [1] 0.2337102
max(dfbetas(m.3))
## [1] 0.6590918
tail(influence.measures(m.2)$infmat)
## dfb.1_ dfb.x dffit cov.r cook.d hat
## 96 0.037904102 -0.021852175 0.03972675 1.0328970 7.962690e-04 0.01419636
## 97 -0.045709958 0.006772670 -0.04610450 1.0265839 1.071384e-03 0.01011936
## 98 0.153060896 0.129316568 0.23493562 0.9588002 2.683028e-02 0.01420470
## 99 -0.010888974 0.001499681 -0.01099902 1.0306630 6.109909e-05 0.01008854
## 100 0.148539484 0.233710182 0.32184173 0.9465132 4.985645e-02 0.02094630
## 101 0.005375748 -0.038028668 -0.03839298 2.1447611 7.445207e-04 0.52419669
tail(influence.measures(m.3)$infmat)
## dfb.1_ dfb.x dffit cov.r cook.d hat
## 96 0.005539391 -0.003461047 0.005865465 1.0362057 1.737698e-05 0.01518988
## 97 -0.037809645 0.005611832 -0.038232266 1.0279425 7.372351e-04 0.01011901
## 98 0.149979574 0.158878957 0.259582123 0.9536621 3.264010e-02 0.01583179
## 99 -0.008022238 0.001095002 -0.008127343 1.0307716 3.336165e-05 0.01008404
## 100 0.148275584 0.304095240 0.392278149 0.9283486 7.320795e-02 0.02481077
## 101 0.659091818 -4.870406178 -4.935175352 0.8322233 8.749748e+00 0.37970130
Zu finden in
Beispiel: Gewicht vorhergesagt durch Größe (weight predicted by height)
In den USA nationaler Zugangstest zu Universitäten
Übersetzung der american grades (A+ bis F) in Zahlenwerte (0-4, höher ist besser). GPA ist schulabhängig.
When deciding whether to admit an applicant, colleges take lots of factors, such as grades, sports, activities, leadership positions, awards, teacher recommendations, and test scores, into consideration. Using SAT scores as a basis of whether to admit a student or not has created some controversy. Among other things, people question whether the SATs are fair and whether they predict college performance.
This study examines the SAT and GPA information of 105 students who graduated from a state university with a B.S. in computer science. Using the grades and test scores from high school, can you predict a student’s college grades?
Questions to Answer: Can the math and verbal SAT scores be used to predict college GPA? Are the high school and college GPAs related?
Design Issues: The conclusions from this study should not be generalized to students of other majors.
Descriptions of Variables
Variable Description
high_GPA High school grade point average
math_SAT Math SAT score
verb_SAT Verbal SAT score
comp_GPA Computer science grade point average
univ_GPA Overall university grade point average
Alle Bewertungen (SAT, GPAs) sind umso besser, je höher der jeweilige Wert ist.
# get the data
df <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/div/sat.txt")
Im Datensatz v_lm_term2.txt finden wir für 100 Beobachtungen subj
eine Vorhersagevariable p
und mehrere Reaktionsvariablen r.*
.
# get the data
df <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term2.txt")
[res_begin:example_poly]
# get the data
require(readr)
df <- read.delim("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_lm_term2.txt")
# we add potences of p
df$p.2 <- df$p^2
df$p.3 <- df$p^3
m.0 <- lm(r.qc ~ 1, data=df)
m.1 <- lm(r.qc ~ p, data=df)
m.2 <- lm(r.qc ~ p + p.2, data=df)
m.3 <- lm(r.qc ~ p + p.2 + p.3, data=df)
# plot raw data
require(ggplot2)
#ggplot(data=df, aes(x=p, y=r.qc)) + geom_point() + geom_point(aes(x = p, y = m.3$fitted.values, color = "red"))
ggplot(data=df, aes(x=p, y=r.qc)) + geom_point() + geom_line(aes(x = p, y = m.3$fitted.values, color="red"))
# take a look at the residuals
ggplot(data=df, aes(x=m.3$fitted.values, y=m.3$residuals)) + geom_point() + geom_hline(yintercept = 0)
[res_end:example_poly]
Beispiele für einfache Regression
Lernfähigkeit über die Lebensspanne: - Sie nimmt kontinuierlich ab (linear, Zellalterung), - steigt ab Geburt drastisch an, erreicht ihren Höhepunkt mit ca 15-20 Jahren und sinkt dann immer weiter (umgekehrt quadratisch)
Simple regression is for continuous variables as reaction variable and explanatory variable.
Our model is basically a line that is characterized by two parameters, its intercept and its slope. Intercept is the y value at x=0, slope is the change in y for increasing x by 1.
Estimated values of our reaction variable depend on the value of the explanatory variable.
R^2 is the squared correlation of our estimated values with our reaction variable values.
Simple regression already allows us to do a lot of what we usually do with more complex models.
library(tidyverse)
dd <- read_tsv("http://md.psych.bio.uni-goettingen.de/mv/data/virt/v_bmi_preprocessed.txt")
# define a model: weight ~ heigth
# adapt a model: lm(weight ~ height, data=dd)
m.1 <- lm(weight ~ height, data=dd)
# parameters
coefficients(m.1)
# residuals
head(m.1$residuals)
# estimated values (and real values)
head(m.1$fitted.values)
head(dd$weight)
# predictive power, R^2
summary(m.1)$r.squared
cor(dd$weight, m.1$fitted.values)^2
# model test
m.0 <- lm(weight ~ 1, data=dd)
anova(m.0, m.1)
# estimate reaction variable values of new observations
predict(m.1, data.frame(height = c(160, 170, 180, 200)))
# visualize data and model
dd %>% ggplot(aes(x=height, y=weight)) + geom_point() + geom_smooth(method=lm)