Machine Learning with R The R Bootcamp @ DHLab |
![]() |
adapted from xkcd.com
In this practical, you’ll practice the basics of fitting and
exploring regression models in R using the tidymodels
package.
By the end of this practical you will know how to:
TheRBootcamp
R project. It should already
have the folders 1_Data
and 2_Code
. Make sure
that the data file(s) listed in the Datasets
section are in
your 1_Data
folder# Done!
Fitting_practical.R
in the 2_Code
folder.# Done!
library()
load the set of packages for this
practical listed in the packages section above.# Load packages necessary for this script
library(tidyverse)
library(tidymodels)
tidymodels_prefer() # to resolve common conflicts
airbnb.csv
. Using the following template, load the dataset
into R as airbnb
:# Load in airbnb.csv data as airbnb
airbnb <- read_csv(file = "1_Data/airbnb.csv")
airbnb
# A tibble: 1,191 × 23
price accom…¹ bedro…² bathr…³ clean…⁴ avail…⁵ distr…⁶ host_…⁷ host_…⁸ host_…⁹
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <lgl>
1 99 3 1 2 30 3 Pankow within… 100 TRUE
2 61 4 1 1 35 0 Mitte within… 100 TRUE
3 50 2 2 0.5 0 0 Mitte within… 100 FALSE
4 30 2 1 1.5 25 31 Mitte within… 63 FALSE
5 60 2 1 1 40 87 Tempel… within… 100 FALSE
6 45 2 1 1.5 10 0 Friedr… within… 83 FALSE
7 32 2 0 1 0 14 Lichte… within… 100 FALSE
8 62 6 2 1 30 76 Pankow within… 100 TRUE
9 50 2 1 1 0 53 Charlo… within… 100 TRUE
10 60 2 1 1 0 16 Charlo… within… 100 FALSE
# … with 1,181 more rows, 13 more variables: host_listings_count <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
# review_scores_checkin <dbl>, review_scores_communication <dbl>,
# review_scores_location <dbl>, review_scores_value <dbl>, kitchen <chr>,
# tv <chr>, coffe_machine <chr>, dishwasher <chr>, terrace <chr>,
# bathtub <chr>, and abbreviated variable names ¹accommodates, ²bedrooms,
# ³bathrooms, ⁴cleaning_fee, ⁵availability_90_days, ⁶district, …
dim()
function.# Print number of rows and columns of airbnb
dim(XX)
# Print number of rows and columns of airbnb
dim(airbnb)
[1] 1191 23
View()
. How does
it look?View(XX)
names()
.# Print column names of airbnb
names(XX)
# Print column names of airbnb
names(airbnb)
[1] "price" "accommodates"
[3] "bedrooms" "bathrooms"
[5] "cleaning_fee" "availability_90_days"
[7] "district" "host_respons_time"
[9] "host_response_rate" "host_superhost"
[11] "host_listings_count" "review_scores_accuracy"
[13] "review_scores_cleanliness" "review_scores_checkin"
[15] "review_scores_communication" "review_scores_location"
[17] "review_scores_value" "kitchen"
[19] "tv" "coffe_machine"
[21] "dishwasher" "terrace"
[23] "bathtub"
recipe
, we specify (a) what to predict,
(b) how to predict it (the features), (c) how to prepare our data.
Create your first recipe called airbnb_recipe
, by adding a
formula to specify that we want to predict price
with the
number of people it can accommodate (accommodates
):price ~ accommodates
airbnb
# create basic recipe
airbnb_recipe <- recipe(XX ~ XX, data = XX)
# create basic recipe
airbnb_recipe <- recipe(price ~ accommodates, data = airbnb)
airbnb_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 1
tidymodels
, we
first have to set up our model. We do this by specifying (a) the model
type, (b) the enginge we want to use, and (c) whether we are working on
a regression or a classification problem. We will do this step-by-step.
To perform step (a), call the linear_reg()
function and
assign it the name lm_model
.# set up our model
lm_model <-
XX()
# set up our model
lm_model <-
linear_reg()
stats
package’s engine. To do so, add a pipe
(%>%
) to the code you specified above, and add the
set_engine(XX)
function, with the engine
"lm"
.# set up the engine
lm_model <-
linear_reg() %>%
XX(XX)
# set up the engine
lm_model <-
linear_reg() %>%
set_engine("lm")
show_engines("MODEL_TYPE")
. Check which other engines would
be available with the model type linear_reg
.show_engines("linear_reg")
# A tibble: 7 × 2
engine mode
<chr> <chr>
1 lm regression
2 glm regression
3 glmnet regression
4 stan regression
5 spark regression
6 keras regression
7 brulee regression
tidymodels
is referred to as the problem mode
.
To do so, add yet another pipe to the definition of
lm_model
and call the set_mode()
function.
Pass "regression"
as argument to this function.# set up the problem mode
lm_model <-
linear_reg() %>%
set_engine("lm") %>%
XX(XX)
# set up the problem mode
lm_model <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_model
.# print lm_model
lm_model
Linear Regression Model Specification (regression)
Computational engine: lm
translate()
we can view the function that will be
called to fit our model. Arguments not yet specified (and thus, at this
point, unknown), will be shown as missing_arg()
. Use
translate()
and pass lm_model
as
argument.# view the underlying function used to fit the model
translate(lm_model)
Linear Regression Model Specification (regression)
Computational engine: lm
Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
workflow
in which
we bring the model specification and recipe together, to then fit our
model. To do solm_workflow
.workflow()
function, to initiate the
workflow.airbnb_recipe
using the
add_recipe()
function.lm_model
model specification using the
add_model()
function.# lm workflow
lm_workflow <-
XX() %>%
XX(XX) %>%
XX(XX)
# lm workflow
lm_workflow <-
workflow() %>%
add_recipe(airbnb_recipe) %>%
add_model(lm_model)
lm_workflow
object to view a summary of how
the modeling will be done.lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
fit()
function. Pass the lm_workflow
into the fit function and
save it as price_lm
. Also we have to provide the data to
the fit function, by specifying the data
argument. Set
data = airbnb
.# Fit the regression model
price_lm <-
XX %>%
XX(XX = XX)
# Fit the regression model
price_lm <-
lm_workflow %>%
fit(airbnb)
price_lm
object.price_lm
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) accommodates
-13.5 27.6
tidy()
function on the price_lm
object.# Fit the regression model
tidy(price_lm)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -13.5 4.04 -3.34 8.60e- 4
2 accommodates 27.6 1.12 24.6 1.15e-108
# For every additional person a flat accommodates, the price of an airbnb is
# predicted to rise by 27.6$s.
predict()
function to extract the model predictions. This
returns them as a column named .pred
. Then, using
bind_cols()
add the true values.# generate predictions
lm_pred <-
XX %>%
predict(new_data = airbnb) %>%
bind_cols(airbnb %>% select(price))
# generate predictions
lm_pred <-
price_lm %>%
predict(new_data = airbnb) %>%
bind_cols(airbnb %>% select(price))
lm_pred
object and make sure you
understand the meaning of these variables. What is contained in the
.pred
variable and what in the price
variable?
Which part of the code in the previous task generated the
.pred
and which the price
variable? (This page
here also provides an overview of the naming convention introduced
by the tidymodels
predict methods.)# The first variable, .pred, was created in the call to the predict() function.
# It contains the predicted prices. The second variable, price, contains the
# actual prices from our dataset.
# use the lm_pred object to generate the plot
ggplot(lm_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Regression: One Feature",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
# The points do not fall on the line, which indicates that the model fit
# is not great. Also there are two outliers that cannot be captured by the
# model.
metrics()
function returns the MAE, RMSE, and
the \(R^2\) of a model. Compute these
indices by passing the price
variable as truth
and the .pred
variable as estimate
to the
metrics()
function.# evaluate performance
XX(lm_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(lm_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 73.7
2 rsq standard 0.338
3 mae standard 29.5
# On average, the model commits a prediction error of 29.5 when predicting the
# price of an airbnb. The large difference between the MAE and the RMSE indicates
# That the prediction errors vary very strongly. This is also apparent in the
# plot we created before.
# The R^2 value is 0.338, that is, about 34% of the variation in the data
# can be captured by our model.
So far we have only used one feature (accommodates
), to
predict price
. Let’s try again, but now we’ll use a total
of four features:
accommodates
- the number of people the airbnb
accommodates.bedrooms
- number of bedrooms.bathrooms
- number of bathrooms.district
- location of the airbnb.lm_recipe
.
Specifically, we want to add the three new features to the formula.
Update the recipe from B1, by extending the formula.# updated recipe
airbnb_recipe <-
recipe(XX ~ XX + XX + XX + XX, data = XX)
# updated recipe
airbnb_recipe <-
recipe(price ~ accommodates + bedrooms + bathrooms + district,
data = airbnb)
district
),
we also have to update the recipe by adding a pre-processing step that
ensures that categorical predictors are dummy-coded. Add a pipe
(%>%
) to the recipe definition of the previous task and
call step_dummy(all_nominal_predictors())
to define this
pre-processing step.# updated recipe
airbnb_recipe <-
recipe(price ~ accommodates + bedrooms + bathrooms + district,
data = airbnb) %>%
XX(XX())
# updated recipe
airbnb_recipe <-
recipe(price ~ accommodates + bedrooms + bathrooms + district,
data = airbnb) %>%
step_dummy(all_nominal_predictors())
update_recipe()
function. Pass the new
airbnb_recipe
to update_recipe()
.# update lm workflow with new recipe
lm_workflow <-
lm_workflow %>%
XX(XX)
# update lm workflow with new recipe
lm_workflow <-
lm_workflow %>%
update_recipe(airbnb_recipe)
lm_workflow
object to view a summary of how
the modeling will be done. The recipe should now be updated, which you
can see by the new section Preprocessor.lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
price_lm
.# Fit the regression model
price_lm <-
lm_workflow %>%
fit(airbnb)
tidy()
function on the price_lm
object, take a look at the parameter estimates.# Fit the regression model
tidy(price_lm)
# A tibble: 15 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -46.2 10.8 -4.27 2.15e- 5
2 accommodates 22.3 1.57 14.2 3.22e-42
3 bedrooms 13.7 4.39 3.12 1.86e- 3
4 bathrooms 23.5 7.23 3.26 1.16e- 3
5 district_Friedrichshain.Kreuzberg 3.42 9.13 0.374 7.08e- 1
6 district_Lichtenberg -6.44 14.7 -0.438 6.61e- 1
7 district_Marzahn...Hellersdorf -19.8 30.8 -0.641 5.22e- 1
8 district_Mitte 22.6 9.26 2.44 1.48e- 2
9 district_Neukölln 0.0925 10.0 0.00923 9.93e- 1
10 district_Pankow 7.52 9.44 0.797 4.26e- 1
11 district_Reinickendorf -17.9 18.5 -0.965 3.35e- 1
12 district_Spandau -29.6 24.4 -1.22 2.24e- 1
13 district_Steglitz...Zehlendorf -0.446 16.4 -0.0272 9.78e- 1
14 district_Tempelhof...Schöneberg 8.90 10.9 0.814 4.16e- 1
15 district_Treptow...Köpenick -11.1 16.9 -0.658 5.11e- 1
predict()
function, to extract the model
predictions and bind them together with the true values using
bind_cols()
.# generate predictions
lm_pred <-
XX %>%
XX(XX) %>%
XX(airbnb %>% select(price))
# generate predictions
lm_pred <-
price_lm %>%
predict(new_data = airbnb) %>%
bind_cols(airbnb %>% select(price))
# use the lm_pred object to generate the plot
ggplot(lm_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Regression: Four Features",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
# The model seems to do a little better than before, but the points still do not
# really fall on the line. Also, the model still cannot account for the two
# price outliers.
metrics()
function, evaluate the model
performance. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(lm_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(lm_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 72.2
2 rsq standard 0.365
3 mae standard 28.7
# On average, the model commits a prediction error of 28.7 when predicting the
# price of an airbnb. This is even larger than with only the one predictor.
# The large difference between the MAE and the RMSE indicates
# that the prediction errors vary very strongly. This is also apparent in the
# plot we created before.
# The R^2 value is 0.365, that is, about 37% of the variation in the data
# can be captured by our model, which is more than twice of what the model with
# only one feature explained.
Alright, now it’s time to use all features available!
lm_recipe
and set it to
price ~ .
The .
indicates that all available
variables that are not outcomes should be used as features.# updated recipe
airbnb_recipe <-
recipe(XX ~ XX, data = XX) %>%
step_dummy(all_nominal_predictors())
# updated recipe
airbnb_recipe <-
recipe(price ~ ., data = airbnb) %>%
step_dummy(all_nominal_predictors())
update_recipe()
function. Pass the new
airbnb_recipe
to update_recipe()
.# update lm workflow with new recipe
lm_workflow <-
lm_workflow %>%
XX(XX)
# update lm workflow with new recipe
lm_workflow <-
lm_workflow %>%
update_recipe(airbnb_recipe)
price_lm
.# Fit the regression model
price_lm <-
lm_workflow %>%
fit(airbnb)
tidy()
function on the price_lm
object, take a look at the parameter estimates.# Fit the regression model
tidy(price_lm)
# A tibble: 35 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -149. 68.1 -2.19 2.86e- 2
2 accommodates 23.3 1.78 13.1 9.41e-37
3 bedrooms 13.0 4.55 2.86 4.34e- 3
4 bathrooms 24.0 7.36 3.26 1.15e- 3
5 cleaning_fee -0.241 0.0915 -2.64 8.46e- 3
6 availability_90_days -0.0400 0.0741 -0.540 5.89e- 1
7 host_response_rate -0.160 0.261 -0.613 5.40e- 1
8 host_superhostTRUE 10.7 4.95 2.17 3.05e- 2
9 host_listings_count 0.276 0.551 0.501 6.16e- 1
10 review_scores_accuracy 8.43 5.82 1.45 1.48e- 1
# … with 25 more rows
predict()
function, to extract the model
predictions and bind them together with the true values using
bind_cols()
.# generate predictions
lm_pred <-
XX %>%
XX(XX) %>%
XX(airbnb %>% select(price))
# generate predictions
lm_pred <-
price_lm %>%
predict(new_data = airbnb) %>%
bind_cols(airbnb %>% select(price))
# use the lm_pred object to generate the plot
ggplot(lm_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Regression: All Features",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
# Even with all predictors, the model seems to have some issues.
metrics
function, evaluate the model
performance. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(lm_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(lm_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 71.1
2 rsq standard 0.385
3 mae standard 29.7
# On average, the model commits a prediction error of 29.7 when predicting the
# price of an airbnb. This is again larger than with only the one predictor.
# The large difference between the MAE and the RMSE indicates
# that the prediction errors vary very strongly. This is also apparent in the
# plot we created before.
# The R^2 value is 0.38, that is, about 38% of the variation in the data
# can be captured by our model.
Now it’s time to do a classification task! Recall that in a
classification task, we are predicting a category, not a continuous
number. In this task, we’ll predict whether or not a host is a superhost
(these are experienced hosts that meet a set
of criteria). Whether or not a host is a superhost is stored in the
variable host_superhost
.
factor
. To test whether it is coded
as a factor, you can look at its class
as follows.# Look at the class of the variable host_superhost, should be a factor!
class(airbnb$host_superhost)
[1] "logical"
host_superhost
variable is of class
logical
. Therefore, we have to change it to
factor
. Important note: In binary
classification tasks, the first factor level will be
chosen as positive. We therefore explicitly specify, that
TRUE
be the first level.# Recode host_superhost to be a factor with TRUE as first level
airbnb <-
airbnb %>%
mutate(host_superhost = factor(host_superhost, levels = c(TRUE, FALSE)))
host_superhost
is now a factor,
and check whether the order of the levels is as intended using
levels()
(the order should be
"TRUE", "FALSE"
).XX(airbnb$host_superhost)
XX(airbnb$host_superhost)
class(airbnb$host_superhost)
[1] "factor"
levels(airbnb$host_superhost)
[1] "TRUE" "FALSE"
host_superhost
) with a new model (a logistic regression),
we need to update both our model and our recipe. Specify the new recipe.
Specifically…host_superhost ~ .
, to use all
possible featuresstep_dummy(all_nominal_predictors())
to pre-process
nominal featureslogistic_recipe
# create new recipe
XX <-
XX(XX, data = XX) %>%
XX(XX())
# create new recipe
logistic_recipe <-
recipe(host_superhost ~ ., data = airbnb) %>%
step_dummy(all_nominal_predictors())
logistic_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Dummy variables from all_nominal_predictors()
logistic_model
, with the
model type logistic_reg
, the engine "glm"
, and
mode "classification"
.# create a logistic regression model
XX_model <-
XX() %>%
set_XX(XX) %>%
set_XX(XX)
# create a logistic regression model
logistic_model <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_model
object. Using
translate()
, check out the underlying function that will be
used to fit the model.logistic_model
Logistic Regression Model Specification (classification)
Computational engine: glm
translate(logistic_model)
Logistic Regression Model Specification (classification)
Computational engine: glm
Model fit template:
stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
family = stats::binomial)
logistic_workflow
, where
you add the logistic_model
and the
logistic_recipe
together.# create logistic_workflow
logistic_workflow <-
workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(logistic_model)
logistic_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
fit()
. Save the result as# Fit the logistic regression model
superhost_glm <-
logistic_workflow %>%
fit(airbnb)
metrics()
function to do so. First, we
again create a dataset containing the predicted and true values. This
time, we call the predict()
function twice: once to obtain
the predicted classes, and once to obtain the probabilities, with which
the classes are predicted.# Get fitted values from the Private_glm object
logistic_pred <-
predict(superhost_glm, airbnb, type = "prob") %>%
bind_cols(predict(superhost_glm, airbnb)) %>%
bind_cols(airbnb %>% select(host_superhost))
logistic_pred
object and make sure
you understand what the variables mean.# The first two variables contain the predicted class probabilities and were
# created from the first call to predict(), where type = "prob" was used.
# The third variable, .pred_class, contains the predicted class. If the .pred_TRUE
# variable in a given row was >=.5, this will be TRUE, otherwise it will be FALSE.
# Finally, the last variable, host_superhost, contains the true values.
conf_mat()
function and passing it the host_superhost
variable as
truth
, and th .pred_class
variable as
estimate
. Just by looking at the confusion matrix, do you
think the model is doing well?XX(logistic_pred, truth = XX, estimate = XX)
conf_mat(logistic_pred, truth = host_superhost, estimate = .pred_class)
Truth
Prediction TRUE FALSE
TRUE 337 152
FALSE 143 559
metrics()
function, with exactly the same arguments as you
used in the call to conf_mat()
before, to obtain the
accuracy and the kappa statistic (a chance-corrected measure of
agreement between model prediction and true value).metrics(logistic_pred, truth = host_superhost, estimate = .pred_class)
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.752
2 kap binary 0.487
logistic_pred %>%
pull(host_superhost) %>%
table() %>%
prop.table() %>%
round(2)
.
TRUE FALSE
0.4 0.6
# Just by predicting always FALSE, the model could reach an accuracy of 60%.
# By using all the features available, the model can do about 15 percentage points
# better than that. According to some, completely arbitrary, guidelines, the
# kappa value of .49 can be considered moderate or fair to good.
# Whether such values are acceptable depends on the use case.
.pred_TRUE
) as an unnamed, fourth
argument:XX(logistic_pred, truth = XX, estimate = XX, XX)
metrics(logistic_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.752
2 kap binary 0.487
3 mn_log_loss binary 0.485
4 roc_auc binary 0.837
roc_auc
value indicate?# It indicates the area under the curve (AUC) of the receiver operator
# characteristic (ROC) curve. A value of 1 would be perfect, indicating that both
# sensitivity and specificity simultaneously take perfect values.
roc_curve()
function, to create sensitivity and specificity values of different
cut-offs, and pass this into the autoplot()
function, to
plot the curve. Add the host_superhost
column as
truth
, and the .pred_TRUE
column as third,
unnamed argument, to the roc_curve()
function and plot the
curve.XX(logistic_pred, truth = XX, XX) %>%
autoplot()
roc_curve(logistic_pred, truth = host_superhost, .pred_TRUE) %>%
autoplot()
# Fitting and evaluating a regression model ------------------------------------
# Step 0: Load packages---------------------------------------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(tidymodels) # For ML mastery
tidymodels_prefer() # To resolve common conflicts
# Step 1: Load and Clean, and Explore Training data ----------------------------
# I'll use the mpg dataset from the dplyr package in this example
data_train <- read_csv("1_Data/mpg_train.csv")
# Explore training data
data_train # Print the dataset
View(data_train) # Open in a new spreadsheet-like window
dim(data_train) # Print dimensions
names(data_train) # Print the names
# Step 2: Define recipe --------------------------------------------------------
# The recipe defines what to predict with what, and how to pre-process the data
lm_recipe <-
recipe(hwy ~ year + cyl + displ + trans, # Specify formula
data = data_train) %>% # Specify the data
step_dummy(all_nominal_predictors()) # Dummy code all categorical predictors
# Step 3: Define model ---------------------------------------------------------
# The model definition defines what kind of model we want to use and how to
# fit it
lm_model <-
linear_reg() %>% # Specify model type
set_engine("lm") %>% # Specify engine (often package name) to use
set_mode("regression") # Specify whether it's a regressio or classification
# problem.
# Step 4: Define workflow ------------------------------------------------------
# The workflow combines model and recipe, so that we can fit the model
lm_workflow <-
workflow() %>% # Initialize workflow
add_model(lm_model) %>% # Add the model to the workflow
add_recipe(lm_recipe) # Add the recipe to the workflow
# Step 5: Fit the model --------------------------------------------------------
hwy_lm <-
lm_workflow %>% # Use the specified workflow
fit(data_train) # Fit the model on the specified data
tidy(hwy_lm) # Look at summary information
# Step 6: Assess fit -----------------------------------------------------------
# Save model predictions and observed values
lm_fitted <-
hwy_lm %>% # Model from which to extract predictions
predict(data_train) %>% # Obtain predictions, based on entered data (in this
# case, these predictions are not out-of-sample)
bind_cols(data_train %>% select(hwy)) # Extract observed/true values
# Obtain performance metrics
metrics(lm_fitted, truth = hwy, estimate = .pred)
# Step 7: Visualize Accuracy ---------------------------------------------------
# use the lm_fitted object to generate the plot
ggplot(lm_fitted, aes(x = .pred, y = hwy)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Regression: Four Features",
subtitle = "Line indicates perfect performance",
x = "Predicted hwy",
y = "True hwy") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
The dataset contains data of the 1191 apartments that were added on Airbnb for the Berlin area in the year 2018.
File | Rows | Columns |
---|---|---|
airbnb.csv | 1191 | 23 |
airbnb
Name | Description |
---|---|
price | Price per night (in $s) |
accommodates | Number of people the airbnb accommodates |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms |
cleaning_fee | Amount of cleaning fee (in $s) |
availability_90_days | How many of the following 90 days the airbnb is available |
district | The district the Airbnb is located in |
host_respons_time | Host average response time |
host_response_rate | Host response rate |
host_superhost | Whether host is a superhost TRUE/FALSE |
host_listings_count | Number of listings the host has |
review_scores_accuracy | Accuracy of information rating [0, 10] |
review_scores_cleanliness | Cleanliness rating [0, 10] |
review_scores_checkin | Check in rating [0, 10] |
review_scores_communication | Communication rating [0, 10] |
review_scores_location | Location rating [0, 10] |
review_scores_value | Value rating [0, 10] |
kitchen | Kitchen available TRUE/FALSE |
tv | TV available TRUE/FALSE |
coffe_machine | Coffee machine available TRUE/FALSE |
dishwasher | Dishwasher available TRUE/FALSE |
terrace | Terrace/balcony available TRUE/FALSE |
bathtub | Bathtub available TRUE/FALSE |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
tidymodels |
install.packages("tidymodels") |
Function | Package | Description |
---|---|---|
read_csv() |
tidyverse |
Read in data |
mutate() |
tidyverse |
Manipulate or create columns |
bind_cols() |
tidyverse |
Bind columns together and return a tibble |
linear_reg() /logistic_reg() |
tidymodels |
Initialize linear/logistic regression model |
set_engine() |
tidymodels |
Specify which engine to use for the modeling (e.g.,
“lm” to use stats::lm() , or “stan” to use
rstanarm::stan_lm() ) |
set_mode() |
tidymodels |
Specify whether it’s a regression or classification problem |
recipe() |
tidymodels |
Initialize recipe |
step_dummy() |
tidymodels |
pre-process data into dummy variables |
workflow() |
tidymodels |
Initialize workflow |
add_recipe() |
tidymodels |
Add recipe to workflow |
update_recipe() |
tidymodels |
Update workflow with a new recipe |
add_model() |
tidymodels |
Add model to workflow |
fit() |
tidymodels |
Fit model |
tidy() |
tidymodels |
Show model parameters |
predict() |
tidymodels |
Create model predictions based on specified data |
metrics() |
tidymodels |
Evaluate model performance |
conf_mat() |
tidymodels |
Create confusion matrix |
roc_curve() |
tidymodels |
Calculate sensitivity and specificity with different thresholds for ROC-curve |
autoplot() |
tidymodels |
Plot methods for different objects such as those
created from roc_curve() to plot the ROC-curve |
tidymodels
framework.