Machine Learning with R The R Bootcamp @ DHLab |
![]() |
from Medium.com
By the end of this practical you will know how to:
Open your TheRBootcamp
R project. It should already
have the folders 1_Data
and 2_Code
. Make sure
that the data file(s) listed in the Datasets
section are in
your 1_Data
folder.
Open a new R script and save it as a new file called
Prediction_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this
practical listed in the packages section above.
# Load packages necessary for this script
library(rpart.plot)
library(tidyverse)
library(tidymodels)
tidymodels_prefer() # to resolve common conflicts
airbnb
data. Load the
dataset using the code below.# airbnb data
airbnb <- read_csv(file = "1_Data/airbnb.csv")
names()
and
the contents using View()
.airbnb
dataset to fit the models. To avoid over-fitting, we will now split the
data into a training- and a test-set. Use the
initial_split()
function to create a split. Pass it the
airbnb
data as argument and save the output as
airbnb_split
.# initialize split
XX <- XX(XX)
airbnb_split <- initial_split(airbnb)
training()
function.
Pass it the airbnb_split
object as argument and save the
output as airbnb_train
.# training data
XX <- XX(XX)
airbnb_train <- training(airbnb_split)
testing()
function. Pass it
the airbnb_split
object as argument and save the output as
airbnb_test
.# test data
XX <- XX(XX)
airbnb_test <- testing(airbnb_split)
Your goal in this set of tasks is again to fit models predicting
price
, the price of Airbnbs located in Berlin.
lm_recipe
by calling the
recipe()
function. Use all available predictors by setting
the formula to price ~ .
and use the
airbnb_train
data. Also, add a pipe (%>%
)
and step_dummy(all_nominal_predictors())
to dummy-code all
categorical predictors.# create recipe
XX <-
XX(XX, data = XX) %>%
XX(XX())
# create recipe
lm_recipe <-
recipe(price ~ ., data = airbnb_train) %>%
step_dummy(all_nominal_predictors())
lm_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Dummy variables from all_nominal_predictors()
linear_reg()
function."lm"
using
set_engine()
."regression"
using
set_mode()
.lm_model
.# set up the regression model
XX <-
XX() %>%
XX(XX) %>%
XX(XX)
# set up the regression model
lm_model <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
lm_model
Linear Regression Model Specification (regression)
Computational engine: lm
lm_workflow
using
workflow()
and add the lm_recipe
and
lm_model
objects using add_recipe()
and
add_model()
.# lm workflow
lm_workflow <-
XX() %>%
XX(XX) %>%
XX(XX)
# lm workflow
lm_workflow <-
workflow() %>%
add_recipe(lm_recipe) %>%
add_model(lm_model)
lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
lm_workflow
and the aribnb_train
data to the
fit()
function and save the output as
price_lm
.# Fit the regression model
XX <-
XX %>%
XX(XX)
# Fit the regression model
price_lm <-
lm_workflow %>%
fit(airbnb_train)
tidy()
function on the price_lm
object, take a look at the parameter estimates.# regression model parameters
tidy(price_lm)
# A tibble: 35 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -187. 92.3 -2.03 4.30e- 2
2 accommodates 27.2 2.28 11.9 1.60e-30
3 bedrooms 12.7 5.86 2.17 3.02e- 2
4 bathrooms 17.9 9.49 1.89 5.96e- 2
5 cleaning_fee -0.301 0.114 -2.64 8.40e- 3
6 availability_90_days -0.0576 0.0965 -0.597 5.51e- 1
7 host_response_rate -0.205 0.351 -0.583 5.60e- 1
8 host_superhostTRUE 10.0 6.47 1.55 1.22e- 1
9 host_listings_count 0.411 0.736 0.558 5.77e- 1
10 review_scores_accuracy 9.83 7.96 1.24 2.17e- 1
# … with 25 more rows
predict()
function, to extract the model
predictions from price_lm
and bind them together with the
true values from airbnb_train
using
bind_cols()
.# get predicted values from training data
lm_pred <-
XX %>%
XX(XX) %>%
XX(airbnb_train %>% select(price))
# get predicted values from training data
lm_pred <-
price_lm %>%
predict(new_data = airbnb_test) %>%
bind_cols(airbnb_test %>% select(price))
metrics()
function, evaluate the model
performance. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(lm_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(lm_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 41.6
2 rsq standard 0.468
3 mae standard 30.9
# use the lm_pred object to generate the plot
ggplot(lm_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Regression: All Features",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
tree_recipe
that uses all
available predictors to predict the price
of Airbnbs based
on the airbnb_train
data. In addition, use the
pre-proccessing step
step_other(all_nominal_predictors(), threshold = 0.005)
.
This will lump together all cases of categorical variables that make up
less than 0.5% of the cases into an other
category. This
will prevent issues when assessing performance using the test set.tree_recipe <-
recipe(price ~ ., data = airbnb_train) %>%
step_other(all_nominal_predictors(), threshold = 0.005)
decision_tree()
function to specify the model, and set the engine to rpart
.
Set the mode to "regression"
. Call the output
dt_model
.# set up the decision tree model
XX <-
XX() %>%
XX(XX) %>%
XX(XX)
# set up the decision tree model
dt_model <-
decision_tree() %>%
set_engine("rpart") %>%
set_mode("regression")
dt_workflow
, where you add the
newly created tree_recipe
and the
dt_model
.# decision tree workflow
dt_workflow <-
XX() %>%
XX(XX) %>%
XX(XX)
# decision tree workflow
dt_workflow <-
workflow() %>%
add_recipe(tree_recipe) %>%
add_model(dt_model)
dt_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_other()
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)
Computational engine: rpart
dt_workflow
and the aribnb_train
data to the
fit()
function and save the output as
price_dt
.# Fit the decision tree
XX <-
XX %>%
XX(XX)
# Fit the decision tree
price_dt <-
dt_workflow %>%
fit(airbnb_train)
tidy()
function won’t work with decision tree fit
objects, but we can print the output using the following code:# print the decision tree output
price_dt %>%
extract_fit_parsnip() %>%
pluck("fit")
n= 893
node), split, n, deviance, yval
* denotes terminal node
1) root 893 9230000 73.7
2) accommodates< 11.5 884 1920000 68.1
4) accommodates< 3.5 615 457000 50.7 *
5) accommodates>=3.5 269 853000 108.0
10) accommodates< 5.5 180 318000 91.0 *
11) accommodates>=5.5 89 383000 142.0 *
3) accommodates>=11.5 9 4490000 631.0 *
rpart.plot
function. This will create a
visualization of the decision tree (in this case, the plot does not look
very usefull, but depending on the variables used by the model it can
be).price_dt %>%
extract_fit_parsnip() %>%
pluck("fit") %>%
rpart.plot()
predict()
function, to extract the model
predictions from price_dt
and bind them together with the
true values from airbnb_train
using
bind_cols()
.# get predicted values from training data
dt_pred <-
XX %>%
XX(XX) %>%
XX(airbnb_train %>% select(price))
# get predicted values from training data
dt_pred <-
price_dt %>%
predict(new_data = airbnb_train) %>%
bind_cols(airbnb_train %>% select(price))
metrics()
function, evaluate the model
performance. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(dt_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(dt_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 79.5
2 rsq standard 0.388
3 mae standard 30.2
How does the model performance of the decision tree compare to the one of the regression model, based on the training data?
Using the following code, plot the fitted against the true value, to judge how well our model performed.
# use the dt_pred object to generate the plot
ggplot(dt_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Decision Tree: All Features",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
rand_forest()
function to
specify the model, and set the engine to "ranger"
. Set the
mode to "regression"
. Call the output
rf_model
.# set up the random forest model
XX <-
XX() %>%
XX(XX) %>%
XX(XX)
# set up the random forest model
rf_model <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression")
rf_workflow
, where you add the
tree_recipe
and the newly created
rf_model
.# random forest workflow
rf_workflow <-
XX() %>%
XX(XX) %>%
XX(XX)
# random forest workflow
rf_workflow <-
workflow() %>%
add_recipe(tree_recipe) %>%
add_model(rf_model)
rf_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_other()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Computational engine: ranger
rf_workflow
and the aribnb_train
data to the
fit()
function and save the output as
price_rf
.# Fit the random forest
XX <-
XX %>%
XX(XX)
# Fit the random forest
price_rf <-
rf_workflow %>%
fit(airbnb_train)
tidy()
function won’t work with random forest fit
objects, but we can print the output using the following code:# print the random forest output
price_rf %>%
extract_fit_parsnip() %>%
pluck("fit")
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 500
Sample size: 893
Number of independent variables: 22
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 5677
R squared (OOB): 0.451
predict()
function, to extract the model
predictions from price_rf
and bind them together with the
true values from airbnb_train
using
bind_cols()
.# get predicted values from training data
rf_pred <-
XX %>%
XX(XX) %>%
XX(airbnb_train %>% select(price))
# get predicted values from training data
rf_pred <-
price_rf %>%
predict(new_data = airbnb_train) %>%
bind_cols(airbnb_train %>% select(price))
metrics()
function, evaluate the model
performance. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(rf_pred, truth = XX, estimate = XX)
# evaluate performance
metrics(rf_pred, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 48.4
2 rsq standard 0.803
3 mae standard 13.7
How does the training performance of the random forest compare to the ones of the other two models?
Using the following code, plot the fitted against the true value, to judge how well our model performed.
# use the rf_pred object to generate the plot
ggplot(rf_pred, aes(x = .pred, y = price)) +
# Create a diagonal line:
geom_abline(lty = 2) +
# Add data points:
geom_point(alpha = 0.5) +
labs(title = "Random Forest: All Features",
subtitle = "Line indicates perfect performance",
x = "Predicted Airbnb Prices in $",
y = "True Airbnb Prices in $") +
# Scale and size the x- and y-axis uniformly:
coord_obs_pred()
predict()
function, to extract the model
predictions from price_lm
, but this time based on
airbnb_test
and bind them together with the true values
from airbnb_test
using bind_cols()
. Save the
output as lm_pred_test
# get predicted values from test data
lm_pred_test <-
XX %>%
XX(XX) %>%
XX(airbnb_test %>% select(price))
# get predicted values from test data
lm_pred_test <-
price_lm %>%
predict(new_data = airbnb_test) %>%
bind_cols(airbnb_test %>% select(price))
dt_pred_test
and rf_pred_test
.# decision tree
dt_pred_test <-
price_dt %>%
predict(new_data = airbnb_test) %>%
bind_cols(airbnb_test %>% select(price))
# random forest
rf_pred_test <-
price_rf %>%
predict(new_data = airbnb_test) %>%
bind_cols(airbnb_test %>% select(price))
metrics()
function, evaluate the models’
out-of-sample performances. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
XX(XX, truth = XX, estimate = XX)
XX(XX, truth = XX, estimate = XX)
XX(XX, truth = XX, estimate = XX)
# evaluate performance
metrics(lm_pred_test, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 41.6
2 rsq standard 0.468
3 mae standard 30.9
metrics(dt_pred_test, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 45.7
2 rsq standard 0.208
3 mae standard 26.7
metrics(rf_pred_test, truth = price, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 26.9
2 rsq standard 0.601
3 mae standard 19.3
# The random forest predictions are still the most accurate.
# The regression model's performance is very similar in the training and
# test data. The test performance of the decision tree drops somewhat and
# the test performance of the random forest has the most significant drop
# in comparison to it's training performance.
# The random forest predictions are still the most accurate.
host_superhost
variable. Like in the previous
practical, we first have to change our criterion to be a
factor
. We again explicitly specify TRUE
as
first level.# Recode host_superhost to be a factor with TRUE as first level
airbnb <-
airbnb %>%
mutate(host_superhost = factor(host_superhost, levels = c(TRUE, FALSE)))
host_superhost
and that uses 80% of the data for the
training.airbnb_split <- initial_split(XX, prop = XX, strata = XX)
airbnb_split <- initial_split(airbnb, prop = .8, strata = host_superhost)
airbnb_train
and
airbnb_test
.XX <- XX(XX)
XX <- XX(XX)
airbnb_train <- training(airbnb_split)
airbnb_test <- testing(airbnb_split)
host_superhost ~ .
, to use all
possible featuresairbnb_train
datastep_dummy(all_nominal_predictors())
to pre-process
nominal featureslogistic_recipe
# create new recipe
XX <-
XX(XX, data = XX) %>%
XX(XX())
# create new recipe
logistic_recipe <-
recipe(host_superhost ~ ., data = airbnb_train) %>%
step_dummy(all_nominal_predictors())
logistic_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Dummy variables from all_nominal_predictors()
logistic_model
, with the
model type logistic_reg
, the engine "glm"
, and
mode "classification"
.# create a logistic regression model
XX_model <-
XX() %>%
set_XX(XX) %>%
set_XX(XX)
# create a logistic regression model
logistic_model <-
logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_workflow
, where
you add the logistic_model
and the
logistic_recipe
together.# create logistic_workflow
logistic_workflow <-
workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(logistic_model)
airbnb_train
) using
fit()
. Save the result as superhost_glm
.# Fit the logistic regression model
superhost_glm <-
logistic_workflow %>%
fit(airbnb_train)
metrics()
function to do so. First, we again create a dataset containing the
predicted and true values. This time, we call the predict()
function twice: once to obtain the predicted classes, and once to obtain
the probabilities, with which the classes are predicted.# Get fitted values from the Private_glm object
logistic_pred <-
predict(superhost_glm, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_glm, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics()
function and pass it the
host_superhost
variable as truth
, the
.pred_class
variable as estimate
, and
.pred_TRUE
as last argument.XX(logistic_pred, truth = XX, estimate = XX, XX)
metrics(logistic_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.754
2 kap binary 0.492
3 mn_log_loss binary 0.480
4 roc_auc binary 0.841
roc_curve()
function, to
create sensitivity and specificity values of different cut-offs, and
pass this into the autoplot()
function, to plot the curve.
Add the host_superhost
column as truth
, and
the .pred_TRUE
column as third, unnamed argument, to the
roc_curve()
function and plot the curve.XX(logistic_pred, truth = XX, XX) %>%
autoplot()
roc_curve(logistic_pred, truth = host_superhost, .pred_TRUE) %>%
autoplot()
tree_recipe
that uses all
available predictors to predict host_superhost
. In
addition, again use the pre-processing step
step_other(all_nominal_predictors(), threshold = 0.005)
.tree_recipe <-
recipe(host_superhost ~ ., data = airbnb_train) %>%
step_other(all_nominal_predictors(), threshold = 0.005)
decision_tree()
function to specify the model, and set the engine to rpart
.
Set the mode to "classification"
. Call the output
dt_model
.# set up the decision tree model
dt_model <-
decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
dt_workflow
, where you add the
newly created tree_recipe
and the
dt_model
.# decision tree workflow
dt_workflow <-
workflow() %>%
add_recipe(tree_recipe) %>%
add_model(dt_model)
dt_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_other()
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)
Computational engine: rpart
dt_workflow
and the aribnb_train
data to the
fit()
function and save the output as
superhost_dt
.# Fit the decision tree
superhost_dt <-
dt_workflow %>%
fit(airbnb_train)
metrics()
function to do so. Use the code from the logistic regression above as
template. Save the output as dt_pred
.dt_pred <-
predict(superhost_dt, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_dt, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics()
function and pass it the
host_superhost
variable as truth
, the
.pred_class
variable as estimate
, and
.pred_TRUE
as last argument.metrics(dt_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.756
2 kap binary 0.495
3 mn_log_loss binary 0.544
4 roc_auc binary 0.763
roc_curve()
function, to
create sensitivity and specificity values of different cut-offs, and
pass this into the autoplot()
function, to plot the curve.
Add the host_superhost
column as truth
, and
the .pred_TRUE
column as third, unnamed argument, to the
roc_curve()
function and plot the curve.roc_curve(dt_pred, truth = host_superhost, .pred_TRUE) %>%
autoplot()
rand_forest()
function to specify the model, and set the
engine to ranger
. Set the mode to
"classification"
. Call the output
rf_model
.# set up the random forest model
rf_model <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
rf_workflow
, where you add the
previously created tree_recipe
and the new
rf_model
.# random forest workflow
rf_workflow <-
workflow() %>%
add_recipe(tree_recipe) %>%
add_model(rf_model)
rf_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_other()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)
Computational engine: ranger
rf_workflow
and the aribnb_train
data to the
fit()
function and save the output as
superhost_rf
.# Fit the random forest
superhost_rf <-
rf_workflow %>%
fit(airbnb_train)
metrics()
function to do so and save the output as rf_pred
.rf_pred <-
predict(superhost_rf, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_rf, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics()
function and pass it the
host_superhost
variable as truth
, the
.pred_class
variable as estimate
, and
.pred_TRUE
as last argument.metrics(rf_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.959
2 kap binary 0.915
3 mn_log_loss binary 0.278
4 roc_auc binary 0.995
roc_curve()
function, to
create sensitivity and specificity values of different cut-offs, and
pass this into the autoplot()
function, to plot the
curve.roc_curve(rf_pred, truth = host_superhost, .pred_TRUE) %>%
autoplot()
predict()
function twice to extract
the model predictions from superhost_glm
(as done with the
training data), but this time based on airbnb_test
and bind
them together with the true values from airbnb_test
using
bind_cols()
. Save the output as
glm_pred_test
# get predicted values from test data
glm_pred_test <-
superhost_glm %>%
predict(airbnb_test, type = "prob") %>%
bind_cols(predict(superhost_glm, airbnb_test)) %>%
bind_cols(airbnb_test %>% select(host_superhost))
dt_pred_test
and rf_pred_test
.# decision tree
dt_pred_test <-
superhost_dt %>%
predict(airbnb_test, type = "prob") %>%
bind_cols(predict(superhost_dt, airbnb_test)) %>%
bind_cols(airbnb_test %>% select(host_superhost))
# random forest
rf_pred_test <-
superhost_rf %>%
predict(airbnb_test, type = "prob") %>%
bind_cols(predict(superhost_rf, airbnb_test)) %>%
bind_cols(airbnb_test %>% select(host_superhost))
metrics()
function, evaluate the models’
out-of-sample performances. Pass it the price
variable as
truth
and the .pred
variable as
estimate
.# evaluate performance
metrics(glm_pred_test, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.732
2 kap binary 0.443
3 mn_log_loss binary 0.517
4 roc_auc binary 0.815
metrics(dt_pred_test, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.732
2 kap binary 0.447
3 mn_log_loss binary 0.581
4 roc_auc binary 0.716
metrics(rf_pred_test, truth = host_superhost, estimate = .pred_class, .pred_TRUE)
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.762
2 kap binary 0.503
3 mn_log_loss binary 0.508
4 roc_auc binary 0.824
# The random forest predictions are still the most accurate.
roc_curve(glm_pred_test, truth = host_superhost, .pred_TRUE) %>%
autoplot()
roc_curve(dt_pred_test, truth = host_superhost, .pred_TRUE) %>%
autoplot()
roc_curve(rf_pred_test, truth = host_superhost, .pred_TRUE) %>%
autoplot()
# Fitting and evaluating a regression model ------------------------------------
# Step 0: Load packages---------------------------------------------------------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(tidymodels) # For ML mastery
tidymodels_prefer() # To resolve common conflicts
# Step 1: Load and Clean, and Explore Training data ----------------------------
# I'll use the mpg dataset from the dplyr package
# Explore training data
mpg # Print the dataset
View(mpg) # Open in a new spreadsheet-like window
dim(mpg) # Print dimensions
names(mpg) # Print the names
# Step 2: Split the data--------------------------------------------------------
mpg_split <- initial_split(mpg)
data_train <- training(mpg_split)
data_test <- testing(mpg_split)
# Step 3: Define recipe --------------------------------------------------------
# The recipe defines what to predict with what, and how to pre-process the data
lm_recipe <-
recipe(hwy ~ year + cyl + displ + trans, # Specify formula
data = data_train) %>% # Specify the data
step_dummy(all_nominal_predictors()) # Dummy code all categorical predictors
# Step 4: Define model ---------------------------------------------------------
# The model definition defines what kind of model we want to use and how to
# fit it
lm_model <-
linear_reg() %>% # Specify model type
set_engine("lm") %>% # Specify engine (often package name) to use
set_mode("regression") # Specify whether it's a regressio or classification
# problem.
# Step 5: Define workflow ------------------------------------------------------
# The workflow combines model and recipe, so that we can fit the model
lm_workflow <-
workflow() %>% # Initialize workflow
add_model(lm_model) %>% # Add the model to the workflow
add_recipe(lm_recipe) # Add the recipe to the workflow
# Step 6: Fit the model --------------------------------------------------------
hwy_lm <-
lm_workflow %>% # Use the specified workflow
fit(data_train) # Fit the model on the specified data
tidy(hwy_lm) # Look at summary information
# Step 7: Assess fit -----------------------------------------------------------
# Save model predictions and observed values
lm_fitted <-
hwy_lm %>% # Model from which to extract predictions
predict(data_train) %>% # Obtain predictions, based on entered data (in this
# case, these predictions are not out-of-sample)
bind_cols(data_train %>% select(hwy)) # Extract observed/true values
# Obtain performance metrics
metrics(lm_fitted, truth = hwy, estimate = .pred)
# Step 8: Assess prediction performance ----------------------------------------
# Save model predictions and observed values
lm_pred <-
hwy_lm %>% # Model from which to extract predictions
predict(data_test) %>% # Obtain predictions, based on entered data (in this
# case, these predictions ARE out-of-sample)
bind_cols(data_test %>% select(hwy)) # Extract observed/true values
# Obtain performance metrics
metrics(lm_pred, truth = hwy, estimate = .pred)
The dataset contains data of the 1191 apartments that were added on Airbnb for the Berlin area in the year 2018.
File | Rows | Columns |
---|---|---|
airbnb.csv | 1191 | 23 |
airbnb
Name | Description |
---|---|
price | Price per night (in $s) |
accommodates | Number of people the airbnb accommodates |
bedrooms | Number of bedrooms |
bathrooms | Number of bathrooms |
cleaning_fee | Amount of cleaning fee (in $s) |
availability_90_days | How many of the following 90 days the airbnb is available |
district | The district the Airbnb is located in |
host_respons_time | Host average response time |
host_response_rate | Host response rate |
host_superhost | Whether host is a superhost TRUE/FALSE |
host_listings_count | Number of listings the host has |
review_scores_accuracy | Accuracy of information rating [0, 10] |
review_scores_cleanliness | Cleanliness rating [0, 10] |
review_scores_checkin | Check in rating [0, 10] |
review_scores_communication | Communication rating [0, 10] |
review_scores_location | Location rating [0, 10] |
review_scores_value | Value rating [0, 10] |
kitchen | Kitchen available TRUE/FALSE |
tv | TV available TRUE/FALSE |
coffe_machine | Coffee machine available TRUE/FALSE |
dishwasher | Dishwasher available TRUE/FALSE |
terrace | Terrace/balcony available TRUE/FALSE |
bathtub | Bathtub available TRUE/FALSE |
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
tidymodels |
install.packages("tidymodels") |
rpart.plot |
install.packages("rpart.plot") |
Function | Package | Description |
---|---|---|
read_csv() |
tidyverse |
Read in data |
mutate() |
tidyverse |
Manipulate or create columns |
bind_cols() |
tidyverse |
Bind columns together and return a tibble |
pluck() |
tidyverse |
Extract element from list |
initial_split() |
tidymodels |
Initialize splitting dataset into training and test data |
training() |
tidymodels |
Create training data from initial_split
output |
testing() |
tidymodels |
Create training data from initial_split
output |
linear_reg() /logistic_reg() |
tidymodels |
Initialize linear/logistic regression model |
set_engine() |
tidymodels |
Specify which engine to use for the modeling (e.g.,
“lm” to use stats::lm() , or “stan” to use
rstanarm::stan_lm() ) |
set_mode() |
tidymodels |
Specify whether it’s a regression or classification problem |
recipe() |
tidymodels |
Initialize recipe |
step_dummy() |
tidymodels |
pre-process data into dummy variables |
workflow() |
tidymodels |
Initialize workflow |
add_recipe() |
tidymodels |
Add recipe to workflow |
update_recipe() |
tidymodels |
Update workflow with a new recipe |
add_model() |
tidymodels |
Add model to workflow |
fit() |
tidymodels |
Fit model |
tidy() |
tidymodels |
Show model parameters |
predict() |
tidymodels |
Create model predictions based on specified data |
metrics() |
tidymodels |
Evaluate model performance |
conf_mat() |
tidymodels |
Create confusion matrix |
roc_curve() |
tidymodels |
Calculate sensitivity and specificity with different thresholds for ROC-curve |
autoplot() |
tidymodels |
Plot methods for different objects such as those
created from roc_curve() to plot the ROC-curve |
rpart.plot() |
rpart.plot |
Plot a decision tree from an rpart fit
object |
tidymodels
framework.