In this practical you’ll practice the basics of machine learning in R
By the end of this practical you will know how to:
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
broom |
install.packages("broom") |
rpart |
install.packages("rpart") |
FFTrees |
install.packages("FFTrees") |
partykit |
install.packages("partykit") |
party |
install.packages("party") |
randomForest |
install.packages("randomForest") |
caret |
install.packages("caret") |
library(tidyverse)
library(rpart)
library(FFTrees)
library(partykit)
library(party)
library(randomForest)
library(rminer)
library(caret)
attrition_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_train.csv")
attrition_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_test.csv")
heartdisease_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_train.csv")
heartdisease_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_test.csv")
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
File | Rows | Columns |
---|---|---|
house_train.csv | 1000 | 21 |
house_test.csv | 15000 | 21 |
Function | Package | Description |
---|---|---|
summary() |
base |
Get summary information from an R object |
names() |
base |
See the named elements of a list |
LIST$NAME() |
base |
Get the named element NAME from a list OBJECT |
predict(object, newdata) |
base |
Predict the criterion values of newdata based on object |
# Machine learning basics ------------------------------------
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(rpart) # For rpart()
library(broom) # For tidy()
library(caret) # For resamp
library(partykit) # For nice decision trees
# Step 1: Load Training and Test data ----------------------
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
# Step 2: Explore training data -----------------------------
summary(house_train)
# We will do a log-transformation on price
# because it is so heavily skewed
# Log-transform price
house_train <- house_train %>%
mutate(price = log(price))
# Log-transform price
house_test <- house_test %>%
mutate(price = log(price))
# Step 3: Fit models predicting price ------------------
# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Step 4: Explore models -------------------------------
# Regression
summary(price_lm)
# Decision Trees
price_rpart
plot(price_rpart)
text(price_rpart)
# Nicer version!
plot(as.party(price_rpart))
# Step 5: Assess fitting accuracy ----------------------------
# Get fitted values
lm_fit <- predict(price_lm,
newdata = house_train)
rpart_fit <- predict(price_rpart,
newdata = house_train)
# Regression Fitting Accuracy
postResample(pred = lm_fit,
obs = house_train$price)
# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit,
obs = house_train$price)
# Step 6: Predict new data -------------------------
lm_pred <- predict(object = price_lm,
newdata = house_test)
rpart_pred <- predict(object = price_rpart,
newdata = house_test)
# Step 7: Compare accuracy --------------------------
# Regression Prediction Accuracy
postResample(pred = lm_pred,
obs = house_test$price)
# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred,
obs = house_test$price)
# Plot results
# Tidy competition results
competition_results <- tibble(truth = house_test$price,
Regression = lm_pred,
Decision_Trees = rpart_pred) %>%
gather(group, prediction, -truth)
# Plot!
ggplot(data = competition_results,
aes(x = truth, y = prediction, col = group)) +
geom_point(alpha = .2) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Predicting housing prices",
subtitle = "Regression versus decision trees")
baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder# Done!
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called machinelearning_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
## NAME
## DATE
## Wrangling Practical
library(XX)
library(XX)
#...
house_train.csv
and a model testing dataset house_test.csv
data. Using the following template, load the datasets into R as house_train
and house_test
:house_train <- read_csv(file = "XXX/XXX")
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(rpart) # For rpart()
library(broom) # For tidy()
library(caret) # For resamp
library(partykit) # For nice decision trees
# Step 1: Load Training and Test data ----------------------
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
Parsed with column specification:
cols(
.default = col_integer(),
id = col_character(),
date = col_datetime(format = ""),
price = col_double(),
bathrooms = col_double(),
floors = col_double(),
lat = col_double(),
long = col_double()
)
See spec(...) for full column specifications.
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
Parsed with column specification:
cols(
.default = col_integer(),
id = col_character(),
date = col_datetime(format = ""),
price = col_double(),
bathrooms = col_double(),
floors = col_double(),
lat = col_double(),
long = col_double()
)
See spec(...) for full column specifications.
# Step 2: Explore training data -----------------------------
summary(house_train)
id date price
Length:1000 Min. :2014-05-02 00:00:00 Min. : 82000
Class :character 1st Qu.:2014-07-24 00:00:00 1st Qu.: 324875
Mode :character Median :2014-10-13 12:00:00 Median : 452750
Mean :2014-10-27 04:07:40 Mean : 549251
3rd Qu.:2015-02-10 00:00:00 3rd Qu.: 636900
Max. :2015-05-14 00:00:00 Max. :5110800
bedrooms bathrooms sqft_living sqft_lot
Min. :1.00 Min. :0.75 Min. : 520 Min. : 740
1st Qu.:3.00 1st Qu.:1.75 1st Qu.:1440 1st Qu.: 5200
Median :3.00 Median :2.25 Median :1955 Median : 7782
Mean :3.41 Mean :2.17 Mean :2100 Mean : 15011
3rd Qu.:4.00 3rd Qu.:2.50 3rd Qu.:2540 3rd Qu.: 10779
Max. :7.00 Max. :6.00 Max. :8010 Max. :920423
floors waterfront view condition
Min. :1.0 Min. :0.000 Min. :0.00 Min. :1.00
1st Qu.:1.0 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00
Median :1.5 Median :0.000 Median :0.00 Median :3.00
Mean :1.5 Mean :0.008 Mean :0.24 Mean :3.45
3rd Qu.:2.0 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00
Max. :3.0 Max. :1.000 Max. :4.00 Max. :5.00
grade sqft_above sqft_basement yr_built
Min. : 4.00 Min. : 520 Min. : 0 Min. :1900
1st Qu.: 7.00 1st Qu.:1220 1st Qu.: 0 1st Qu.:1954
Median : 7.00 Median :1610 Median : 0 Median :1977
Mean : 7.68 Mean :1813 Mean : 288 Mean :1972
3rd Qu.: 8.00 3rd Qu.:2200 3rd Qu.: 550 3rd Qu.:1996
Max. :12.00 Max. :6430 Max. :3500 Max. :2015
yr_renovated zipcode lat long
Min. : 0 Min. :98001 Min. :47.2 Min. :-122
1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5 1st Qu.:-122
Median : 0 Median :98059 Median :47.6 Median :-122
Mean : 66 Mean :98076 Mean :47.6 Mean :-122
3rd Qu.: 0 3rd Qu.:98117 3rd Qu.:47.7 3rd Qu.:-122
Max. :2015 Max. :98199 Max. :47.8 Max. :-121
sqft_living15 sqft_lot15
Min. : 620 Min. : 915
1st Qu.:1500 1st Qu.: 5200
Median :1820 Median : 7830
Mean :1989 Mean : 13190
3rd Qu.:2370 3rd Qu.: 10142
Max. :5030 Max. :411962
# We will do a log-transformation on price
# because it is so heavily skewed
# Log-transform price
house_train <- house_train %>%
mutate(price = log(price))
# Log-transform price
house_test <- house_test %>%
mutate(price = log(price))
# Step 3: Fit models predicting price ------------------
# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Step 4: Explore models -------------------------------
# Regression
summary(price_lm)
Call:
glm(formula = price ~ bedrooms + bathrooms + floors, data = house_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.312 -0.326 -0.002 0.298 1.945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.1723 0.0622 195.6 <2e-16 ***
bedrooms 0.0373 0.0187 2.0 0.046 *
bathrooms 0.3661 0.0234 15.6 <2e-16 ***
floors -0.0239 0.0299 -0.8 0.425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.191)
Null deviance: 277.59 on 999 degrees of freedom
Residual deviance: 190.42 on 996 degrees of freedom
AIC: 1189
Number of Fisher Scoring iterations: 2
# Decision Trees
price_rpart
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 278.00 13.1
2) bathrooms< 3.12 904 192.00 13.0
4) bathrooms< 1.62 216 40.90 12.7 *
5) bathrooms>=1.62 688 126.00 13.1
10) bathrooms< 2.62 598 99.10 13.0 *
11) bathrooms>=2.62 90 23.20 13.3 *
3) bathrooms>=3.12 96 35.40 13.7
6) bathrooms< 4.62 89 25.40 13.7
12) bedrooms< 3.5 17 3.42 13.3 *
13) bedrooms>=3.5 72 18.50 13.8 *
7) bathrooms>=4.62 7 1.52 14.8 *
plot(price_rpart)
text(price_rpart)
# Nicer version!
plot(as.party(price_rpart))
# Step 5: Assess fitting accuracy ----------------------------
# Get fitted values
lm_fit <- predict(price_lm,
newdata = house_train)
rpart_fit <- predict(price_rpart,
newdata = house_train)
# Regression Fitting Accuracy
postResample(pred = lm_fit,
obs = house_train$price)
RMSE Rsquared MAE
0.436 0.314 0.351
# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit,
obs = house_train$price)
RMSE Rsquared MAE
0.432 0.328 0.346
# Step 6: Predict new data -------------------------
lm_pred <- predict(object = price_lm,
newdata = house_test)
rpart_pred <- predict(object = price_rpart,
newdata = house_test)
# Step 7: Compare accuracy --------------------------
# Regression Prediction Accuracy
postResample(pred = lm_pred,
obs = house_test$price)
RMSE Rsquared MAE
0.438 0.308 0.352
# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred,
obs = house_test$price)
RMSE Rsquared MAE
0.444 0.289 0.354
# Plot results
# Tidy competition results
competition_results <- tibble(truth = house_test$price,
Regression = lm_pred,
Decision_Trees = rpart_pred) %>%
gather(group, prediction, -truth)
# Plot!
ggplot(data = competition_results,
aes(x = truth, y = prediction, col = group)) +
geom_point(alpha = .2) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Predicting housing prices",
subtitle = "Regression versus decision trees")
Run Steps 0 through 4 in the Examples section above. Run each line of code individually and explore each object you create. Try to understand each of the steps! If you have trouble, ask for help!
Which of the three features (bedrooms, bathrooms, floors) do the regression and decision tree models use? Do they treat the features equally?
Run step 5. Look at the results. Which model has the best fitting performance?
Run Steps 6-7. Look at the results. Which model has the best prediction performance? How does each model’s prediction performance compare to its fitting performance?
randomForest()
function from the randomForest
package to fit your model!# Just add the following:
# Regression
price_rf <- randomForest(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Get fitted values
rf_fit <- predict(price_rf,
newdata = house_train)
# prediction acuracy
postResample(pred = rf_fit,
obs = house_train$price)
RMSE Rsquared MAE
0.415 0.395 0.335
# Get fitted values
rf_pred <- predict(price_rf,
newdata = house_test)
# prediction acuracy
postResample(pred = rf_pred,
obs = house_test$price)
RMSE Rsquared MAE
0.435 0.326 0.347
# rf is better!
# rf is better!
Until now, you’ve only been predicting price based on 3 features (bedrooms, bathrooms, and floors). Of course, you have access to lots more data to predict housing prices! Now it’s time to try using more data to predict price.
Look closely again at the columns in the house_train
data. There are two features in the data that you definitely don’t want to include in your models. Which two are they?
Remove those two features from your training data using the following template:
# Remove two features (columns) from house_train
house_train <- house_train %>%
select(-XX, -XX)
# Remove two features (columns) from house_train
house_train <- house_train %>%
select(-id, -date)
formula = price ~ .
.# Regression example
price_lm <- glm(formula = price ~.,
data = house_train)
# Same with other models...
# Yes each model improves!!
# On your own!
So far, we have predicted house prices based on many features. Now, see how well you can predict the year a house was built (yr_built
) based on the following four features: bedrooms, bathrooms, condition, sqft_living.
Go through Steps 0 through 4 using regression glm()
and decision trees rpart()
to build models predicting the year a house was built yr_built
based on bedrooms
, bathrooms
, condition
, sqft_living
.
Based on your model exploration (Step 4), which of the four features seem to be the most important in predicting the year a house was built?
Complete Steps 5 through 7.
Which of your three models is the best at predicting the year a house was built?
For more details check out caret
’s vignette using vignette("caret")
.
Also check out Applied Predictive Modeling Kuhn & Johnson.