Source: Android Authority

Source: Android Authority

Overview

In this practical you’ll practice the basics of machine learning in R

By the end of this practical you will know how to:

  1. Fit regression, decision trees, and random forests to training data using the original model packages
  2. Explore each object with generic functions.
  3. Predict outcomes from new data with all models.
  4. Evaluate the model’s fitting and prediction performance.

Packages

Package Installation
tidyverse install.packages("tidyverse")
broom install.packages("broom")
rpart install.packages("rpart")
FFTrees install.packages("FFTrees")
partykit install.packages("partykit")
party install.packages("party")
randomForest install.packages("randomForest")
caret install.packages("caret")

Datasets

library(tidyverse)
library(rpart)
library(FFTrees)
library(partykit)
library(party)
library(randomForest)
library(rminer)
library(caret)


attrition_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_train.csv")
attrition_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_test.csv")
heartdisease_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_train.csv")
heartdisease_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_test.csv")
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
File Rows Columns
house_train.csv 1000 21
house_test.csv 15000 21

Glossary

Function Package Description
summary() base Get summary information from an R object
names() base See the named elements of a list
LIST$NAME() base Get the named element NAME from a list OBJECT
predict(object, newdata) base Predict the criterion values of newdata based on object

Examples

# Machine learning basics ------------------------------------

# Step 0: Load packages-----------

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(rpart)        # For rpart()
library(broom)        # For tidy()
library(caret)        # For resamp 
library(partykit)     # For nice decision trees

# Step 1: Load Training and Test data ----------------------

house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")

house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")

# Step 2: Explore training data -----------------------------

summary(house_train)

# We will do a log-transformation on price
#  because it is so heavily skewed

# Log-transform price
house_train <- house_train %>%
  mutate(price = log(price))

# Log-transform price
house_test <- house_test %>%
  mutate(price = log(price))

# Step 3: Fit models predicting price ------------------

# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
                data = house_train)

# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
                     data = house_train)

# Step 4: Explore models -------------------------------

# Regression
summary(price_lm)

# Decision Trees
price_rpart
plot(price_rpart)
text(price_rpart)

# Nicer version!
plot(as.party(price_rpart))


# Step 5: Assess fitting accuracy ----------------------------


# Get fitted values
lm_fit <- predict(price_lm, 
                 newdata = house_train)

rpart_fit <- predict(price_rpart, 
                    newdata = house_train)

# Regression Fitting Accuracy
postResample(pred = lm_fit, 
             obs = house_train$price)

# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit, 
             obs = house_train$price)

# Step 6: Predict new data -------------------------

lm_pred <- predict(object = price_lm, 
                   newdata = house_test)

rpart_pred <- predict(object = price_rpart, 
                      newdata = house_test)

# Step 7: Compare accuracy --------------------------

# Regression Prediction Accuracy
postResample(pred = lm_pred, 
             obs = house_test$price)

# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred, 
             obs = house_test$price)


# Plot results

# Tidy competition results
competition_results <- tibble(truth = house_test$price,
                              Regression = lm_pred,
                              Decision_Trees = rpart_pred) %>%
                       gather(group, prediction, -truth)

# Plot!
ggplot(data = competition_results,
       aes(x = truth, y = prediction, col = group)) +
  geom_point(alpha = .2) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Predicting housing prices",
       subtitle = "Regression versus decision trees")

Tasks

A - Setup

  1. Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called machinelearning_practical.R in the 2_Code folder.

  2. Using library() load the set of packages for this practical listed in the packages section above.

## NAME
## DATE
## Wrangling Practical

library(XX)     
library(XX)
#...
  1. For this practical, we’ll use two datasets related to the prices of houses in King County Washington. There is a training dataset house_train.csv and a model testing dataset house_test.csv data. Using the following template, load the datasets into R as house_train and house_test:
house_train <- read_csv(file = "XXX/XXX")
  1. Take a look at the first few rows of each dataset by printing them to the console.

B - Walking through the 7 steps

# Step 0: Load packages-----------

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(rpart)        # For rpart()
library(broom)        # For tidy()
library(caret)        # For resamp 
library(partykit)     # For nice decision trees

# Step 1: Load Training and Test data ----------------------

house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
# Step 2: Explore training data -----------------------------

summary(house_train)
      id                 date                         price        
 Length:1000        Min.   :2014-05-02 00:00:00   Min.   :  82000  
 Class :character   1st Qu.:2014-07-24 00:00:00   1st Qu.: 324875  
 Mode  :character   Median :2014-10-13 12:00:00   Median : 452750  
                    Mean   :2014-10-27 04:07:40   Mean   : 549251  
                    3rd Qu.:2015-02-10 00:00:00   3rd Qu.: 636900  
                    Max.   :2015-05-14 00:00:00   Max.   :5110800  
    bedrooms      bathrooms     sqft_living      sqft_lot     
 Min.   :1.00   Min.   :0.75   Min.   : 520   Min.   :   740  
 1st Qu.:3.00   1st Qu.:1.75   1st Qu.:1440   1st Qu.:  5200  
 Median :3.00   Median :2.25   Median :1955   Median :  7782  
 Mean   :3.41   Mean   :2.17   Mean   :2100   Mean   : 15011  
 3rd Qu.:4.00   3rd Qu.:2.50   3rd Qu.:2540   3rd Qu.: 10779  
 Max.   :7.00   Max.   :6.00   Max.   :8010   Max.   :920423  
     floors      waterfront         view        condition   
 Min.   :1.0   Min.   :0.000   Min.   :0.00   Min.   :1.00  
 1st Qu.:1.0   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3.00  
 Median :1.5   Median :0.000   Median :0.00   Median :3.00  
 Mean   :1.5   Mean   :0.008   Mean   :0.24   Mean   :3.45  
 3rd Qu.:2.0   3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:4.00  
 Max.   :3.0   Max.   :1.000   Max.   :4.00   Max.   :5.00  
     grade         sqft_above   sqft_basement     yr_built   
 Min.   : 4.00   Min.   : 520   Min.   :   0   Min.   :1900  
 1st Qu.: 7.00   1st Qu.:1220   1st Qu.:   0   1st Qu.:1954  
 Median : 7.00   Median :1610   Median :   0   Median :1977  
 Mean   : 7.68   Mean   :1813   Mean   : 288   Mean   :1972  
 3rd Qu.: 8.00   3rd Qu.:2200   3rd Qu.: 550   3rd Qu.:1996  
 Max.   :12.00   Max.   :6430   Max.   :3500   Max.   :2015  
  yr_renovated     zipcode           lat            long     
 Min.   :   0   Min.   :98001   Min.   :47.2   Min.   :-122  
 1st Qu.:   0   1st Qu.:98033   1st Qu.:47.5   1st Qu.:-122  
 Median :   0   Median :98059   Median :47.6   Median :-122  
 Mean   :  66   Mean   :98076   Mean   :47.6   Mean   :-122  
 3rd Qu.:   0   3rd Qu.:98117   3rd Qu.:47.7   3rd Qu.:-122  
 Max.   :2015   Max.   :98199   Max.   :47.8   Max.   :-121  
 sqft_living15    sqft_lot15    
 Min.   : 620   Min.   :   915  
 1st Qu.:1500   1st Qu.:  5200  
 Median :1820   Median :  7830  
 Mean   :1989   Mean   : 13190  
 3rd Qu.:2370   3rd Qu.: 10142  
 Max.   :5030   Max.   :411962  
# We will do a log-transformation on price
#  because it is so heavily skewed

# Log-transform price
house_train <- house_train %>%
  mutate(price = log(price))

# Log-transform price
house_test <- house_test %>%
  mutate(price = log(price))

# Step 3: Fit models predicting price ------------------

# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
                data = house_train)

# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
                     data = house_train)

# Step 4: Explore models -------------------------------

# Regression
summary(price_lm)

Call:
glm(formula = price ~ bedrooms + bathrooms + floors, data = house_train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.312  -0.326  -0.002   0.298   1.945  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.1723     0.0622   195.6   <2e-16 ***
bedrooms      0.0373     0.0187     2.0    0.046 *  
bathrooms     0.3661     0.0234    15.6   <2e-16 ***
floors       -0.0239     0.0299    -0.8    0.425    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.191)

    Null deviance: 277.59  on 999  degrees of freedom
Residual deviance: 190.42  on 996  degrees of freedom
AIC: 1189

Number of Fisher Scoring iterations: 2
# Decision Trees
price_rpart
n= 1000 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1000 278.00 13.1  
   2) bathrooms< 3.12 904 192.00 13.0  
     4) bathrooms< 1.62 216  40.90 12.7 *
     5) bathrooms>=1.62 688 126.00 13.1  
      10) bathrooms< 2.62 598  99.10 13.0 *
      11) bathrooms>=2.62 90  23.20 13.3 *
   3) bathrooms>=3.12 96  35.40 13.7  
     6) bathrooms< 4.62 89  25.40 13.7  
      12) bedrooms< 3.5 17   3.42 13.3 *
      13) bedrooms>=3.5 72  18.50 13.8 *
     7) bathrooms>=4.62 7   1.52 14.8 *
plot(price_rpart)
text(price_rpart)

# Nicer version!
plot(as.party(price_rpart))

# Step 5: Assess fitting accuracy ----------------------------


# Get fitted values
lm_fit <- predict(price_lm, 
                 newdata = house_train)

rpart_fit <- predict(price_rpart, 
                    newdata = house_train)

# Regression Fitting Accuracy
postResample(pred = lm_fit, 
             obs = house_train$price)
    RMSE Rsquared      MAE 
   0.436    0.314    0.351 
# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit, 
             obs = house_train$price)
    RMSE Rsquared      MAE 
   0.432    0.328    0.346 
# Step 6: Predict new data -------------------------

lm_pred <- predict(object = price_lm, 
                   newdata = house_test)

rpart_pred <- predict(object = price_rpart, 
                      newdata = house_test)

# Step 7: Compare accuracy --------------------------

# Regression Prediction Accuracy
postResample(pred = lm_pred, 
             obs = house_test$price)
    RMSE Rsquared      MAE 
   0.438    0.308    0.352 
# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred, 
             obs = house_test$price)
    RMSE Rsquared      MAE 
   0.444    0.289    0.354 
# Plot results

# Tidy competition results
competition_results <- tibble(truth = house_test$price,
                              Regression = lm_pred,
                              Decision_Trees = rpart_pred) %>%
                       gather(group, prediction, -truth)

# Plot!
ggplot(data = competition_results,
       aes(x = truth, y = prediction, col = group)) +
  geom_point(alpha = .2) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Predicting housing prices",
       subtitle = "Regression versus decision trees")

  1. Run Steps 0 through 4 in the Examples section above. Run each line of code individually and explore each object you create. Try to understand each of the steps! If you have trouble, ask for help!

  2. Which of the three features (bedrooms, bathrooms, floors) do the regression and decision tree models use? Do they treat the features equally?

  3. Run step 5. Look at the results. Which model has the best fitting performance?

  4. Run Steps 6-7. Look at the results. Which model has the best prediction performance? How does each model’s prediction performance compare to its fitting performance?

D - Include random forests

  1. Random forests are much more complex than regression and decision trees. Try including Random forests in your analyses as a new competitor. You can use the randomForest() function from the randomForest package to fit your model!
# Just add the following:

# Regression
price_rf <- randomForest(formula = price ~ bedrooms + bathrooms + floors,
                         data = house_train)


# Get fitted values
rf_fit <- predict(price_rf, 
                 newdata = house_train)


# prediction acuracy
postResample(pred = rf_fit, 
             obs = house_train$price)
    RMSE Rsquared      MAE 
   0.415    0.395    0.335 
# Get fitted values
rf_pred <- predict(price_rf, 
                 newdata = house_test)


# prediction acuracy
postResample(pred = rf_pred, 
             obs = house_test$price)
    RMSE Rsquared      MAE 
   0.435    0.326    0.347 
  1. How does the fitting performance of random forests compare to the other algorithms in training?
# rf is better!
  1. How does the prediction performance of random forests compare to the other algorithms in testing?
# rf is better!

E - Include more features

Until now, you’ve only been predicting price based on 3 features (bedrooms, bathrooms, and floors). Of course, you have access to lots more data to predict housing prices! Now it’s time to try using more data to predict price.

  1. Look closely again at the columns in the house_train data. There are two features in the data that you definitely don’t want to include in your models. Which two are they?

  2. Remove those two features from your training data using the following template:

# Remove two features (columns) from house_train
house_train <- house_train %>%
  select(-XX, -XX)
# Remove two features (columns) from house_train
house_train <- house_train %>%
  select(-id, -date)
  1. Re-run your models, but now predict price based on all of the features. To do this, use the formula short hand formula = price ~ ..
# Regression example
price_lm <- glm(formula = price ~.,
                data = house_train)

# Same with other models...
  1. How does the overall fitting and prediction performance of the models compare to when you only used three features? Did each model improve?
# Yes each model improves!!

F - Predict the year a house was built

# On your own!

So far, we have predicted house prices based on many features. Now, see how well you can predict the year a house was built (yr_built) based on the following four features: bedrooms, bathrooms, condition, sqft_living.

  1. Go through Steps 0 through 4 using regression glm() and decision trees rpart() to build models predicting the year a house was built yr_built based on bedrooms, bathrooms, condition, sqft_living.

  2. Based on your model exploration (Step 4), which of the four features seem to be the most important in predicting the year a house was built?

  3. Complete Steps 5 through 7.

  4. Which of your three models is the best at predicting the year a house was built?

Additional reading