Source: Android Authority

Source: Android Authority

Overview

In this practical you’ll practice the basics of machine learning in R

By the end of this practical you will know how to:

  1. Fit regression, decision trees, and random forests to training data using the original model packages
  2. Explore each object with generic functions.
  3. Predict outcomes from new data with all models.
  4. Evaluate the model’s fitting and prediction performance.

Packages

Package Installation
tidyverse install.packages("tidyverse")
broom install.packages("broom")
rpart install.packages("rpart")
FFTrees install.packages("FFTrees")
partykit install.packages("partykit")
party install.packages("party")
randomForest install.packages("randomForest")
caret install.packages("caret")

Datasets

library(tidyverse)
library(rpart)
library(FFTrees)
library(partykit)
library(party)
library(randomForest)
library(rminer)
library(caret)


attrition_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_train.csv")
attrition_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_test.csv")
heartdisease_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_train.csv")
heartdisease_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_test.csv")
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
File Rows Columns
house_train.csv 1000 21
house_test.csv 15000 21

Glossary

Function Package Description
summary() base Get summary information from an R object
names() base See the named elements of a list
LIST$NAME() base Get the named element NAME from a list OBJECT
predict(object, newdata) base Predict the criterion values of newdata based on object

Examples

# Machine learning basics ------------------------------------

# Step 0: Load packages-----------

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(rpart)        # For rpart()
library(broom)        # For tidy()
library(caret)        # For resamp 
library(partykit)     # For nice decision trees

# Step 1: Load Training and Test data ----------------------

house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")

house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")

# Step 2: Explore training data -----------------------------

summary(house_train)

# We will do a log-transformation on price
#  because it is so heavily skewed

# Log-transform price
house_train <- house_train %>%
  mutate(price = log(price))

# Log-transform price
house_test <- house_test %>%
  mutate(price = log(price))

# Step 3: Fit models predicting price ------------------

# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
                data = house_train)

# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
                     data = house_train)

# Step 4: Explore models -------------------------------

# Regression
summary(price_lm)

# Decision Trees
price_rpart
plot(price_rpart)
text(price_rpart)

# Nicer version!
plot(as.party(price_rpart))


# Step 5: Assess fitting accuracy ----------------------------


# Get fitted values
lm_fit <- predict(price_lm, 
                 newdata = house_train)

rpart_fit <- predict(price_rpart, 
                    newdata = house_train)

# Regression Fitting Accuracy
postResample(pred = lm_fit, 
             obs = house_train$price)

# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit, 
             obs = house_train$price)

# Step 6: Predict new data -------------------------

lm_pred <- predict(object = price_lm, 
                   newdata = house_test)

rpart_pred <- predict(object = price_rpart, 
                      newdata = house_test)

# Step 7: Compare accuracy --------------------------

# Regression Prediction Accuracy
postResample(pred = lm_pred, 
             obs = house_test$price)

# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred, 
             obs = house_test$price)


# Plot results

# Tidy competition results
competition_results <- tibble(truth = house_test$price,
                              Regression = lm_pred,
                              Decision_Trees = rpart_pred) %>%
                       gather(group, prediction, -truth)

# Plot!
ggplot(data = competition_results,
       aes(x = truth, y = prediction, col = group)) +
  geom_point(alpha = .2) +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Predicting housing prices",
       subtitle = "Regression versus decision trees")

Tasks

A - Setup

  1. Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called machinelearning_practical.R in the 2_Code folder.

  2. Using library() load the set of packages for this practical listed in the packages section above.

## NAME
## DATE
## Wrangling Practical

library(XX)     
library(XX)
#...
  1. For this practical, we’ll use two datasets related to the prices of houses in King County Washington. There is a training dataset house_train.csv and a model testing dataset house_test.csv data. Using the following template, load the datasets into R as house_train and house_test:
house_train <- read_csv(file = "XXX/XXX")
  1. Take a look at the first few rows of each dataset by printing them to the console.

B - Walking through the 7 steps

# Step 0: Load packages-----------

library(tidyverse)    # Load tidyverse for dplyr and tidyr
library(rpart)        # For rpart()
library(broom)        # For tidy()
library(caret)        # For resamp 
library(partykit)     # For nice decision trees

# Step 1: Load Training and Test data ----------------------

house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
Parsed with column specification:
cols(
  .default = col_integer(),
  id = col_character(),
  date = col_datetime(format = ""),
  price = col_double(),
  bathrooms = col_double(),
  floors = col_double(),
  lat = col_double(),
  long = col_double()
)
See spec(...) for full column specifications.
# Step 2: Explore training data -----------------------------

summary(house_train)
      id                 date                         price        
 Length:1000        Min.   :2014-05-02 00:00:00   Min.   :  82000  
 Class :character   1st Qu.:2014-07-24 00:00:00   1st Qu.: 324875  
 Mode  :character   Median :2014-10-13 12:00:00   Median : 452750  
                    Mean   :2014-10-27 04:07:40   Mean   : 549251  
                    3rd Qu.:2015-02-10 00:00:00   3rd Qu.: 636900  
                    Max.   :2015-05-14 00:00:00   Max.   :5110800  
    bedrooms      bathrooms     sqft_living      sqft_lot     
 Min.   :1.00   Min.   :0.75   Min.   : 520   Min.   :   740  
 1st Qu.:3.00   1st Qu.:1.75   1st Qu.:1440   1st Qu.:  5200  
 Median :3.00   Median :2.25   Median :1955   Median :  7782  
 Mean   :3.41   Mean   :2.17   Mean   :2100   Mean   : 15011  
 3rd Qu.:4.00   3rd Qu.:2.50   3rd Qu.:2540   3rd Qu.: 10779  
 Max.   :7.00   Max.   :6.00   Max.   :8010   Max.   :920423  
     floors      waterfront         view        condition   
 Min.   :1.0   Min.   :0.000   Min.   :0.00   Min.   :1.00  
 1st Qu.:1.0   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3.00  
 Median :1.5   Median :0.000   Median :0.00   Median :3.00  
 Mean   :1.5   Mean   :0.008   Mean   :0.24   Mean   :3.45  
 3rd Qu.:2.0   3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:4.00  
 Max.   :3.0   Max.   :1.000   Max.   :4.00   Max.   :5.00  
     grade         sqft_above   sqft_basement     yr_built   
 Min.   : 4.00   Min.   : 520   Min.   :   0   Min.   :1900  
 1st Qu.: 7.00   1st Qu.:1220   1st Qu.:   0   1st Qu.:1954  
 Median : 7.00   Median :1610   Median :   0   Median :1977  
 Mean   : 7.68   Mean   :1813   Mean   : 288   Mean   :1972  
 3rd Qu.: 8.00   3rd Qu.:2200   3rd Qu.: 550   3rd Qu.:1996  
 Max.   :12.00   Max.   :6430   Max.   :3500   Max.   :2015  
  yr_renovated     zipcode           lat            long     
 Min.   :   0   Min.   :98001   Min.   :47.2   Min.   :-122  
 1st Qu.:   0   1st Qu.:98033   1st Qu.:47.5   1st Qu.:-122  
 Median :   0   Median :98059   Median :47.6   Median :-122  
 Mean   :  66   Mean   :98076   Mean   :47.6   Mean   :-122  
 3rd Qu.:   0   3rd Qu.:98117   3rd Qu.:47.7   3rd Qu.:-122  
 Max.   :2015   Max.   :98199   Max.   :47.8   Max.   :-121  
 sqft_living15    sqft_lot15    
 Min.   : 620   Min.   :   915  
 1st Qu.:1500   1st Qu.:  5200  
 Median :1820   Median :  7830  
 Mean   :1989   Mean   : 13190  
 3rd Qu.:2370   3rd Qu.: 10142  
 Max.   :5030   Max.   :411962  
# We will do a log-transformation on price
#  because it is so heavily skewed

# Log-transform price
house_train <- house_train %>%
  mutate(price = log(price))

# Log-transform price
house_test <- house_test %>%
  mutate(price = log(price))

# Step 3: Fit models predicting price ------------------

# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
                data = house_train)

# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
                     data = house_train)

# Step 4: Explore models -------------------------------

# Regression
summary(price_lm)

Call:
glm(formula = price ~ bedrooms + bathrooms + floors, data = house_train)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.312  -0.326  -0.002   0.298   1.945  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12.1723     0.0622   195.6   <2e-16 ***
bedrooms      0.0373     0.0187     2.0    0.046 *  
bathrooms     0.3661     0.0234    15.6   <2e-16 ***
floors       -0.0239     0.0299    -0.8    0.425    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.191)

    Null deviance: 277.59  on 999  degrees of freedom
Residual deviance: 190.42  on 996  degrees of freedom
AIC: 1189

Number of Fisher Scoring iterations: 2
# Decision Trees
price_rpart
n= 1000 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1000 278.00 13.1  
   2) bathrooms< 3.12 904 192.00 13.0  
     4) bathrooms< 1.62 216  40.90 12.7 *
     5) bathrooms>=1.62 688 126.00 13.1  
      10) bathrooms< 2.62 598  99.10 13.0 *
      11) bathrooms>=2.62 90  23.20 13.3 *
   3) bathrooms>=3.12 96  35.40 13.7  
     6) bathrooms< 4.62 89  25.40 13.7  
      12) bedrooms< 3.5 17   3.42 13.3 *
      13) bedrooms>=3.5 72  18.50 13.8 *
     7) bathrooms>=4.62 7   1.52 14.8 *
plot(price_rpart)
text(price_rpart)

# Nicer version!
plot(as.party(price_rpart))