In this practical you’ll practice the basics of machine learning in R
By the end of this practical you will know how to:
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
broom |
install.packages("broom") |
rpart |
install.packages("rpart") |
FFTrees |
install.packages("FFTrees") |
partykit |
install.packages("partykit") |
party |
install.packages("party") |
randomForest |
install.packages("randomForest") |
caret |
install.packages("caret") |
library(tidyverse)
library(rpart)
library(FFTrees)
library(partykit)
library(party)
library(randomForest)
library(rminer)
library(caret)
attrition_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_train.csv")
attrition_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/attrition_test.csv")
heartdisease_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_train.csv")
heartdisease_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/heartdisease_test.csv")
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
File | Rows | Columns |
---|---|---|
house_train.csv | 1000 | 21 |
house_test.csv | 15000 | 21 |
Function | Package | Description |
---|---|---|
summary() |
base |
Get summary information from an R object |
names() |
base |
See the named elements of a list |
LIST$NAME() |
base |
Get the named element NAME from a list OBJECT |
predict(object, newdata) |
base |
Predict the criterion values of newdata based on object |
# Machine learning basics ------------------------------------
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(rpart) # For rpart()
library(broom) # For tidy()
library(caret) # For resamp
library(partykit) # For nice decision trees
# Step 1: Load Training and Test data ----------------------
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
# Step 2: Explore training data -----------------------------
summary(house_train)
# We will do a log-transformation on price
# because it is so heavily skewed
# Log-transform price
house_train <- house_train %>%
mutate(price = log(price))
# Log-transform price
house_test <- house_test %>%
mutate(price = log(price))
# Step 3: Fit models predicting price ------------------
# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Step 4: Explore models -------------------------------
# Regression
summary(price_lm)
# Decision Trees
price_rpart
plot(price_rpart)
text(price_rpart)
# Nicer version!
plot(as.party(price_rpart))
# Step 5: Assess fitting accuracy ----------------------------
# Get fitted values
lm_fit <- predict(price_lm,
newdata = house_train)
rpart_fit <- predict(price_rpart,
newdata = house_train)
# Regression Fitting Accuracy
postResample(pred = lm_fit,
obs = house_train$price)
# Decision Tree Fitting Accuracy
postResample(pred = rpart_fit,
obs = house_train$price)
# Step 6: Predict new data -------------------------
lm_pred <- predict(object = price_lm,
newdata = house_test)
rpart_pred <- predict(object = price_rpart,
newdata = house_test)
# Step 7: Compare accuracy --------------------------
# Regression Prediction Accuracy
postResample(pred = lm_pred,
obs = house_test$price)
# Decision Tree Prediction Accuracy
postResample(pred = rpart_pred,
obs = house_test$price)
# Plot results
# Tidy competition results
competition_results <- tibble(truth = house_test$price,
Regression = lm_pred,
Decision_Trees = rpart_pred) %>%
gather(group, prediction, -truth)
# Plot!
ggplot(data = competition_results,
aes(x = truth, y = prediction, col = group)) +
geom_point(alpha = .2) +
geom_abline(slope = 1, intercept = 0) +
labs(title = "Predicting housing prices",
subtitle = "Regression versus decision trees")
baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder# Done!
Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called machinelearning_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
## NAME
## DATE
## Wrangling Practical
library(XX)
library(XX)
#...
house_train.csv
and a model testing dataset house_test.csv
data. Using the following template, load the datasets into R as house_train
and house_test
:house_train <- read_csv(file = "XXX/XXX")
# Step 0: Load packages-----------
library(tidyverse) # Load tidyverse for dplyr and tidyr
library(rpart) # For rpart()
library(broom) # For tidy()
library(caret) # For resamp
library(partykit) # For nice decision trees
# Step 1: Load Training and Test data ----------------------
house_train <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_train.csv")
Parsed with column specification:
cols(
.default = col_integer(),
id = col_character(),
date = col_datetime(format = ""),
price = col_double(),
bathrooms = col_double(),
floors = col_double(),
lat = col_double(),
long = col_double()
)
See spec(...) for full column specifications.
house_test <- read_csv("https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/house_test.csv")
Parsed with column specification:
cols(
.default = col_integer(),
id = col_character(),
date = col_datetime(format = ""),
price = col_double(),
bathrooms = col_double(),
floors = col_double(),
lat = col_double(),
long = col_double()
)
See spec(...) for full column specifications.
# Step 2: Explore training data -----------------------------
summary(house_train)
id date price
Length:1000 Min. :2014-05-02 00:00:00 Min. : 82000
Class :character 1st Qu.:2014-07-24 00:00:00 1st Qu.: 324875
Mode :character Median :2014-10-13 12:00:00 Median : 452750
Mean :2014-10-27 04:07:40 Mean : 549251
3rd Qu.:2015-02-10 00:00:00 3rd Qu.: 636900
Max. :2015-05-14 00:00:00 Max. :5110800
bedrooms bathrooms sqft_living sqft_lot
Min. :1.00 Min. :0.75 Min. : 520 Min. : 740
1st Qu.:3.00 1st Qu.:1.75 1st Qu.:1440 1st Qu.: 5200
Median :3.00 Median :2.25 Median :1955 Median : 7782
Mean :3.41 Mean :2.17 Mean :2100 Mean : 15011
3rd Qu.:4.00 3rd Qu.:2.50 3rd Qu.:2540 3rd Qu.: 10779
Max. :7.00 Max. :6.00 Max. :8010 Max. :920423
floors waterfront view condition
Min. :1.0 Min. :0.000 Min. :0.00 Min. :1.00
1st Qu.:1.0 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00
Median :1.5 Median :0.000 Median :0.00 Median :3.00
Mean :1.5 Mean :0.008 Mean :0.24 Mean :3.45
3rd Qu.:2.0 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00
Max. :3.0 Max. :1.000 Max. :4.00 Max. :5.00
grade sqft_above sqft_basement yr_built
Min. : 4.00 Min. : 520 Min. : 0 Min. :1900
1st Qu.: 7.00 1st Qu.:1220 1st Qu.: 0 1st Qu.:1954
Median : 7.00 Median :1610 Median : 0 Median :1977
Mean : 7.68 Mean :1813 Mean : 288 Mean :1972
3rd Qu.: 8.00 3rd Qu.:2200 3rd Qu.: 550 3rd Qu.:1996
Max. :12.00 Max. :6430 Max. :3500 Max. :2015
yr_renovated zipcode lat long
Min. : 0 Min. :98001 Min. :47.2 Min. :-122
1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5 1st Qu.:-122
Median : 0 Median :98059 Median :47.6 Median :-122
Mean : 66 Mean :98076 Mean :47.6 Mean :-122
3rd Qu.: 0 3rd Qu.:98117 3rd Qu.:47.7 3rd Qu.:-122
Max. :2015 Max. :98199 Max. :47.8 Max. :-121
sqft_living15 sqft_lot15
Min. : 620 Min. : 915
1st Qu.:1500 1st Qu.: 5200
Median :1820 Median : 7830
Mean :1989 Mean : 13190
3rd Qu.:2370 3rd Qu.: 10142
Max. :5030 Max. :411962
# We will do a log-transformation on price
# because it is so heavily skewed
# Log-transform price
house_train <- house_train %>%
mutate(price = log(price))
# Log-transform price
house_test <- house_test %>%
mutate(price = log(price))
# Step 3: Fit models predicting price ------------------
# Regression
price_lm <- glm(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Decision Trees
price_rpart <- rpart(formula = price ~ bedrooms + bathrooms + floors,
data = house_train)
# Step 4: Explore models -------------------------------
# Regression
summary(price_lm)
Call:
glm(formula = price ~ bedrooms + bathrooms + floors, data = house_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.312 -0.326 -0.002 0.298 1.945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.1723 0.0622 195.6 <2e-16 ***
bedrooms 0.0373 0.0187 2.0 0.046 *
bathrooms 0.3661 0.0234 15.6 <2e-16 ***
floors -0.0239 0.0299 -0.8 0.425
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.191)
Null deviance: 277.59 on 999 degrees of freedom
Residual deviance: 190.42 on 996 degrees of freedom
AIC: 1189
Number of Fisher Scoring iterations: 2
# Decision Trees
price_rpart
n= 1000
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 278.00 13.1
2) bathrooms< 3.12 904 192.00 13.0
4) bathrooms< 1.62 216 40.90 12.7 *
5) bathrooms>=1.62 688 126.00 13.1
10) bathrooms< 2.62 598 99.10 13.0 *
11) bathrooms>=2.62 90 23.20 13.3 *
3) bathrooms>=3.12 96 35.40 13.7
6) bathrooms< 4.62 89 25.40 13.7
12) bedrooms< 3.5 17 3.42 13.3 *
13) bedrooms>=3.5 72 18.50 13.8 *
7) bathrooms>=4.62 7 1.52 14.8 *
plot(price_rpart)
text(price_rpart)
# Nicer version!
plot(as.party(price_rpart))