In this practical you’ll conduct machine learning analyses on a dataset on heart disease. You will see how well many different machine learning models can predict new data. By the end of this practical you will know how to:
Here are the main functions and packages you’ll be using. For more information about the specific models, click on the link in Additional Details.
Algorithm | Function | Package | Additional Details |
---|---|---|---|
Regression | glm() |
Base R | https://bookdown.org/ndphillips/YaRrr/regression.html#the-linear-model |
Fast-and-Frugal decision trees | FFTrees() |
FFTrees | https://cran.r-project.org/web/packages/FFTrees/vignettes/guide.html |
Random Forests | randomForest() |
randomForest |
http://www.blopig.com/blog/2017/04/a-very-basic-introduction-to-random-forests-using-r/ |
You’ll use two datasets in this practical: heartdisease.csv
and ACTG175.csv
. They available in the data_BaselRBootcamp_Day2.zip
file available through the main course page. If you haven’t already, download the data_BaselRBootcamp_Day2.zip
folder and unzip it to get the two files.
# -----------------------------------------------
# A step-by-step tutorial for conducting machine learning
# In this tutorial, we'll see how well 3 different models can
# predict medical data
# ------------------------------------------------
# -----------------------
# Part A:
# Load libraries
# -----------------------
library(tidyverse) # for dplyr and ggplot2
library(randomForest) # for randomForest()
library(FFTrees) # for FFTrees and the heartdisease data
# -----------------------
# Part B: Create datasets
# heart_train, heart_test
# -----------------------
heartdisease <- read_csv(file = "data/heartdisease.csv") # Save a copy of the heartdisease data as heart
set.seed(100) # To fix the training / test randomization
heartdisease <- heartdisease %>%
mutate_if(is.character, factor) %>% # Convert character to factor
sample_frac(1) # Randomly sort rows
# Savew first 100 rows as heart_train and remaining as heart_test
heart_train <- heartdisease %>%
slice(1:100)
heart_test <- heartdisease %>%
slice(101:nrow(heartdisease))
# ------------------------------
# Part I: Build Models
# ------------------------------
# Build FFTrees_model
FFTrees_model <- FFTrees(formula = sex ~ .,
data = heart_train)
# Build glm_model
glm_model <- glm(formula = factor(sex) ~ .,
data = heart_train,
family = "binomial") # For predicting a binary variable
# Build randomForest model
randomForest_model <- randomForest(formula = factor(sex) ~ .,
data = heart_train)
# ------------------------------
# Part II: Explore Models
# ------------------------------
print(FFTrees_model)
summary(FFTrees_model)
print(glm_model)
summary(glm_model)
print(randomForest_model)
summary(randomForest_model)
# ------------------------------
# Part III: Training Accuracy
# ------------------------------
# FFTrees training decisions
FFTrees_fit <- predict(object = FFTrees_model,
newdata = heart_train)
# Regression training decisions
# Positive values are predicted to be 1, negative values are 0
glm_fit <- predict(object = glm_model,
newdata = heart_train) > 0
# randomForest training decisions
randomForest_fit <- predict(object = randomForest_model,
newdata = heart_train)
# Now calculate fitting accuracies and put in dataframe
# Truth value for training data is heart_train$sex
train_truth <- heart_train$sex
# Put training results together
training_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_fit == train_truth),
mean(glm_fit == train_truth),
mean(randomForest_fit == train_truth)))
# Plot training results
ggplot(data = training_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Training Accuracy",
y = "Correct Classifications")
# ------------------------------
# Part IV: Prediction Accuacy!
# ------------------------------
# Calculate predictions for each model for heart_test
# FFTrees testing decisions
FFTrees_pred <- predict(object = FFTrees_model,
newdata = heart_test)
# Regression testing decisions
# Positive values are predicted to be 1, negative values are 0
glm_pred <- predict(object = glm_model,
newdata = heart_test) >= 0
# randomForest testing decisions
randomForest_pred <- predict(object = randomForest_model,
newdata = heart_test)
# Now calculate testing accuracies and put in dataframe
# Truth value for test data is heart_test$sex
test_truth <- heart_test$sex
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_pred == test_truth),
mean(glm_pred == test_truth),
mean(randomForest_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
A. Open your BaselRBootcamp
project. This project should have the folders R
and data
in the working directory.
B. Open a new R script and save it in the R
folder you just created as a new file called machinelearning_practical.R
. At the top of the script, using comments, write your name and the date. The, load the tidyverse
, randomForest
and FFTrees
packages. Here’s how the top of your script should look:
## NAME
## DATE
## Machine Learning Practical
library(tidyverse) # for dplyr and ggplot2
── Attaching packages ─────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1 ✔ purrr 0.2.4
✔ tibble 1.4.2 ✔ dplyr 0.7.4
✔ tidyr 0.8.0 ✔ stringr 1.3.0
✔ readr 1.1.1 ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(randomForest) # for randomForest()
Warning: package 'randomForest' was built under R version 3.4.4
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
library(FFTrees) # for FFTrees and the heartdisease data
O
/ \
F O
/ \
F T
FFTrees v1.3.5. Email: Nathaniel.D.Phillips.is@gmail.com
FFTrees.guide() opens the package guide. Citation info at citation('FFTrees')
C. Make sure the heartdisease.csv
file is in the data
in your working directory. Then, using read_csv()
, read the heartdisease.csv
data and assign it to a new object in R called heartdisease
.
heartdisease <- read_csv(file = "data/heartdisease.csv")
D. Explore the heartdisease
dataset using summary()
, and View()
summary(heartdisease)
age sex cp trestbps
Min. :29.00 Min. :0.0000 Length:303 Min. : 94.0
1st Qu.:48.00 1st Qu.:0.0000 Class :character 1st Qu.:120.0
Median :56.00 Median :1.0000 Mode :character Median :130.0
Mean :54.44 Mean :0.6799 Mean :131.7
3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:140.0
Max. :77.00 Max. :1.0000 Max. :200.0
chol fbs restecg thalach
Min. :126.0 Min. :0.0000 Length:303 Min. : 71.0
1st Qu.:211.0 1st Qu.:0.0000 Class :character 1st Qu.:133.5
Median :241.0 Median :0.0000 Mode :character Median :153.0
Mean :246.7 Mean :0.1485 Mean :149.6
3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:166.0
Max. :564.0 Max. :1.0000 Max. :202.0
exang oldpeak slope ca
Min. :0.0000 Min. :0.00 Length:303 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00 Class :character 1st Qu.:0.0000
Median :0.0000 Median :0.80 Mode :character Median :0.0000
Mean :0.3267 Mean :1.04 Mean :0.6766
3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:1.0000
Max. :1.0000 Max. :6.20 Max. :3.0000
thal diagnosis
Length:303 Min. :0.0000
Class :character 1st Qu.:0.0000
Mode :character Median :0.0000
Mean :0.4587
3rd Qu.:1.0000
Max. :1.0000
# View(heartdisease)
E. There is a help menu for the heartdisease
dataset in the FFTrees
package. Look at the help menu for heartdisease
by running ?heartdisease
?heartdisease
F. Create separate training dataframes heart_train
for model training and heart_test
for model testing. Print each of these dataframes to make sure they look ok.
set.seed(100) # To fix the training / test randomization
heartdisease <- heartdisease %>%
mutate_if(is.character, factor) %>% # Convert character to factor
sample_frac(1) # Randomly sort rows
Warning: package 'bindrcpp' was built under R version 3.4.4
# Savew first 100 rows as heart_train and remaining as heart_test
heart_train <- heartdisease %>%
slice(1:100)
heart_test <- heartdisease %>%
slice(101:nrow(heartdisease))
heart_train
# A tibble: 100 x 14
age sex cp trestbps chol fbs restecg thalach exang oldpeak
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 44. 0. np 108. 141. 0. normal 175. 0. 0.600
2 51. 0. np 140. 308. 0. hypertrop… 142. 0. 1.50
3 52. 1. np 138. 223. 0. normal 169. 0. 0.
4 48. 1. aa 110. 229. 0. normal 168. 0. 1.00
5 59. 1. aa 140. 221. 0. normal 164. 1. 0.
6 58. 1. np 105. 240. 0. hypertrop… 154. 1. 0.600
7 41. 0. aa 126. 306. 0. normal 163. 0. 0.
8 39. 1. a 118. 219. 0. normal 140. 0. 1.20
9 77. 1. a 125. 304. 0. hypertrop… 162. 1. 0.
10 41. 0. aa 105. 198. 0. normal 168. 0. 0.
# ... with 90 more rows, and 4 more variables: slope <fct>, ca <dbl>,
# thal <fct>, diagnosis <dbl>
heart_test
# A tibble: 203 x 14
age sex cp trestbps chol fbs restecg thalach exang oldpeak
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 60. 1. np 140. 185. 0. hypertrop… 155. 0. 3.00
2 48. 1. aa 130. 245. 0. hypertrop… 180. 0. 0.200
3 64. 0. a 130. 303. 0. normal 122. 0. 2.00
4 62. 1. a 120. 267. 0. normal 99. 1. 1.80
5 43. 0. a 132. 341. 1. hypertrop… 136. 1. 3.00
6 55. 0. aa 135. 250. 0. hypertrop… 161. 0. 1.40
7 51. 1. a 140. 298. 0. normal 122. 1. 4.20
8 62. 1. aa 120. 281. 0. hypertrop… 103. 0. 1.40
9 67. 1. a 120. 229. 0. hypertrop… 129. 1. 2.60
10 71. 0. np 110. 265. 1. hypertrop… 130. 0. 0.
# ... with 193 more rows, and 4 more variables: slope <fct>, ca <dbl>,
# thal <fct>, diagnosis <dbl>
diagnosis
diagnosis
. To do this, create three new model objects FFTrees_model
, glm_model
, and randomForest_model
, each predicting diagnosis
using the training data heart_train
# ------------------------------
# Part I: Build Models
# ------------------------------
# Build FFTrees_model
FFTrees_model <- FFTrees(formula = diagnosis ~ .,
data = heart_train)
Growing FFTs with ifan
Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...
# Build glm_model
glm_model <- glm(formula = factor(diagnosis) ~ .,
data = heart_train,
family = "binomial") # For predicting a binary variable
# Build randomForest model
randomForest_model <- randomForest(formula = factor(diagnosis) ~ .,
data = heart_train)
predict()
, then create training_results
containing the fitting accuracy of each model in a dataframe. The code will be almost identical to what is in the Example. All you need to do is change the value of truth_train
to the correct column in heart_train
. Afterwards, plot the results using ggplot
. Which model had the best training accuracy?# FFTrees training decisions
FFTrees_fit <- predict(object = FFTrees_model,
newdata = heart_train)
# Regression training decisions
# Positive values are predicted to be 1, negative values are 0
glm_fit <- predict(object = glm_model,
newdata = heart_train) > 0
# randomForest training decisions
randomForest_fit <- predict(object = randomForest_model,
newdata = heart_train)
# Now calculate fitting accuracies and put in dataframe
# Truth value for training data is heart_train$sex
train_truth <- heart_train$diagnosis
# Put training results together
training_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_fit == train_truth),
mean(glm_fit == train_truth),
mean(randomForest_fit == train_truth)))
# Plot testing results
ggplot(data = training_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Training Accuracy",
y = "Correct Classifications")
print()
, summary()
, plot()
, and names()
to the objects. Which of these methods work for each object? Can you interpret any of the outputs?print(FFTrees_model)
FFT #1 predicts diagnosis using 4 cues: {cp,thal,ca,oldpeak}
[1] If cp != {a}, predict False.
[2] If thal = {rd,fd}, predict True.
[3] If ca > 0, predict True.
[4] If oldpeak <= 0.9, predict False, otherwise, predict True.
train
cases :n 100.00
speed :mcu 1.69
frugality :pci 0.88
accuracy :acc 0.83
weighted :wacc 0.82
sensitivity :sens 0.73
specificity :spec 0.91
pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4
print(glm_model)
Call: glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)
Coefficients:
(Intercept) age sex
-42.74285 0.10042 3.89776
cpaa cpnp cpta
-4.81183 -9.60073 -11.51812
trestbps chol fbs
0.05502 0.01829 0.11396
restecghypertrophy restecgnormal thalach
-7.16047 -7.38367 0.17836
exang oldpeak slopeflat
-3.96418 1.31366 4.26435
slopeup ca thalnormal
-2.07748 5.23272 1.36110
thalrd
5.06249
Degrees of Freedom: 99 Total (i.e. Null); 81 Residual
Null Deviance: 137.6
Residual Deviance: 30.26 AIC: 68.26
print(randomForest_model)
Call:
randomForest(formula = factor(diagnosis) ~ ., data = heart_train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 18%
Confusion matrix:
0 1 class.error
0 48 7 0.1272727
1 11 34 0.2444444
summary(FFTrees_model)
$train
tree n hi mi fa cr sens spec ppv npv far
1 1 100 33 12 5 50 0.7333333 0.9090909 0.8684211 0.8064516 0.09090909
2 2 100 31 14 3 52 0.6888889 0.9454545 0.9117647 0.7878788 0.05454545
3 3 100 40 5 14 41 0.8888889 0.7454545 0.7407407 0.8913043 0.25454545
4 4 100 38 7 12 43 0.8444444 0.7818182 0.7600000 0.8600000 0.21818182
5 5 100 42 3 22 33 0.9333333 0.6000000 0.6562500 0.9166667 0.40000000
6 6 100 21 24 1 54 0.4666667 0.9818182 0.9545455 0.6923077 0.01818182
7 7 100 45 0 34 21 1.0000000 0.3818182 0.5696203 1.0000000 0.61818182
8 8 100 17 28 0 55 0.3777778 1.0000000 1.0000000 0.6626506 0.00000000
acc bacc wacc bpv dprime cost pci mcu
1 0.83 0.8212121 0.8212121 0.8374363 1.958103 0.17 0.8792857 1.69
2 0.83 0.8171717 0.8171717 0.8498217 2.094996 0.17 0.8864286 1.59
3 0.81 0.8171717 0.8171717 0.8160225 1.880894 0.19 0.8700000 1.82
4 0.81 0.8131313 0.8131313 0.8100000 1.791242 0.19 0.8771429 1.72
5 0.75 0.7666667 0.7666667 0.7864583 1.754433 0.25 0.8507143 2.09
6 0.75 0.7242424 0.7242424 0.8234266 2.009186 0.25 0.8764286 1.73
7 0.66 0.6909091 0.6909091 0.7848101 1.999716 0.34 0.8407143 2.23
8 0.72 0.6888889 0.6888889 0.8313253 2.064228 0.28 0.8607143 1.95
$test
NULL
summary(glm_model)
Call:
glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.90474 -0.07549 -0.00268 0.04191 1.85561
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -42.74285 25.08901 -1.704 0.08845 .
age 0.10042 0.09606 1.045 0.29583
sex 3.89776 2.69543 1.446 0.14816
cpaa -4.81183 2.52717 -1.904 0.05691 .
cpnp -9.60073 4.31506 -2.225 0.02609 *
cpta -11.51812 4.85146 -2.374 0.01759 *
trestbps 0.05502 0.05502 1.000 0.31737
chol 0.01829 0.01733 1.055 0.29137
fbs 0.11396 1.97646 0.058 0.95402
restecghypertrophy -7.16047 12.51323 -0.572 0.56717
restecgnormal -7.38367 12.38725 -0.596 0.55113
thalach 0.17836 0.07913 2.254 0.02420 *
exang -3.96418 2.91634 -1.359 0.17405
oldpeak 1.31366 1.31935 0.996 0.31940
slopeflat 4.26435 3.42511 1.245 0.21312
slopeup -2.07748 2.64305 -0.786 0.43186
ca 5.23272 1.81873 2.877 0.00401 **
thalnormal 1.36110 6.36321 0.214 0.83062
thalrd 5.06249 6.46121 0.784 0.43332
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.628 on 99 degrees of freedom
Residual deviance: 30.255 on 81 degrees of freedom
AIC: 68.255
Number of Fisher Scoring iterations: 9
summary(randomForest_model)
Length Class Mode
call 3 -none- call
type 1 -none- character
predicted 100 factor numeric
err.rate 1500 -none- numeric
confusion 6 -none- numeric
votes 200 matrix numeric
oob.times 100 -none- numeric
classes 2 -none- character
importance 13 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 100 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
terms 3 terms call
# ----------------------
# Explore the importance of different variables in models
# ----------------------
# Print the elements of the FFTrees model
FFTrees_model
FFT #1 predicts diagnosis using 4 cues: {cp,thal,ca,oldpeak}
[1] If cp != {a}, predict False.
[2] If thal = {rd,fd}, predict True.
[3] If ca > 0, predict True.
[4] If oldpeak <= 0.9, predict False, otherwise, predict True.
train
cases :n 100.00
speed :mcu 1.69
frugality :pci 0.88
accuracy :acc 0.83
weighted :wacc 0.82
sensitivity :sens 0.73
specificity :spec 0.91
pars: algorithm = 'ifan', goal = 'wacc', goal.chase = 'bacc', sens.w = 0.5, max.levels = 4
# Look at significance of regression
summary(glm_model)
Call:
glm(formula = factor(diagnosis) ~ ., family = "binomial", data = heart_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.90474 -0.07549 -0.00268 0.04191 1.85561
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -42.74285 25.08901 -1.704 0.08845 .
age 0.10042 0.09606 1.045 0.29583
sex 3.89776 2.69543 1.446 0.14816
cpaa -4.81183 2.52717 -1.904 0.05691 .
cpnp -9.60073 4.31506 -2.225 0.02609 *
cpta -11.51812 4.85146 -2.374 0.01759 *
trestbps 0.05502 0.05502 1.000 0.31737
chol 0.01829 0.01733 1.055 0.29137
fbs 0.11396 1.97646 0.058 0.95402
restecghypertrophy -7.16047 12.51323 -0.572 0.56717
restecgnormal -7.38367 12.38725 -0.596 0.55113
thalach 0.17836 0.07913 2.254 0.02420 *
exang -3.96418 2.91634 -1.359 0.17405
oldpeak 1.31366 1.31935 0.996 0.31940
slopeflat 4.26435 3.42511 1.245 0.21312
slopeup -2.07748 2.64305 -0.786 0.43186
ca 5.23272 1.81873 2.877 0.00401 **
thalnormal 1.36110 6.36321 0.214 0.83062
thalrd 5.06249 6.46121 0.784 0.43332
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.628 on 99 degrees of freedom
Residual deviance: 30.255 on 81 degrees of freedom
AIC: 68.255
Number of Fisher Scoring iterations: 9
# Look at importance of variables in randomforests
randomForest_model$importance
MeanDecreaseGini
age 4.8117369
sex 1.0653393
cp 7.8707268
trestbps 3.6289871
chol 3.8782158
fbs 0.4243759
restecg 0.7612203
thalach 4.0227932
exang 1.1082324
oldpeak 5.2384905
slope 2.2656301
ca 6.6686235
thal 6.8666014
heart_test
, and then calculate the prediction accuracies. Don’t forget to change the value of truth_test
to reflect the true value for the current analysis! Then plot the results. Which model predicted each patient’s diagnosis the best?# ------------------------------
# Part IV: Prediction Accuacy!
# ------------------------------
# Calculate predictions for each model for heart_test
# FFTrees testing decisions
FFTrees_pred <- predict(object = FFTrees_model,
newdata = heart_test)
# Regression testing decisions
# Positive values are predicted to be 1, negative values are 0
glm_pred <- predict(object = glm_model,
newdata = heart_test) >= 0
# randomForest testing decisions
randomForest_pred <- predict(object = randomForest_model,
newdata = heart_test)
# Now calculate testing accuracies and put in dataframe
# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_pred == test_truth),
mean(glm_pred == test_truth),
mean(randomForest_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
ntree
argument in randomForest()
. Try creating 2 new randomForest objects, one based on either a small number of trees (e.g. 10), and one based on 10,000 trees. How much better (or worse) do these models do compared to the default model with 500 trees?# Build randomForest model
randomForest_10_model <- randomForest(formula = factor(diagnosis) ~ .,
data = heart_train,
ntree = 10)
randomForest_10000_model <- randomForest(formula = factor(diagnosis) ~ .,
data = heart_train,
ntree = 10000)
# randomForest testing decisions
randomForest_10_pred <- predict(object = randomForest_10_model,
newdata = heart_test)
randomForest_10000_pred <- predict(object = randomForest_10000_model,
newdata = heart_test)
# Truth value for test data is heart_test$sex
test_truth <- heart_test$sex
testing_results <- tibble(model = c("RF_10", "RF_10000"),
result = c(mean(randomForest_10_pred == test_truth),
mean(randomForest_10000_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
my.tree
argument in FFTrees()
to create your own custom tree ‘in words’. To see how this works, look at the “Specifying FFTs directly” vignette in the FFTrees
package by running vignette("FFTrees_mytree", package = "FFTrees")
. Look through the vignette to see how the my.tree
argument works. Then, try creating a new FFTrees
object called custom_FFTrees_model
trained on the heart_train
data using the following rule: “If sex = 0, predict False. If chol > 250, predict True. Otherwise, predict True.”. Plot the object to see how the tree did!custom_FFTrees_model <- FFTrees(formula = diagnosis ~.,
data = heart_train,
my.tree = "If sex = 0, predict False.
If chol > 250, predict TRUE.
Otherwise, predict TRUE.")
Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...
plot(custom_FFTrees_model, main = "My custom FFT")
svm()
function from the e1071
package in all of your analyses. You’ll need to add code for support vector machines at each stage of the machine learning process, model building, data fitting, and data prediction. Was she right?library(e1071)
# Build randomForest model
svm_model <- svm(formula = factor(diagnosis) ~ .,
data = heart_train)
# randomForest testing decisions
svm_pred <- predict(object = svm_model,
newdata = heart_test)
# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest", "svm"),
result = c(mean(FFTrees_pred == test_truth),
mean(glm_pred == test_truth),
mean(randomForest_pred == test_truth),
mean(svm_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
heartdisease
dataset. However, some of the data is more expensive to collect than other data. What do you think would happen if we only let the models use a few cheap predictors like age
, sex
and chol
? Test your prediction by replicating the machine learning process, but only allow the models to make predictions based on the three variables age
, sex
and chol
. Does one model substantially outperform the others? (Hint: You can easily tell a model what specific variables to include using the formula
argument. For example, formula = y ~ a + b
will tell a model to model a variable y, but only using variables a and b.)# Build FFTrees_model
FFTrees_asc_model <- FFTrees(formula = diagnosis ~ age + sex + chol,
data = heart_train)
Growing FFTs with ifan
Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...
# Build glm_model
glm_asc_model <- glm(formula = factor(diagnosis) ~ age + sex + chol,
data = heart_train,
family = "binomial") # For predicting a binary variable
# Build randomForest model
randomForest_asc_model <- randomForest(formula = factor(diagnosis) ~ age + sex + chol,
data = heart_train)
# FFTrees testing decisions
FFTrees_asc_pred <- predict(object = FFTrees_asc_model,
newdata = heart_test)
# Regression testing decisions
# Positive values are predicted to be 1, negative values are 0
glm_asc_pred <- predict(object = glm_asc_model,
newdata = heart_test) >= 0
# randomForest testing decisions
randomForest_asc_pred <- predict(object = randomForest_asc_model,
newdata = heart_test)
# Now calculate testing accuracies and put in dataframe
# Truth value for test data is heart_test$sex
test_truth <- heart_test$diagnosis
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_asc_pred == test_truth),
mean(glm_asc_pred == test_truth),
mean(randomForest_asc_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
random
to heart_train
and heart_test
. Then, run your machine learning analysis, but now train and test the models to predict the new random data column. How well do the models do in training and testing?# Add a new column random to heart_train and heart_test
heart_train <- heart_train %>%
mutate(
random = sample(c(0, 1), size = nrow(.), replace = TRUE)
)
heart_test <- heart_test %>%
mutate(
random = sample(c(0, 1), size = nrow(.), replace = TRUE)
)
# Build FFTrees_model
FFTrees_random_model <- FFTrees(formula = random ~ age + sex + chol,
data = heart_train)
Growing FFTs with ifan
Fitting non-FFTrees algorithms for comparison (you can turn this off with do.comp = FALSE) ...
# Build glm_model
glm_random_model <- glm(formula = factor(random) ~ age + sex + chol,
data = heart_train,
family = "binomial") # For predicting a binary variable
# Build randomForest model
randomForest_random_model <- randomForest(formula = factor(random) ~ age + sex + chol,
data = heart_train)
# FFTrees testing decisions
FFTrees_random_pred <- predict(object = FFTrees_random_model,
newdata = heart_test)
# Regression testing decisions
# Positive values are predicted to be 1, negative values are 0
glm_random_pred <- predict(object = glm_random_model,
newdata = heart_test) >= 0
# randomForest testing decisions
randomForest_random_pred <- predict(object = randomForest_random_model,
newdata = heart_test)
# Now calculate testing accuracies and put in dataframe
# Truth value for test data is heart_test$sex
test_truth <- heart_test$random
testing_results <- tibble(model = c("FFTrees", "glm", "randomForest"),
result = c(mean(FFTrees_random_pred == test_truth),
mean(glm_random_pred == test_truth),
mean(randomForest_random_pred == test_truth)))
# Plot testing results
ggplot(data = testing_results,
aes(x = model, y = result, fill = model)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Testing Accuracy",
y = "Correct Classifications")
For more advanced machine learning functionality in R, check out the caret
package caret documentation link and the mlr
package mlr documentation link. Both of these packages contain functions that automate most aspects of the machine learning process.
To read more about the fundamentals of machine learning and statistical prediction, check out An Introduction to Statistical Learning by James et al.