Here a link to the lecture slides for this session: LINK
In this practical you’ll learn how to produce tidy code (and data). By the end of this practical you will know how to:
Filenames should be meaningful. To order them, prefix them with numbers.
# Good
analyze_my_data.R
0_read_my_data.R
1_analyze_my_data.R
# Bad
stuff.r
code.r
Object names should be lowercase. Use _
rather than .
or ‘CamelCase’ (using capitalization) for multi-word names. If possible use nouns for variables and verbs for functions. Use meaningful names. Avoid using names of existing objects.
# Good
trial_id
trial_1
# Bad
nameOFtrial
trial.object
t
ekfjw
Place spaces around all operators, such as, =
, +
, -
, <-
, etc. Also applies for defining arguments in functions. Always put a space after, never before a comma.
# Good
var_rt <- var(rt, na.rm = TRUE)
# Bad
var_rt<-var(rt,na.rm=TRUE)
Extra spacing may be used to align assignments.
# Good
list(
var_rt = var(rt)
mean_rt = mean(rt)
)
An opening curly bracket should never be on its own line. Always indent within curly brackets. To indent code use two spaces. Don’t use tabs.
# Good
if (my_dbl < 2){
message('my_dbl is smaller 2')
} else {
message('my_dbl is larger or equal 2')
}
# Bad
if (my_dbl < 2)
{
message('my_dbl is smaller 2')
} else {
message('my_dbl is smaller 2')
}
For assignments use <-
, not =
.
# Good
x <- 24324
# Bad
x = 24324
Comment each line of your code. To break up your code in chunks use -
or =
.
# Plot data ----------------------------
# Plot data ============================
R’s error messages are not always very intuitive, but over time you will learn to understand them. In the beginning it helps to focus on the part after the colon. E.g.,
sapply(1:10, 'fefkl')
Error in get(as.character(FUN), mode = "function", envir = envir): object 'fefkl' of mode 'function' was not found
Here the key message is 'function' was not found
. I.e., R is interpreting 'fekl'
as a function but cannot find an instance of this function anywhere (because it doesn’t exist).
According to an analysis of stackoverflow.com, a popular help forum, the 7 most frequent error messages and their meaning are:
Error | Example | Description |
---|---|---|
'could not find function' |
lenth(my_vec) | There is a typo in the function name or that a package has not been loaded. |
'error in if' |
if(NA == 2) 2 + 2 | The object in the if clause is non-logical or NA. |
'error in eval' |
lm(fefq~wzfe) | An object is used that does not exist. |
'cannot open()' |
read_csv(‘hjht.txt’) | The file does not exist. Could be a typo or a missing filepath. |
'no applicable method' |
predict(‘efwe’) | A ‘generic function’ has not been defined for this type/class |
'subsscript out of bounds' |
a <- matrix(c(1,2)); a[2,2] | R tried to access an element (or variable) that does not exist |
package errors | Occur when R is unable to install, compile, or load a package. Often this means that some software in the background is missing. |
For more information visit here.
mice
exampleTo impute missing values, the mice
package is very helpful. The code below loads the titanic dataset containing records on 1313 Titanic survivors and then attempts to predict missing values in Age using central tendencies and mice
.
# Load packages
library(readr)
library(mice)
# read and duplicate in titanic data
data <- read_csv('https://tinyurl.com/y99aj5ed')
data_mean <- data
data_median <- data
data_mice <- data
# use central tendencies
data_mean$Age[is.na(data_mean$Age)] <- mean(data_mean$Age, na.rm = TRUE)
data_median$Age[is.na(data_median$Age)] <- median(data_median$Age, na.rm = TRUE)
# use mice
mice_model <- mice(data_mice, method = 'rf') # uses random forests
data_mice <- complete(mice_model)
If you haven’t already, begin a new project in a new folder. Within the folder, create two new folders called 1_data
and 2_code
.
1_data
folder using write_csv()
. Finally, change in the code the URL its loading the data from to the new file path on your computer. When ready save the code in the 2_code
folder as cleaned_code_2018Jan27.R
.Note: The code is using a data set by Sir Francis Galton on the heights of parents and their children, illustrating one of the most classic cases of regression toward the mean.
library(readr )
library( magrittr)
library( dplyr)
syc = read_csv('https://tinyurl.com/galton-txt')
syc = syc%>%mutate(fcm=father/2.54,mcm=mother/2.54)
a = syc$father
b = syc[['mother']]
t.test(a,b)
2_code
folder. Now, install and load the formatR
package (install.packages("formatR")
and library("formatR")
) and apply it to the file path of the ‘dirty code’ using the tidy_source(file_path)
function. Did it do a good job? What did it miss?2_code
folder as corrected_code_2018Jan27.R
.library(readr )
library(magritt)
syc = read_csv('https://tinyurl.com/galton-txt')
syc %>% mutate(higher_ratio = height / fther)
m=lm(height~father,data=syk)
yhat=predict(M)
plot(syc[[father]],yhat)
vs = syc %>% group_by(father) %>% summarize(a = meaen(height))
points(vs,pch=16)
On to different adventures. This section will be about missing values and how one could and should deal with them. Using the very powerful and user-friendly mice
-package this will entail some machine learning.
mice
example’ from introductory part above (uses the titanic dataset). Then evaluate the mean and variance (var()
) of the variable Age
for the imputation by mean
and the imputation by random forests (rf) through mice
. Do mean and variance differ between the imputation methods? Which imputation method produces the larger variance and why? Remember what Nathaniel said about overfitting and model complexity?# Load packages
library(readr)
library(mice)
# read and duplicate in titanic data
data <- read_csv('https://tinyurl.com/y99aj5ed')
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
Name = col_character(),
PClass = col_character(),
Age = col_double(),
Sex = col_character(),
Survived = col_integer(),
SexCode = col_integer()
)
data_mean <- data
data_median <- data
data_mice <- data
# use central tendencies
data_mean$Age[is.na(data_mean$Age)] <- mean(data_mean$Age, na.rm = TRUE)
data_median$Age[is.na(data_median$Age)] <- median(data_median$Age, na.rm = TRUE)
# use mice
mice_model <- mice(data_mice, method = 'rf') # uses random forests
iter imp variable
1 1 Age
1 2 Age
1 3 Age
1 4 Age
1 5 Age
2 1 Age
2 2 Age
2 3 Age
2 4 Age
2 5 Age
3 1 Age
3 2 Age
3 3 Age
3 4 Age
3 5 Age
4 1 Age
4 2 Age
4 3 Age
4 4 Age
4 5 Age
5 1 Age
5 2 Age
5 3 Age
5 4 Age
5 5 Age
data_mice <- complete(mice_model)
# evaluate
mean(data_mean$Age) ; mean(data_median$Age) ; mean(data_mice$Age)
[1] 30.39799
[1] 29.38072
[1] 29.99966
var(data_mean$Age) ; var(data_median$Age) ; var(data_mice$Age)
[1] 117.0023
[1] 118.4079
[1] 188.4796
NAs
, so that we can evaluate the relative performance of mean
versus mice
. To do this, create a copy of the Galton dataset (i.e., assign to new object), and remove in the copy 10% of the values in the variable height
. Any easy way to do this, is to use the sample()
- function. Provided with the number of rows in the dataset and the number of samples, it will return a vector of indices that can be used to eliminated values (i.e., assign NA
s to those indices). Now create additional copies of the Galton copy (with the NA
s), one for each method (e.g., Galton_mean
and Galton_mice
) and impute within those the NA
s using the respective methods. To evaluate then the performance of the two methods, we then only need to compare the height
variable in the original data with those from the two data sets with imputed values. To summarize the differences use mean squared error mean((original - imputed)^2)
. What do you expect, which method works better? And, did it?# load packages
library(readr)
library(mice)
# load and copy data
galton_data <- read_csv('https://tinyurl.com/galton-txt')
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
family = col_character(),
father = col_double(),
mother = col_double(),
sex = col_character(),
height = col_double(),
nkids = col_integer()
)
galton_copy <- galton_data
# introduce NAs
galton_copy$height[sample(nrow(galton_copy), ceiling(nrow(galton_copy)*.1))] <- NA
galton_mean <- galton_copy
galton_mice <- galton_copy
# impute mean
galton_mean$height[is.na(galton_mean$height)] <- mean(galton_mean$height, na.rm = TRUE)
# impute mice
mice_model <- mice(galton_mice, method="rf")
iter imp variable
1 1 height
1 2 height
1 3 height
1 4 height
1 5 height
2 1 height
2 2 height
2 3 height
2 4 height
2 5 height
3 1 height
3 2 height
3 3 height
3 4 height
3 5 height
4 1 height
4 2 height
4 3 height
4 4 height
4 5 height
5 1 height
5 2 height
5 3 height
5 4 height
5 5 height
galton_mice <- complete(mice_model)
# evaluate
mean((galton_mean$height - galton_data$height)[is.na(galton_copy$height)]**2)
[1] 11.60851
mean((galton_mice$height - galton_data$height)[is.na(galton_copy$height)]**2)
[1] 18.40489
NA
s randomly. Now, let’s see what happens, when NA
s are not missing at random. Repeat the analysis of the last exercise, but this time introduce NA
s systematically. That is use other variable or combination of other variables to determine the locations of NA
s in the height
variable. For instance, introduce NA
s for father > 70
and the sex == 'M'
. Have the results changed? If yes, why?# load packages
library(readr)
library(mice)
# load and copy data
galton_data <- read_csv('https://tinyurl.com/galton-txt')
Warning: Missing column names filled in: 'X1' [1]
Parsed with column specification:
cols(
X1 = col_integer(),
family = col_character(),
father = col_double(),
mother = col_double(),
sex = col_character(),
height = col_double(),
nkids = col_integer()
)
galton_copy <- galton_data
# introduce NAs
sel <- with(galton_copy, father > 70 | sex == 'M')
galton_copy$height[sel] <- NA
galton_mean <- galton_copy
galton_mice <- galton_copy
# impute mean
galton_mean$height[is.na(galton_mean$height)] <- mean(galton_mean$height, na.rm = TRUE)
# impute mice
mice_model <- mice(galton_mice, method="rf")
iter imp variable
1 1 height
1 2 height
1 3 height
1 4 height
1 5 height
2 1 height
2 2 height
2 3 height
2 4 height
2 5 height
3 1 height
3 2 height
3 3 height
3 4 height
3 5 height
4 1 height
4 2 height
4 3 height
4 4 height
4 5 height
5 1 height
5 2 height
5 3 height
5 4 height
5 5 height
galton_mice <- complete(mice_model)
# evaluate error - MSE
mean((galton_mean$height - galton_data$height)[is.na(galton_copy$height)]**2)
[1] 32.288
mean((galton_mice$height - galton_data$height)[is.na(galton_copy$height)]**2)
[1] 31.16207
?mice
. Try out different methods and compare their performance.For more details check out check out Hadley Wickham’s Advanced R.