Trulli
https://en.wikipedia.org/wiki/Seattle_box

Overview

In this practical, you’ll practice grouping and analysing data with the dplyr and tidyr packages (part of the `tidyverse collection of packages).

By the end of this practical you will know how to:

  1. Group data and calculate summary statistics
  2. Run simple statistical analyses.

Datasets

library(tidyverse)
kc_house <- read_csv("https://raw.githubusercontent.com/therbootcamp/BaselRBootcamp_2018July/master/_sessions/_data//baselrbootcamp_data/kc_house.csv")
File Rows Columns Description
kc_house.csv 21613 21 House sale prices for King County between May 2014 and May 2015.

Packages

Package Installation
tidyverse install.packages("tidyverse")

Glossary

Wrangling

Function Package Description
rename() dplyr Rename columns
select() dplyr Select columns based on name or index
filter() dplyr Select rows based on some logical criteria
arrange() dplyr Sort rows
mutate() dplyr Add new columns
case_when() dplyr Recode values of a column
group_by(), summarise() dplyr Group data and then calculate summary statistics
left_join() dplyr Combine multiple data sets using a key column
spread() tidyr Convert long data to wide format - from rows to columns
gather() tidyr Convert wide data to long format - from columns to rows

Statistical Tests

Function Hypothesis Test
t.test() One and two sample t-test
cor.test() Correlation test
glm(), lm() Generalized linear model and linear model

Examples

# Wrangling with dplyr and tidyr ---------------------------

library(tidyverse)    # Load tidyverse for dplyr and tidyr

# Load baselers data
baselers <- read_csv("1_Data/baselers.txt")

# No grouping variables

bas <- baselers %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

# One grouping variable
bas_sex <- baselers %>%
  group_by(sex) %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

bas_sex

# Two grouping variables
bas_sex_ed <- baselers %>%
  group_by(sex, education) %>%
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    income_median = median(income, na.rm = TRUE),
    N = n()
  )

# Advanced scoping

# Calculate mean of ALL numeric variables
baselers %>%
  group_by(sex, education) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)


# Examples of hypothesis tests on the diamonds -------------

library(tidyverse)
library(broom)
library(rsq)

# First few rows of the diamonds data

diamonds


# 2-sample t- test ---------------------------

# Q: Is there a difference in the carats of color = E and color = I diamonds?

htest_B <- t.test(formula = carat ~ color,     # DV ~ IV
                   alternative = "two.sided",  # Two-sided test
                   data = diamonds,         # The data
                   subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2

htest_B  # Print result


# Regression ----------------------------


# Q: Create regression equation predicting price by carat, depth, table, and x

price_glm <- glm(formula = price ~ carat + depth + table + x,
                  data = diamonds)

# Print coefficients
price_glm$coefficients

# Tidy version
tidy(price_glm)

# Extract R-Squared
rsq(price_glm)

Tasks

A - Setup

  1. Open your baselrbootcamp R project. It should already have the folders 1_Data and 2_Code. Make sure that the data files listed in the Datasets section above are in your 1_Data folder.
# Done!
  1. Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called analysing_practical.R in the 2_Code folder.
  2. Using library() load the set of packages for this practical listed in the packages section above.

  3. For this practical, we’ll use the kc_house.csv data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house.

kc_house <- read_csv(file = "XX")
  1. Using print(), summary(), and head(), explore the data to make sure it was loaded correctly.
kc_house
# A tibble: 21,613 x 21
   id         date                  price bedrooms bathrooms sqft_living
   <chr>      <dttm>                <dbl>    <int>     <dbl>       <int>
 1 7129300520 2014-10-13 00:00:00  221900        3      1           1180
 2 6414100192 2014-12-09 00:00:00  538000        3      2.25        2570
 3 5631500400 2015-02-25 00:00:00  180000        2      1            770
 4 2487200875 2014-12-09 00:00:00  604000        4      3           1960
 5 1954400510 2015-02-18 00:00:00  510000        3      2           1680
 6 7237550310 2014-05-12 00:00:00 1225000        4      4.5         5420
 7 1321400060 2014-06-27 00:00:00  257500        3      2.25        1715
 8 2008000270 2015-01-15 00:00:00  291850        3      1.5         1060
 9 2414600126 2015-04-15 00:00:00  229500        3      1           1780
10 3793500160 2015-03-12 00:00:00  323000        3      2.5         1890
# ... with 21,603 more rows, and 15 more variables: sqft_lot <int>,
#   floors <dbl>, waterfront <int>, view <int>, condition <int>,
#   grade <int>, sqft_above <int>, sqft_basement <int>, yr_built <int>,
#   yr_renovated <int>, zipcode <int>, lat <dbl>, long <dbl>,
#   sqft_living15 <int>, sqft_lot15 <int>
summary(kc_house)
      id                 date                         price        
 Length:21613       Min.   :2014-05-02 00:00:00   Min.   :  75000  
 Class :character   1st Qu.:2014-07-22 00:00:00   1st Qu.: 321950  
 Mode  :character   Median :2014-10-16 00:00:00   Median : 450000  
                    Mean   :2014-10-29 04:38:01   Mean   : 540088  
                    3rd Qu.:2015-02-17 00:00:00   3rd Qu.: 645000  
                    Max.   :2015-05-27 00:00:00   Max.   :7700000  
    bedrooms      bathrooms     sqft_living       sqft_lot      
 Min.   : 0.0   Min.   :0.00   Min.   :  290   Min.   :    520  
 1st Qu.: 3.0   1st Qu.:1.75   1st Qu.: 1427   1st Qu.:   5040  
 Median : 3.0   Median :2.25   Median : 1910   Median :   7618  
 Mean   : 3.4   Mean   :2.11   Mean   : 2080   Mean   :  15107  
 3rd Qu.: 4.0   3rd Qu.:2.50   3rd Qu.: 2550   3rd Qu.:  10688  
 Max.   :33.0   Max.   :8.00   Max.   :13540   Max.   :1651359  
     floors       waterfront         view        condition   
 Min.   :1.00   Min.   :0.000   Min.   :0.00   Min.   :1.00  
 1st Qu.:1.00   1st Qu.:0.000   1st Qu.:0.00   1st Qu.:3.00  
 Median :1.50   Median :0.000   Median :0.00   Median :3.00  
 Mean   :1.49   Mean   :0.008   Mean   :0.23   Mean   :3.41  
 3rd Qu.:2.00   3rd Qu.:0.000   3rd Qu.:0.00   3rd Qu.:4.00  
 Max.   :3.50   Max.   :1.000   Max.   :4.00   Max.   :5.00  
     grade         sqft_above   sqft_basement     yr_built   
 Min.   : 1.00   Min.   : 290   Min.   :   0   Min.   :1900  
 1st Qu.: 7.00   1st Qu.:1190   1st Qu.:   0   1st Qu.:1951  
 Median : 7.00   Median :1560   Median :   0   Median :1975  
 Mean   : 7.66   Mean   :1788   Mean   : 292   Mean   :1971  
 3rd Qu.: 8.00   3rd Qu.:2210   3rd Qu.: 560   3rd Qu.:1997  
 Max.   :13.00   Max.   :9410   Max.   :4820   Max.   :2015  
  yr_renovated     zipcode           lat            long     
 Min.   :   0   Min.   :98001   Min.   :47.2   Min.   :-122  
 1st Qu.:   0   1st Qu.:98033   1st Qu.:47.5   1st Qu.:-122  
 Median :   0   Median :98065   Median :47.6   Median :-122  
 Mean   :  84   Mean   :98078   Mean   :47.6   Mean   :-122  
 3rd Qu.:   0   3rd Qu.:98118   3rd Qu.:47.7   3rd Qu.:-122  
 Max.   :2015   Max.   :98199   Max.   :47.8   Max.   :-121  
 sqft_living15    sqft_lot15    
 Min.   : 399   Min.   :   651  
 1st Qu.:1490   1st Qu.:  5100  
 Median :1840   Median :  7620  
 Mean   :1987   Mean   : 12768  
 3rd Qu.:2360   3rd Qu.: 10083  
 Max.   :6210   Max.   :871200  
head(kc_house)
# A tibble: 6 x 21
  id    date                 price bedrooms bathrooms sqft_living sqft_lot
  <chr> <dttm>               <dbl>    <int>     <dbl>       <int>    <int>
1 7129… 2014-10-13 00:00:00 2.22e5        3      1           1180     5650
2 6414… 2014-12-09 00:00:00 5.38e5        3      2.25        2570     7242
3 5631… 2015-02-25 00:00:00 1.80e5        2      1            770    10000
4 2487… 2014-12-09 00:00:00 6.04e5        4      3           1960     5000
5 1954… 2015-02-18 00:00:00 5.10e5        3      2           1680     8080
6 7237… 2014-05-12 00:00:00 1.23e6        4      4.5         5420   101930
# ... with 14 more variables: floors <dbl>, waterfront <int>, view <int>,
#   condition <int>, grade <int>, sqft_above <int>, sqft_basement <int>,
#   yr_built <int>, yr_renovated <int>, zipcode <int>, lat <dbl>,
#   long <dbl>, sqft_living15 <int>, sqft_lot15 <int>

B - Recap

  1. Print the names of the kc_house data with names().
names(kc_house)
 [1] "id"            "date"          "price"         "bedrooms"     
 [5] "bathrooms"     "sqft_living"   "sqft_lot"      "floors"       
 [9] "waterfront"    "view"          "condition"     "grade"        
[13] "sqft_above"    "sqft_basement" "yr_built"      "yr_renovated" 
[17] "zipcode"       "lat"           "long"          "sqft_living15"
[21] "sqft_lot15"   
  1. Change the following column names using rename().
New Name Old Name
living_sqft sqft_living
lot_sqft sqft_lot
above_sqft sqft_above
basement_sqft sqft_basement
built_yr yr_built
renovated_yr yr_renovated
  1. Create new column(s) living_sqm, lot_sqm, above_sqm and basement_sqm which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).
kc_house <- kc_house %>%
  mutate(living_sqm = XXX * XXX,
         lot_sqm = XXX * XXX,
         XXX = XXX,
         XXX = XXX)
kc_house <- kc_house %>%
  mutate(living_sqm = living_sqft * 0.093,
         lot_sqm = lot_sqft * 0.093,
         above_sqm = above_sqft * 0.093,
         basement_sqm  = basement_sqft * 0.093)
  1. Add a new variable to the dataframe called mansion which is “Yes” when the sum of the house’s living, above, and basement space is above 750 square meters.
kc_house <- kc_house %>%
                mutate(mansion = case_when(
                              living_sqm + above_sqm +  basement_sqm > 750 ~ "Yes",
                              living_sqm + above_sqm + basement_sqm <= 750 ~ "No"))

C - Simple summaries

  1. Using the base-R df$col notation, calculate the mean price of all houses.
mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
  1. Now, do the same using summarise() using the following template. Do you get the same answer? What is different about the output from summarise() versus using the dollar sign?
kc_house %>%
  summarise(
    price_mean = mean(XXX)
  )
kc_house %>%
  summarise(
    price_mean = mean(price)
  )
# A tibble: 1 x 1
  price_mean
       <dbl>
1    540088.
  1. What is the median price of all houses? Use the median() function!
kc_house %>%
  summarise(
    price_median = median(price)
  )
# A tibble: 1 x 1
  price_median
         <dbl>
1       450000
  1. What was the highest selling price? Use the max() function!
kc_house %>%
  summarise(
    price_max = max(price)
  )
# A tibble: 1 x 1
  price_max
      <dbl>
1   7700000
  1. Using the following template, sort the data frame in descending order of price. Then, print it. Do you see the house with the highest selling price at the top?
kc_house <- kc_house %>%
  arrange(desc(XXX))

kc_house
kc_house <- kc_house %>%
  arrange(desc(price))

kc_house
# A tibble: 21,613 x 26
   id         date                  price bedrooms bathrooms living_sqft
   <chr>      <dttm>                <dbl>    <int>     <dbl>       <int>
 1 6762700020 2014-10-13 00:00:00 7700000        6      8          12050
 2 9808700762 2014-06-11 00:00:00 7062500        5      4.5        10040
 3 9208900037 2014-09-19 00:00:00 6885000        6      7.75        9890
 4 2470100110 2014-08-04 00:00:00 5570000        5      5.75        9200
 5 8907500070 2015-04-13 00:00:00 5350000        5      5           8000
 6 7558700030 2015-04-13 00:00:00 5300000        6      6           7390
 7 1247600105 2014-10-20 00:00:00 5110800        5      5.25        8010
 8 1924059029 2014-06-17 00:00:00 4668000        5      6.75        9640
 9 7738500731 2014-08-15 00:00:00 4500000        5      5.5         6640
10 3835500195 2014-06-18 00:00:00 4489000        4      3           6430
# ... with 21,603 more rows, and 20 more variables: lot_sqft <int>,
#   floors <dbl>, waterfront <int>, view <int>, condition <int>,
#   grade <int>, above_sqft <int>, basement_sqft <int>, built_yr <int>,
#   renovated_yr <int>, zipcode <int>, lat <dbl>, long <dbl>,
#   sqft_living15 <int>, sqft_lot15 <int>, living_sqm <dbl>,
#   lot_sqm <dbl>, above_sqm <dbl>, basement_sqm <dbl>, mansion <chr>
  1. What percentage of houses sold for more than 1,000,000? Let’s answer this with summarise().
kc_house %>%
  summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
  summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
   mil_p
   <dbl>
1 0.0678
  1. For mansions only, calculate the mean number of floors and bathrooms (hint: before summarising the data, use filter() to only select mansions!)
kc_house %>%
  filter(mansion == "Yes") %>%
  summarise(
    floors_mean = mean(floors),
    bathrooms_mean = mean(bathrooms)
  )
# A tibble: 1 x 2
  floors_mean bathrooms_mean
        <dbl>          <dbl>
1        1.92           3.68

D - Simple grouped summaries

  1. How many mansions are there? To do this, use group_by() to group the dataset by the mansions column, then use the n() function to count the number of cases.
kc_house %>%
  group_by(XXX) %>%
  summarise(XXX = n())
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n())
# A tibble: 2 x 2
  mansion     N
  <chr>   <int>
1 No      20862
2 Yes       751
  1. What is the mean selling price of mansions versus non-mansions? To do this, just add another argument to your summarise() function!
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n(),
            XXX = XXX(XXX))
kc_house %>%
  group_by(mansion) %>%
  summarise(N = n(),
            price_mean = mean(price))
# A tibble: 2 x 3
  mansion     N price_mean
  <chr>   <int>      <dbl>
1 No      20862    504024.
2 Yes       751   1541915.
  1. Using group_by() and summarise(), create a dataframe showing the same results as the following table.
mansion N price_min price_mean price_median price_max
No 20862 75000 504024 441000 3100000
Yes 751 404000 1541915 1300000 7700000
  1. Do houses built in later years tend to have more living space? To answer this, group the data by built_yr, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!
kc_house %>%
  group_by(built_yr) %>%
  summarise(N = n(),
            living = mean(living_sqm))
# A tibble: 116 x 3
   built_yr     N living
      <int> <int>  <dbl>
 1     1900    87   161.
 2     1901    29   164.
 3     1902    27   179.
 4     1903    46   140.
 5     1904    45   149.
 6     1905    74   183.
 7     1906    92   168.
 8     1907    65   177.
 9     1908    86   158.
10     1909    94   177.
# ... with 106 more rows
  1. Was that table too big? Try using the following code to get the results for each decade rather than each year!
kc_house %>%
  mutate(built_decade = floor(built_yr / 10)) %>%
  group_by(built_decade) %>%
  summarise(XX = XX,
            XX = XX(XX))
  1. A friend of yours who is getting into Seattle real estate wants to know how the number of floors a house has affects its selling price. Create a table for her showing the minimum, mean, and maximum price for houses separated by the number of floors they have.

E - Multiple groups

  1. Your friend Brumhilda is interested in statistics on houses in the following 4 zipcodes only: 98001, 98109, 98117, 98199. Create a new dataframe called brumhilda that only contains data from houses in those zipcode (hint: use filter() combined with the %in% operator as follows:
brumhilda <- kc_house %>%
  filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
  filter(zipcode %in% c(98001, 98109, 98117, 98199))
  1. For each of the zip codes, calculate the mean() and median() selling price (as well as the number of houses) in each zip code.
brumhilda %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           N = n())
# A tibble: 4 x 4
  zipcode price_mean price_median     N
    <int>      <dbl>        <dbl> <int>
1   98001    280805.       260000   362
2   98109    879624.       736000   109
3   98117    576795.       544000   553
4   98199    791821.       689800   317
  1. Now Brumhilda wants the same data separated by whether or not the house is a mansion or not. Include these results by also grouping the data by mansion (as well as zipcode), and calculating the same summary statistics as before.
brumhilda %>%
  group_by(zipcode, mansion) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           N = n())
# A tibble: 8 x 5
# Groups:   zipcode [?]
  zipcode mansion price_mean price_median     N
    <int> <chr>        <dbl>        <dbl> <int>
1   98001 No         277589.       260000   359
2   98001 Yes        665667.       637000     3
3   98109 No         833528.       730500   106
4   98109 Yes       2508333.      2900000     3
5   98117 No         575626.       543000   551
6   98117 Yes        898750        898750     2
7   98199 No         753625.       675000   305
8   98199 Yes       1762618.      1425000    12
  1. Ok that was good, but now she also wants to know what the maximum and minimum number of floors were in each group. Add these summary statistics!
brumhilda %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price),
           price_median = median(price),
           floors_min = min(floors),
           floors_max = max(floors),
           N = n())
# A tibble: 4 x 6
  zipcode price_mean price_median floors_min floors_max     N
    <int>      <dbl>        <dbl>      <dbl>      <dbl> <int>
1   98001    280805.       260000          1        2.5   362
2   98109    879624.       736000          1        3     109
3   98117    576795.       544000          1        3     553
4   98199    791821.       689800          1        3     317

F - Statistics

  1. Let’s see if there is a significant difference between the selling prices of houses on the waterfront versus those not on the waterfront. To do this, we’ll conduct a t-test using the t.test() function and assign the result to waterfront_htest. Fill in the XXXs in the code below, such that the dependent variable is price and the independent variable is waterfront
waterfront_htest <- t.test(formula = XXX ~ XXX,
                           data = XXX)
waterfront_htest <- t.test(formula = price ~ waterfront,
                           data = kc_house)
  1. Print your waterfront_htest object to see a printout of the main results.
waterfront_htest

    Welch Two Sample t-test

data:  price by waterfront
t = -10, df = 200, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1303662  -956963
sample estimates:
mean in group 0 mean in group 1 
         531564         1661876 
  1. Look at the names of your waterfront_htest object with names().
names(waterfront_htest)
[1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  
  1. Using the $, print the test statistic (statistic) from your waterfront_htest object.
waterfront_htest$statistic
    t 
-12.9 
  1. Now using $, print only the p-value (p.value) from the object.
waterfront_htest$p.value
[1] 1.38e-26
  1. Which of the variables bedrooms, bathrooms, living_sqm, waterfront, lat and long best predict a house’s price (price)? Answer this by conducting a regression analysis using the glm() function and assign the result to price_glm. Fill in the XXXs in the code below.
price_glm <- glm(formula = XXX ~ XXX + XXX + XXX + XXX + XXX + XXX,
                 data = XXX)
price_glm <- glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront + lat + long,
                 data = kc_house)
  1. Print your price_glm object. What do you see?
price_glm

Call:  glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront + 
    lat + long, data = kc_house)

Coefficients:
(Intercept)     bedrooms    bathrooms   living_sqm   waterfront  
  -65680578       -46146        17422         3157       790264  
        lat         long  
     675209      -275008  

Degrees of Freedom: 21612 Total (i.e. Null);  21606 Residual
Null Deviance:      2.91e+15 
Residual Deviance: 1.09e+15     AIC: 594000
  1. Look at the names of your price_glm object with names().
names(price_glm)
 [1] "coefficients"      "residuals"         "fitted.values"    
 [4] "effects"           "R"                 "rank"             
 [7] "qr"                "family"            "linear.predictors"
[10] "deviance"          "aic"               "null.deviance"    
[13] "iter"              "weights"           "prior.weights"    
[16] "df.residual"       "df.null"           "y"                
[19] "converged"         "boundary"          "model"            
[22] "call"              "formula"           "terms"            
[25] "data"              "offset"            "control"          
[28] "method"            "contrasts"         "xlevels"          
  1. Using $ print the coefficients from your analysis.
price_glm$coefficients
(Intercept)    bedrooms   bathrooms  living_sqm  waterfront         lat 
  -65680578      -46146       17422        3157      790264      675209 
       long 
    -275008 
  1. Using the summary() function, show summary results from your price_glm object.
summary(price_glm)

Call:
glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront + 
    lat + long, data = kc_house)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1563708   -116127    -14138     84523   4180382  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.57e+07   1.41e+06  -46.51  < 2e-16 ***
bedrooms    -4.61e+04   2.05e+03  -22.53  < 2e-16 ***
bathrooms    1.74e+04   3.07e+03    5.67  1.4e-08 ***
living_sqm   3.16e+03   2.95e+01  107.17  < 2e-16 ***
waterfront   7.90e+05   1.79e+04   44.14  < 2e-16 ***
lat          6.75e+05   1.12e+04   60.20  < 2e-16 ***
long        -2.75e+05   1.14e+04  -24.15  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 5.06e+10)

    Null deviance: 2.9129e+15  on 21612  degrees of freedom
Residual deviance: 1.0943e+15  on 21606  degrees of freedom
AIC: 594064

Number of Fisher Scoring iterations: 2
  1. The $residuals element in your price_glm object contains the residuals (difference from the predicted and true dependent variable values). Add this vector as a new column called residuals to your kc_house object using mutate()
kc_house <- kc_house %>%
  mutate(residuals = price_glm$residuals)
  1. Using the following template, create a histogram of the residuals.
ggplot(data = kc_house,
       aes(x = residuals)) +
  geom_histogram(col = "white") +
  labs(title = "Residual regression")
ggplot(data = kc_house,
       aes(x = residuals)) +
  geom_histogram(col = "white") +
  labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. What is the mean of the residuals? How do you interpret this?
kc_house %>% 
  summarise(resid_mean = mean(residuals))
# A tibble: 1 x 1
    resid_mean
         <dbl>
1 -0.000000151
  1. Now, create a new column called residuals_abs which shows the absolute value of the residuals (hint: use the abs() function).
kc_house <- kc_house %>%
  mutate(residuals_abs = abs(price_glm$residuals))
  1. Create a histogram of the absolute value of the residuals.
ggplot(data = kc_house,
       aes(x = residuals_abs)) +
  geom_histogram(col = "white") +
  labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. What is the mean of the absolute value of the residuals? How do you interpret this? In general, do you think you can predict the price of a house very well based on the features in your regression model?
kc_house %>% 
  summarise(resid_abs_mean = mean(residuals_abs))
# A tibble: 1 x 1
  resid_abs_mean
           <dbl>
1        144018.
# On average, the model's fitted values are off by 144,018. So the model isn't terribly accurate on average.

X - Challenges

  1. Which zipcode has the highest percentage of houses on the waterfront? (Hint: group by zipcode, calculate the percentage of houses on the waterfront using mean(), then sort the data in descending order) with arrange(), then select the first row with slice(). Once you find it, try searching for that zipcode on Google Maps and see if it’s location makes sense!
kc_house %>%
  group_by(zipcode) %>%
  summarise(waterfront_p = mean(waterfront)) %>%
  arrange(desc(waterfront_p)) %>%
  slice(1)
# A tibble: 1 x 2
  zipcode waterfront_p
    <int>        <dbl>
1   98070        0.203
  1. Which house had the highest price to living space ratio? To answer this, create a new variable called price_to_living that takes price / living_sqm. Then, sort the data in descending order of this variable, and select the first row with slice()! What id value do you get?
kc_house %>%
  mutate(price_to_living = price / living_sqm) %>%
  arrange(desc(price_to_living)) %>%
  slice(1)
# A tibble: 1 x 29
  id    date                 price bedrooms bathrooms living_sqft lot_sqft
  <chr> <dttm>               <dbl>    <int>     <dbl>       <int>    <int>
1 6021… 2015-04-07 00:00:00 874950        2         1        1080     4000
# ... with 22 more variables: floors <dbl>, waterfront <int>, view <int>,
#   condition <int>, grade <int>, above_sqft <int>, basement_sqft <int>,
#   built_yr <int>, renovated_yr <int>, zipcode <int>, lat <dbl>,
#   long <dbl>, sqft_living15 <int>, sqft_lot15 <int>, living_sqm <dbl>,
#   lot_sqm <dbl>, above_sqm <dbl>, basement_sqm <dbl>, mansion <chr>,
#   residuals <dbl>, residuals_abs <dbl>, price_to_living <dbl>
  1. Which are the top 10 zip codes in terms of mean housing prices? To answer this, group the data by zipcode, calculate the mean price, arrange the dataset in descending order of mean price, then select the top 10 rows!
kc_house %>%
  group_by(zipcode) %>%
  summarise(price_mean = mean(price)) %>%
  arrange(desc(price_mean)) %>%
  slice(1:10)
# A tibble: 10 x 2
   zipcode price_mean
     <int>      <dbl>
 1   98039   2160607.
 2   98004   1355927.
 3   98040   1194230.
 4   98112   1095499.
 5   98102    901258.
 6   98109    879624.
 7   98105    862825.
 8   98006    859685.
 9   98119    849448.
10   98005    810165.
  1. Create the following dataframe exactly as it appears.
built_yr N price_mean price_max living_sqm_mean
1990 320 563966 3640900 234
1991 224 630441 5300000 244
1992 198 548169 2480000 223
1993 202 556612 3120000 226
1994 249 486834 2880500 209
1995 169 577771 3200000 224
1996 195 639534 3100000 240
1997 177 606058 3800000 234
1998 239 594159 1960000 241
kc_house %>%
  filter(built_yr >= 1990 & built_yr < 1999) %>%
  group_by(built_yr) %>%
  summarise(N = n(),
            price_mean = mean(price),
            price_max = max(price),
            living_sqm_mean = mean(living_sqm)) %>%
  knitr::kable(digits = 0)
built_yr N price_mean price_max living_sqm_mean
1990 320 563966 3640900 234
1991 224 630441 5300000 244
1992 198 548169 2480000 223
1993 202 556612 3120000 226
1994 249 486834 2880500 209
1995 169 577771 3200000 224
1996 195 639534 3100000 240
1997 177 606058 3800000 234
1998 239 594159 1960000 241
  1. Create a regression object called living_glm predicting the amount of living space (living_sqm) as a function of bedrooms, bathrooms, waterfront, and built_yr. Explore the object with names() and summary(). Which variables seem to predict living space?
living_glm <- glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront + built_yr,
                  data = kc_house)

names(living_glm)
 [1] "coefficients"      "residuals"         "fitted.values"    
 [4] "effects"           "R"                 "rank"             
 [7] "qr"                "family"            "linear.predictors"
[10] "deviance"          "aic"               "null.deviance"    
[13] "iter"              "weights"           "prior.weights"    
[16] "df.residual"       "df.null"           "y"                
[19] "converged"         "boundary"          "model"            
[22] "call"              "formula"           "terms"            
[25] "data"              "offset"            "control"          
[28] "method"            "contrasts"         "xlevels"          
summary(living_glm)

Call:
glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront + 
    built_yr, data = kc_house)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-705.7   -32.0    -4.9    25.7   566.1  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 219.6403    27.7140    7.93  2.4e-15 ***
bedrooms     23.1621     0.4532   51.11  < 2e-16 ***
bathrooms    71.3214     0.6294  113.31  < 2e-16 ***
waterfront   62.5121     4.1476   15.07  < 2e-16 ***
built_yr     -0.1297     0.0143   -9.09  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 2750)

    Null deviance: 157675161  on 21612  degrees of freedom
Residual deviance:  59420739  on 21608  degrees of freedom
AIC: 232503

Number of Fisher Scoring iterations: 2
  1. The chisq.test() function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?
# First look at a table

table(kc_house$waterfront, kc_house$grade)
   
       1    3    4    5    6    7    8    9   10   11   12   13
  0    1    3   29  238 2026 8958 6028 2590 1106  379   79   13
  1    0    0    0    4   12   23   40   25   28   20   11    0
chisq.test(table(kc_house$waterfront, kc_house$grade))

    Pearson's Chi-squared test

data:  table(kc_house$waterfront, kc_house$grade)
X-squared = 300, df = 10, p-value <2e-16

Additional Resources