R for Data Science Basel R Bootcamp |
adapted from trueloveproperty.co.uk
In this practical, you’ll practice grouping and analysing data with dplyr
By the end of this practical you will know how to:
BaselRBootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder.# Done!
## NAME
## DATE
## Analysing Practical
Save it as a new file called analysing_practical.R
in the 2_Code
folder.
Using library()
load the tidyverse
and broom
packages for this practical listed in the Functions section above. If you don’t have them installed, you’ll need to install them, see the Functions tab above for installation instructions.
# Load packages
library(tidyverse)
library(broom)
library(tidyverse)
library(broom)
kc_house.csv
data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house
.kc_house <- read_csv(file = "XX/XX")
print()
, dim()
, summary()
, head()
, and View()
, explore the data to make sure it was loaded correctly.kc_house
# A tibble: 21,613 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
7 1321… 2014-06-27 00:00:00 2.58e5 3 2.25 1715 6819
8 2008… 2015-01-15 00:00:00 2.92e5 3 1.5 1060 9711
9 2414… 2015-04-15 00:00:00 2.30e5 3 1 1780 7470
10 3793… 2015-03-12 00:00:00 3.23e5 3 2.5 1890 6560
# … with 21,603 more rows, and 14 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>,
# yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
# sqft_living15 <dbl>, sqft_lot15 <dbl>
dim(kc_house)
[1] 21613 21
summary(kc_house)
id date price
Length:21613 Min. :2014-05-02 00:00:00 Min. : 75000
Class :character 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950
Mode :character Median :2014-10-16 00:00:00 Median : 450000
Mean :2014-10-29 04:38:01 Mean : 540088
3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000
Max. :2015-05-27 00:00:00 Max. :7700000
bedrooms bathrooms sqft_living sqft_lot
Min. : 0.0 Min. :0.00 Min. : 290 Min. : 520
1st Qu.: 3.0 1st Qu.:1.75 1st Qu.: 1427 1st Qu.: 5040
Median : 3.0 Median :2.25 Median : 1910 Median : 7618
Mean : 3.4 Mean :2.11 Mean : 2080 Mean : 15107
3rd Qu.: 4.0 3rd Qu.:2.50 3rd Qu.: 2550 3rd Qu.: 10688
Max. :33.0 Max. :8.00 Max. :13540 Max. :1651359
floors waterfront view condition
Min. :1.00 Min. :0.000 Min. :0.00 Min. :1.00
1st Qu.:1.00 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00
Median :1.50 Median :0.000 Median :0.00 Median :3.00
Mean :1.49 Mean :0.008 Mean :0.23 Mean :3.41
3rd Qu.:2.00 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00
Max. :3.50 Max. :1.000 Max. :4.00 Max. :5.00
grade sqft_above sqft_basement yr_built
Min. : 1.00 Min. : 290 Min. : 0 Min. :1900
1st Qu.: 7.00 1st Qu.:1190 1st Qu.: 0 1st Qu.:1951
Median : 7.00 Median :1560 Median : 0 Median :1975
Mean : 7.66 Mean :1788 Mean : 292 Mean :1971
3rd Qu.: 8.00 3rd Qu.:2210 3rd Qu.: 560 3rd Qu.:1997
Max. :13.00 Max. :9410 Max. :4820 Max. :2015
yr_renovated zipcode lat long
Min. : 0 Min. :98001 Min. :47.2 Min. :-122
1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5 1st Qu.:-122
Median : 0 Median :98065 Median :47.6 Median :-122
Mean : 84 Mean :98078 Mean :47.6 Mean :-122
3rd Qu.: 0 3rd Qu.:98118 3rd Qu.:47.7 3rd Qu.:-122
Max. :2015 Max. :98199 Max. :47.8 Max. :-121
sqft_living15 sqft_lot15
Min. : 399 Min. : 651
1st Qu.:1490 1st Qu.: 5100
Median :1840 Median : 7620
Mean :1987 Mean : 12768
3rd Qu.:2360 3rd Qu.: 10083
Max. :6210 Max. :871200
head(kc_house)
# A tibble: 6 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
# … with 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
# condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
# yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>,
# long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>
# View(kc_house)
kc_house
data with names()
.names(kc_house)
[1] "id" "date" "price" "bedrooms"
[5] "bathrooms" "sqft_living" "sqft_lot" "floors"
[9] "waterfront" "view" "condition" "grade"
[13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[17] "zipcode" "lat" "long" "sqft_living15"
[21] "sqft_lot15"
rename()
.New Name | Old Name |
---|---|
living_sqft | sqft_living |
lot_sqft | sqft_lot |
above_sqft | sqft_above |
basement_sqft | sqft_basement |
built_yr | yr_built |
renovated_yr | yr_renovated |
living_sqm
, above_sqm
and basement_sqm
which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).kc_house <- kc_house %>%
mutate(living_sqm = living_sqft * 0.093,
XXX = XXX * XXX,
XXX = XXX * XXX)
kc_house <- kc_house %>%
mutate(living_sqm = living_sqft * 0.093,
above_sqm = above_sqft * 0.093,
basement_sqm = basement_sqft * 0.093)
total_sqm
which is the sum of living_sqm
, above_sqm
and basement_sqm
kc_house <- kc_house %>%
mutate(total_sqm = living_sqm + above_sqm + basement_sqm)
mansion
which is “Yes” when total_sqm
is greater than 750, and “No” when total_sqm
is less than or equal to 750.kc_house <- kc_house %>%
mutate(XXX = case_when(XXX > XXX ~ "XXX",
XXX <= XXX ~ "XXX"))
kc_house <- kc_house %>%
mutate(mansion = case_when(
total_sqm > 750 ~ "Yes",
total_sqm <= 750~ "No"))
table(kc_house$mansion)
. You should find 751 mansions and 20862 non-mansions!df$col
notation, calculate the mean price of all houses.mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
summarise()
using the following template. Do you get the same answer? What is different about the output from summarise()
versus using the dollar sign?kc_house %>%
summarise(
price_mean = mean(XXX)
)
kc_house %>%
summarise(
price_mean = mean(price)
)
# A tibble: 1 x 1
price_mean
<dbl>
1 540088.
median()
function!kc_house %>%
summarise(
price_median = median(price)
)
# A tibble: 1 x 1
price_median
<dbl>
1 450000
max()
function!kc_house %>%
summarise(
price_max = max(price)
)
# A tibble: 1 x 1
price_max
<dbl>
1 7700000
kc_house <- kc_house %>%
arrange(desc(XXX))
kc_house
kc_house <- kc_house %>%
arrange(desc(price))
kc_house
# A tibble: 21,613 x 26
id date price bedrooms bathrooms living_sqft lot_sqft
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6762… 2014-10-13 00:00:00 7.70e6 6 8 12050 27600
2 9808… 2014-06-11 00:00:00 7.06e6 5 4.5 10040 37325
3 9208… 2014-09-19 00:00:00 6.88e6 6 7.75 9890 31374
4 2470… 2014-08-04 00:00:00 5.57e6 5 5.75 9200 35069
5 8907… 2015-04-13 00:00:00 5.35e6 5 5 8000 23985
6 7558… 2015-04-13 00:00:00 5.30e6 6 6 7390 24829
7 1247… 2014-10-20 00:00:00 5.11e6 5 5.25 8010 45517
8 1924… 2014-06-17 00:00:00 4.67e6 5 6.75 9640 13068
9 7738… 2014-08-15 00:00:00 4.50e6 5 5.5 6640 40014
10 3835… 2014-06-18 00:00:00 4.49e6 4 3 6430 27517
# … with 21,603 more rows, and 19 more variables: floors <dbl>,
# waterfront <dbl>, view <dbl>, condition <dbl>, grade <dbl>,
# above_sqft <dbl>, basement_sqft <dbl>, built_yr <dbl>,
# renovated_yr <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
# sqft_living15 <dbl>, sqft_lot15 <dbl>, living_sqm <dbl>,
# above_sqm <dbl>, basement_sqm <dbl>, total_sqm <dbl>, mansion <chr>
summarise()
.kc_house %>%
summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
mil_p
<dbl>
1 0.0678
filter()
to only select mansions!)kc_house %>%
filter(mansion == "Yes") %>%
summarise(
floors_mean = mean(floors),
bathrooms_mean = mean(bathrooms)
)
# A tibble: 1 x 2
floors_mean bathrooms_mean
<dbl> <dbl>
1 1.92 3.68
group_by()
to group the dataset by the mansion
column, then use the n()
function to count the number of cases.kc_house %>%
group_by(XXX) %>%
summarise(XXX = n())
kc_house %>%
group_by(mansion) %>%
summarise(N = n())
# A tibble: 2 x 2
mansion N
<chr> <int>
1 No 20862
2 Yes 751
summarise()
function!kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
XXX = XXX(XXX))
kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_mean = mean(price))
# A tibble: 2 x 3
mansion N price_mean
<chr> <int> <dbl>
1 No 20862 504024.
2 Yes 751 1541915.
group_by()
and summarise()
, create a dataframe showing the same results as the following table.mansion | N | price_min | price_mean | price_median | price_max |
---|---|---|---|---|---|
No | 20862 | 75000 | 504024 | 441000 | 3100000 |
Yes | 751 | 404000 | 1541915 | 1300000 | 7700000 |
built_yr
, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!kc_house %>%
group_by(built_yr) %>%
summarise(N = n(),
living = mean(living_sqm))
# A tibble: 116 x 3
built_yr N living
<dbl> <int> <dbl>
1 1900 87 161.
2 1901 29 164.
3 1902 27 179.
4 1903 46 140.
5 1904 45 149.
6 1905 74 183.
7 1906 92 168.
8 1907 65 177.
9 1908 86 158.
10 1909 94 177.
# … with 106 more rows
kc_house %>%
mutate(built_decade = floor(built_yr / 10)) %>%
group_by(built_decade) %>%
summarise(XX = XX,
XX = XX(XX))
brumhilda
that only contains data from houses in those zipcode (hint: use filter()
combined with the %in%
operator as follows:brumhilda <- kc_house %>%
filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
filter(zipcode %in% c(98001, 98109, 98117, 98199))
mean()
and median()
selling price (as well as the number of houses) in each zip code.brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 4 x 4
zipcode price_mean price_median N
<dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 362
2 98109 879624. 736000 109
3 98117 576795. 544000 553
4 98199 791821. 689800 317
mansion
(as well as zipcode
), and calculating the same summary statistics as before.brumhilda %>%
group_by(zipcode, mansion) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 8 x 5
# Groups: zipcode [4]
zipcode mansion price_mean price_median N
<dbl> <chr> <dbl> <dbl> <int>
1 98001 No 277589. 260000 359
2 98001 Yes 665667. 637000 3
3 98109 No 833528. 730500 106
4 98109 Yes 2508333. 2900000 3
5 98117 No 575626. 543000 551
6 98117 Yes 898750 898750 2
7 98199 No 753625. 675000 305
8 98199 Yes 1762618. 1425000 12
brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
floors_min = min(floors),
floors_max = max(floors),
N = n())
# A tibble: 4 x 6
zipcode price_mean price_median floors_min floors_max N
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 1 2.5 362
2 98109 879624. 736000 1 3 109
3 98117 576795. 544000 1 3 553
4 98199 791821. 689800 1 3 317
t.test()
function and assign the result to waterfront_htest
. Fill in the XXXs in the code below, such that the dependent variable is price
and the independent variable is waterfront
waterfront_htest <- t.test(formula = XXX ~ XXX,
data = XXX)
waterfront_htest <- t.test(formula = price ~ waterfront,
data = kc_house)
waterfront_htest
object to see a printout of the main results.waterfront_htest
Welch Two Sample t-test
data: price by waterfront
t = -10, df = 200, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1303662 -956963
sample estimates:
mean in group 0 mean in group 1
531564 1661876
waterfront_htest
object with names()
.names(waterfront_htest)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
$
, print the test statistic (statistic
) from your waterfront_htest
object.waterfront_htest$statistic
t
-12.9
$
, print only the p-value (p.value
) from the object.waterfront_htest$p.value
[1] 1.38e-26
bedrooms
, bathrooms
, living_sqm
, waterfront
, lat
and long
best predict a house’s price (price
)? Answer this by conducting a regression analysis using the glm()
function and assign the result to price_glm
. Fill in the XXXs in the code below.price_glm <- glm(formula = XXX ~ XXX + XXX + XXX + XXX + XXX + XXX,
data = XXX)
price_glm <- glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront + lat + long,
data = kc_house)
price_glm
object. What do you see?price_glm
Call: glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront +
lat + long, data = kc_house)
Coefficients:
(Intercept) bedrooms bathrooms living_sqm waterfront
-65680578 -46146 17422 3157 790264
lat long
675209 -275008
Degrees of Freedom: 21612 Total (i.e. Null); 21606 Residual
Null Deviance: 2.91e+15
Residual Deviance: 1.09e+15 AIC: 594000
price_glm
object with names()
.names(price_glm)
[1] "coefficients" "residuals" "fitted.values"
[4] "effects" "R" "rank"
[7] "qr" "family" "linear.predictors"
[10] "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights"
[16] "df.residual" "df.null" "y"
[19] "converged" "boundary" "model"
[22] "call" "formula" "terms"
[25] "data" "offset" "control"
[28] "method" "contrasts" "xlevels"
$
print the coefficients from your analysis.price_glm$coefficients
(Intercept) bedrooms bathrooms living_sqm waterfront lat
-65680578 -46146 17422 3157 790264 675209
long
-275008
summary()
function, show summary results from your price_glm
object.summary(price_glm)
Call:
glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront +
lat + long, data = kc_house)
Deviance Residuals:
Min 1Q Median 3Q Max
-1563708 -116127 -14138 84523 4180382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.57e+07 1.41e+06 -46.51 < 2e-16 ***
bedrooms -4.61e+04 2.05e+03 -22.53 < 2e-16 ***
bathrooms 1.74e+04 3.07e+03 5.67 1.4e-08 ***
living_sqm 3.16e+03 2.95e+01 107.17 < 2e-16 ***
waterfront 7.90e+05 1.79e+04 44.14 < 2e-16 ***
lat 6.75e+05 1.12e+04 60.20 < 2e-16 ***
long -2.75e+05 1.14e+04 -24.15 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 5.06e+10)
Null deviance: 2.9129e+15 on 21612 degrees of freedom
Residual deviance: 1.0943e+15 on 21606 degrees of freedom
AIC: 594064
Number of Fisher Scoring iterations: 2
$residuals
element in your price_glm
object contains the residuals (difference from the predicted and true dependent variable values). Add this vector as a new column called residuals
to your kc_house
object using mutate()
kc_house <- kc_house %>%
mutate(residuals = price_glm$residuals)
ggplot(data = kc_house,
aes(x = residuals)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
ggplot(data = kc_house,
aes(x = residuals)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kc_house %>%
summarise(resid_mean = mean(residuals))
# A tibble: 1 x 1
resid_mean
<dbl>
1 -0.000000151
residuals_abs
which shows the absolute value of the residuals (hint: use the abs()
function).kc_house <- kc_house %>%
mutate(residuals_abs = abs(price_glm$residuals))
ggplot(data = kc_house,
aes(x = residuals_abs)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
kc_house %>%
summarise(resid_abs_mean = mean(residuals_abs))
# A tibble: 1 x 1
resid_abs_mean
<dbl>
1 144018.
# On average, the model's fitted values are off by 144,018. So the model isn't terribly accurate on average.
mean()
, then sort the data in descending order) with arrange()
, then select the first row with slice()
. Once you find it, try searching for that zipcode on Google Maps and see if its location makes sense!kc_house %>%
group_by(zipcode) %>%
summarise(waterfront_p = mean(waterfront)) %>%
arrange(desc(waterfront_p)) %>%
slice(1)
# A tibble: 1 x 2
zipcode waterfront_p
<dbl> <dbl>
1 98070 0.203
price_to_living
that takes price / living_sqm
. Then, sort the data in descending order of this variable, and select the first row with slice()
! What id value do you get?kc_house %>%
mutate(price_to_living = price / living_sqm) %>%
arrange(desc(price_to_living)) %>%
slice(1)
# A tibble: 1 x 29
id date price bedrooms bathrooms living_sqft lot_sqft
<chr> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6021… 2015-04-07 00:00:00 874950 2 1 1080 4000
# … with 22 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
# condition <dbl>, grade <dbl>, above_sqft <dbl>, basement_sqft <dbl>,
# built_yr <dbl>, renovated_yr <dbl>, zipcode <dbl>, lat <dbl>,
# long <dbl>, sqft_living15 <dbl>, sqft_lot15 <dbl>, living_sqm <dbl>,
# above_sqm <dbl>, basement_sqm <dbl>, total_sqm <dbl>, mansion <chr>,
# residuals <dbl>, residuals_abs <dbl>, price_to_living <dbl>
kc_house %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price)) %>%
arrange(desc(price_mean)) %>%
slice(1:10)
# A tibble: 10 x 2
zipcode price_mean
<dbl> <dbl>
1 98039 2160607.
2 98004 1355927.
3 98040 1194230.
4 98112 1095499.
5 98102 901258.
6 98109 879624.
7 98105 862825.
8 98006 859685.
9 98119 849448.
10 98005 810165.
built_yr | N | price_mean | price_max | living_sqm_mean |
---|---|---|---|---|
1990 | 320 | 563966 | 3640900 | 234 |
1991 | 224 | 630441 | 5300000 | 244 |
1992 | 198 | 548169 | 2480000 | 223 |
1993 | 202 | 556612 | 3120000 | 226 |
1994 | 249 | 486834 | 2880500 | 209 |
1995 | 169 | 577771 | 3200000 | 224 |
1996 | 195 | 639534 | 3100000 | 240 |
1997 | 177 | 606058 | 3800000 | 234 |
1998 | 239 | 594159 | 1960000 | 241 |
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
built_yr | N | price_mean | price_max | living_sqm_mean |
---|---|---|---|---|
1990 | 320 | 563966 | 3640900 | 234 |
1991 | 224 | 630441 | 5300000 | 244 |
1992 | 198 | 548169 | 2480000 | 223 |
1993 | 202 | 556612 | 3120000 | 226 |
1994 | 249 | 486834 | 2880500 | 209 |
1995 | 169 | 577771 | 3200000 | 224 |
1996 | 195 | 639534 | 3100000 | 240 |
1997 | 177 | 606058 | 3800000 | 234 |
1998 | 239 | 594159 | 1960000 | 241 |
living_glm
predicting the amount of living space (living_sqm
) as a function of bedrooms
, bathrooms
, waterfront
, and built_yr
. Explore the object with names()
and summary()
. Which variables seem to predict living space?living_glm <- glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront + built_yr,
data = kc_house)
names(living_glm)
[1] "coefficients" "residuals" "fitted.values"
[4] "effects" "R" "rank"
[7] "qr" "family" "linear.predictors"
[10] "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights"
[16] "df.residual" "df.null" "y"
[19] "converged" "boundary" "model"
[22] "call" "formula" "terms"
[25] "data" "offset" "control"
[28] "method" "contrasts" "xlevels"
summary(living_glm)
Call:
glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront +
built_yr, data = kc_house)
Deviance Residuals:
Min 1Q Median 3Q Max
-705.7 -32.0 -4.9 25.7 566.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 219.6403 27.7140 7.93 2.4e-15 ***
bedrooms 23.1621 0.4532 51.11 < 2e-16 ***
bathrooms 71.3214 0.6294 113.31 < 2e-16 ***
waterfront 62.5121 4.1476 15.07 < 2e-16 ***
built_yr -0.1297 0.0143 -9.09 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2750)
Null deviance: 157675161 on 21612 degrees of freedom
Residual deviance: 59420739 on 21608 degrees of freedom
AIC: 232503
Number of Fisher Scoring iterations: 2
chisq.test()
function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?# First look at a table
table(kc_house$waterfront, kc_house$grade)
1 3 4 5 6 7 8 9 10 11 12 13
0 1 3 29 238 2026 8958 6028 2590 1106 379 79 13
1 0 0 0 4 12 23 40 25 28 20 11 0
chisq.test(table(kc_house$waterfront, kc_house$grade))
Pearson's Chi-squared test
data: table(kc_house$waterfront, kc_house$grade)
X-squared = 300, df = 10, p-value <2e-16
# Analysing with dplyr and tidyr ---------------------------
library(tidyverse) # Load tidyverse for dplyr
library(broom) # For tidy results
library(rsq) # For r-squared values
# Load baselers data
baselers <- read_csv("1_Data/baselers.txt")
# No grouping variables
bas <- baselers %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# One grouping variable
bas_sex <- baselers %>%
group_by(sex) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
bas_sex
# Two grouping variables
bas_sex_ed <- baselers %>%
group_by(sex, education) %>%
summarise(
age_mean = mean(age, na.rm = TRUE),
income_median = median(income, na.rm = TRUE),
N = n()
)
# Advanced scoping
# Calculate mean of ALL numeric variables
baselers %>%
group_by(sex, education) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
# Examples of hypothesis tests on the diamonds -------------
# First few rows of the diamonds data
diamonds
# 2-sample t- test ---------------------------
# Q: Is there a difference in the carats of color = E and color = I diamonds?
htest_B <- t.test(formula = carat ~ color, # DV ~ IV
alternative = "two.sided", # Two-sided test
data = diamonds, # The data
subset = color %in% c("E", "I")) # Compare Diet 1 and Diet 2
htest_B # Print result
# Regression ----------------------------
# Q: Create regression equation predicting price by carat, depth, table, and x
price_glm <- glm(formula = price ~ carat + depth + table + x,
data = diamonds)
# Print coefficients
price_glm$coefficients
# Tidy version
tidy(price_glm)
# Extract R-Squared
rsq(price_glm)
File | Rows | Columns | Description |
---|---|---|---|
kc_house.csv | 21613 | 21 | House sale prices for King County between May 2014 and May 2015. |
First 5 rows and 5 columns of kc_house.csv
id | date | price | bedrooms | bathrooms |
---|---|---|---|---|
6762700020 | 2014-10-13 | 7700000 | 6 | 8.00 |
9808700762 | 2014-06-11 | 7062500 | 5 | 4.50 |
9208900037 | 2014-09-19 | 6885000 | 6 | 7.75 |
2470100110 | 2014-08-04 | 5570000 | 5 | 5.75 |
8907500070 | 2015-04-13 | 5350000 | 5 | 5.00 |
Package | Installation | Description |
---|---|---|
tidyverse |
install.packages("tidyverse") |
A collection of packages for data science, including dplyr , ggplot2 and others |
broom |
install.packages("broom") |
Provides ‘tidy’ outputs from statistical objects. |
rsq |
install.packages("rsq") |
Returns r-squared statistics from glm objects. |
Wrangling
Function | Package | Description |
---|---|---|
rename() |
dplyr |
Rename columns |
select() |
dplyr |
Select columns based on name or index |
filter() |
dplyr |
Select rows based on some logical criteria |
arrange() |
dplyr |
Sort rows |
mutate() |
dplyr |
Add new columns |
case_when() |
dplyr |
Recode values of a column |
group_by(), summarise() |
dplyr |
Group data and then calculate summary statistics |
left_join() |
dplyr |
Combine multiple data sets using a key column |
Statistical Tests
Function | Package | Hypothesis Test |
---|---|---|
glm() , lm() |
base |
Generalized linear model and linear model |
t.test() |
base |
One and two sample t-test |
wilcox.test() |
base |
Wilcoxon test |
cor.test() |
base |
Correlation test |
chisq.test |
base |
Chi-square tests for frequency tables |
Other
Function | Package | Description |
---|---|---|
tidy() |
broom |
Converts statistical objects to ‘tidy’ dataframes of key results |
rsq() |
rsq |
Returns r-squared statistics from glm objects |
dplyr
vignetteSee the full dplyr vignette with lots of wrangling tips and tricks.
from R Studio