A - Setup
- Open your
baselrbootcamp
R project. It should already have the folders 1_Data
and 2_Code
. Make sure that the data files listed in the Datasets
section above are in your 1_Data
folder.
# Done!
- Open a new R script. At the top of the script, using comments, write your name and the date. Save it as a new file called
analysing_practical.R
in the 2_Code
folder.
Using library()
load the set of packages for this practical listed in the packages section above.
For this practical, we’ll use the kc_house.csv
data. This dataset contains house sale prices for King County, Washington. It includes homes sold between May 2014 and May 2015. Using the following template, load the data into R and store it as a new object called kc_house
.
kc_house <- read_csv(file = "XX")
- Using
print()
, summary()
, and head()
, explore the data to make sure it was loaded correctly.
kc_house
# A tibble: 21,613 x 21
id date price bedrooms bathrooms sqft_living
<chr> <dttm> <dbl> <int> <dbl> <int>
1 7129300520 2014-10-13 00:00:00 221900 3 1 1180
2 6414100192 2014-12-09 00:00:00 538000 3 2.25 2570
3 5631500400 2015-02-25 00:00:00 180000 2 1 770
4 2487200875 2014-12-09 00:00:00 604000 4 3 1960
5 1954400510 2015-02-18 00:00:00 510000 3 2 1680
6 7237550310 2014-05-12 00:00:00 1225000 4 4.5 5420
7 1321400060 2014-06-27 00:00:00 257500 3 2.25 1715
8 2008000270 2015-01-15 00:00:00 291850 3 1.5 1060
9 2414600126 2015-04-15 00:00:00 229500 3 1 1780
10 3793500160 2015-03-12 00:00:00 323000 3 2.5 1890
# ... with 21,603 more rows, and 15 more variables: sqft_lot <int>,
# floors <dbl>, waterfront <int>, view <int>, condition <int>,
# grade <int>, sqft_above <int>, sqft_basement <int>, yr_built <int>,
# yr_renovated <int>, zipcode <int>, lat <dbl>, long <dbl>,
# sqft_living15 <int>, sqft_lot15 <int>
summary(kc_house)
id date price
Length:21613 Min. :2014-05-02 00:00:00 Min. : 75000
Class :character 1st Qu.:2014-07-22 00:00:00 1st Qu.: 321950
Mode :character Median :2014-10-16 00:00:00 Median : 450000
Mean :2014-10-29 04:38:01 Mean : 540088
3rd Qu.:2015-02-17 00:00:00 3rd Qu.: 645000
Max. :2015-05-27 00:00:00 Max. :7700000
bedrooms bathrooms sqft_living sqft_lot
Min. : 0.0 Min. :0.00 Min. : 290 Min. : 520
1st Qu.: 3.0 1st Qu.:1.75 1st Qu.: 1427 1st Qu.: 5040
Median : 3.0 Median :2.25 Median : 1910 Median : 7618
Mean : 3.4 Mean :2.11 Mean : 2080 Mean : 15107
3rd Qu.: 4.0 3rd Qu.:2.50 3rd Qu.: 2550 3rd Qu.: 10688
Max. :33.0 Max. :8.00 Max. :13540 Max. :1651359
floors waterfront view condition
Min. :1.00 Min. :0.000 Min. :0.00 Min. :1.00
1st Qu.:1.00 1st Qu.:0.000 1st Qu.:0.00 1st Qu.:3.00
Median :1.50 Median :0.000 Median :0.00 Median :3.00
Mean :1.49 Mean :0.008 Mean :0.23 Mean :3.41
3rd Qu.:2.00 3rd Qu.:0.000 3rd Qu.:0.00 3rd Qu.:4.00
Max. :3.50 Max. :1.000 Max. :4.00 Max. :5.00
grade sqft_above sqft_basement yr_built
Min. : 1.00 Min. : 290 Min. : 0 Min. :1900
1st Qu.: 7.00 1st Qu.:1190 1st Qu.: 0 1st Qu.:1951
Median : 7.00 Median :1560 Median : 0 Median :1975
Mean : 7.66 Mean :1788 Mean : 292 Mean :1971
3rd Qu.: 8.00 3rd Qu.:2210 3rd Qu.: 560 3rd Qu.:1997
Max. :13.00 Max. :9410 Max. :4820 Max. :2015
yr_renovated zipcode lat long
Min. : 0 Min. :98001 Min. :47.2 Min. :-122
1st Qu.: 0 1st Qu.:98033 1st Qu.:47.5 1st Qu.:-122
Median : 0 Median :98065 Median :47.6 Median :-122
Mean : 84 Mean :98078 Mean :47.6 Mean :-122
3rd Qu.: 0 3rd Qu.:98118 3rd Qu.:47.7 3rd Qu.:-122
Max. :2015 Max. :98199 Max. :47.8 Max. :-121
sqft_living15 sqft_lot15
Min. : 399 Min. : 651
1st Qu.:1490 1st Qu.: 5100
Median :1840 Median : 7620
Mean :1987 Mean : 12768
3rd Qu.:2360 3rd Qu.: 10083
Max. :6210 Max. :871200
head(kc_house)
# A tibble: 6 x 21
id date price bedrooms bathrooms sqft_living sqft_lot
<chr> <dttm> <dbl> <int> <dbl> <int> <int>
1 7129… 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
2 6414… 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
3 5631… 2015-02-25 00:00:00 1.80e5 2 1 770 10000
4 2487… 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
5 1954… 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
6 7237… 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
# ... with 14 more variables: floors <dbl>, waterfront <int>, view <int>,
# condition <int>, grade <int>, sqft_above <int>, sqft_basement <int>,
# yr_built <int>, yr_renovated <int>, zipcode <int>, lat <dbl>,
# long <dbl>, sqft_living15 <int>, sqft_lot15 <int>
B - Recap
- Print the names of the
kc_house
data with names()
.
names(kc_house)
[1] "id" "date" "price" "bedrooms"
[5] "bathrooms" "sqft_living" "sqft_lot" "floors"
[9] "waterfront" "view" "condition" "grade"
[13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
[17] "zipcode" "lat" "long" "sqft_living15"
[21] "sqft_lot15"
- Change the following column names using
rename()
.
living_sqft |
sqft_living |
lot_sqft |
sqft_lot |
above_sqft |
sqft_above |
basement_sqft |
sqft_basement |
built_yr |
yr_built |
renovated_yr |
yr_renovated |
- Create new column(s)
living_sqm
, lot_sqm
, above_sqm
and basement_sqm
which show the respective room sizes in square meters rather than square feet (Hint: Multiply each by 0.093).
kc_house <- kc_house %>%
mutate(living_sqm = XXX * XXX,
lot_sqm = XXX * XXX,
XXX = XXX,
XXX = XXX)
kc_house <- kc_house %>%
mutate(living_sqm = living_sqft * 0.093,
lot_sqm = lot_sqft * 0.093,
above_sqm = above_sqft * 0.093,
basement_sqm = basement_sqft * 0.093)
- Add a new variable to the dataframe called
mansion
which is “Yes” when the sum of the house’s living, above, and basement space is above 750 square meters.
kc_house <- kc_house %>%
mutate(mansion = case_when(
living_sqm + above_sqm + basement_sqm > 750 ~ "Yes",
living_sqm + above_sqm + basement_sqm <= 750 ~ "No"))
C - Simple summaries
- Using the base-R
df$col
notation, calculate the mean price of all houses.
mean(XXX$XXX)
mean(kc_house$price)
[1] 540088
- Now, do the same using
summarise()
using the following template. Do you get the same answer? What is different about the output from summarise()
versus using the dollar sign?
kc_house %>%
summarise(
price_mean = mean(XXX)
)
kc_house %>%
summarise(
price_mean = mean(price)
)
# A tibble: 1 x 1
price_mean
<dbl>
1 540088.
- What is the median price of all houses? Use the
median()
function!
kc_house %>%
summarise(
price_median = median(price)
)
# A tibble: 1 x 1
price_median
<dbl>
1 450000
- What was the highest selling price? Use the
max()
function!
kc_house %>%
summarise(
price_max = max(price)
)
# A tibble: 1 x 1
price_max
<dbl>
1 7700000
- Using the following template, sort the data frame in descending order of price. Then, print it. Do you see the house with the highest selling price at the top?
kc_house <- kc_house %>%
arrange(desc(XXX))
kc_house
kc_house <- kc_house %>%
arrange(desc(price))
kc_house
# A tibble: 21,613 x 26
id date price bedrooms bathrooms living_sqft
<chr> <dttm> <dbl> <int> <dbl> <int>
1 6762700020 2014-10-13 00:00:00 7700000 6 8 12050
2 9808700762 2014-06-11 00:00:00 7062500 5 4.5 10040
3 9208900037 2014-09-19 00:00:00 6885000 6 7.75 9890
4 2470100110 2014-08-04 00:00:00 5570000 5 5.75 9200
5 8907500070 2015-04-13 00:00:00 5350000 5 5 8000
6 7558700030 2015-04-13 00:00:00 5300000 6 6 7390
7 1247600105 2014-10-20 00:00:00 5110800 5 5.25 8010
8 1924059029 2014-06-17 00:00:00 4668000 5 6.75 9640
9 7738500731 2014-08-15 00:00:00 4500000 5 5.5 6640
10 3835500195 2014-06-18 00:00:00 4489000 4 3 6430
# ... with 21,603 more rows, and 20 more variables: lot_sqft <int>,
# floors <dbl>, waterfront <int>, view <int>, condition <int>,
# grade <int>, above_sqft <int>, basement_sqft <int>, built_yr <int>,
# renovated_yr <int>, zipcode <int>, lat <dbl>, long <dbl>,
# sqft_living15 <int>, sqft_lot15 <int>, living_sqm <dbl>,
# lot_sqm <dbl>, above_sqm <dbl>, basement_sqm <dbl>, mansion <chr>
- What percentage of houses sold for more than 1,000,000? Let’s answer this with
summarise()
.
kc_house %>%
summarise(mil_p = mean(XXX > 1000000))
kc_house %>%
summarise(mil_p = mean(price > 1000000))
# A tibble: 1 x 1
mil_p
<dbl>
1 0.0678
- For mansions only, calculate the mean number of floors and bathrooms (hint: before summarising the data, use
filter()
to only select mansions!)
kc_house %>%
filter(mansion == "Yes") %>%
summarise(
floors_mean = mean(floors),
bathrooms_mean = mean(bathrooms)
)
# A tibble: 1 x 2
floors_mean bathrooms_mean
<dbl> <dbl>
1 1.92 3.68
D - Simple grouped summaries
- How many mansions are there? To do this, use
group_by()
to group the dataset by the mansions
column, then use the n()
function to count the number of cases.
kc_house %>%
group_by(XXX) %>%
summarise(XXX = n())
kc_house %>%
group_by(mansion) %>%
summarise(N = n())
# A tibble: 2 x 2
mansion N
<chr> <int>
1 No 20862
2 Yes 751
- What is the mean selling price of mansions versus non-mansions? To do this, just add another argument to your
summarise()
function!
kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
XXX = XXX(XXX))
kc_house %>%
group_by(mansion) %>%
summarise(N = n(),
price_mean = mean(price))
# A tibble: 2 x 3
mansion N price_mean
<chr> <int> <dbl>
1 No 20862 504024.
2 Yes 751 1541915.
- Using
group_by()
and summarise()
, create a dataframe showing the same results as the following table.
No |
20862 |
75000 |
504024 |
441000 |
3100000 |
Yes |
751 |
404000 |
1541915 |
1300000 |
7700000 |
- Do houses built in later years tend to have more living space? To answer this, group the data by
built_yr
, and then calculate the mean number of living square meters. Be sure to also include the number of houses built in each year!
kc_house %>%
group_by(built_yr) %>%
summarise(N = n(),
living = mean(living_sqm))
# A tibble: 116 x 3
built_yr N living
<int> <int> <dbl>
1 1900 87 161.
2 1901 29 164.
3 1902 27 179.
4 1903 46 140.
5 1904 45 149.
6 1905 74 183.
7 1906 92 168.
8 1907 65 177.
9 1908 86 158.
10 1909 94 177.
# ... with 106 more rows
- Was that table too big? Try using the following code to get the results for each decade rather than each year!
kc_house %>%
mutate(built_decade = floor(built_yr / 10)) %>%
group_by(built_decade) %>%
summarise(XX = XX,
XX = XX(XX))
- A friend of yours who is getting into Seattle real estate wants to know how the number of floors a house has affects its selling price. Create a table for her showing the minimum, mean, and maximum price for houses separated by the number of floors they have.
E - Multiple groups
- Your friend Brumhilda is interested in statistics on houses in the following 4 zipcodes only: 98001, 98109, 98117, 98199. Create a new dataframe called
brumhilda
that only contains data from houses in those zipcode (hint: use filter()
combined with the %in%
operator as follows:
brumhilda <- kc_house %>%
filter(XXX %in% c(XXX, XXX, XXX, XXX))
brumhilda <- kc_house %>%
filter(zipcode %in% c(98001, 98109, 98117, 98199))
- For each of the zip codes, calculate the
mean()
and median()
selling price (as well as the number of houses) in each zip code.
brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 4 x 4
zipcode price_mean price_median N
<int> <dbl> <dbl> <int>
1 98001 280805. 260000 362
2 98109 879624. 736000 109
3 98117 576795. 544000 553
4 98199 791821. 689800 317
- Now Brumhilda wants the same data separated by whether or not the house is a mansion or not. Include these results by also grouping the data by
mansion
(as well as zipcode
), and calculating the same summary statistics as before.
brumhilda %>%
group_by(zipcode, mansion) %>%
summarise(price_mean = mean(price),
price_median = median(price),
N = n())
# A tibble: 8 x 5
# Groups: zipcode [?]
zipcode mansion price_mean price_median N
<int> <chr> <dbl> <dbl> <int>
1 98001 No 277589. 260000 359
2 98001 Yes 665667. 637000 3
3 98109 No 833528. 730500 106
4 98109 Yes 2508333. 2900000 3
5 98117 No 575626. 543000 551
6 98117 Yes 898750 898750 2
7 98199 No 753625. 675000 305
8 98199 Yes 1762618. 1425000 12
- Ok that was good, but now she also wants to know what the maximum and minimum number of floors were in each group. Add these summary statistics!
brumhilda %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price),
price_median = median(price),
floors_min = min(floors),
floors_max = max(floors),
N = n())
# A tibble: 4 x 6
zipcode price_mean price_median floors_min floors_max N
<int> <dbl> <dbl> <dbl> <dbl> <int>
1 98001 280805. 260000 1 2.5 362
2 98109 879624. 736000 1 3 109
3 98117 576795. 544000 1 3 553
4 98199 791821. 689800 1 3 317
F - Statistics
- Let’s see if there is a significant difference between the selling prices of houses on the waterfront versus those not on the waterfront. To do this, we’ll conduct a t-test using the
t.test()
function and assign the result to waterfront_htest
. Fill in the XXXs in the code below, such that the dependent variable is price
and the independent variable is waterfront
waterfront_htest <- t.test(formula = XXX ~ XXX,
data = XXX)
waterfront_htest <- t.test(formula = price ~ waterfront,
data = kc_house)
- Print your
waterfront_htest
object to see a printout of the main results.
waterfront_htest
Welch Two Sample t-test
data: price by waterfront
t = -10, df = 200, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1303662 -956963
sample estimates:
mean in group 0 mean in group 1
531564 1661876
- Look at the names of your
waterfront_htest
object with names()
.
names(waterfront_htest)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "alternative" "method" "data.name"
- Using the
$
, print the test statistic (statistic
) from your waterfront_htest
object.
waterfront_htest$statistic
t
-12.9
- Now using
$
, print only the p-value (p.value
) from the object.
waterfront_htest$p.value
[1] 1.38e-26
- Which of the variables
bedrooms
, bathrooms
, living_sqm
, waterfront
, lat
and long
best predict a house’s price (price
)? Answer this by conducting a regression analysis using the glm()
function and assign the result to price_glm
. Fill in the XXXs in the code below.
price_glm <- glm(formula = XXX ~ XXX + XXX + XXX + XXX + XXX + XXX,
data = XXX)
price_glm <- glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront + lat + long,
data = kc_house)
- Print your
price_glm
object. What do you see?
price_glm
Call: glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront +
lat + long, data = kc_house)
Coefficients:
(Intercept) bedrooms bathrooms living_sqm waterfront
-65680578 -46146 17422 3157 790264
lat long
675209 -275008
Degrees of Freedom: 21612 Total (i.e. Null); 21606 Residual
Null Deviance: 2.91e+15
Residual Deviance: 1.09e+15 AIC: 594000
- Look at the names of your
price_glm
object with names()
.
names(price_glm)
[1] "coefficients" "residuals" "fitted.values"
[4] "effects" "R" "rank"
[7] "qr" "family" "linear.predictors"
[10] "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights"
[16] "df.residual" "df.null" "y"
[19] "converged" "boundary" "model"
[22] "call" "formula" "terms"
[25] "data" "offset" "control"
[28] "method" "contrasts" "xlevels"
- Using
$
print the coefficients from your analysis.
price_glm$coefficients
(Intercept) bedrooms bathrooms living_sqm waterfront lat
-65680578 -46146 17422 3157 790264 675209
long
-275008
- Using the
summary()
function, show summary results from your price_glm
object.
summary(price_glm)
Call:
glm(formula = price ~ bedrooms + bathrooms + living_sqm + waterfront +
lat + long, data = kc_house)
Deviance Residuals:
Min 1Q Median 3Q Max
-1563708 -116127 -14138 84523 4180382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.57e+07 1.41e+06 -46.51 < 2e-16 ***
bedrooms -4.61e+04 2.05e+03 -22.53 < 2e-16 ***
bathrooms 1.74e+04 3.07e+03 5.67 1.4e-08 ***
living_sqm 3.16e+03 2.95e+01 107.17 < 2e-16 ***
waterfront 7.90e+05 1.79e+04 44.14 < 2e-16 ***
lat 6.75e+05 1.12e+04 60.20 < 2e-16 ***
long -2.75e+05 1.14e+04 -24.15 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 5.06e+10)
Null deviance: 2.9129e+15 on 21612 degrees of freedom
Residual deviance: 1.0943e+15 on 21606 degrees of freedom
AIC: 594064
Number of Fisher Scoring iterations: 2
- The
$residuals
element in your price_glm
object contains the residuals (difference from the predicted and true dependent variable values). Add this vector as a new column called residuals
to your kc_house
object using mutate()
kc_house <- kc_house %>%
mutate(residuals = price_glm$residuals)
- Using the following template, create a histogram of the residuals.
ggplot(data = kc_house,
aes(x = residuals)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
ggplot(data = kc_house,
aes(x = residuals)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- What is the mean of the residuals? How do you interpret this?
kc_house %>%
summarise(resid_mean = mean(residuals))
# A tibble: 1 x 1
resid_mean
<dbl>
1 -0.000000151
- Now, create a new column called
residuals_abs
which shows the absolute value of the residuals (hint: use the abs()
function).
kc_house <- kc_house %>%
mutate(residuals_abs = abs(price_glm$residuals))
- Create a histogram of the absolute value of the residuals.
ggplot(data = kc_house,
aes(x = residuals_abs)) +
geom_histogram(col = "white") +
labs(title = "Residual regression")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- What is the mean of the absolute value of the residuals? How do you interpret this? In general, do you think you can predict the price of a house very well based on the features in your regression model?
kc_house %>%
summarise(resid_abs_mean = mean(residuals_abs))
# A tibble: 1 x 1
resid_abs_mean
<dbl>
1 144018.
# On average, the model's fitted values are off by 144,018. So the model isn't terribly accurate on average.
X - Challenges
- Which zipcode has the highest percentage of houses on the waterfront? (Hint: group by zipcode, calculate the percentage of houses on the waterfront using
mean()
, then sort the data in descending order) with arrange()
, then select the first row with slice()
. Once you find it, try searching for that zipcode on Google Maps and see if it’s location makes sense!
kc_house %>%
group_by(zipcode) %>%
summarise(waterfront_p = mean(waterfront)) %>%
arrange(desc(waterfront_p)) %>%
slice(1)
# A tibble: 1 x 2
zipcode waterfront_p
<int> <dbl>
1 98070 0.203
- Which house had the highest price to living space ratio? To answer this, create a new variable called
price_to_living
that takes price / living_sqm
. Then, sort the data in descending order of this variable, and select the first row with slice()
! What id value do you get?
kc_house %>%
mutate(price_to_living = price / living_sqm) %>%
arrange(desc(price_to_living)) %>%
slice(1)
# A tibble: 1 x 29
id date price bedrooms bathrooms living_sqft lot_sqft
<chr> <dttm> <dbl> <int> <dbl> <int> <int>
1 6021… 2015-04-07 00:00:00 874950 2 1 1080 4000
# ... with 22 more variables: floors <dbl>, waterfront <int>, view <int>,
# condition <int>, grade <int>, above_sqft <int>, basement_sqft <int>,
# built_yr <int>, renovated_yr <int>, zipcode <int>, lat <dbl>,
# long <dbl>, sqft_living15 <int>, sqft_lot15 <int>, living_sqm <dbl>,
# lot_sqm <dbl>, above_sqm <dbl>, basement_sqm <dbl>, mansion <chr>,
# residuals <dbl>, residuals_abs <dbl>, price_to_living <dbl>
- Which are the top 10 zip codes in terms of mean housing prices? To answer this, group the data by zipcode, calculate the mean price, arrange the dataset in descending order of mean price, then select the top 10 rows!
kc_house %>%
group_by(zipcode) %>%
summarise(price_mean = mean(price)) %>%
arrange(desc(price_mean)) %>%
slice(1:10)
# A tibble: 10 x 2
zipcode price_mean
<int> <dbl>
1 98039 2160607.
2 98004 1355927.
3 98040 1194230.
4 98112 1095499.
5 98102 901258.
6 98109 879624.
7 98105 862825.
8 98006 859685.
9 98119 849448.
10 98005 810165.
- Create the following dataframe exactly as it appears.
1990 |
320 |
563966 |
3640900 |
234 |
1991 |
224 |
630441 |
5300000 |
244 |
1992 |
198 |
548169 |
2480000 |
223 |
1993 |
202 |
556612 |
3120000 |
226 |
1994 |
249 |
486834 |
2880500 |
209 |
1995 |
169 |
577771 |
3200000 |
224 |
1996 |
195 |
639534 |
3100000 |
240 |
1997 |
177 |
606058 |
3800000 |
234 |
1998 |
239 |
594159 |
1960000 |
241 |
kc_house %>%
filter(built_yr >= 1990 & built_yr < 1999) %>%
group_by(built_yr) %>%
summarise(N = n(),
price_mean = mean(price),
price_max = max(price),
living_sqm_mean = mean(living_sqm)) %>%
knitr::kable(digits = 0)
1990 |
320 |
563966 |
3640900 |
234 |
1991 |
224 |
630441 |
5300000 |
244 |
1992 |
198 |
548169 |
2480000 |
223 |
1993 |
202 |
556612 |
3120000 |
226 |
1994 |
249 |
486834 |
2880500 |
209 |
1995 |
169 |
577771 |
3200000 |
224 |
1996 |
195 |
639534 |
3100000 |
240 |
1997 |
177 |
606058 |
3800000 |
234 |
1998 |
239 |
594159 |
1960000 |
241 |
- Create a regression object called
living_glm
predicting the amount of living space (living_sqm
) as a function of bedrooms
, bathrooms
, waterfront
, and built_yr
. Explore the object with names()
and summary()
. Which variables seem to predict living space?
living_glm <- glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront + built_yr,
data = kc_house)
names(living_glm)
[1] "coefficients" "residuals" "fitted.values"
[4] "effects" "R" "rank"
[7] "qr" "family" "linear.predictors"
[10] "deviance" "aic" "null.deviance"
[13] "iter" "weights" "prior.weights"
[16] "df.residual" "df.null" "y"
[19] "converged" "boundary" "model"
[22] "call" "formula" "terms"
[25] "data" "offset" "control"
[28] "method" "contrasts" "xlevels"
summary(living_glm)
Call:
glm(formula = living_sqm ~ bedrooms + bathrooms + waterfront +
built_yr, data = kc_house)
Deviance Residuals:
Min 1Q Median 3Q Max
-705.7 -32.0 -4.9 25.7 566.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 219.6403 27.7140 7.93 2.4e-15 ***
bedrooms 23.1621 0.4532 51.11 < 2e-16 ***
bathrooms 71.3214 0.6294 113.31 < 2e-16 ***
waterfront 62.5121 4.1476 15.07 < 2e-16 ***
built_yr -0.1297 0.0143 -9.09 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 2750)
Null deviance: 157675161 on 21612 degrees of freedom
Residual deviance: 59420739 on 21608 degrees of freedom
AIC: 232503
Number of Fisher Scoring iterations: 2
- The
chisq.test()
function allows you to do conduct a chi square test testing the relationship between two nominal variables. Look at the help menu to see how the function works. Then, conduct a chi-square test to see if there is a relationship between whether a house is on the waterfront and the grade of the house. Do houses on the waterfront tend to have higher (or lower) grades than houses not on the waterfront?
# First look at a table
table(kc_house$waterfront, kc_house$grade)
1 3 4 5 6 7 8 9 10 11 12 13
0 1 3 29 238 2026 8958 6028 2590 1106 379 79 13
1 0 0 0 4 12 23 40 25 28 20 11 0
chisq.test(table(kc_house$waterfront, kc_house$grade))
Pearson's Chi-squared test
data: table(kc_house$waterfront, kc_house$grade)
X-squared = 300, df = 10, p-value <2e-16