Zillow is the most-visited real estate website in the United States which offer customers services of selling, buying, and renting homes. It is essential for Zillow to predict housing prices to improve the level of service. In this project, we are going to build a predictive model of home prices to help Zillow forecast housing market in Boulder County, CO. The goal is to train a model with recent transactions and test whether the model is accurate and generalizable enough to predict the home price of properties that have not recently sold.
Firstly, we would collect data covering houses’ internal characteristic, their nearby public services, information about places they are in and so on. After compiling the data into one dataset, we would do some data wrangling, variables transformation, and independent variables selection. In the next step, we would develop a model with OLS regression. Lastly, we will assess whether our model’s accuracy and generalizability with methods including cross validation, spatial lag error and Moran’s I.
Data collection is one of inconvenient difficulties. Firstly, some useful data might be inaccessible due to the concern of security, privacy, and the cost of collection. Secondly, there might be some errors in data we collect, especially those from unofficial departments like OpenStreetMap. Thirdly, we might subjectively omit datasets which are important in predicting. These kinds of situation would all lead to decrease of our model’s efficiency. Besides, methods that would be used to training data in our research has its inherent limitation, which will also have an influence on successful machine learning.
Our result shows, we trained a model with a moderate accuracy, with an overall MAPE of 0.19 and R-square of 0.75. Despite some limitation in generalizability, our model can act as an reference when studying the house price in boulder county.
There are three main sources of data. First one is open boulder data, which is an official website offering many sorts of city data. It provides dataset including traffic stations, fire stations, and police offenses. The second one is Boulder’s census tract data which can be accessed with ‘tidycensus’ package in r. we also scrape school data, hospital data, transit data, and commercial data from open street map. It is noteworthy that since people can edit OpenStreetMap collaboratively, so it may lead to incomplete and wrong data.
# transform variables
st_c <- st_coordinates
boulder_data <-
boulder_data %>%
mutate(
fire_stations_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(fire_stations)), 3),
commercial_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(commercial))[,c(1,2)], 3),
traffic_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(traffic_stations)), 3),
schools_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(schools))[,c(1,2)], 3),
crime_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(crimes)[,c(1,2)]), 3)
)
# load api key
census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
# variance of interest in census data
census_vars <- c( "B01001_001E", # ACS total Pop estimate
"B25002_001E", # Estimate of total housing units
"B25002_003E", # Number of vacant housing units
"B19013_001E", # Median HH Income ($)
"B02001_002E", # People describing themselves as "white alone"
"B06009_006E", # Total graduate or professional degree
"B06012_002E", # Number of below 100 percent of the poverty level
"B23025_002E", # Number of people in labor force
"B23025_005E") # Number of people in labor force unemployed
# get access to census data of 2019
acsTractsBD19 <- get_acs(geography = "tract",
year = 2019,
variables = census_vars,
geometry = TRUE,
state = "Colorado",
county = "Boulder",
output = "wide")
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# drop extra data columns
acsTractsBD19 <- acsTractsBD19[-c(4,6,8,10,12,14,16,18,20)]
# rename columns
acsTractsBD19 <- acsTractsBD19 %>%
rename (total_pop = B01001_001E,
total_hu = B25002_001E,
total_vacant = B25002_003E,
med_hh_income = B19013_001E,
total_white = B02001_002E,
total_graddeg = B06009_006E,
total_belpov100 = B06012_002E,
total_labor_f = B23025_002E,
total_unemployed = B23025_005E)
# add new columns
acsTractsBD19 <- acsTractsBD19 %>%
mutate(pctvacancy = total_vacant/total_hu,
pctwhite = total_white/total_pop,
pctunemployed = total_unemployed/total_labor_f,
pctedu = total_graddeg/total_pop) %>%
dplyr::select(-total_vacant,-total_hu,-total_white,-total_unemployed,-total_labor_f,-total_graddeg) %>%
st_transform('ESRI:102254')
# add census data to boulder data
boulder_data <- st_join(boulder_data, acsTractsBD19, join = st_intersects)
training_data <-
filter(boulder_data,toPredict == 0) %>%
dplyr::select(-toPredict)
testing_data <-
filter(boulder_data,toPredict == 1) %>%
dplyr::select(-toPredict)
In our analysis, there are four kinds of variables in our analysis.
houses’ internal characteristics: “saleDate”,“year”,“year_quarter”,“address”,“bld_num”,“section_num”,“designCode”,“designCodeDscr”, “qualityCode”,“qualityCodeDscr”,“bldgClass”,bldgClassDscr“,”ConstCode“,”ConstCodeDscr“,”builtYear“,”CompCode“,”EffectiveYear"“bsmtSF”,“bsmtType”,“bsmtTypeDscr”,“carStorageSF”,“carStorageType”, “carStorageTypeDscr”,“nbrBedRoom”,“nbrRoomsNobath”,“mainfloorSF”,“nbrThreeQtrBaths”,“nbrFullBaths”, “nbrHalfBaths”,“TotalFinishedSF”,“Ac”,“AcDscr”,“Heating”,“HeatingDscr”,“ExtWallPrim”,“ExtWallDscrPrim”,
“ExtWallSec”,“ExtWallDscrSec”,“IntWall”,“IntWallDscr”,“Roof_Cover”,“Roof_CoverDscr”,“Stories”,
“UnitCount”,“status_cd”,“MUSA_ID”
public amenities near houses: We can’t measure the exposure of each house to public services directly, so we would calculate the average distance from each home sale to its three nearest sepecific public service with ‘nn_function’. “fire_stations_nn3”,“commercial_nn3”,“traffic_nn3”,“schools_nn3”,“crime_nn3”
spatial structure: According to the thumb rule that those which are near to each other are more related when compared to those that are further away, we added spatial structure type of data. “GEOID”,“NAME”
the socioeconomical data on the census tracts in which each house is located in: “total_pop”,“total_belpov100”,“pctvacancy”,“pctwhite”, “pctunemployed”,“pctedu”
Firstly, we develop a statistic summary table of variables.
stargazer(st_drop_geometry(training_data),
type = 'text',
title = "Summary Statistics")
##
## Summary Statistics
## ==========================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ------------------------------------------------------------------------------------------
## price 11,264 749,112.900 616,228.300 10,000 452,700 831,580.8 31,500,000
## year 11,264 2,019.791 0.729 2,019 2,019 2,020 2,021
## bld_num 11,264 1.008 0.137 1 1 1 9
## section_num 11,264 1.009 0.140 1 1 1 9
## qualityCode 11,264 37.301 7.864 10 30 41 80
## bldgClass 11,264 1,212.000 0.000 1,212 1,212 1,212 1,212
## ConstCode 11,264 310.592 18.436 0 310 310 420
## builtYear 11,264 1,987.169 26.797 1,860 1,972 2,006 2,021
## CompCode 11,264 0.988 0.096 0 1 1 1
## EffectiveYear 11,264 1,997.698 17.475 1,879 1,989 2,010 2,021
## bsmtSF 11,264 744.117 627.701 0 135.5 1,126 4,534
## carStorageSF 11,264 471.566 235.759 0 380 600 3,040
## nbrBedRoom 11,264 3.444 1.051 0 3 4 24
## nbrRoomsNobath 11,264 7.390 2.136 0 6 9 28
## mainfloorSF 11,264 1,313.107 607.499 0 922.8 1,619 7,795
## nbrThreeQtrBaths 11,264 0.617 0.730 0 0 1 9
## nbrFullBaths 11,264 1.781 0.872 0 1 2 12
## nbrHalfBaths 11,264 0.555 0.540 0 0 1 3
## TotalFinishedSF 11,264 1,954.525 891.631 0 1,276 2,443 10,188
## Ac 7,423 210.116 1.261 200.000 210.000 210.000 220.000
## Heating 11,160 814.046 14.791 800.000 810.000 810.000 1,000.000
## ExtWallPrim 11,260 46.238 49.098 0.000 10.000 40.000 140.000
## ExtWallSec 3,418 529.596 39.341 500.000 500.000 530.000 630.000
## IntWall 11,186 1,130.171 4.769 1,110.000 1,130.000 1,130.000 1,160.000
## Roof_Cover 8,031 1,214.176 14.994 1,210.000 1,210.000 1,210.000 1,290.000
## Stories 4 1.000 0.000 1.000 1.000 1.000 1.000
## UnitCount 4 2.500 1.000 2.000 2.000 2.500 4.000
## MUSA_ID 11,264 5,680.483 3,281.191 1 2,833.8 8,522.2 11,364
## fire_stations_nn3 11,264 12,072.240 6,412.017 1,367.279 7,671.262 17,222.100 34,142.450
## commercial_nn3 11,264 1,756.156 2,499.654 24.553 624.674 1,745.199 23,878.360
## traffic_nn3 11,264 2,165.635 1,074.113 101.082 1,472.874 2,703.550 6,870.880
## schools_nn3 11,264 1,776.347 3,390.153 30.079 514.459 1,536.743 25,693.050
## crime_nn3 11,264 7,602.171 4,902.370 15.768 4,324.914 10,879.090 30,610.350
## total_pop 11,264 6,122.057 2,911.499 867 4,089 7,365 12,923
## med_hh_income 11,264 97,710.940 25,115.450 23,699 82,068.5 120,717 173,500
## total_belpov100 11,264 483.164 394.325 40 227 724 3,202
## pctvacancy 11,264 0.055 0.081 0.000 0.014 0.063 0.713
## pctwhite 11,264 0.897 0.047 0.749 0.877 0.928 0.978
## pctunemployed 11,264 0.040 0.016 0.007 0.031 0.052 0.096
## pctedu 11,264 0.191 0.074 0.025 0.136 0.238 0.383
## ------------------------------------------------------------------------------------------
Based on summary statistic table, Stories and UnitCount which have only 4 observations are dropped for their low-impact number.
bldgClass variable has the same value in every observation, so it has no influence on our model.
There are also several parings of data, one represents the code while the other is for description.We only remain the description one.
boulder_data <- boulder_data %>%
dplyr::select(-Stories, -UnitCount, -bldgClass, -bldgClassDscr,
-designCode, -qualityCode, -ConstCode,
-carStorageType, -bsmtType, -Ac,
-Heating, -IntWall, -Roof_Cover, -ExtWallSec,
-ExtWallPrim, -status_cd)
training_data <-
filter(boulder_data,toPredict == 0) %>%
dplyr::select(-toPredict)
testing_data <-
filter(boulder_data,toPredict == 1) %>%
dplyr::select(-toPredict)
We develop histograms of all numeric variables. For variables that are not normally distributed, we would log-transform them, which is helpful to train a greater model.
# select numeric variables
numericVars <-
select_if(st_drop_geometry(training_data), is.numeric) %>% na.omit()
# plot histograms of numeric variables
ggplot(gather(numericVars), aes(value)) +
geom_histogram(bins = 50) +
facet_wrap(~key, scales = 'free_x')
# log tranform some variables.
boulder_data <-
boulder_data %>%
mutate(
lncommercial_nn3 = log(commercial_nn3),
lnpctedu = log(pctedu),
lnpctunemployed = log(pctunemployed),
lnpctwhite = log(pctwhite),
lnschools_nn3 = log(schools_nn3),
lntotal_belpov100 = log(total_belpov100),
lntotal_pop = log(total_pop)
) %>%
dplyr::select(-commercial_nn3, -pctedu,
-pctunemployed, -pctwhite,
-schools_nn3, -total_belpov100, -total_pop)
training_data <-
filter(boulder_data,toPredict == 0) %>%
dplyr::select(-toPredict)
testing_data <-
filter(boulder_data,toPredict == 1) %>%
dplyr::select(-toPredict)
The below graph is a correlation matrix of numeric variables. The darker the color, the higher the r correlation, and the closely two variables are correlated.
ggcorrplot(
round(cor(numericVars), 1),
p.mat = cor_pmat(numericVars),
colors = c("#25CB10", "white", "#FA7800"),
type="lower",
show.diag = TRUE,
insig = "blank") +
labs(title = "Correlation across numeric variables") +
theme(plot.title = element_text(hjust = 0.5)) +
plotTheme()
To represent the correlation between variables more explicitly, we make a correlation matrix in the type of table.
cormax <- cor(numericVars)
cormax.df <- as.data.frame(t(cormax))
cormax.df
## price year bld_num section_num
## price 1.000000000 0.111253820 0.0888673114 0.090224206
## year 0.111253820 1.000000000 -0.0053461517 -0.006970878
## bld_num 0.088867311 -0.005346152 1.0000000000 0.951453969
## section_num 0.090224206 -0.006970878 0.9514539695 1.000000000
## builtYear -0.047831765 -0.040693891 -0.0375865251 -0.048836741
## CompCode 0.001904995 -0.124878234 -0.0024582397 0.002978419
## EffectiveYear 0.083457052 -0.121552435 -0.0455018549 -0.054556577
## bsmtSF 0.183016210 -0.007669119 -0.0619467986 -0.066863293
## carStorageSF 0.224488400 -0.005520495 -0.0377596597 -0.044517786
## nbrBedRoom 0.267387067 -0.038135723 -0.1128380364 -0.122073670
## nbrRoomsNobath 0.354603271 -0.037689118 -0.1076506926 -0.117248170
## mainfloorSF 0.381421772 -0.003643487 -0.0525176835 -0.056852125
## nbrThreeQtrBaths 0.225803590 -0.002381863 -0.0284590683 -0.032438552
## nbrFullBaths 0.207105343 -0.033200198 -0.0727423408 -0.078814287
## nbrHalfBaths 0.154215592 -0.010019878 -0.0556786509 -0.058062627
## TotalFinishedSF 0.434325553 -0.012742931 -0.0649334034 -0.071723745
## MUSA_ID 0.005416397 -0.007824222 -0.0039809684 -0.002420071
## fire_stations_nn3 -0.428148834 -0.026190646 -0.0008059242 0.003609749
## commercial_nn3 -0.046209931 -0.023648979 0.0078526482 0.015048607
## traffic_nn3 -0.152217255 -0.006343534 -0.0153601223 -0.015950689
## schools_nn3 -0.043480929 -0.004061844 0.0107329735 0.017493709
## crime_nn3 -0.403490027 -0.030238350 -0.0012866508 0.003491714
## total_pop -0.081133841 -0.035278751 -0.0193047037 -0.019570450
## med_hh_income 0.157113248 -0.025201288 -0.0097293025 -0.010337217
## total_belpov100 0.011663661 0.010558039 0.0101050077 0.009560429
## pctvacancy 0.004271884 0.011277044 0.0191063919 0.022220980
## pctwhite 0.038154772 0.020351973 -0.0055927083 -0.002458372
## pctunemployed -0.036498085 -0.007109133 0.0040888823 0.011149888
## pctedu 0.401077539 -0.001609232 -0.0097692023 -0.010338321
## builtYear CompCode EffectiveYear bsmtSF
## price -0.047831765 0.001904995 0.083457052 0.183016210
## year -0.040693891 -0.124878234 -0.121552435 -0.007669119
## bld_num -0.037586525 -0.002458240 -0.045501855 -0.061946799
## section_num -0.048836741 0.002978419 -0.054556577 -0.066863293
## builtYear 1.000000000 -0.149228825 0.728546369 0.332641305
## CompCode -0.149228825 1.000000000 -0.160820312 -0.046625580
## EffectiveYear 0.728546369 -0.160820312 1.000000000 0.298094656
## bsmtSF 0.332641305 -0.046625580 0.298094656 1.000000000
## carStorageSF 0.382186160 -0.049932442 0.293586189 0.425459961
## nbrBedRoom 0.179274991 0.013871247 0.221956962 0.391412303
## nbrRoomsNobath 0.210382878 0.016230011 0.257890037 0.431571053
## mainfloorSF 0.149299094 -0.035435056 0.189512638 0.542358525
## nbrThreeQtrBaths -0.027766157 -0.036736226 0.019050070 0.101601740
## nbrFullBaths 0.418784859 -0.024380223 0.415760961 0.402939496
## nbrHalfBaths 0.326621007 -0.051345610 0.290546466 0.134037315
## TotalFinishedSF 0.389284395 -0.075513371 0.413392153 0.486149516
## MUSA_ID 0.004776114 0.007953813 0.004952922 0.009993530
## fire_stations_nn3 0.207025222 -0.016542160 0.088261376 0.045715196
## commercial_nn3 0.028324106 0.007217295 -0.019470052 0.011331548
## traffic_nn3 -0.094645609 0.007812952 -0.086474887 -0.187944281
## schools_nn3 -0.077915760 0.003770609 -0.116944716 -0.091321424
## crime_nn3 0.221817255 -0.017157589 0.101137693 0.049844257
## total_pop 0.384518061 -0.049656243 0.351445959 0.263836571
## med_hh_income 0.355006960 -0.059805725 0.288473115 0.209361206
## total_belpov100 -0.106619526 0.003277886 0.001947383 0.029506614
## pctvacancy -0.210387411 0.018452842 -0.193783309 -0.167605138
## pctwhite -0.102582622 0.037261540 -0.093923305 0.103710833
## pctunemployed -0.155523374 0.040649051 -0.098695198 0.001849994
## pctedu -0.088891278 0.007283817 -0.026584035 0.008115751
## carStorageSF nbrBedRoom nbrRoomsNobath mainfloorSF
## price 0.224488400 0.267387067 0.354603271 0.381421772
## year -0.005520495 -0.038135723 -0.037689118 -0.003643487
## bld_num -0.037759660 -0.112838036 -0.107650693 -0.052517683
## section_num -0.044517786 -0.122073670 -0.117248170 -0.056852125
## builtYear 0.382186160 0.179274991 0.210382878 0.149299094
## CompCode -0.049932442 0.013871247 0.016230011 -0.035435056
## EffectiveYear 0.293586189 0.221956962 0.257890037 0.189512638
## bsmtSF 0.425459961 0.391412303 0.431571053 0.542358525
## carStorageSF 1.000000000 0.356762111 0.444982166 0.492626205
## nbrBedRoom 0.356762111 1.000000000 0.766152939 0.383561760
## nbrRoomsNobath 0.444982166 0.766152939 1.000000000 0.468928168
## mainfloorSF 0.492626205 0.383561760 0.468928168 1.000000000
## nbrThreeQtrBaths 0.139994693 0.288322245 0.289731875 0.173128468
## nbrFullBaths 0.388357354 0.506984065 0.525035071 0.352513873
## nbrHalfBaths 0.234875665 0.181022682 0.286311664 0.081776017
## TotalFinishedSF 0.575771612 0.566611457 0.670092102 0.705306200
## MUSA_ID -0.001421327 0.007441575 0.004461081 0.001721248
## fire_stations_nn3 0.042034563 -0.127488152 -0.177593298 -0.047914644
## commercial_nn3 0.017396487 -0.102540782 -0.071627826 0.074026677
## traffic_nn3 -0.142229641 -0.069636634 -0.100029939 -0.237029844
## schools_nn3 -0.095414537 -0.181055784 -0.150579415 -0.007046662
## crime_nn3 0.023829710 -0.139662511 -0.172689363 -0.056994704
## total_pop 0.185443848 0.120835077 0.117849737 0.164453885
## med_hh_income 0.274380864 0.191350528 0.262999762 0.185243576
## total_belpov100 -0.089757042 0.011181961 -0.015044431 0.014139594
## pctvacancy -0.183288321 -0.214413411 -0.198212964 -0.072254920
## pctwhite 0.044536401 0.001093382 0.018387623 0.166292192
## pctunemployed -0.071282544 -0.017174116 -0.054153463 0.010134360
## pctedu 0.016774251 0.100128144 0.174743351 0.091457423
## nbrThreeQtrBaths nbrFullBaths nbrHalfBaths TotalFinishedSF
## price 0.225803590 0.2071053426 0.1542155915 0.434325553
## year -0.002381863 -0.0332001978 -0.0100198778 -0.012742931
## bld_num -0.028459068 -0.0727423408 -0.0556786509 -0.064933403
## section_num -0.032438552 -0.0788142867 -0.0580626273 -0.071723745
## builtYear -0.027766157 0.4187848595 0.3266210071 0.389284395
## CompCode -0.036736226 -0.0243802234 -0.0513456101 -0.075513371
## EffectiveYear 0.019050070 0.4157609609 0.2905464663 0.413392153
## bsmtSF 0.101601740 0.4029394957 0.1340373154 0.486149516
## carStorageSF 0.139994693 0.3883573536 0.2348756648 0.575771612
## nbrBedRoom 0.288322245 0.5069840651 0.1810226824 0.566611457
## nbrRoomsNobath 0.289731875 0.5250350709 0.2863116642 0.670092102
## mainfloorSF 0.173128468 0.3525138733 0.0817760174 0.705306200
## nbrThreeQtrBaths 1.000000000 -0.2979752825 -0.0880992240 0.167976831
## nbrFullBaths -0.297975282 1.0000000000 0.2159457950 0.598088498
## nbrHalfBaths -0.088099224 0.2159457950 1.0000000000 0.372232281
## TotalFinishedSF 0.167976831 0.5980884984 0.3722322811 1.000000000
## MUSA_ID 0.017333595 -0.0098074503 0.0001713499 -0.003364032
## fire_stations_nn3 -0.165341961 -0.0165938211 -0.0619557790 -0.073840103
## commercial_nn3 -0.053440381 0.0005218684 -0.0421739411 0.056201938
## traffic_nn3 -0.038189482 -0.1191261707 -0.0374090398 -0.200999519
## schools_nn3 -0.046936701 -0.0796766729 -0.0954472819 -0.047506574
## crime_nn3 -0.173551979 0.0024802686 -0.0663684685 -0.057889159
## total_pop -0.093173078 0.2582160809 0.1347894180 0.267850856
## med_hh_income 0.061178421 0.2849435288 0.1944129248 0.333520933
## total_belpov100 -0.052751368 0.0069381992 -0.0105743905 0.010709888
## pctvacancy -0.032205380 -0.1542865406 -0.1191805004 -0.133593414
## pctwhite 0.006138100 -0.0085438926 -0.0738004358 0.073120340
## pctunemployed -0.038148793 -0.0439511339 -0.0599807322 -0.068191112
## pctedu 0.129625422 0.0829293464 0.0620296035 0.138958869
## MUSA_ID fire_stations_nn3 commercial_nn3 traffic_nn3
## price 0.0054163967 -0.4281488341 -0.0462099313 -0.152217255
## year -0.0078242220 -0.0261906464 -0.0236489794 -0.006343534
## bld_num -0.0039809684 -0.0008059242 0.0078526482 -0.015360122
## section_num -0.0024200710 0.0036097495 0.0150486071 -0.015950689
## builtYear 0.0047761138 0.2070252219 0.0283241064 -0.094645609
## CompCode 0.0079538127 -0.0165421605 0.0072172955 0.007812952
## EffectiveYear 0.0049529223 0.0882613764 -0.0194700525 -0.086474887
## bsmtSF 0.0099935298 0.0457151962 0.0113315478 -0.187944281
## carStorageSF -0.0014213270 0.0420345628 0.0173964870 -0.142229641
## nbrBedRoom 0.0074415753 -0.1274881522 -0.1025407818 -0.069636634
## nbrRoomsNobath 0.0044610805 -0.1775932976 -0.0716278261 -0.100029939
## mainfloorSF 0.0017212478 -0.0479146441 0.0740266772 -0.237029844
## nbrThreeQtrBaths 0.0173335955 -0.1653419612 -0.0534403812 -0.038189482
## nbrFullBaths -0.0098074503 -0.0165938211 0.0005218684 -0.119126171
## nbrHalfBaths 0.0001713499 -0.0619557790 -0.0421739411 -0.037409040
## TotalFinishedSF -0.0033640324 -0.0738401034 0.0562019384 -0.200999519
## MUSA_ID 1.0000000000 0.0012464439 -0.0062147082 0.004806899
## fire_stations_nn3 0.0012464439 1.0000000000 0.2402103855 0.164076133
## commercial_nn3 -0.0062147082 0.2402103855 1.0000000000 -0.088804894
## traffic_nn3 0.0048068987 0.1640761326 -0.0888048944 1.000000000
## schools_nn3 -0.0035661357 0.2755032375 0.6455045455 -0.147325570
## crime_nn3 0.0004416048 0.9295680400 0.3648543085 0.186258775
## total_pop -0.0103363674 0.1231064285 0.1589664885 -0.171127540
## med_hh_income 0.0077329969 -0.2170400084 0.1830949627 -0.131714638
## total_belpov100 -0.0131581292 -0.0449116524 0.0219337031 -0.022406358
## pctvacancy -0.0055773892 0.1454118502 0.6572442141 -0.057846869
## pctwhite -0.0229831079 0.0936764518 0.2665317421 -0.515922551
## pctunemployed 0.0033199817 0.1097087952 -0.0208129708 0.003859160
## pctedu 0.0025029339 -0.7379720443 0.1050347304 -0.174191994
## schools_nn3 crime_nn3 total_pop med_hh_income
## price -0.043480929 -0.4034900273 -0.08113384 0.157113248
## year -0.004061844 -0.0302383503 -0.03527875 -0.025201288
## bld_num 0.010732974 -0.0012866508 -0.01930470 -0.009729302
## section_num 0.017493709 0.0034917142 -0.01957045 -0.010337217
## builtYear -0.077915760 0.2218172553 0.38451806 0.355006960
## CompCode 0.003770609 -0.0171575888 -0.04965624 -0.059805725
## EffectiveYear -0.116944716 0.1011376930 0.35144596 0.288473115
## bsmtSF -0.091321424 0.0498442565 0.26383657 0.209361206
## carStorageSF -0.095414537 0.0238297104 0.18544385 0.274380864
## nbrBedRoom -0.181055784 -0.1396625112 0.12083508 0.191350528
## nbrRoomsNobath -0.150579415 -0.1726893634 0.11784974 0.262999762
## mainfloorSF -0.007046662 -0.0569947041 0.16445388 0.185243576
## nbrThreeQtrBaths -0.046936701 -0.1735519790 -0.09317308 0.061178421
## nbrFullBaths -0.079676673 0.0024802686 0.25821608 0.284943529
## nbrHalfBaths -0.095447282 -0.0663684685 0.13478942 0.194412925
## TotalFinishedSF -0.047506574 -0.0578891594 0.26785086 0.333520933
## MUSA_ID -0.003566136 0.0004416048 -0.01033637 0.007732997
## fire_stations_nn3 0.275503237 0.9295680400 0.12310643 -0.217040008
## commercial_nn3 0.645504546 0.3648543085 0.15896649 0.183094963
## traffic_nn3 -0.147325570 0.1862587749 -0.17112754 -0.131714638
## schools_nn3 1.000000000 0.4403893888 -0.08838762 0.059069158
## crime_nn3 0.440389389 1.0000000000 0.18194066 -0.107062259
## total_pop -0.088387624 0.1819406639 1.00000000 0.173236161
## med_hh_income 0.059069158 -0.1070622595 0.17323616 1.000000000
## total_belpov100 -0.044140657 -0.0514527439 0.42218524 -0.444950608
## pctvacancy 0.750902972 0.2389511771 -0.23488207 -0.118141715
## pctwhite 0.241968010 0.1023546404 0.18898876 0.070667395
## pctunemployed -0.043116074 -0.0252017930 -0.11344722 -0.314300628
## pctedu 0.106129018 -0.6002637073 -0.12084785 0.578432548
## total_belpov100 pctvacancy pctwhite pctunemployed
## price 0.011663661 0.004271884 0.038154772 -0.036498085
## year 0.010558039 0.011277044 0.020351973 -0.007109133
## bld_num 0.010105008 0.019106392 -0.005592708 0.004088882
## section_num 0.009560429 0.022220980 -0.002458372 0.011149888
## builtYear -0.106619526 -0.210387411 -0.102582622 -0.155523374
## CompCode 0.003277886 0.018452842 0.037261540 0.040649051
## EffectiveYear 0.001947383 -0.193783309 -0.093923305 -0.098695198
## bsmtSF 0.029506614 -0.167605138 0.103710833 0.001849994
## carStorageSF -0.089757042 -0.183288321 0.044536401 -0.071282544
## nbrBedRoom 0.011181961 -0.214413411 0.001093382 -0.017174116
## nbrRoomsNobath -0.015044431 -0.198212964 0.018387623 -0.054153463
## mainfloorSF 0.014139594 -0.072254920 0.166292192 0.010134360
## nbrThreeQtrBaths -0.052751368 -0.032205380 0.006138100 -0.038148793
## nbrFullBaths 0.006938199 -0.154286541 -0.008543893 -0.043951134
## nbrHalfBaths -0.010574390 -0.119180500 -0.073800436 -0.059980732
## TotalFinishedSF 0.010709888 -0.133593414 0.073120340 -0.068191112
## MUSA_ID -0.013158129 -0.005577389 -0.022983108 0.003319982
## fire_stations_nn3 -0.044911652 0.145411850 0.093676452 0.109708795
## commercial_nn3 0.021933703 0.657244214 0.266531742 -0.020812971
## traffic_nn3 -0.022406358 -0.057846869 -0.515922551 0.003859160
## schools_nn3 -0.044140657 0.750902972 0.241968010 -0.043116074
## crime_nn3 -0.051452744 0.238951177 0.102354640 -0.025201793
## total_pop 0.422185241 -0.234882069 0.188988763 -0.113447222
## med_hh_income -0.444950608 -0.118141715 0.070667395 -0.314300628
## total_belpov100 1.000000000 0.008382288 0.014012686 0.190553212
## pctvacancy 0.008382288 1.000000000 0.160124543 0.073779007
## pctwhite 0.014012686 0.160124543 1.000000000 -0.083657590
## pctunemployed 0.190553212 0.073779007 -0.083657590 1.000000000
## pctedu -0.263963828 0.088536514 0.049643744 -0.226605012
## pctedu
## price 0.401077539
## year -0.001609232
## bld_num -0.009769202
## section_num -0.010338321
## builtYear -0.088891278
## CompCode 0.007283817
## EffectiveYear -0.026584035
## bsmtSF 0.008115751
## carStorageSF 0.016774251
## nbrBedRoom 0.100128144
## nbrRoomsNobath 0.174743351
## mainfloorSF 0.091457423
## nbrThreeQtrBaths 0.129625422
## nbrFullBaths 0.082929346
## nbrHalfBaths 0.062029604
## TotalFinishedSF 0.138958869
## MUSA_ID 0.002502934
## fire_stations_nn3 -0.737972044
## commercial_nn3 0.105034730
## traffic_nn3 -0.174191994
## schools_nn3 0.106129018
## crime_nn3 -0.600263707
## total_pop -0.120847845
## med_hh_income 0.578432548
## total_belpov100 -0.263963828
## pctvacancy 0.088536514
## pctwhite 0.049643744
## pctunemployed -0.226605012
## pctedu 1.000000000
For those strongly correlated with each other (r > 0.8), we would choose only one of the two to increase model’s accuracy. According to the table, r square between fire station and crime data is 0.93, and r square between bld_num and section_num is 0.95,so we won’t include fire station and section_num in regression.
# a loop to look which two variables are highly correlated
for (i in colnames(cormax.df)) {
for (j in colnames(cormax.df)){
if (cormax.df[j,i] >= 0.8){
if(i != j){
print(i)
print(j)
print(' ')}
}
}
}
## [1] "bld_num"
## [1] "section_num"
## [1] " "
## [1] "section_num"
## [1] "bld_num"
## [1] " "
## [1] "fire_stations_nn3"
## [1] "crime_nn3"
## [1] " "
## [1] "crime_nn3"
## [1] "fire_stations_nn3"
## [1] " "
# for those r > 0.8, only choose one of the two.
boulder_data <-
boulder_data %>%
dplyr::select(-fire_stations_nn3,-section_num)
training_data <-
filter(boulder_data,toPredict == 0) %>%
dplyr::select(-toPredict)
testing_data <-
filter(boulder_data,toPredict == 1) %>%
dplyr::select(-toPredict)
There are four correlation scatterplot plots of home price as functions of 4 independent variables, namely exposure to transit stations, age of houses, area of main floor and natural log of percent of unemployed in the census tract.
In all scatter plots, there is no apparent linear correlation between them.
st_drop_geometry(training_data) %>%
mutate(Age = 2022 - builtYear) %>%
dplyr::select(price, traffic_nn3, Age, mainfloorSF,lnpctunemployed) %>%
filter(price <= 1000000, Age < 500) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
facet_wrap(~Variable, ncol = 2, scales = "free") +
labs(title = "Price as a function of continuous variables") +
theme(plot.title = element_text(hjust = 0.5)) +
plotTheme()
## `geom_smooth()` using formula 'y ~ x'
The correlation matrix and correlation scatterplots can only include numeric variables, so we choose number of bedroom and level of quality which are categorical and develop bar plots between them and house price.
It seems that houses with more bedrooms have higher prices. Similarly, houses with higher level of quality have higher prices.
cate <- dplyr::select(boulder_data, price, qualityCodeDscr, designCodeDscr, ConstCodeDscr,
HeatingDscr, carStorageTypeDscr, AcDscr)
cate <- st_drop_geometry(cate)
cate %>%
mutate(qualityCodeDscr = as.factor(qualityCodeDscr)) %>%
mutate(designCodeDscr = as.factor(designCodeDscr)) %>%
mutate(ConstCodeDscr = as.factor(ConstCodeDscr)) %>%
mutate(HeatingDscr = as.factor(HeatingDscr)) %>%
mutate(carStorageTypeDscr = as.factor(carStorageTypeDscr)) %>%
mutate(AcDscr = as.factor(AcDscr)) %>%
filter(price <= 1000000) %>%
gather(Variable, Value, -price) %>%
ggplot(aes(Value, price)) +
geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Price as a function of categorical variables", y = "Mean_Price") +
plotTheme() +
theme(axis.text.x = element_text(angle = 45, hjust = 0.5))+
theme(plot.title = element_text(hjust = 0.5))
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
The map below shows the distribution of houses and their sale prices in Boulder County. The warmer the spot’s color, the higher the sale price. Spots with high values mainly concentrated in the center and the southeastern part of Boulder County where Boulder City is in, while houses with low prices are mainly distributed in northeastern Boulder as well as east Boulder which are mountain areas.
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = training_data, aes(colour = q5(price)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(training_data,"price"),
name="Quintile\nBreaks") +
labs(title="Price of houses, Boulder") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
Next three graphs are maps of three independent variables, respectively area of main floor, exposure to crimes and percent of vacant houses in the census tract that each house is located in. In all three maps, orange spots represent higher value while the green spots represent lower value.
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = training_data, aes(colour = q5(mainfloorSF)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(training_data,"mainfloorSF"),
name="Quintile\nBreaks") +
labs(title="Area of mainfloor of houses, Boulder") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = training_data, aes(colour = q5(crime_nn3)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(training_data,"crime_nn3"),
name="Quintile\nBreaks") +
labs(title="Exposure to crimes of houses, Boulder") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
vacant_data <- training_data %>%
mutate(pctvacancy1 = pctvacancy*100) %>%
dplyr::select(pctvacancy1)
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = vacant_data, aes(colour = q5(pctvacancy1)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(vacant_data,"pctvacancy1"),
name="Quintile\nBreaks") +
labs(title="pct of vacant houses in the census tract that each home is located in") +
theme(plot.title = element_text(hjust = 0.5))+
mapTheme()
Bases on maps, it seems that the area of main floor distribute randomly. Houses with lower exposure to crimes are mainly located in center districts of Boulder County. The East Boulder has higher percent of vacant houses while the West Boulder has lower percent.
Basically speaking, there are five stages in machine learning process, namely data wrangling, exploratory analysis, feature engineering, feature selection as well as model estimation and validation.
In data wrangling step, we would firstly compile all sorts of data we have collected into one dataset. Secondly, we would do some data cleaning, including deleting the data that is apparently impossible. In exploratory analysis step, a few visualization tools will be used to help understand correlations between variables of interested and underlying spatial process, which would be useful to create an efficient model. In feature engineering part, we would measure exposure from a house sale to public amenities and do logarithm transformation to variables that are not normal distributed. In the feature selection stage, we will select and include some variables in a model based on testing the collinearity of independent variables. We have done those steps already in the Data part. Then we come to model estimation and validation stage.
We are going to predict the price of house as a function of all other variables remained now using OLS regression. Then ‘summary’ function is used to view the results of the regression. In the result table, P-value and (adjusted) R-square are especially paid attention to. P-value is used to measure whether the coefficient is variable. When p-value is smaller than 0.05, we can reject the ‘null hypothesis’ that there is no relationship between this independent variable and house price. R2 is an important and common indictor to describe how well all features in the model explain the model. It represents how the proportion of variation in the price variable is explained by the independent variables. Consequentially, the higher the R2, the better the model.
We are going to evaluate a model based on its accuracy and generalizability.
When testing the model’s accuracy, we are going to randomly splitting our current data into a 75% training dataset and a 25% testing dataset with ‘createDataPartition’ function. Then we would create a model based on training dataset and predict home sale price of test data with this model. Secondly, we will calculate the error and absolute error which are differences in predicted and observed prices. Thirdly, absolute percent error will be calculated on a percentage basis. After those steps, we then calculate the average of all absolute percent errors (MAPE) and all absolute errors (MAE). Those statistics are powerful tools to measure how well the model predicts for data it has never seen before. The smaller the MAPE, the better the model. Plotting the observed price as a function of predicted price with another line representing a perfect fit is also a useful visualization method to reflect how well our model predicts. It shows gaps between prediction and reality directly.
There are two definitions of generalizability. One is whether the model can accurately predict on new data, another is whether the model can accurately predict across form different groups. K-fold cross validation is a good way to gauge model’s performance on new data. During this process, all data will be partitioned into 100 equal sized folds. For each given fold, we will train on a subset of observations, predict on a test set, then measure its goodness of fit (MAE). Making a histogram of MAE of each fold is a useful visualization to gauge whether the model is generalizable to new data. If the distribution of errors would cluster tightly together, then the model generalizes well. To test the ability of the model to predict across different group context, we would firstly map MAPE by census tract. If values of MAPE are comparable in each census tract, then the model is generalizable. More specifically, we would select median household income as a context to look whether the model is as efficient in high-income census tracts as in low-income census tracts. Firstly, we would load household income data from census data to define an income text. Tracts with incomes greater than the countywide mean receive a High-Income designation. Based on calculating the difference between their corresponding MAPE, we could assess model’s generalizability. The smaller the gap, the better the model.
We are also going to check whether the model is severely affected by spatial autocorrelation. Firstly, we are going to calculate the spatial lag in errors with ‘spdep’ package. During this process, we calculate average difference between predicted price and observed price of each home sale’s five nearest neighbors and look whether it is correlated to its own price. If there is, then spatial autocorrelation exists. Moran’s I is a useful method to test whether model errors cluster or not. A positive Moran’s I near 1 means that values are spatially clustered, while one near 0 means values are randomly distributed.
In the end, we also create a stepwise model to optimize our regression model based on OLS regression.
We split the dataset into a training dataset and a testing dataset, and the ratio is 75/25. The training dataset is used to train the OLS regression model, and the testing dataset is used for accuracy testing. The testing dataset will not be included in the training process, so that it can better interpret the model accuracy.
For those categorical variables, the parameter y is applied to ensure each category of a variable can be included in the training dataset to achieve better accuracy.
# ensure reproducible partition
set.seed(123)
# train on 75% data and test on the other 25% data
inTrain <- createDataPartition(
y = paste(training_data$designCodeDscr,
training_data$qualityCodeDscr,
training_data$NAME,
training_data$HeatingDscr,
training_data$ExtWallDscrPrim
),
p = .75, list = FALSE)
boulder.train <- training_data[inTrain,]
boulder.test <- training_data[-inTrain,]
Then we train a OLS regression model with the training dataset. Results are showed as below.
all_reg <- lm(price ~.,data = st_drop_geometry(boulder.train) %>%
dplyr::select(price, NAME,year_quarter,bld_num,
designCodeDscr, qualityCodeDscr, ConstCodeDscr,
builtYear, CompCode, EffectiveYear, bsmtSF,
bsmtTypeDscr,carStorageTypeDscr,mainfloorSF,
nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
ExtWallDscrSec, Roof_CoverDscr, lnschools_nn3, crime_nn3))
stargazer(all_reg, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
##
## Regression Results
## ============================================================================================
## Dependent variable:
## ---------------------------
## price
## --------------------------------------------------------------------------------------------
## NAMECensus Tract 121.02, Boulder County, Colorado -450,394.400***
## (30,997.200)
## NAMECensus Tract 121.03, Boulder County, Colorado -451,292.400***
## (37,607.980)
## NAMECensus Tract 121.04, Boulder County, Colorado -471,544.300***
## (38,158.620)
## NAMECensus Tract 121.05, Boulder County, Colorado -567,423.800***
## (36,423.210)
## NAMECensus Tract 122.01, Boulder County, Colorado -113,742.400***
## (40,503.160)
## NAMECensus Tract 122.02, Boulder County, Colorado -607,236.400***
## (51,969.590)
## NAMECensus Tract 122.03, Boulder County, Colorado -906,274.700***
## (40,702.060)
## NAMECensus Tract 122.04, Boulder County, Colorado 65,817.290
## (72,797.390)
## NAMECensus Tract 123, Boulder County, Colorado -227,066.100
## (204,802.800)
## NAMECensus Tract 124.01, Boulder County, Colorado -551,881.100***
## (44,596.630)
## NAMECensus Tract 125.01, Boulder County, Colorado -840,532.300***
## (48,395.790)
## NAMECensus Tract 125.05, Boulder County, Colorado -407,428.200***
## (34,790.600)
## NAMECensus Tract 125.07, Boulder County, Colorado -820,130.400***
## (39,525.980)
## NAMECensus Tract 125.08, Boulder County, Colorado -791,899.000***
## (47,402.310)
## NAMECensus Tract 125.09, Boulder County, Colorado -650,136.400***
## (38,628.020)
## NAMECensus Tract 125.10, Boulder County, Colorado -406,741.500***
## (36,969.210)
## NAMECensus Tract 125.11, Boulder County, Colorado -839,994.400***
## (48,369.020)
## NAMECensus Tract 126.03, Boulder County, Colorado -774,492.500***
## (37,928.760)
## NAMECensus Tract 126.05, Boulder County, Colorado -824,645.900***
## (119,777.200)
## NAMECensus Tract 126.07, Boulder County, Colorado -801,590.700***
## (76,202.950)
## NAMECensus Tract 126.08, Boulder County, Colorado -844,373.300***
## (44,263.990)
## NAMECensus Tract 127.01, Boulder County, Colorado -788,175.300***
## (34,529.710)
## NAMECensus Tract 127.05, Boulder County, Colorado -975,402.800***
## (50,869.820)
## NAMECensus Tract 127.07, Boulder County, Colorado -528,159.800***
## (53,511.160)
## NAMECensus Tract 127.08, Boulder County, Colorado -994,951.700***
## (31,872.470)
## NAMECensus Tract 127.09, Boulder County, Colorado -946,988.800***
## (47,019.450)
## NAMECensus Tract 127.10, Boulder County, Colorado -714,539.400***
## (36,946.700)
## NAMECensus Tract 128, Boulder County, Colorado -1,052,407.000***
## (33,120.720)
## NAMECensus Tract 129.03, Boulder County, Colorado -1,024,345.000***
## (42,411.730)
## NAMECensus Tract 129.04, Boulder County, Colorado -919,533.400***
## (35,662.220)
## NAMECensus Tract 129.05, Boulder County, Colorado -857,147.700***
## (46,897.170)
## NAMECensus Tract 129.07, Boulder County, Colorado -879,470.000***
## (38,671.260)
## NAMECensus Tract 130.03, Boulder County, Colorado -855,585.500***
## (32,521.840)
## NAMECensus Tract 130.04, Boulder County, Colorado -834,790.300***
## (40,027.230)
## NAMECensus Tract 130.05, Boulder County, Colorado -772,037.400***
## (39,067.840)
## NAMECensus Tract 130.06, Boulder County, Colorado -786,723.400***
## (34,804.810)
## NAMECensus Tract 132.01, Boulder County, Colorado -867,551.100***
## (53,760.580)
## NAMECensus Tract 132.02, Boulder County, Colorado -462,936.900***
## (58,519.170)
## NAMECensus Tract 132.05, Boulder County, Colorado -974,509.100***
## (33,641.140)
## NAMECensus Tract 132.07, Boulder County, Colorado -1,038,645.000***
## (42,755.810)
## NAMECensus Tract 132.08, Boulder County, Colorado -1,069,317.000***
## (38,745.700)
## NAMECensus Tract 132.10, Boulder County, Colorado -1,018,447.000***
## (38,820.450)
## NAMECensus Tract 132.11, Boulder County, Colorado -1,065,777.000***
## (32,402.400)
## NAMECensus Tract 132.12, Boulder County, Colorado -997,339.600***
## (33,930.180)
## NAMECensus Tract 132.13, Boulder County, Colorado -1,107,830.000***
## (31,387.100)
## NAMECensus Tract 133.02, Boulder County, Colorado -937,273.600***
## (36,331.240)
## NAMECensus Tract 133.05, Boulder County, Colorado -1,009,640.000***
## (40,525.210)
## NAMECensus Tract 133.06, Boulder County, Colorado -995,495.300***
## (45,087.490)
## NAMECensus Tract 133.07, Boulder County, Colorado -1,016,554.000***
## (43,648.060)
## NAMECensus Tract 133.08, Boulder County, Colorado -975,068.600***
## (42,222.170)
## NAMECensus Tract 134.01, Boulder County, Colorado -984,756.300***
## (42,913.780)
## NAMECensus Tract 134.02, Boulder County, Colorado -1,026,995.000***
## (39,152.860)
## NAMECensus Tract 135.03, Boulder County, Colorado -961,547.600***
## (42,545.300)
## NAMECensus Tract 135.05, Boulder County, Colorado -990,366.100***
## (53,474.300)
## NAMECensus Tract 135.06, Boulder County, Colorado -1,041,835.000***
## (41,444.730)
## NAMECensus Tract 135.07, Boulder County, Colorado -1,024,478.000***
## (44,648.370)
## NAMECensus Tract 135.08, Boulder County, Colorado -1,017,531.000***
## (39,540.940)
## NAMECensus Tract 136.01, Boulder County, Colorado -852,311.000***
## (44,050.550)
## NAMECensus Tract 136.02, Boulder County, Colorado -808,817.900***
## (54,812.410)
## NAMECensus Tract 137.01, Boulder County, Colorado -853,285.100***
## (31,454.220)
## NAMECensus Tract 137.02, Boulder County, Colorado -879,857.600***
## (40,062.690)
## NAMECensus Tract 606, Boulder County, Colorado -960,852.800***
## (35,119.110)
## NAMECensus Tract 607, Boulder County, Colorado -878,529.300***
## (41,766.500)
## NAMECensus Tract 608, Boulder County, Colorado -855,957.300***
## (40,170.140)
## NAMECensus Tract 609, Boulder County, Colorado -931,612.700***
## (38,183.340)
## NAMECensus Tract 613, Boulder County, Colorado -985,189.500***
## (42,042.250)
## NAMECensus Tract 614, Boulder County, Colorado -968,232.800***
## (39,400.310)
## year_quarter2019-2 12,745.340
## (14,076.000)
## year_quarter2019-3 24,777.940*
## (14,378.760)
## year_quarter2019-4 31,890.320**
## (14,853.830)
## year_quarter2020-1 34,568.680**
## (15,339.960)
## year_quarter2020-2 42,115.910***
## (14,827.530)
## year_quarter2020-3 57,399.350***
## (13,818.660)
## year_quarter2020-4 94,564.210***
## (14,374.520)
## year_quarter2021-1 168,700.900***
## (15,458.320)
## year_quarter2021-2 248,502.300***
## (14,794.620)
## bld_num 534,235.600***
## (23,921.370)
## designCodeDscr2-3 Story -47,779.670***
## (12,839.580)
## designCodeDscrBi-level 16,593.810
## (29,641.500)
## designCodeDscrMULTI STORY- TOWNHOUSE -176,977.700***
## (16,156.310)
## designCodeDscrSplit-level -23,097.800
## (23,336.470)
## qualityCodeDscrAVERAGE + -50,854.000***
## (13,402.510)
## qualityCodeDscrAVERAGE ++ -21,064.660
## (13,928.750)
## qualityCodeDscrEXCELLENT 1,015,678.000***
## (31,668.570)
## qualityCodeDscrEXCELLENT + 1,201,716.000***
## (63,782.090)
## qualityCodeDscrEXCELLENT++ 1,879,708.000***
## (52,964.460)
## qualityCodeDscrEXCEPTIONAL 1 1,043,698.000***
## (68,155.240)
## qualityCodeDscrEXCEPTIONAL 2 1,400,096.000***
## (176,430.500)
## qualityCodeDscrFAIR -53,062.510
## (35,055.940)
## qualityCodeDscrGOOD 53,689.110***
## (10,211.720)
## qualityCodeDscrGOOD + 77,697.070***
## (16,296.610)
## qualityCodeDscrGOOD ++ 147,808.500***
## (15,373.950)
## qualityCodeDscrLOW -47,121.860
## (73,483.040)
## qualityCodeDscrVERY GOOD 242,957.700***
## (16,582.470)
## qualityCodeDscrVERY GOOD + 450,203.300***
## (27,354.120)
## qualityCodeDscrVERY GOOD ++ 574,939.700***
## (23,810.950)
## ConstCodeDscrBrick -65,151.000
## (53,484.680)
## ConstCodeDscrConcrete -22,286.610
## (159,847.700)
## ConstCodeDscrFrame -62,219.220
## (42,981.370)
## ConstCodeDscrMasonry -38,829.580
## (44,673.640)
## ConstCodeDscrPrecast -1,012,995.000***
## (216,072.700)
## ConstCodeDscrVeneer -702,027.600**
## (296,601.200)
## ConstCodeDscrWood -137,671.000
## (190,756.300)
## builtYear -708.076***
## (229.792)
## CompCode 311,488.400***
## (34,623.510)
## EffectiveYear 3,104.635***
## (301.751)
## bsmtSF -31.856***
## (8.191)
## bsmtTypeDscrGARDEN BASEMENT FINISHED AREA 89,629.710***
## (20,591.370)
## bsmtTypeDscrGARDEN BASEMENT UNFINISHED AREA 51,775.520
## (36,646.980)
## bsmtTypeDscrLOWER LVL GARDEN FINISHED (BI-SPLIT LVL) 80,478.950***
## (26,159.610)
## bsmtTypeDscrLOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL) -106,935.100
## (93,994.280)
## bsmtTypeDscrLOWER LVL WALKOUT FINISHED (BI-SPLIT LVL) 174,805.600***
## (52,048.330)
## bsmtTypeDscrLOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL) 61,242.490
## (204,438.300)
## bsmtTypeDscrSUBTERRANEAN BASEMENT FINISHED AREA 57,314.250***
## (12,082.640)
## bsmtTypeDscrSUBTERRANEAN BASEMENT UNFINISHED AREA 21,603.880*
## (12,165.840)
## bsmtTypeDscrWALK-OUT BASEMENT FINISHED AREA 128,212.700***
## (17,000.810)
## bsmtTypeDscrWALK-OUT BASEMENT UNFINISHED AREA 60,816.490**
## (24,803.570)
## carStorageTypeDscrATTACHED GARAGE AREA 36,889.720***
## (13,461.180)
## carStorageTypeDscrBASEMENT GARAGE AREA 142,794.600***
## (27,820.840)
## carStorageTypeDscrCARPORT AREA 12,367.600
## (24,306.950)
## carStorageTypeDscrDETACHED GARAGE 75,386.290***
## (14,439.890)
## carStorageTypeDscrGARAGE SET UP AS A WORKSHOP (ELEC., ETC.) AREA -68,240.050
## (110,127.400)
## carStorageTypeDscrGARAGE W/ FINISHED WALLS AREA 20,009.120
## (102,887.500)
## mainfloorSF 38.972***
## (12.103)
## nbrThreeQtrBaths 45,954.330***
## (5,829.389)
## nbrFullBaths 20,259.050***
## (5,999.818)
## nbrHalfBaths 35,646.690***
## (6,964.377)
## TotalFinishedSF 146.102***
## (10.202)
## HeatingDscrElectric -127,394.200***
## (39,503.940)
## HeatingDscrElectric Wall Heat (1500W) 335,639.900
## (227,433.400)
## HeatingDscrForced Air -76,556.240**
## (36,358.550)
## HeatingDscrGravity -17,915.400
## (67,783.420)
## HeatingDscrHeat Pump -202,837.000*
## (109,378.300)
## HeatingDscrHot Water -16,465.450
## (37,643.850)
## HeatingDscrNo HVAC -180,050.600
## (207,895.700)
## HeatingDscrPackage Unit 233,782.300
## (332,578.100)
## HeatingDscrRadiant Floor 135,722.900***
## (47,724.450)
## HeatingDscrVentilation Only -393,033.000
## (309,120.200)
## HeatingDscrWall Furnace 43,472.560
## (48,409.320)
## ExtWallDscrPrimBlock Stucco 197,458.000**
## (79,114.350)
## ExtWallDscrPrimBrick on Block 291,843.100***
## (74,717.180)
## ExtWallDscrPrimBrick Veneer 123,407.800*
## (69,670.540)
## ExtWallDscrPrimCedar 75,703.450
## (127,893.400)
## ExtWallDscrPrimCement Board 133,029.900*
## (68,689.930)
## ExtWallDscrPrimFaux Stone 259,013.500***
## (80,904.120)
## ExtWallDscrPrimFrame Asbestos 210,880.100
## (296,889.700)
## ExtWallDscrPrimFrame Stucco 110,940.600
## (69,543.820)
## ExtWallDscrPrimFrame Wood/Shake 188,433.000***
## (68,431.630)
## ExtWallDscrPrimLog 208,744.600***
## (76,302.070)
## ExtWallDscrPrimMetal 211,603.100*
## (123,119.600)
## ExtWallDscrPrimMoss Rock/Flagstone 169,501.800**
## (73,688.950)
## ExtWallDscrPrimPainted Block 140,232.100
## (91,173.060)
## ExtWallDscrPrimStrawbale 232,766.200
## (227,928.600)
## ExtWallDscrPrimVinyl 189,328.100**
## (77,527.780)
## ExtWallDscrSecBlock Stucco 776,115.300***
## (110,499.800)
## ExtWallDscrSecBrick on Block -24,248.880
## (93,390.680)
## ExtWallDscrSecBrick Veneer -16,336.050
## (13,236.000)
## ExtWallDscrSecCedar 415,054.600**
## (167,870.600)
## ExtWallDscrSecCement Board 4,992.113
## (71,259.240)
## ExtWallDscrSecFaux Stone 23,282.950
## (15,410.530)
## ExtWallDscrSecFrame Asbestos 138,613.900
## (118,416.300)
## ExtWallDscrSecFrame Stucco -48,032.770*
## (28,795.460)
## ExtWallDscrSecFrame Wood/Shake 36,012.830***
## (13,204.140)
## ExtWallDscrSecLog 115,407.000
## (145,045.100)
## ExtWallDscrSecMetal 32,746.370
## (85,785.170)
## ExtWallDscrSecMoss Rock/Flagstone 44,611.370
## (32,128.740)
## ExtWallDscrSecPainted Block 197,007.400
## (168,714.600)
## ExtWallDscrSecVinyl -16,726.080
## (97,060.320)
## Roof_CoverDscrAsphalt -7,238.716
## (7,740.046)
## Roof_CoverDscrBuilt-Up 481,071.400***
## (121,856.700)
## Roof_CoverDscrClay Tile -71,444.940
## (47,479.230)
## Roof_CoverDscrConcrete Tile -88,007.710***
## (24,973.460)
## Roof_CoverDscrMetal 180,195.200***
## (26,062.230)
## Roof_CoverDscrRoll -118,481.800
## (288,023.400)
## Roof_CoverDscrRubber Membrane 140,318.600***
## (27,660.920)
## Roof_CoverDscrShake -75,356.520
## (48,688.470)
## Roof_CoverDscrTar and Gravel 170,389.000
## (146,434.900)
## lnschools_nn3 -4,639.976
## (5,389.377)
## crime_nn3 -13.958***
## (1.960)
## Constant -4,493,725.000***
## (609,639.000)
## --------------------------------------------------------------------------------------------
## Observations 9,475
## R2 0.754
## Adjusted R2 0.749
## Residual Std. Error 286,444.400 (df = 9295)
## F Statistic 159.170*** (df = 179; 9295)
## ============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The adjusted R-square of the regression model reaches 0.749, which means about 75% of the result can be explained by the model.
23 predictors are included in the regression model training, including 10 categorical variables.
Combining the first model result and stepwise regression results below, most predictors is statistically significant (p-value < 0.001) and one numeric predictor (lnschool_nn3) is not significant (p-value > 0.1). So we decided to remove that variable and train a new model based on the other variables.
fit <- lm(price~ ., data = st_drop_geometry(boulder.train) %>%
dplyr::select(price, NAME,year_quarter,bld_num,
designCodeDscr, qualityCodeDscr, ConstCodeDscr,
builtYear, CompCode, EffectiveYear, bsmtSF,
bsmtTypeDscr,carStorageTypeDscr,mainfloorSF,
nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
ExtWallDscrSec, Roof_CoverDscr, lnschools_nn3, crime_nn3))
step <- stepAIC(fit, direction="both")
## Start: AIC=238290.7
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr +
## ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF +
## bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths +
## nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr +
## ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + lnschools_nn3 +
## crime_nn3
##
## Df Sum of Sq RSS AIC
## - lnschools_nn3 1 6.0818e+10 7.6272e+14 238289
## <none> 7.6266e+14 238291
## - builtYear 1 7.7906e+11 7.6344e+14 238298
## - mainfloorSF 1 8.5076e+11 7.6351e+14 238299
## - nbrFullBaths 1 9.3550e+11 7.6359e+14 238300
## - bsmtSF 1 1.2411e+12 7.6390e+14 238304
## - ConstCodeDscr 7 2.5211e+12 7.6518e+14 238308
## - nbrHalfBaths 1 2.1496e+12 7.6481e+14 238315
## - carStorageTypeDscr 6 3.8632e+12 7.6652e+14 238327
## - ExtWallDscrSec 14 6.2670e+12 7.6893e+14 238340
## - crime_nn3 1 4.1623e+12 7.6682e+14 238340
## - nbrThreeQtrBaths 1 5.0990e+12 7.6776e+14 238352
## - bsmtTypeDscr 10 6.6564e+12 7.6931e+14 238353
## - ExtWallDscrPrim 15 7.7421e+12 7.7040e+14 238356
## - HeatingDscr 11 7.4707e+12 7.7013e+14 238361
## - CompCode 1 6.6408e+12 7.6930e+14 238371
## - Roof_CoverDscr 9 9.6003e+12 7.7226e+14 238391
## - EffectiveYear 1 8.6857e+12 7.7134e+14 238396
## - designCodeDscr 4 1.1591e+13 7.7425e+14 238426
## - TotalFinishedSF 1 1.6828e+13 7.7949e+14 238495
## - bld_num 1 4.0924e+13 8.0358e+14 238784
## - year_quarter 9 4.6558e+13 8.0922e+14 238834
## - qualityCodeDscr 15 1.9295e+14 9.5561e+14 240398
## - NAME 67 2.3608e+14 9.9874e+14 240712
##
## Step: AIC=238289.5
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr +
## ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF +
## bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths +
## nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr +
## ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + crime_nn3
##
## Df Sum of Sq RSS AIC
## <none> 7.6272e+14 238289
## + lnschools_nn3 1 6.0818e+10 7.6266e+14 238291
## - builtYear 1 7.9061e+11 7.6351e+14 238297
## - mainfloorSF 1 8.2664e+11 7.6355e+14 238298
## - nbrFullBaths 1 9.2872e+11 7.6365e+14 238299
## - bsmtSF 1 1.2476e+12 7.6397e+14 238303
## - ConstCodeDscr 7 2.5178e+12 7.6524e+14 238307
## - nbrHalfBaths 1 2.1577e+12 7.6488e+14 238314
## - carStorageTypeDscr 6 3.8557e+12 7.6657e+14 238325
## - ExtWallDscrSec 14 6.2526e+12 7.6897e+14 238339
## - crime_nn3 1 4.6916e+12 7.6741e+14 238346
## - nbrThreeQtrBaths 1 5.1152e+12 7.6783e+14 238351
## - bsmtTypeDscr 10 6.6723e+12 7.6939e+14 238352
## - ExtWallDscrPrim 15 7.7401e+12 7.7046e+14 238355
## - HeatingDscr 11 7.4503e+12 7.7017e+14 238360
## - CompCode 1 6.6460e+12 7.6937e+14 238370
## - Roof_CoverDscr 9 9.5749e+12 7.7229e+14 238390
## - EffectiveYear 1 8.7360e+12 7.7146e+14 238395
## - designCodeDscr 4 1.1530e+13 7.7425e+14 238424
## - TotalFinishedSF 1 1.6897e+13 7.7962e+14 238495
## - bld_num 1 4.0969e+13 8.0369e+14 238783
## - year_quarter 9 4.6671e+13 8.0939e+14 238834
## - qualityCodeDscr 15 1.9313e+14 9.5585e+14 240398
## - NAME 67 2.3604e+14 9.9876e+14 240710
# display results
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr +
## ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF +
## bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths +
## nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr +
## ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + lnschools_nn3 +
## crime_nn3
##
## Final Model:
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr +
## ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF +
## bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths +
## nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr +
## ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + crime_nn3
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 9295 7.626583e+14 238290.7
## 2 - lnschools_nn3 1 60818410273 9296 7.627191e+14 238289.5
The new regression model we built is shown as follows.
all_reg_fin <- lm(price ~.,data = st_drop_geometry(boulder.train) %>%
dplyr::select(price, NAME,year_quarter, bld_num,
designCodeDscr, qualityCodeDscr,
builtYear, CompCode, EffectiveYear, bsmtSF,
bsmtTypeDscr,carStorageTypeDscr,mainfloorSF,
nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
TotalFinishedSF, ExtWallDscrPrim, HeatingDscr,
ExtWallDscrSec, Roof_CoverDscr, crime_nn3))
stargazer(all_reg_fin, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
##
## Regression Results
## ============================================================================================
## Dependent variable:
## ---------------------------
## price
## --------------------------------------------------------------------------------------------
## NAMECensus Tract 121.02, Boulder County, Colorado -443,119.600***
## (30,538.960)
## NAMECensus Tract 121.03, Boulder County, Colorado -450,442.600***
## (37,461.770)
## NAMECensus Tract 121.04, Boulder County, Colorado -474,141.100***
## (38,143.220)
## NAMECensus Tract 121.05, Boulder County, Colorado -564,486.200***
## (36,360.070)
## NAMECensus Tract 122.01, Boulder County, Colorado -107,092.700***
## (40,279.420)
## NAMECensus Tract 122.02, Boulder County, Colorado -604,179.900***
## (51,578.850)
## NAMECensus Tract 122.03, Boulder County, Colorado -908,887.200***
## (40,672.780)
## NAMECensus Tract 122.04, Boulder County, Colorado 68,755.930
## (72,576.240)
## NAMECensus Tract 123, Boulder County, Colorado -224,758.200
## (204,781.900)
## NAMECensus Tract 124.01, Boulder County, Colorado -547,880.200***
## (44,598.890)
## NAMECensus Tract 125.01, Boulder County, Colorado -832,136.600***
## (48,173.570)
## NAMECensus Tract 125.05, Boulder County, Colorado -403,155.100***
## (34,708.800)
## NAMECensus Tract 125.07, Boulder County, Colorado -811,439.400***
## (38,952.480)
## NAMECensus Tract 125.08, Boulder County, Colorado -785,693.600***
## (47,156.880)
## NAMECensus Tract 125.09, Boulder County, Colorado -641,933.500***
## (38,218.910)
## NAMECensus Tract 125.10, Boulder County, Colorado -406,190.600***
## (36,716.310)
## NAMECensus Tract 125.11, Boulder County, Colorado -831,142.900***
## (48,005.860)
## NAMECensus Tract 126.03, Boulder County, Colorado -782,932.400***
## (37,688.170)
## NAMECensus Tract 126.05, Boulder County, Colorado -823,293.800***
## (119,922.700)
## NAMECensus Tract 126.07, Boulder County, Colorado -794,936.700***
## (75,989.360)
## NAMECensus Tract 126.08, Boulder County, Colorado -835,831.600***
## (43,799.460)
## NAMECensus Tract 127.01, Boulder County, Colorado -793,634.500***
## (34,457.280)
## NAMECensus Tract 127.05, Boulder County, Colorado -977,923.000***
## (50,891.030)
## NAMECensus Tract 127.07, Boulder County, Colorado -527,560.200***
## (53,357.910)
## NAMECensus Tract 127.08, Boulder County, Colorado -1,000,356.000***
## (31,733.700)
## NAMECensus Tract 127.09, Boulder County, Colorado -956,894.000***
## (46,843.480)
## NAMECensus Tract 127.10, Boulder County, Colorado -714,038.100***
## (36,904.710)
## NAMECensus Tract 128, Boulder County, Colorado -1,049,945.000***
## (32,834.850)
## NAMECensus Tract 129.03, Boulder County, Colorado -1,026,141.000***
## (42,445.040)
## NAMECensus Tract 129.04, Boulder County, Colorado -916,862.700***
## (35,028.150)
## NAMECensus Tract 129.05, Boulder County, Colorado -854,026.800***
## (46,646.240)
## NAMECensus Tract 129.07, Boulder County, Colorado -878,942.200***
## (38,673.800)
## NAMECensus Tract 130.03, Boulder County, Colorado -855,484.100***
## (32,498.650)
## NAMECensus Tract 130.04, Boulder County, Colorado -834,528.100***
## (39,450.380)
## NAMECensus Tract 130.05, Boulder County, Colorado -768,116.300***
## (38,333.000)
## NAMECensus Tract 130.06, Boulder County, Colorado -785,315.600***
## (34,524.040)
## NAMECensus Tract 132.01, Boulder County, Colorado -864,235.600***
## (53,772.710)
## NAMECensus Tract 132.02, Boulder County, Colorado -468,161.900***
## (58,466.230)
## NAMECensus Tract 132.05, Boulder County, Colorado -977,960.000***
## (33,472.010)
## NAMECensus Tract 132.07, Boulder County, Colorado -1,035,867.000***
## (42,728.120)
## NAMECensus Tract 132.08, Boulder County, Colorado -1,064,972.000***
## (38,454.710)
## NAMECensus Tract 132.10, Boulder County, Colorado -1,009,256.000***
## (37,958.490)
## NAMECensus Tract 132.11, Boulder County, Colorado -1,067,702.000***
## (32,352.920)
## NAMECensus Tract 132.12, Boulder County, Colorado -998,199.800***
## (33,912.990)
## NAMECensus Tract 132.13, Boulder County, Colorado -1,105,218.000***
## (31,034.140)
## NAMECensus Tract 133.02, Boulder County, Colorado -931,591.800***
## (35,561.400)
## NAMECensus Tract 133.05, Boulder County, Colorado -1,010,090.000***
## (40,351.290)
## NAMECensus Tract 133.06, Boulder County, Colorado -991,054.600***
## (44,735.560)
## NAMECensus Tract 133.07, Boulder County, Colorado -1,006,140.000***
## (42,341.790)
## NAMECensus Tract 133.08, Boulder County, Colorado -966,873.400***
## (41,257.400)
## NAMECensus Tract 134.01, Boulder County, Colorado -983,564.500***
## (42,695.800)
## NAMECensus Tract 134.02, Boulder County, Colorado -1,024,232.000***
## (38,887.100)
## NAMECensus Tract 135.03, Boulder County, Colorado -952,902.900***
## (41,432.020)
## NAMECensus Tract 135.05, Boulder County, Colorado -980,680.100***
## (52,829.210)
## NAMECensus Tract 135.06, Boulder County, Colorado -1,035,724.000***
## (40,427.010)
## NAMECensus Tract 135.07, Boulder County, Colorado -1,013,692.000***
## (43,429.810)
## NAMECensus Tract 135.08, Boulder County, Colorado -1,011,534.000***
## (38,753.720)
## NAMECensus Tract 136.01, Boulder County, Colorado -844,444.900***
## (43,457.960)
## NAMECensus Tract 136.02, Boulder County, Colorado -811,031.700***
## (54,623.040)
## NAMECensus Tract 137.01, Boulder County, Colorado -857,892.200***
## (31,000.510)
## NAMECensus Tract 137.02, Boulder County, Colorado -887,061.900***
## (39,435.060)
## NAMECensus Tract 606, Boulder County, Colorado -962,767.900***
## (35,090.790)
## NAMECensus Tract 607, Boulder County, Colorado -879,919.200***
## (41,654.380)
## NAMECensus Tract 608, Boulder County, Colorado -852,642.900***
## (39,792.880)
## NAMECensus Tract 609, Boulder County, Colorado -929,643.800***
## (37,861.730)
## NAMECensus Tract 613, Boulder County, Colorado -982,603.100***
## (41,392.490)
## NAMECensus Tract 614, Boulder County, Colorado -965,569.100***
## (38,996.000)
## year_quarter2019-2 12,616.910
## (14,091.770)
## year_quarter2019-3 23,693.930*
## (14,387.170)
## year_quarter2019-4 31,147.090**
## (14,869.550)
## year_quarter2020-1 34,298.120**
## (15,356.700)
## year_quarter2020-2 42,321.820***
## (14,843.090)
## year_quarter2020-3 57,282.720***
## (13,835.280)
## year_quarter2020-4 94,036.190***
## (14,389.950)
## year_quarter2021-1 168,304.500***
## (15,464.510)
## year_quarter2021-2 247,198.600***
## (14,808.640)
## bld_num 535,797.900***
## (23,898.800)
## designCodeDscr2-3 Story -50,482.490***
## (12,819.180)
## designCodeDscrBi-level 12,498.260
## (29,598.120)
## designCodeDscrMULTI STORY- TOWNHOUSE -178,860.800***
## (16,127.810)
## designCodeDscrSplit-level -25,825.390
## (23,331.500)
## qualityCodeDscrAVERAGE + -51,043.520***
## (13,409.330)
## qualityCodeDscrAVERAGE ++ -21,157.950
## (13,940.900)
## qualityCodeDscrEXCELLENT 1,014,861.000***
## (31,576.420)
## qualityCodeDscrEXCELLENT + 1,205,749.000***
## (63,750.040)
## qualityCodeDscrEXCELLENT++ 1,852,756.000***
## (52,550.330)
## qualityCodeDscrEXCEPTIONAL 1 1,044,320.000***
## (68,128.260)
## qualityCodeDscrEXCEPTIONAL 2 1,399,918.000***
## (176,559.100)
## qualityCodeDscrFAIR -54,942.740
## (35,073.940)
## qualityCodeDscrGOOD 54,058.410***
## (10,198.260)
## qualityCodeDscrGOOD + 78,145.890***
## (16,287.510)
## qualityCodeDscrGOOD ++ 148,599.100***
## (15,368.160)
## qualityCodeDscrLOW -58,446.710
## (72,748.980)
## qualityCodeDscrVERY GOOD 242,727.800***
## (16,543.990)
## qualityCodeDscrVERY GOOD + 451,555.200***
## (27,344.760)
## qualityCodeDscrVERY GOOD ++ 571,376.600***
## (23,765.010)
## builtYear -713.599***
## (228.302)
## CompCode 308,362.900***
## (34,592.690)
## EffectiveYear 3,125.189***
## (301.797)
## bsmtSF -30.826***
## (8.186)
## bsmtTypeDscrGARDEN BASEMENT FINISHED AREA 90,927.880***
## (20,603.910)
## bsmtTypeDscrGARDEN BASEMENT UNFINISHED AREA 51,633.710
## (36,688.580)
## bsmtTypeDscrLOWER LVL GARDEN FINISHED (BI-SPLIT LVL) 82,890.740***
## (26,155.640)
## bsmtTypeDscrLOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL) -107,186.500
## (94,109.090)
## bsmtTypeDscrLOWER LVL WALKOUT FINISHED (BI-SPLIT LVL) 173,312.200***
## (52,108.140)
## bsmtTypeDscrLOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL) 65,171.950
## (204,684.300)
## bsmtTypeDscrSUBTERRANEAN BASEMENT FINISHED AREA 58,274.030***
## (12,079.670)
## bsmtTypeDscrSUBTERRANEAN BASEMENT UNFINISHED AREA 22,612.160*
## (12,164.890)
## bsmtTypeDscrWALK-OUT BASEMENT FINISHED AREA 127,795.700***
## (17,007.160)
## bsmtTypeDscrWALK-OUT BASEMENT UNFINISHED AREA 62,771.790**
## (24,822.120)
## carStorageTypeDscrATTACHED GARAGE AREA 36,752.290***
## (13,449.260)
## carStorageTypeDscrBASEMENT GARAGE AREA 141,826.700***
## (27,833.650)
## carStorageTypeDscrCARPORT AREA 10,727.650
## (24,318.130)
## carStorageTypeDscrDETACHED GARAGE 76,190.090***
## (14,443.760)
## carStorageTypeDscrGARAGE SET UP AS A WORKSHOP (ELEC., ETC.) AREA -66,109.580
## (110,180.000)
## carStorageTypeDscrGARAGE W/ FINISHED WALLS AREA 22,351.180
## (102,894.400)
## mainfloorSF 36.292***
## (12.074)
## nbrThreeQtrBaths 45,235.750***
## (5,815.871)
## nbrFullBaths 17,658.330***
## (5,931.780)
## nbrHalfBaths 35,506.410***
## (6,971.469)
## TotalFinishedSF 147.823***
## (10.188)
## ExtWallDscrPrimBlock Stucco 167,480.000**
## (72,844.240)
## ExtWallDscrPrimBrick on Block 267,953.100***
## (67,983.460)
## ExtWallDscrPrimBrick Veneer 94,757.290
## (62,421.650)
## ExtWallDscrPrimCedar 34,328.910
## (124,041.700)
## ExtWallDscrPrimCement Board 93,149.210
## (61,362.260)
## ExtWallDscrPrimFaux Stone 201,153.800***
## (74,682.410)
## ExtWallDscrPrimFrame Asbestos 171,854.400
## (295,639.000)
## ExtWallDscrPrimFrame Stucco 73,069.180
## (62,405.930)
## ExtWallDscrPrimFrame Wood/Shake 147,071.600**
## (61,313.670)
## ExtWallDscrPrimLog 168,914.800**
## (69,718.270)
## ExtWallDscrPrimMetal 171,517.900
## (119,286.400)
## ExtWallDscrPrimMoss Rock/Flagstone 135,905.800**
## (67,151.330)
## ExtWallDscrPrimPainted Block 112,289.900
## (85,733.910)
## ExtWallDscrPrimStrawbale 246,197.000
## (214,830.200)
## ExtWallDscrPrimVinyl 147,140.600**
## (71,182.630)
## HeatingDscrElectric -132,959.500***
## (39,291.580)
## HeatingDscrElectric Wall Heat (1500W) 275,842.300
## (206,672.500)
## HeatingDscrForced Air -80,408.020**
## (36,108.330)
## HeatingDscrGravity -20,625.360
## (67,712.130)
## HeatingDscrHeat Pump -196,400.400*
## (109,438.900)
## HeatingDscrHot Water -20,764.100
## (37,418.470)
## HeatingDscrNo HVAC -188,088.500
## (208,099.800)
## HeatingDscrPackage Unit 231,031.600
## (293,667.900)
## HeatingDscrRadiant Floor 126,866.000***
## (47,575.510)
## HeatingDscrVentilation Only -398,786.000
## (309,432.300)
## HeatingDscrWall Furnace 36,175.080
## (47,702.750)
## ExtWallDscrSecBlock Stucco 774,196.800***
## (110,633.500)
## ExtWallDscrSecBrick on Block -10,576.010
## (93,210.560)
## ExtWallDscrSecBrick Veneer -14,796.800
## (13,230.290)
## ExtWallDscrSecCedar 418,255.100**
## (168,070.200)
## ExtWallDscrSecCement Board 6,972.315
## (71,306.360)
## ExtWallDscrSecFaux Stone 19,862.340
## (15,398.080)
## ExtWallDscrSecFrame Asbestos 135,124.200
## (118,539.200)
## ExtWallDscrSecFrame Stucco -47,876.670*
## (28,780.820)
## ExtWallDscrSecFrame Wood/Shake 29,783.440**
## (12,689.530)
## ExtWallDscrSecLog 117,830.900
## (145,220.400)
## ExtWallDscrSecMetal -35,892.170
## (84,261.590)
## ExtWallDscrSecMoss Rock/Flagstone 44,313.640
## (32,156.520)
## ExtWallDscrSecPainted Block 192,898.000
## (168,895.900)
## ExtWallDscrSecVinyl -19,062.270
## (97,143.770)
## Roof_CoverDscrAsphalt -6,984.258
## (7,720.393)
## Roof_CoverDscrBuilt-Up 476,225.500***
## (122,000.900)
## Roof_CoverDscrClay Tile -71,847.950
## (47,515.480)
## Roof_CoverDscrConcrete Tile -84,741.840***
## (24,966.070)
## Roof_CoverDscrMetal 176,562.000***
## (26,065.220)
## Roof_CoverDscrRoll -101,376.000
## (288,217.000)
## Roof_CoverDscrRubber Membrane 130,813.100***
## (27,620.800)
## Roof_CoverDscrShake -72,925.380
## (48,736.270)
## Roof_CoverDscrTar and Gravel 169,176.700
## (146,610.200)
## crime_nn3 -14.710***
## (1.901)
## Constant -4,560,358.000***
## (604,719.400)
## --------------------------------------------------------------------------------------------
## Observations 9,475
## R2 0.753
## Adjusted R2 0.749
## Residual Std. Error 286,804.800 (df = 9303)
## F Statistic 166.015*** (df = 171; 9303)
## ============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The new model shows an adjusted R-square of 0.749, at a same level of the previous.
boulder.test <-
boulder.test %>%
mutate(price.Predict = predict(all_reg_fin, boulder.test),
price.Error = price.Predict - price,
price.AbsError = abs(price.Predict - price),
price.APE = (abs(price.Predict - price)) / price.Predict) %>%
filter(price < 5000000)
MAE <- mean(boulder.test$price.AbsError, na.rm = T)
MAPE <- mean(boulder.test$price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)
kable(acc) %>% kable_styling(full_width = F)
MAE | MAPE |
---|---|
107989.5 | 0.1968083 |
The MAE of the model is 107989.5, which means the mean absolute error. The MAPE is 0.1968083, which means there is a 19.68% error between the predicted value and true value.
Cross validation test is applied to the regression model.
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)
all_reg_fin.cv <-
train(price ~ ., data = st_drop_geometry(boulder.train) %>%
dplyr::select(price, NAME,year_quarter, bld_num,
designCodeDscr, qualityCodeDscr,
builtYear, CompCode, EffectiveYear, bsmtSF,
bsmtTypeDscr,carStorageTypeDscr,mainfloorSF,
nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
ExtWallDscrSec, Roof_CoverDscr, crime_nn3),
method = "lm", trControl = fitControl, na.action = na.pass)
all_reg_fin.cv
## Linear Regression
##
## 9475 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 9381, 9379, 9381, 9380, 9381, 9381, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 283134.8 0.7588802 153884.2
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
all_reg_fin.cv$resample[1:5,]
## RMSE Rsquared MAE Resample
## 1 183483.3 0.8288958 128012.3 Fold001
## 2 167515.6 0.8415570 123110.9 Fold002
## 3 400506.9 0.6638319 191935.7 Fold003
## 4 271663.7 0.7967200 160424.5 Fold004
## 5 167155.8 0.8580706 125578.2 Fold005
sd(all_reg_fin.cv$resample[,3])
## [1] 24030.65
mean(all_reg_fin.cv$resample[,3])
## [1] 153884.2
hist(all_reg_fin.cv$resample[,3],
breaks = 50,
main = 'Distribution of MAE \nK fold cross validation;k=100',
xlab = 'Mean Absolute Mile',
ylab = 'count')
The mean of the MAEs is 24030.65 and the standard deviation is 153884.2. The histogram below shows the distribution of the MAEs of the 100 folds in k fold cross validation. As is shown in the graph, the MAEs is not very tightly distributed, which means the generalizability of the model is not very good.
Below is the graph of predicted price as a function of observed price. The red line represents a perfect prediction according to the observed house price, and the orange line represents our average prediction. Our prediction is quite near the perfect prediction, although there is a small gap between them.
ggplot(
boulder.test, aes(price.Predict, price)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
geom_abline(intercept = 0, slope = 1, color='red',size=1) +
labs(title = 'Predicted sale price as a function of observed price', subtitle = 'Red line represents a perfect prediction \nOrange line represents an average prediction') +
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'
The graph of lag price error and price error shows a slight positive relationship between them, which also indicates the existence of spatial autocorrelation.
coords.test <- st_coordinates(boulder.test)
tractList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(tractList.test, style="W")
boulder.test <- boulder.test %>%
mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error))
ggplot(data=boulder.test, aes(lagPriceError, price.Error)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "#FA7800") +
labs(title = 'price error as a function of lag price errors') +
theme(plot.title = element_text(hjust = 0.5)) +
plotTheme()
## `geom_smooth()` using formula 'y ~ x'
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = boulder.test, aes(colour = q5(lagPriceError)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder.test,"lagPriceError"),
name="Quintile\nBreaks") +
labs(title="a map of spatial lag in errors") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
Below is the graph of Moran’s I test. The histogram means the randomly permuted Moran’s I and the orange line means the observed Moran’s I. The graph shows that there is somewhat spatial autocorrelation in the residuals’ distribution, but the autocorrelation is not very high. Ideally, there should be no spatial autocorrelation if the model is absolutely generalized.
moranTest <- moran.mc(boulder.test$price.Error,
spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange",
x="Moran's I",
y="Count") +
plotTheme()
Below is a list of mean MAPE of each census tract. There are differences between the mean MAPE in each census tract, but the majority lies between 0.10 to 0.25. The max value of MAPE is 0.84 and occurs in Census Tract 132.01, which means our model does not fit well in this particular region. As is shown in the map, the MAPE in the center of the region is lower than the outside of the region, which means our model fit better in center city.
test_map = st_drop_geometry(boulder.test) %>%
group_by(NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T))
test_map
## # A tibble: 63 x 2
## NAME MAPE
## <chr> <dbl>
## 1 Census Tract 121.01, Boulder County, Colorado 0.197
## 2 Census Tract 121.02, Boulder County, Colorado 0.235
## 3 Census Tract 121.03, Boulder County, Colorado 0.206
## 4 Census Tract 121.04, Boulder County, Colorado 0.207
## 5 Census Tract 121.05, Boulder County, Colorado 0.182
## 6 Census Tract 122.01, Boulder County, Colorado 0.605
## 7 Census Tract 122.02, Boulder County, Colorado 0.236
## 8 Census Tract 122.03, Boulder County, Colorado 0.136
## 9 Census Tract 124.01, Boulder County, Colorado 0.162
## 10 Census Tract 125.01, Boulder County, Colorado 0.120
## # ... with 53 more rows
Below is a map of the predicted price of our regression model. As is shown in the map, predicted prices tend to cluster. Most high-price houses are in the center city and suburb, while low-price houses locate in the mountainous area in the west.
boulder_data_1 <-
boulder_data %>%
mutate(price.Predict = predict(all_reg_fin, boulder_data))%>%
filter(price < 5000000)
ggplot() +
geom_sf(data = acsTractsBD19, fill = "grey60") +
geom_sf(data = boulder_data_1, aes(colour = q5(price.Predict)),
show.legend = "point", size = .75) +
scale_colour_manual(values = palette5,
labels=qBr(boulder_data_1,"price.Predict"),
name="Quintile\nBreaks") +
labs(title="predicted price with model") +
theme(plot.title = element_text(hjust = 0.5)) +
mapTheme()
The scatter plot shows that mean MAPE of each census tract is not that relevant with mean house price.
mm_table <- left_join(
st_drop_geometry(boulder.test) %>%
group_by(NAME) %>%
summarize(meanPrice = mean(price, na.rm = T)),
st_drop_geometry(boulder.test) %>%
group_by(NAME) %>%
summarize(MAPE = mean(price.APE, na.rm = T)))
## Joining, by = "NAME"
mm_table <- merge(acsTractsBD19, mm_table,by.x = 'NAME',by.y = 'NAME') # add geometry of each census tracts
mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]
# mm_table
ggplot(data=st_drop_geometry(mm_table), aes(meanPrice, MAPE)) +
geom_point(size = 1.0, colour = "#FA7800") +
labs(title = 'MAPE as a function of mean price in each census tract') +
theme(plot.title = element_text(hjust = 0.5))+
plotTheme()
ggplot() +
geom_sf(data = mm_table, aes(fill = MAPE)) +
scale_colour_manual(values = palette5,
labels=qBr(mm_table, "MAPE"),
name="Quintile\nBreaks") +
theme(plot.title = element_text(hjust = 0.5))+
labs(title="MAPE of Each Census Tract") +
mapTheme()
We select income as the factor to split the census tracts into two groups, high-income group and low-income group. The MAPE of high-income regions is 20% and that of low-income regions is 19%. There is little difference in MAPE for both income groups, showing our model has a good generalizability in different income groups.
# set income context
median_hhincome <- median(acsTractsBD19$med_hh_income)
acsTractsBD19 <- acsTractsBD19 %>%
mutate(incomeContext = ifelse(med_hh_income > median_hhincome, "High Income", "Low Income"))
#plot incime context
ggplot() + geom_sf(data = na.omit(acsTractsBD19), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
labs(title = "Income Context in Boulder County") +
theme(plot.title = element_text(hjust = 0.5),legend.position="bottom") +
mapTheme()
st_join(boulder.test, acsTractsBD19) %>%
group_by(incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Test set MAPE by census Tracts income context") %>% kable_styling(full_width = F)
High Income | Low Income |
---|---|
20% | 19% |
# testing_data["price"] <- predict(all_reg_fin, testing_data)
# write.csv(testing_data,"predictions.csv", row.names = FALSE)
In general, our model is an effective one. Our R-square reaches 0.75 which means the major variation in house price can be interpreted by our model. The MAPE is at the level of 0.19, showing a moderate accuracy in local house price prediction. Our model performs especially well in the downtown area, with an average MAPE of about 0.12.
However, there is also limitation in our model that there is spatial autocorrelation in our residuals, showing an unsatisfying generalizability. In the center and suburb of the boulder city, our model performs quite well, with a lower MAPE than average. But in some specific outer regions, the high MAPE value indicates a bad model adoption. The reason of this might be that there are more houses in the center city and suburbs, so most of our training data come from these regions. Due to the lack of training data, predicted house price in the outer region varies quite a lot from the observed price. What’s more, houses in the mountainous areas are less dense and therefore farther from the surrounding public facilities, some of our predictors do not apply when predicting this kind of houses.
The map of our prediction result shows that the lower house prices cluster in the downtown, while those houses in the suburb have a greater value. This pattern is quite common in US cities.
There are some interesting predictors when training the regression model. We introduced a couple of census data to the model as predictors and we can see level of income, level of education and poverty rate all have an impact on regional house price. What surprises us when doing the training part is that the accessibility to schools (calculated as 3 nearest schools’ distance) is not significant in influencing the house price, and neither is the accessibility to hospitals. This is quite different from my ‘knowledge’ and surprised me at first.
In general, we’d like to recommend our model to Zillow. Firstly, our model reaches a moderate accuracy and can be applied especially in downtown house price prediction with a MAPE about 12% on average.
Furthermore, we consider a wide variety of predictors in the process of model training. The predictors come from the field of houses’ internal characteristics, public amenities near houses, spatial structure, and socioeconomical characteristics and have been carefully selected to avoid correlation and multi-collinearity.
Different methods, like creating data partition, are also introduced to reduce errors when training our model. Thus, our model is likely to be helpful in house price prediction in boulder county.
There are also a few limitations in our model. Our model does not reach a satisfying generalizability, and is quite restricted in the downtown area of boulder county. We’d like to improve it with more generalizable and useful datasets and pay more attention on model training.