Introduction

Zillow is the most-visited real estate website in the United States which offer customers services of selling, buying, and renting homes. It is essential for Zillow to predict housing prices to improve the level of service. In this project, we are going to build a predictive model of home prices to help Zillow forecast housing market in Boulder County, CO. The goal is to train a model with recent transactions and test whether the model is accurate and generalizable enough to predict the home price of properties that have not recently sold.

Firstly, we would collect data covering houses’ internal characteristic, their nearby public services, information about places they are in and so on. After compiling the data into one dataset, we would do some data wrangling, variables transformation, and independent variables selection. In the next step, we would develop a model with OLS regression. Lastly, we will assess whether our model’s accuracy and generalizability with methods including cross validation, spatial lag error and Moran’s I.

Data collection is one of inconvenient difficulties. Firstly, some useful data might be inaccessible due to the concern of security, privacy, and the cost of collection. Secondly, there might be some errors in data we collect, especially those from unofficial departments like OpenStreetMap. Thirdly, we might subjectively omit datasets which are important in predicting. These kinds of situation would all lead to decrease of our model’s efficiency. Besides, methods that would be used to training data in our research has its inherent limitation, which will also have an influence on successful machine learning.

Our result shows, we trained a model with a moderate accuracy, with an overall MAPE of 0.19 and R-square of 0.75. Despite some limitation in generalizability, our model can act as an reference when studying the house price in boulder county.

Data

Data collection

There are three main sources of data. First one is open boulder data, which is an official website offering many sorts of city data. It provides dataset including traffic stations, fire stations, and police offenses. The second one is Boulder’s census tract data which can be accessed with ‘tidycensus’ package in r. we also scrape school data, hospital data, transit data, and commercial data from open street map. It is noteworthy that since people can edit OpenStreetMap collaboratively, so it may lead to incomplete and wrong data.

# transform variables
st_c <- st_coordinates
boulder_data <-
  boulder_data %>%
  mutate(
      fire_stations_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(fire_stations)), 3),
      commercial_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(commercial))[,c(1,2)], 3),
      traffic_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(traffic_stations)), 3),
      schools_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(schools))[,c(1,2)], 3),
      crime_nn3 = nn_function(na.omit(st_c(boulder_data)),na.omit(st_c(crimes)[,c(1,2)]), 3)
      )
# load api key
census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
# variance of interest in census data
census_vars <- c( "B01001_001E", # ACS total Pop estimate
                  "B25002_001E", # Estimate of total housing units
                  "B25002_003E", # Number of vacant housing units
                  "B19013_001E", # Median HH Income ($)
                  "B02001_002E", # People describing themselves as "white alone"
                  "B06009_006E", # Total graduate or professional degree
                  "B06012_002E", # Number of below 100 percent of the poverty level
                  "B23025_002E", # Number of people in labor force
                  "B23025_005E") # Number of people in labor force unemployed

# get access to census data of 2019
acsTractsBD19 <- get_acs(geography = "tract",
                         year = 2019, 
                         variables = census_vars, 
                         geometry = TRUE, 
                         state = "Colorado", 
                         county = "Boulder",
                         output = "wide") 
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
# drop extra data columns
acsTractsBD19 <- acsTractsBD19[-c(4,6,8,10,12,14,16,18,20)] 

# rename columns
acsTractsBD19 <- acsTractsBD19 %>%
  rename (total_pop = B01001_001E,
          total_hu = B25002_001E,
          total_vacant = B25002_003E,
          med_hh_income = B19013_001E,
          total_white = B02001_002E,
          total_graddeg = B06009_006E,
          total_belpov100 = B06012_002E,
          total_labor_f = B23025_002E,
          total_unemployed = B23025_005E)

# add new columns
acsTractsBD19 <- acsTractsBD19 %>%
  mutate(pctvacancy = total_vacant/total_hu,
         pctwhite = total_white/total_pop,
         pctunemployed = total_unemployed/total_labor_f,
         pctedu = total_graddeg/total_pop) %>%
    dplyr::select(-total_vacant,-total_hu,-total_white,-total_unemployed,-total_labor_f,-total_graddeg) %>%
    st_transform('ESRI:102254')

# add census data to boulder data
boulder_data <- st_join(boulder_data, acsTractsBD19, join = st_intersects)
training_data <- 
  filter(boulder_data,toPredict == 0) %>% 
  dplyr::select(-toPredict)
  
testing_data <- 
  filter(boulder_data,toPredict == 1) %>% 
  dplyr::select(-toPredict)

Categories of data

In our analysis, there are four kinds of variables in our analysis.

  1. houses’ internal characteristics: “saleDate”,“year”,“year_quarter”,“address”,“bld_num”,“section_num”,“designCode”,“designCodeDscr”, “qualityCode”,“qualityCodeDscr”,“bldgClass”,bldgClassDscr“,”ConstCode“,”ConstCodeDscr“,”builtYear“,”CompCode“,”EffectiveYear"“bsmtSF”,“bsmtType”,“bsmtTypeDscr”,“carStorageSF”,“carStorageType”, “carStorageTypeDscr”,“nbrBedRoom”,“nbrRoomsNobath”,“mainfloorSF”,“nbrThreeQtrBaths”,“nbrFullBaths”, “nbrHalfBaths”,“TotalFinishedSF”,“Ac”,“AcDscr”,“Heating”,“HeatingDscr”,“ExtWallPrim”,“ExtWallDscrPrim”,
    “ExtWallSec”,“ExtWallDscrSec”,“IntWall”,“IntWallDscr”,“Roof_Cover”,“Roof_CoverDscr”,“Stories”,
    “UnitCount”,“status_cd”,“MUSA_ID”

  2. public amenities near houses: We can’t measure the exposure of each house to public services directly, so we would calculate the average distance from each home sale to its three nearest sepecific public service with ‘nn_function’. “fire_stations_nn3”,“commercial_nn3”,“traffic_nn3”,“schools_nn3”,“crime_nn3”

  3. spatial structure: According to the thumb rule that those which are near to each other are more related when compared to those that are further away, we added spatial structure type of data. “GEOID”,“NAME”

  4. the socioeconomical data on the census tracts in which each house is located in: “total_pop”,“total_belpov100”,“pctvacancy”,“pctwhite”, “pctunemployed”,“pctedu”

Exploring data

Firstly, we develop a statistic summary table of variables.

stargazer(st_drop_geometry(training_data), 
          type = 'text', 
          title = "Summary Statistics")
## 
## Summary Statistics
## ==========================================================================================
## Statistic           N       Mean      St. Dev.      Min    Pctl(25)   Pctl(75)     Max    
## ------------------------------------------------------------------------------------------
## price             11,264 749,112.900 616,228.300  10,000    452,700  831,580.8  31,500,000
## year              11,264  2,019.791     0.729      2,019     2,019     2,020      2,021   
## bld_num           11,264    1.008       0.137        1         1         1          9     
## section_num       11,264    1.009       0.140        1         1         1          9     
## qualityCode       11,264   37.301       7.864       10        30         41         80    
## bldgClass         11,264  1,212.000     0.000      1,212     1,212     1,212      1,212   
## ConstCode         11,264   310.592     18.436        0        310       310        420    
## builtYear         11,264  1,987.169    26.797      1,860     1,972     2,006      2,021   
## CompCode          11,264    0.988       0.096        0         1         1          1     
## EffectiveYear     11,264  1,997.698    17.475      1,879     1,989     2,010      2,021   
## bsmtSF            11,264   744.117     627.701       0       135.5     1,126      4,534   
## carStorageSF      11,264   471.566     235.759       0        380       600       3,040   
## nbrBedRoom        11,264    3.444       1.051        0         3         4          24    
## nbrRoomsNobath    11,264    7.390       2.136        0         6         9          28    
## mainfloorSF       11,264  1,313.107    607.499       0       922.8     1,619      7,795   
## nbrThreeQtrBaths  11,264    0.617       0.730        0         0         1          9     
## nbrFullBaths      11,264    1.781       0.872        0         1         2          12    
## nbrHalfBaths      11,264    0.555       0.540        0         0         1          3     
## TotalFinishedSF   11,264  1,954.525    891.631       0       1,276     2,443      10,188  
## Ac                7,423    210.116      1.261     200.000   210.000   210.000    220.000  
## Heating           11,160   814.046     14.791     800.000   810.000   810.000   1,000.000 
## ExtWallPrim       11,260   46.238      49.098      0.000    10.000     40.000    140.000  
## ExtWallSec        3,418    529.596     39.341     500.000   500.000   530.000    630.000  
## IntWall           11,186  1,130.171     4.769    1,110.000 1,130.000 1,130.000  1,160.000 
## Roof_Cover        8,031   1,214.176    14.994    1,210.000 1,210.000 1,210.000  1,290.000 
## Stories             4       1.000       0.000      1.000     1.000     1.000      1.000   
## UnitCount           4       2.500       1.000      2.000     2.000     2.500      4.000   
## MUSA_ID           11,264  5,680.483   3,281.191      1      2,833.8   8,522.2     11,364  
## fire_stations_nn3 11,264 12,072.240   6,412.017  1,367.279 7,671.262 17,222.100 34,142.450
## commercial_nn3    11,264  1,756.156   2,499.654   24.553    624.674  1,745.199  23,878.360
## traffic_nn3       11,264  2,165.635   1,074.113   101.082  1,472.874 2,703.550  6,870.880 
## schools_nn3       11,264  1,776.347   3,390.153   30.079    514.459  1,536.743  25,693.050
## crime_nn3         11,264  7,602.171   4,902.370   15.768   4,324.914 10,879.090 30,610.350
## total_pop         11,264  6,122.057   2,911.499     867      4,089     7,365      12,923  
## med_hh_income     11,264 97,710.940  25,115.450   23,699   82,068.5   120,717    173,500  
## total_belpov100   11,264   483.164     394.325      40        227       724       3,202   
## pctvacancy        11,264    0.055       0.081      0.000     0.014     0.063      0.713   
## pctwhite          11,264    0.897       0.047      0.749     0.877     0.928      0.978   
## pctunemployed     11,264    0.040       0.016      0.007     0.031     0.052      0.096   
## pctedu            11,264    0.191       0.074      0.025     0.136     0.238      0.383   
## ------------------------------------------------------------------------------------------

Based on summary statistic table, Stories and UnitCount which have only 4 observations are dropped for their low-impact number.

bldgClass variable has the same value in every observation, so it has no influence on our model.

There are also several parings of data, one represents the code while the other is for description.We only remain the description one.

boulder_data <- boulder_data %>%
  dplyr::select(-Stories, -UnitCount, -bldgClass, -bldgClassDscr,
                -designCode, -qualityCode, -ConstCode,
                -carStorageType, -bsmtType, -Ac,
                -Heating, -IntWall, -Roof_Cover, -ExtWallSec,
                -ExtWallPrim, -status_cd)

training_data <- 
  filter(boulder_data,toPredict == 0) %>% 
  dplyr::select(-toPredict)
  
testing_data <- 
  filter(boulder_data,toPredict == 1) %>% 
  dplyr::select(-toPredict)

We develop histograms of all numeric variables. For variables that are not normally distributed, we would log-transform them, which is helpful to train a greater model.

# select numeric variables
numericVars <-
  select_if(st_drop_geometry(training_data), is.numeric) %>% na.omit()

# plot histograms of numeric variables
ggplot(gather(numericVars), aes(value)) +
  geom_histogram(bins = 50) +
  facet_wrap(~key, scales = 'free_x')

# log tranform some variables.
boulder_data <-
  boulder_data %>%
  mutate(
      lncommercial_nn3 = log(commercial_nn3),
      lnpctedu = log(pctedu),
      lnpctunemployed = log(pctunemployed),
      lnpctwhite = log(pctwhite),
      lnschools_nn3 = log(schools_nn3),
      lntotal_belpov100 = log(total_belpov100),
      lntotal_pop = log(total_pop)
      ) %>%
  dplyr::select(-commercial_nn3, -pctedu,
         -pctunemployed,  -pctwhite,
         -schools_nn3, -total_belpov100, -total_pop)

training_data <- 
  filter(boulder_data,toPredict == 0) %>% 
  dplyr::select(-toPredict)
  
testing_data <- 
  filter(boulder_data,toPredict == 1) %>% 
  dplyr::select(-toPredict)

The below graph is a correlation matrix of numeric variables. The darker the color, the higher the r correlation, and the closely two variables are correlated.

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  show.diag = TRUE,
  insig = "blank") +  
  labs(title = "Correlation across numeric variables") +
  theme(plot.title = element_text(hjust = 0.5)) +
  plotTheme()

To represent the correlation between variables more explicitly, we make a correlation matrix in the type of table.

cormax <- cor(numericVars)
cormax.df <- as.data.frame(t(cormax))
cormax.df
##                          price         year       bld_num  section_num
## price              1.000000000  0.111253820  0.0888673114  0.090224206
## year               0.111253820  1.000000000 -0.0053461517 -0.006970878
## bld_num            0.088867311 -0.005346152  1.0000000000  0.951453969
## section_num        0.090224206 -0.006970878  0.9514539695  1.000000000
## builtYear         -0.047831765 -0.040693891 -0.0375865251 -0.048836741
## CompCode           0.001904995 -0.124878234 -0.0024582397  0.002978419
## EffectiveYear      0.083457052 -0.121552435 -0.0455018549 -0.054556577
## bsmtSF             0.183016210 -0.007669119 -0.0619467986 -0.066863293
## carStorageSF       0.224488400 -0.005520495 -0.0377596597 -0.044517786
## nbrBedRoom         0.267387067 -0.038135723 -0.1128380364 -0.122073670
## nbrRoomsNobath     0.354603271 -0.037689118 -0.1076506926 -0.117248170
## mainfloorSF        0.381421772 -0.003643487 -0.0525176835 -0.056852125
## nbrThreeQtrBaths   0.225803590 -0.002381863 -0.0284590683 -0.032438552
## nbrFullBaths       0.207105343 -0.033200198 -0.0727423408 -0.078814287
## nbrHalfBaths       0.154215592 -0.010019878 -0.0556786509 -0.058062627
## TotalFinishedSF    0.434325553 -0.012742931 -0.0649334034 -0.071723745
## MUSA_ID            0.005416397 -0.007824222 -0.0039809684 -0.002420071
## fire_stations_nn3 -0.428148834 -0.026190646 -0.0008059242  0.003609749
## commercial_nn3    -0.046209931 -0.023648979  0.0078526482  0.015048607
## traffic_nn3       -0.152217255 -0.006343534 -0.0153601223 -0.015950689
## schools_nn3       -0.043480929 -0.004061844  0.0107329735  0.017493709
## crime_nn3         -0.403490027 -0.030238350 -0.0012866508  0.003491714
## total_pop         -0.081133841 -0.035278751 -0.0193047037 -0.019570450
## med_hh_income      0.157113248 -0.025201288 -0.0097293025 -0.010337217
## total_belpov100    0.011663661  0.010558039  0.0101050077  0.009560429
## pctvacancy         0.004271884  0.011277044  0.0191063919  0.022220980
## pctwhite           0.038154772  0.020351973 -0.0055927083 -0.002458372
## pctunemployed     -0.036498085 -0.007109133  0.0040888823  0.011149888
## pctedu             0.401077539 -0.001609232 -0.0097692023 -0.010338321
##                      builtYear     CompCode EffectiveYear       bsmtSF
## price             -0.047831765  0.001904995   0.083457052  0.183016210
## year              -0.040693891 -0.124878234  -0.121552435 -0.007669119
## bld_num           -0.037586525 -0.002458240  -0.045501855 -0.061946799
## section_num       -0.048836741  0.002978419  -0.054556577 -0.066863293
## builtYear          1.000000000 -0.149228825   0.728546369  0.332641305
## CompCode          -0.149228825  1.000000000  -0.160820312 -0.046625580
## EffectiveYear      0.728546369 -0.160820312   1.000000000  0.298094656
## bsmtSF             0.332641305 -0.046625580   0.298094656  1.000000000
## carStorageSF       0.382186160 -0.049932442   0.293586189  0.425459961
## nbrBedRoom         0.179274991  0.013871247   0.221956962  0.391412303
## nbrRoomsNobath     0.210382878  0.016230011   0.257890037  0.431571053
## mainfloorSF        0.149299094 -0.035435056   0.189512638  0.542358525
## nbrThreeQtrBaths  -0.027766157 -0.036736226   0.019050070  0.101601740
## nbrFullBaths       0.418784859 -0.024380223   0.415760961  0.402939496
## nbrHalfBaths       0.326621007 -0.051345610   0.290546466  0.134037315
## TotalFinishedSF    0.389284395 -0.075513371   0.413392153  0.486149516
## MUSA_ID            0.004776114  0.007953813   0.004952922  0.009993530
## fire_stations_nn3  0.207025222 -0.016542160   0.088261376  0.045715196
## commercial_nn3     0.028324106  0.007217295  -0.019470052  0.011331548
## traffic_nn3       -0.094645609  0.007812952  -0.086474887 -0.187944281
## schools_nn3       -0.077915760  0.003770609  -0.116944716 -0.091321424
## crime_nn3          0.221817255 -0.017157589   0.101137693  0.049844257
## total_pop          0.384518061 -0.049656243   0.351445959  0.263836571
## med_hh_income      0.355006960 -0.059805725   0.288473115  0.209361206
## total_belpov100   -0.106619526  0.003277886   0.001947383  0.029506614
## pctvacancy        -0.210387411  0.018452842  -0.193783309 -0.167605138
## pctwhite          -0.102582622  0.037261540  -0.093923305  0.103710833
## pctunemployed     -0.155523374  0.040649051  -0.098695198  0.001849994
## pctedu            -0.088891278  0.007283817  -0.026584035  0.008115751
##                   carStorageSF   nbrBedRoom nbrRoomsNobath  mainfloorSF
## price              0.224488400  0.267387067    0.354603271  0.381421772
## year              -0.005520495 -0.038135723   -0.037689118 -0.003643487
## bld_num           -0.037759660 -0.112838036   -0.107650693 -0.052517683
## section_num       -0.044517786 -0.122073670   -0.117248170 -0.056852125
## builtYear          0.382186160  0.179274991    0.210382878  0.149299094
## CompCode          -0.049932442  0.013871247    0.016230011 -0.035435056
## EffectiveYear      0.293586189  0.221956962    0.257890037  0.189512638
## bsmtSF             0.425459961  0.391412303    0.431571053  0.542358525
## carStorageSF       1.000000000  0.356762111    0.444982166  0.492626205
## nbrBedRoom         0.356762111  1.000000000    0.766152939  0.383561760
## nbrRoomsNobath     0.444982166  0.766152939    1.000000000  0.468928168
## mainfloorSF        0.492626205  0.383561760    0.468928168  1.000000000
## nbrThreeQtrBaths   0.139994693  0.288322245    0.289731875  0.173128468
## nbrFullBaths       0.388357354  0.506984065    0.525035071  0.352513873
## nbrHalfBaths       0.234875665  0.181022682    0.286311664  0.081776017
## TotalFinishedSF    0.575771612  0.566611457    0.670092102  0.705306200
## MUSA_ID           -0.001421327  0.007441575    0.004461081  0.001721248
## fire_stations_nn3  0.042034563 -0.127488152   -0.177593298 -0.047914644
## commercial_nn3     0.017396487 -0.102540782   -0.071627826  0.074026677
## traffic_nn3       -0.142229641 -0.069636634   -0.100029939 -0.237029844
## schools_nn3       -0.095414537 -0.181055784   -0.150579415 -0.007046662
## crime_nn3          0.023829710 -0.139662511   -0.172689363 -0.056994704
## total_pop          0.185443848  0.120835077    0.117849737  0.164453885
## med_hh_income      0.274380864  0.191350528    0.262999762  0.185243576
## total_belpov100   -0.089757042  0.011181961   -0.015044431  0.014139594
## pctvacancy        -0.183288321 -0.214413411   -0.198212964 -0.072254920
## pctwhite           0.044536401  0.001093382    0.018387623  0.166292192
## pctunemployed     -0.071282544 -0.017174116   -0.054153463  0.010134360
## pctedu             0.016774251  0.100128144    0.174743351  0.091457423
##                   nbrThreeQtrBaths  nbrFullBaths  nbrHalfBaths TotalFinishedSF
## price                  0.225803590  0.2071053426  0.1542155915     0.434325553
## year                  -0.002381863 -0.0332001978 -0.0100198778    -0.012742931
## bld_num               -0.028459068 -0.0727423408 -0.0556786509    -0.064933403
## section_num           -0.032438552 -0.0788142867 -0.0580626273    -0.071723745
## builtYear             -0.027766157  0.4187848595  0.3266210071     0.389284395
## CompCode              -0.036736226 -0.0243802234 -0.0513456101    -0.075513371
## EffectiveYear          0.019050070  0.4157609609  0.2905464663     0.413392153
## bsmtSF                 0.101601740  0.4029394957  0.1340373154     0.486149516
## carStorageSF           0.139994693  0.3883573536  0.2348756648     0.575771612
## nbrBedRoom             0.288322245  0.5069840651  0.1810226824     0.566611457
## nbrRoomsNobath         0.289731875  0.5250350709  0.2863116642     0.670092102
## mainfloorSF            0.173128468  0.3525138733  0.0817760174     0.705306200
## nbrThreeQtrBaths       1.000000000 -0.2979752825 -0.0880992240     0.167976831
## nbrFullBaths          -0.297975282  1.0000000000  0.2159457950     0.598088498
## nbrHalfBaths          -0.088099224  0.2159457950  1.0000000000     0.372232281
## TotalFinishedSF        0.167976831  0.5980884984  0.3722322811     1.000000000
## MUSA_ID                0.017333595 -0.0098074503  0.0001713499    -0.003364032
## fire_stations_nn3     -0.165341961 -0.0165938211 -0.0619557790    -0.073840103
## commercial_nn3        -0.053440381  0.0005218684 -0.0421739411     0.056201938
## traffic_nn3           -0.038189482 -0.1191261707 -0.0374090398    -0.200999519
## schools_nn3           -0.046936701 -0.0796766729 -0.0954472819    -0.047506574
## crime_nn3             -0.173551979  0.0024802686 -0.0663684685    -0.057889159
## total_pop             -0.093173078  0.2582160809  0.1347894180     0.267850856
## med_hh_income          0.061178421  0.2849435288  0.1944129248     0.333520933
## total_belpov100       -0.052751368  0.0069381992 -0.0105743905     0.010709888
## pctvacancy            -0.032205380 -0.1542865406 -0.1191805004    -0.133593414
## pctwhite               0.006138100 -0.0085438926 -0.0738004358     0.073120340
## pctunemployed         -0.038148793 -0.0439511339 -0.0599807322    -0.068191112
## pctedu                 0.129625422  0.0829293464  0.0620296035     0.138958869
##                         MUSA_ID fire_stations_nn3 commercial_nn3  traffic_nn3
## price              0.0054163967     -0.4281488341  -0.0462099313 -0.152217255
## year              -0.0078242220     -0.0261906464  -0.0236489794 -0.006343534
## bld_num           -0.0039809684     -0.0008059242   0.0078526482 -0.015360122
## section_num       -0.0024200710      0.0036097495   0.0150486071 -0.015950689
## builtYear          0.0047761138      0.2070252219   0.0283241064 -0.094645609
## CompCode           0.0079538127     -0.0165421605   0.0072172955  0.007812952
## EffectiveYear      0.0049529223      0.0882613764  -0.0194700525 -0.086474887
## bsmtSF             0.0099935298      0.0457151962   0.0113315478 -0.187944281
## carStorageSF      -0.0014213270      0.0420345628   0.0173964870 -0.142229641
## nbrBedRoom         0.0074415753     -0.1274881522  -0.1025407818 -0.069636634
## nbrRoomsNobath     0.0044610805     -0.1775932976  -0.0716278261 -0.100029939
## mainfloorSF        0.0017212478     -0.0479146441   0.0740266772 -0.237029844
## nbrThreeQtrBaths   0.0173335955     -0.1653419612  -0.0534403812 -0.038189482
## nbrFullBaths      -0.0098074503     -0.0165938211   0.0005218684 -0.119126171
## nbrHalfBaths       0.0001713499     -0.0619557790  -0.0421739411 -0.037409040
## TotalFinishedSF   -0.0033640324     -0.0738401034   0.0562019384 -0.200999519
## MUSA_ID            1.0000000000      0.0012464439  -0.0062147082  0.004806899
## fire_stations_nn3  0.0012464439      1.0000000000   0.2402103855  0.164076133
## commercial_nn3    -0.0062147082      0.2402103855   1.0000000000 -0.088804894
## traffic_nn3        0.0048068987      0.1640761326  -0.0888048944  1.000000000
## schools_nn3       -0.0035661357      0.2755032375   0.6455045455 -0.147325570
## crime_nn3          0.0004416048      0.9295680400   0.3648543085  0.186258775
## total_pop         -0.0103363674      0.1231064285   0.1589664885 -0.171127540
## med_hh_income      0.0077329969     -0.2170400084   0.1830949627 -0.131714638
## total_belpov100   -0.0131581292     -0.0449116524   0.0219337031 -0.022406358
## pctvacancy        -0.0055773892      0.1454118502   0.6572442141 -0.057846869
## pctwhite          -0.0229831079      0.0936764518   0.2665317421 -0.515922551
## pctunemployed      0.0033199817      0.1097087952  -0.0208129708  0.003859160
## pctedu             0.0025029339     -0.7379720443   0.1050347304 -0.174191994
##                    schools_nn3     crime_nn3   total_pop med_hh_income
## price             -0.043480929 -0.4034900273 -0.08113384   0.157113248
## year              -0.004061844 -0.0302383503 -0.03527875  -0.025201288
## bld_num            0.010732974 -0.0012866508 -0.01930470  -0.009729302
## section_num        0.017493709  0.0034917142 -0.01957045  -0.010337217
## builtYear         -0.077915760  0.2218172553  0.38451806   0.355006960
## CompCode           0.003770609 -0.0171575888 -0.04965624  -0.059805725
## EffectiveYear     -0.116944716  0.1011376930  0.35144596   0.288473115
## bsmtSF            -0.091321424  0.0498442565  0.26383657   0.209361206
## carStorageSF      -0.095414537  0.0238297104  0.18544385   0.274380864
## nbrBedRoom        -0.181055784 -0.1396625112  0.12083508   0.191350528
## nbrRoomsNobath    -0.150579415 -0.1726893634  0.11784974   0.262999762
## mainfloorSF       -0.007046662 -0.0569947041  0.16445388   0.185243576
## nbrThreeQtrBaths  -0.046936701 -0.1735519790 -0.09317308   0.061178421
## nbrFullBaths      -0.079676673  0.0024802686  0.25821608   0.284943529
## nbrHalfBaths      -0.095447282 -0.0663684685  0.13478942   0.194412925
## TotalFinishedSF   -0.047506574 -0.0578891594  0.26785086   0.333520933
## MUSA_ID           -0.003566136  0.0004416048 -0.01033637   0.007732997
## fire_stations_nn3  0.275503237  0.9295680400  0.12310643  -0.217040008
## commercial_nn3     0.645504546  0.3648543085  0.15896649   0.183094963
## traffic_nn3       -0.147325570  0.1862587749 -0.17112754  -0.131714638
## schools_nn3        1.000000000  0.4403893888 -0.08838762   0.059069158
## crime_nn3          0.440389389  1.0000000000  0.18194066  -0.107062259
## total_pop         -0.088387624  0.1819406639  1.00000000   0.173236161
## med_hh_income      0.059069158 -0.1070622595  0.17323616   1.000000000
## total_belpov100   -0.044140657 -0.0514527439  0.42218524  -0.444950608
## pctvacancy         0.750902972  0.2389511771 -0.23488207  -0.118141715
## pctwhite           0.241968010  0.1023546404  0.18898876   0.070667395
## pctunemployed     -0.043116074 -0.0252017930 -0.11344722  -0.314300628
## pctedu             0.106129018 -0.6002637073 -0.12084785   0.578432548
##                   total_belpov100   pctvacancy     pctwhite pctunemployed
## price                 0.011663661  0.004271884  0.038154772  -0.036498085
## year                  0.010558039  0.011277044  0.020351973  -0.007109133
## bld_num               0.010105008  0.019106392 -0.005592708   0.004088882
## section_num           0.009560429  0.022220980 -0.002458372   0.011149888
## builtYear            -0.106619526 -0.210387411 -0.102582622  -0.155523374
## CompCode              0.003277886  0.018452842  0.037261540   0.040649051
## EffectiveYear         0.001947383 -0.193783309 -0.093923305  -0.098695198
## bsmtSF                0.029506614 -0.167605138  0.103710833   0.001849994
## carStorageSF         -0.089757042 -0.183288321  0.044536401  -0.071282544
## nbrBedRoom            0.011181961 -0.214413411  0.001093382  -0.017174116
## nbrRoomsNobath       -0.015044431 -0.198212964  0.018387623  -0.054153463
## mainfloorSF           0.014139594 -0.072254920  0.166292192   0.010134360
## nbrThreeQtrBaths     -0.052751368 -0.032205380  0.006138100  -0.038148793
## nbrFullBaths          0.006938199 -0.154286541 -0.008543893  -0.043951134
## nbrHalfBaths         -0.010574390 -0.119180500 -0.073800436  -0.059980732
## TotalFinishedSF       0.010709888 -0.133593414  0.073120340  -0.068191112
## MUSA_ID              -0.013158129 -0.005577389 -0.022983108   0.003319982
## fire_stations_nn3    -0.044911652  0.145411850  0.093676452   0.109708795
## commercial_nn3        0.021933703  0.657244214  0.266531742  -0.020812971
## traffic_nn3          -0.022406358 -0.057846869 -0.515922551   0.003859160
## schools_nn3          -0.044140657  0.750902972  0.241968010  -0.043116074
## crime_nn3            -0.051452744  0.238951177  0.102354640  -0.025201793
## total_pop             0.422185241 -0.234882069  0.188988763  -0.113447222
## med_hh_income        -0.444950608 -0.118141715  0.070667395  -0.314300628
## total_belpov100       1.000000000  0.008382288  0.014012686   0.190553212
## pctvacancy            0.008382288  1.000000000  0.160124543   0.073779007
## pctwhite              0.014012686  0.160124543  1.000000000  -0.083657590
## pctunemployed         0.190553212  0.073779007 -0.083657590   1.000000000
## pctedu               -0.263963828  0.088536514  0.049643744  -0.226605012
##                         pctedu
## price              0.401077539
## year              -0.001609232
## bld_num           -0.009769202
## section_num       -0.010338321
## builtYear         -0.088891278
## CompCode           0.007283817
## EffectiveYear     -0.026584035
## bsmtSF             0.008115751
## carStorageSF       0.016774251
## nbrBedRoom         0.100128144
## nbrRoomsNobath     0.174743351
## mainfloorSF        0.091457423
## nbrThreeQtrBaths   0.129625422
## nbrFullBaths       0.082929346
## nbrHalfBaths       0.062029604
## TotalFinishedSF    0.138958869
## MUSA_ID            0.002502934
## fire_stations_nn3 -0.737972044
## commercial_nn3     0.105034730
## traffic_nn3       -0.174191994
## schools_nn3        0.106129018
## crime_nn3         -0.600263707
## total_pop         -0.120847845
## med_hh_income      0.578432548
## total_belpov100   -0.263963828
## pctvacancy         0.088536514
## pctwhite           0.049643744
## pctunemployed     -0.226605012
## pctedu             1.000000000

For those strongly correlated with each other (r > 0.8), we would choose only one of the two to increase model’s accuracy. According to the table, r square between fire station and crime data is 0.93, and r square between bld_num and section_num is 0.95,so we won’t include fire station and section_num in regression.

# a loop to look which two variables are highly correlated
for (i in colnames(cormax.df)) {
  for (j in colnames(cormax.df)){
    if (cormax.df[j,i] >= 0.8){
      if(i != j){
      print(i)
      print(j)
      print(' ')}
    }
  }
}
## [1] "bld_num"
## [1] "section_num"
## [1] " "
## [1] "section_num"
## [1] "bld_num"
## [1] " "
## [1] "fire_stations_nn3"
## [1] "crime_nn3"
## [1] " "
## [1] "crime_nn3"
## [1] "fire_stations_nn3"
## [1] " "
# for those r > 0.8, only choose one of the two.
boulder_data <- 
  boulder_data %>%
  dplyr::select(-fire_stations_nn3,-section_num)

training_data <- 
  filter(boulder_data,toPredict == 0) %>% 
  dplyr::select(-toPredict)
  
testing_data <- 
  filter(boulder_data,toPredict == 1) %>% 
  dplyr::select(-toPredict)

There are four correlation scatterplot plots of home price as functions of 4 independent variables, namely exposure to transit stations, age of houses, area of main floor and natural log of percent of unemployed in the census tract.

In all scatter plots, there is no apparent linear correlation between them.

st_drop_geometry(training_data) %>% 
  mutate(Age = 2022 - builtYear) %>%
  dplyr::select(price, traffic_nn3, Age, mainfloorSF,lnpctunemployed) %>%
  filter(price <= 1000000, Age < 500) %>%
  gather(Variable, Value, -price) %>% 
  ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     theme(plot.title = element_text(hjust = 0.5)) +
     plotTheme()
## `geom_smooth()` using formula 'y ~ x'

The correlation matrix and correlation scatterplots can only include numeric variables, so we choose number of bedroom and level of quality which are categorical and develop bar plots between them and house price.

It seems that houses with more bedrooms have higher prices. Similarly, houses with higher level of quality have higher prices.

cate <- dplyr::select(boulder_data, price, qualityCodeDscr, designCodeDscr, ConstCodeDscr,
                      HeatingDscr, carStorageTypeDscr, AcDscr) 
cate <- st_drop_geometry(cate)

cate %>% 
  mutate(qualityCodeDscr = as.factor(qualityCodeDscr)) %>%
  mutate(designCodeDscr = as.factor(designCodeDscr)) %>%
  mutate(ConstCodeDscr = as.factor(ConstCodeDscr)) %>%
  mutate(HeatingDscr = as.factor(HeatingDscr)) %>%
  mutate(carStorageTypeDscr = as.factor(carStorageTypeDscr)) %>%
  mutate(AcDscr = as.factor(AcDscr)) %>%
  filter(price <= 1000000) %>%
  gather(Variable, Value, -price) %>%     
   ggplot(aes(Value, price)) +
     geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of categorical variables", y = "Mean_Price") +
     plotTheme() + 
     theme(axis.text.x = element_text(angle = 45, hjust = 0.5))+
     theme(plot.title = element_text(hjust = 0.5))
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

The map below shows the distribution of houses and their sale prices in Boulder County. The warmer the spot’s color, the higher the sale price. Spots with high values mainly concentrated in the center and the southeastern part of Boulder County where Boulder City is in, while houses with low prices are mainly distributed in northeastern Boulder as well as east Boulder which are mountain areas.

ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = training_data, aes(colour = q5(price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(training_data,"price"),    
                   name="Quintile\nBreaks") +
  labs(title="Price of houses, Boulder") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

Next three graphs are maps of three independent variables, respectively area of main floor, exposure to crimes and percent of vacant houses in the census tract that each house is located in. In all three maps, orange spots represent higher value while the green spots represent lower value.

ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = training_data, aes(colour = q5(mainfloorSF)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(training_data,"mainfloorSF"),    
                   name="Quintile\nBreaks") +
  labs(title="Area of mainfloor of houses, Boulder") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = training_data, aes(colour = q5(crime_nn3)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(training_data,"crime_nn3"),    
                   name="Quintile\nBreaks") +
  labs(title="Exposure to crimes of houses, Boulder") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

vacant_data <- training_data %>% 
  mutate(pctvacancy1 = pctvacancy*100) %>%
  dplyr::select(pctvacancy1)
ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = vacant_data, aes(colour = q5(pctvacancy1)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(vacant_data,"pctvacancy1"),    
                   name="Quintile\nBreaks") +
  labs(title="pct of vacant houses in the census tract that each home is located in") +
  theme(plot.title = element_text(hjust = 0.5))+
  mapTheme()

Bases on maps, it seems that the area of main floor distribute randomly. Houses with lower exposure to crimes are mainly located in center districts of Boulder County. The East Boulder has higher percent of vacant houses while the West Boulder has lower percent.

Method

Basically speaking, there are five stages in machine learning process, namely data wrangling, exploratory analysis, feature engineering, feature selection as well as model estimation and validation.

In data wrangling step, we would firstly compile all sorts of data we have collected into one dataset. Secondly, we would do some data cleaning, including deleting the data that is apparently impossible. In exploratory analysis step, a few visualization tools will be used to help understand correlations between variables of interested and underlying spatial process, which would be useful to create an efficient model. In feature engineering part, we would measure exposure from a house sale to public amenities and do logarithm transformation to variables that are not normal distributed. In the feature selection stage, we will select and include some variables in a model based on testing the collinearity of independent variables. We have done those steps already in the Data part. Then we come to model estimation and validation stage.

We are going to predict the price of house as a function of all other variables remained now using OLS regression. Then ‘summary’ function is used to view the results of the regression. In the result table, P-value and (adjusted) R-square are especially paid attention to. P-value is used to measure whether the coefficient is variable. When p-value is smaller than 0.05, we can reject the ‘null hypothesis’ that there is no relationship between this independent variable and house price. R2 is an important and common indictor to describe how well all features in the model explain the model. It represents how the proportion of variation in the price variable is explained by the independent variables. Consequentially, the higher the R2, the better the model.

We are going to evaluate a model based on its accuracy and generalizability.

When testing the model’s accuracy, we are going to randomly splitting our current data into a 75% training dataset and a 25% testing dataset with ‘createDataPartition’ function. Then we would create a model based on training dataset and predict home sale price of test data with this model. Secondly, we will calculate the error and absolute error which are differences in predicted and observed prices. Thirdly, absolute percent error will be calculated on a percentage basis. After those steps, we then calculate the average of all absolute percent errors (MAPE) and all absolute errors (MAE). Those statistics are powerful tools to measure how well the model predicts for data it has never seen before. The smaller the MAPE, the better the model. Plotting the observed price as a function of predicted price with another line representing a perfect fit is also a useful visualization method to reflect how well our model predicts. It shows gaps between prediction and reality directly.

There are two definitions of generalizability. One is whether the model can accurately predict on new data, another is whether the model can accurately predict across form different groups. K-fold cross validation is a good way to gauge model’s performance on new data. During this process, all data will be partitioned into 100 equal sized folds. For each given fold, we will train on a subset of observations, predict on a test set, then measure its goodness of fit (MAE). Making a histogram of MAE of each fold is a useful visualization to gauge whether the model is generalizable to new data. If the distribution of errors would cluster tightly together, then the model generalizes well. To test the ability of the model to predict across different group context, we would firstly map MAPE by census tract. If values of MAPE are comparable in each census tract, then the model is generalizable. More specifically, we would select median household income as a context to look whether the model is as efficient in high-income census tracts as in low-income census tracts. Firstly, we would load household income data from census data to define an income text. Tracts with incomes greater than the countywide mean receive a High-Income designation. Based on calculating the difference between their corresponding MAPE, we could assess model’s generalizability. The smaller the gap, the better the model.

We are also going to check whether the model is severely affected by spatial autocorrelation. Firstly, we are going to calculate the spatial lag in errors with ‘spdep’ package. During this process, we calculate average difference between predicted price and observed price of each home sale’s five nearest neighbors and look whether it is correlated to its own price. If there is, then spatial autocorrelation exists. Moran’s I is a useful method to test whether model errors cluster or not. A positive Moran’s I near 1 means that values are spatially clustered, while one near 0 means values are randomly distributed.

In the end, we also create a stepwise model to optimize our regression model based on OLS regression.

Results

We split the dataset into a training dataset and a testing dataset, and the ratio is 75/25. The training dataset is used to train the OLS regression model, and the testing dataset is used for accuracy testing. The testing dataset will not be included in the training process, so that it can better interpret the model accuracy.

For those categorical variables, the parameter y is applied to ensure each category of a variable can be included in the training dataset to achieve better accuracy.

# ensure reproducible partition
set.seed(123) 


# train on 75% data and test on the other 25% data
inTrain <- createDataPartition(
              y = paste(training_data$designCodeDscr, 
                        training_data$qualityCodeDscr, 
                        training_data$NAME, 
                        training_data$HeatingDscr, 
                        training_data$ExtWallDscrPrim
                        ),  
              p = .75, list = FALSE)  
boulder.train <- training_data[inTrain,]
boulder.test <- training_data[-inTrain,]  

Then we train a OLS regression model with the training dataset. Results are showed as below.

all_reg <- lm(price ~.,data = st_drop_geometry(boulder.train) %>%
                  dplyr::select(price, NAME,year_quarter,bld_num,
                                designCodeDscr, qualityCodeDscr, ConstCodeDscr,
                                builtYear, CompCode, EffectiveYear, bsmtSF,
                                bsmtTypeDscr,carStorageTypeDscr,mainfloorSF, 
                                nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
                                TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
                                ExtWallDscrSec, Roof_CoverDscr, lnschools_nn3, crime_nn3))

stargazer(all_reg, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
## 
## Regression Results
## ============================================================================================
##                                                                      Dependent variable:    
##                                                                  ---------------------------
##                                                                             price           
## --------------------------------------------------------------------------------------------
## NAMECensus Tract 121.02, Boulder County, Colorado                      -450,394.400***      
##                                                                         (30,997.200)        
## NAMECensus Tract 121.03, Boulder County, Colorado                      -451,292.400***      
##                                                                         (37,607.980)        
## NAMECensus Tract 121.04, Boulder County, Colorado                      -471,544.300***      
##                                                                         (38,158.620)        
## NAMECensus Tract 121.05, Boulder County, Colorado                      -567,423.800***      
##                                                                         (36,423.210)        
## NAMECensus Tract 122.01, Boulder County, Colorado                      -113,742.400***      
##                                                                         (40,503.160)        
## NAMECensus Tract 122.02, Boulder County, Colorado                      -607,236.400***      
##                                                                         (51,969.590)        
## NAMECensus Tract 122.03, Boulder County, Colorado                      -906,274.700***      
##                                                                         (40,702.060)        
## NAMECensus Tract 122.04, Boulder County, Colorado                        65,817.290         
##                                                                         (72,797.390)        
## NAMECensus Tract 123, Boulder County, Colorado                          -227,066.100        
##                                                                         (204,802.800)       
## NAMECensus Tract 124.01, Boulder County, Colorado                      -551,881.100***      
##                                                                         (44,596.630)        
## NAMECensus Tract 125.01, Boulder County, Colorado                      -840,532.300***      
##                                                                         (48,395.790)        
## NAMECensus Tract 125.05, Boulder County, Colorado                      -407,428.200***      
##                                                                         (34,790.600)        
## NAMECensus Tract 125.07, Boulder County, Colorado                      -820,130.400***      
##                                                                         (39,525.980)        
## NAMECensus Tract 125.08, Boulder County, Colorado                      -791,899.000***      
##                                                                         (47,402.310)        
## NAMECensus Tract 125.09, Boulder County, Colorado                      -650,136.400***      
##                                                                         (38,628.020)        
## NAMECensus Tract 125.10, Boulder County, Colorado                      -406,741.500***      
##                                                                         (36,969.210)        
## NAMECensus Tract 125.11, Boulder County, Colorado                      -839,994.400***      
##                                                                         (48,369.020)        
## NAMECensus Tract 126.03, Boulder County, Colorado                      -774,492.500***      
##                                                                         (37,928.760)        
## NAMECensus Tract 126.05, Boulder County, Colorado                      -824,645.900***      
##                                                                         (119,777.200)       
## NAMECensus Tract 126.07, Boulder County, Colorado                      -801,590.700***      
##                                                                         (76,202.950)        
## NAMECensus Tract 126.08, Boulder County, Colorado                      -844,373.300***      
##                                                                         (44,263.990)        
## NAMECensus Tract 127.01, Boulder County, Colorado                      -788,175.300***      
##                                                                         (34,529.710)        
## NAMECensus Tract 127.05, Boulder County, Colorado                      -975,402.800***      
##                                                                         (50,869.820)        
## NAMECensus Tract 127.07, Boulder County, Colorado                      -528,159.800***      
##                                                                         (53,511.160)        
## NAMECensus Tract 127.08, Boulder County, Colorado                      -994,951.700***      
##                                                                         (31,872.470)        
## NAMECensus Tract 127.09, Boulder County, Colorado                      -946,988.800***      
##                                                                         (47,019.450)        
## NAMECensus Tract 127.10, Boulder County, Colorado                      -714,539.400***      
##                                                                         (36,946.700)        
## NAMECensus Tract 128, Boulder County, Colorado                        -1,052,407.000***     
##                                                                         (33,120.720)        
## NAMECensus Tract 129.03, Boulder County, Colorado                     -1,024,345.000***     
##                                                                         (42,411.730)        
## NAMECensus Tract 129.04, Boulder County, Colorado                      -919,533.400***      
##                                                                         (35,662.220)        
## NAMECensus Tract 129.05, Boulder County, Colorado                      -857,147.700***      
##                                                                         (46,897.170)        
## NAMECensus Tract 129.07, Boulder County, Colorado                      -879,470.000***      
##                                                                         (38,671.260)        
## NAMECensus Tract 130.03, Boulder County, Colorado                      -855,585.500***      
##                                                                         (32,521.840)        
## NAMECensus Tract 130.04, Boulder County, Colorado                      -834,790.300***      
##                                                                         (40,027.230)        
## NAMECensus Tract 130.05, Boulder County, Colorado                      -772,037.400***      
##                                                                         (39,067.840)        
## NAMECensus Tract 130.06, Boulder County, Colorado                      -786,723.400***      
##                                                                         (34,804.810)        
## NAMECensus Tract 132.01, Boulder County, Colorado                      -867,551.100***      
##                                                                         (53,760.580)        
## NAMECensus Tract 132.02, Boulder County, Colorado                      -462,936.900***      
##                                                                         (58,519.170)        
## NAMECensus Tract 132.05, Boulder County, Colorado                      -974,509.100***      
##                                                                         (33,641.140)        
## NAMECensus Tract 132.07, Boulder County, Colorado                     -1,038,645.000***     
##                                                                         (42,755.810)        
## NAMECensus Tract 132.08, Boulder County, Colorado                     -1,069,317.000***     
##                                                                         (38,745.700)        
## NAMECensus Tract 132.10, Boulder County, Colorado                     -1,018,447.000***     
##                                                                         (38,820.450)        
## NAMECensus Tract 132.11, Boulder County, Colorado                     -1,065,777.000***     
##                                                                         (32,402.400)        
## NAMECensus Tract 132.12, Boulder County, Colorado                      -997,339.600***      
##                                                                         (33,930.180)        
## NAMECensus Tract 132.13, Boulder County, Colorado                     -1,107,830.000***     
##                                                                         (31,387.100)        
## NAMECensus Tract 133.02, Boulder County, Colorado                      -937,273.600***      
##                                                                         (36,331.240)        
## NAMECensus Tract 133.05, Boulder County, Colorado                     -1,009,640.000***     
##                                                                         (40,525.210)        
## NAMECensus Tract 133.06, Boulder County, Colorado                      -995,495.300***      
##                                                                         (45,087.490)        
## NAMECensus Tract 133.07, Boulder County, Colorado                     -1,016,554.000***     
##                                                                         (43,648.060)        
## NAMECensus Tract 133.08, Boulder County, Colorado                      -975,068.600***      
##                                                                         (42,222.170)        
## NAMECensus Tract 134.01, Boulder County, Colorado                      -984,756.300***      
##                                                                         (42,913.780)        
## NAMECensus Tract 134.02, Boulder County, Colorado                     -1,026,995.000***     
##                                                                         (39,152.860)        
## NAMECensus Tract 135.03, Boulder County, Colorado                      -961,547.600***      
##                                                                         (42,545.300)        
## NAMECensus Tract 135.05, Boulder County, Colorado                      -990,366.100***      
##                                                                         (53,474.300)        
## NAMECensus Tract 135.06, Boulder County, Colorado                     -1,041,835.000***     
##                                                                         (41,444.730)        
## NAMECensus Tract 135.07, Boulder County, Colorado                     -1,024,478.000***     
##                                                                         (44,648.370)        
## NAMECensus Tract 135.08, Boulder County, Colorado                     -1,017,531.000***     
##                                                                         (39,540.940)        
## NAMECensus Tract 136.01, Boulder County, Colorado                      -852,311.000***      
##                                                                         (44,050.550)        
## NAMECensus Tract 136.02, Boulder County, Colorado                      -808,817.900***      
##                                                                         (54,812.410)        
## NAMECensus Tract 137.01, Boulder County, Colorado                      -853,285.100***      
##                                                                         (31,454.220)        
## NAMECensus Tract 137.02, Boulder County, Colorado                      -879,857.600***      
##                                                                         (40,062.690)        
## NAMECensus Tract 606, Boulder County, Colorado                         -960,852.800***      
##                                                                         (35,119.110)        
## NAMECensus Tract 607, Boulder County, Colorado                         -878,529.300***      
##                                                                         (41,766.500)        
## NAMECensus Tract 608, Boulder County, Colorado                         -855,957.300***      
##                                                                         (40,170.140)        
## NAMECensus Tract 609, Boulder County, Colorado                         -931,612.700***      
##                                                                         (38,183.340)        
## NAMECensus Tract 613, Boulder County, Colorado                         -985,189.500***      
##                                                                         (42,042.250)        
## NAMECensus Tract 614, Boulder County, Colorado                         -968,232.800***      
##                                                                         (39,400.310)        
## year_quarter2019-2                                                       12,745.340         
##                                                                         (14,076.000)        
## year_quarter2019-3                                                       24,777.940*        
##                                                                         (14,378.760)        
## year_quarter2019-4                                                      31,890.320**        
##                                                                         (14,853.830)        
## year_quarter2020-1                                                      34,568.680**        
##                                                                         (15,339.960)        
## year_quarter2020-2                                                      42,115.910***       
##                                                                         (14,827.530)        
## year_quarter2020-3                                                      57,399.350***       
##                                                                         (13,818.660)        
## year_quarter2020-4                                                      94,564.210***       
##                                                                         (14,374.520)        
## year_quarter2021-1                                                     168,700.900***       
##                                                                         (15,458.320)        
## year_quarter2021-2                                                     248,502.300***       
##                                                                         (14,794.620)        
## bld_num                                                                534,235.600***       
##                                                                         (23,921.370)        
## designCodeDscr2-3 Story                                                -47,779.670***       
##                                                                         (12,839.580)        
## designCodeDscrBi-level                                                   16,593.810         
##                                                                         (29,641.500)        
## designCodeDscrMULTI STORY- TOWNHOUSE                                   -176,977.700***      
##                                                                         (16,156.310)        
## designCodeDscrSplit-level                                                -23,097.800        
##                                                                         (23,336.470)        
## qualityCodeDscrAVERAGE +                                               -50,854.000***       
##                                                                         (13,402.510)        
## qualityCodeDscrAVERAGE ++                                                -21,064.660        
##                                                                         (13,928.750)        
## qualityCodeDscrEXCELLENT                                              1,015,678.000***      
##                                                                         (31,668.570)        
## qualityCodeDscrEXCELLENT +                                            1,201,716.000***      
##                                                                         (63,782.090)        
## qualityCodeDscrEXCELLENT++                                            1,879,708.000***      
##                                                                         (52,964.460)        
## qualityCodeDscrEXCEPTIONAL 1                                          1,043,698.000***      
##                                                                         (68,155.240)        
## qualityCodeDscrEXCEPTIONAL 2                                          1,400,096.000***      
##                                                                         (176,430.500)       
## qualityCodeDscrFAIR                                                      -53,062.510        
##                                                                         (35,055.940)        
## qualityCodeDscrGOOD                                                     53,689.110***       
##                                                                         (10,211.720)        
## qualityCodeDscrGOOD +                                                   77,697.070***       
##                                                                         (16,296.610)        
## qualityCodeDscrGOOD ++                                                 147,808.500***       
##                                                                         (15,373.950)        
## qualityCodeDscrLOW                                                       -47,121.860        
##                                                                         (73,483.040)        
## qualityCodeDscrVERY GOOD                                               242,957.700***       
##                                                                         (16,582.470)        
## qualityCodeDscrVERY GOOD +                                             450,203.300***       
##                                                                         (27,354.120)        
## qualityCodeDscrVERY GOOD ++                                            574,939.700***       
##                                                                         (23,810.950)        
## ConstCodeDscrBrick                                                       -65,151.000        
##                                                                         (53,484.680)        
## ConstCodeDscrConcrete                                                    -22,286.610        
##                                                                         (159,847.700)       
## ConstCodeDscrFrame                                                       -62,219.220        
##                                                                         (42,981.370)        
## ConstCodeDscrMasonry                                                     -38,829.580        
##                                                                         (44,673.640)        
## ConstCodeDscrPrecast                                                  -1,012,995.000***     
##                                                                         (216,072.700)       
## ConstCodeDscrVeneer                                                    -702,027.600**       
##                                                                         (296,601.200)       
## ConstCodeDscrWood                                                       -137,671.000        
##                                                                         (190,756.300)       
## builtYear                                                                -708.076***        
##                                                                           (229.792)         
## CompCode                                                               311,488.400***       
##                                                                         (34,623.510)        
## EffectiveYear                                                           3,104.635***        
##                                                                           (301.751)         
## bsmtSF                                                                   -31.856***         
##                                                                            (8.191)          
## bsmtTypeDscrGARDEN BASEMENT FINISHED AREA                               89,629.710***       
##                                                                         (20,591.370)        
## bsmtTypeDscrGARDEN BASEMENT UNFINISHED AREA                              51,775.520         
##                                                                         (36,646.980)        
## bsmtTypeDscrLOWER LVL GARDEN FINISHED (BI-SPLIT LVL)                    80,478.950***       
##                                                                         (26,159.610)        
## bsmtTypeDscrLOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL)                  -106,935.100        
##                                                                         (93,994.280)        
## bsmtTypeDscrLOWER LVL WALKOUT FINISHED (BI-SPLIT LVL)                  174,805.600***       
##                                                                         (52,048.330)        
## bsmtTypeDscrLOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL)                  61,242.490         
##                                                                         (204,438.300)       
## bsmtTypeDscrSUBTERRANEAN BASEMENT FINISHED AREA                         57,314.250***       
##                                                                         (12,082.640)        
## bsmtTypeDscrSUBTERRANEAN BASEMENT UNFINISHED AREA                        21,603.880*        
##                                                                         (12,165.840)        
## bsmtTypeDscrWALK-OUT BASEMENT FINISHED AREA                            128,212.700***       
##                                                                         (17,000.810)        
## bsmtTypeDscrWALK-OUT BASEMENT UNFINISHED AREA                           60,816.490**        
##                                                                         (24,803.570)        
## carStorageTypeDscrATTACHED GARAGE AREA                                  36,889.720***       
##                                                                         (13,461.180)        
## carStorageTypeDscrBASEMENT GARAGE AREA                                 142,794.600***       
##                                                                         (27,820.840)        
## carStorageTypeDscrCARPORT AREA                                           12,367.600         
##                                                                         (24,306.950)        
## carStorageTypeDscrDETACHED GARAGE                                       75,386.290***       
##                                                                         (14,439.890)        
## carStorageTypeDscrGARAGE SET UP AS A WORKSHOP (ELEC., ETC.) AREA         -68,240.050        
##                                                                         (110,127.400)       
## carStorageTypeDscrGARAGE W/ FINISHED WALLS AREA                          20,009.120         
##                                                                         (102,887.500)       
## mainfloorSF                                                               38.972***         
##                                                                           (12.103)          
## nbrThreeQtrBaths                                                        45,954.330***       
##                                                                          (5,829.389)        
## nbrFullBaths                                                            20,259.050***       
##                                                                          (5,999.818)        
## nbrHalfBaths                                                            35,646.690***       
##                                                                          (6,964.377)        
## TotalFinishedSF                                                          146.102***         
##                                                                           (10.202)          
## HeatingDscrElectric                                                    -127,394.200***      
##                                                                         (39,503.940)        
## HeatingDscrElectric Wall Heat (1500W)                                    335,639.900        
##                                                                         (227,433.400)       
## HeatingDscrForced Air                                                   -76,556.240**       
##                                                                         (36,358.550)        
## HeatingDscrGravity                                                       -17,915.400        
##                                                                         (67,783.420)        
## HeatingDscrHeat Pump                                                    -202,837.000*       
##                                                                         (109,378.300)       
## HeatingDscrHot Water                                                     -16,465.450        
##                                                                         (37,643.850)        
## HeatingDscrNo HVAC                                                      -180,050.600        
##                                                                         (207,895.700)       
## HeatingDscrPackage Unit                                                  233,782.300        
##                                                                         (332,578.100)       
## HeatingDscrRadiant Floor                                               135,722.900***       
##                                                                         (47,724.450)        
## HeatingDscrVentilation Only                                             -393,033.000        
##                                                                         (309,120.200)       
## HeatingDscrWall Furnace                                                  43,472.560         
##                                                                         (48,409.320)        
## ExtWallDscrPrimBlock Stucco                                             197,458.000**       
##                                                                         (79,114.350)        
## ExtWallDscrPrimBrick on Block                                          291,843.100***       
##                                                                         (74,717.180)        
## ExtWallDscrPrimBrick Veneer                                             123,407.800*        
##                                                                         (69,670.540)        
## ExtWallDscrPrimCedar                                                     75,703.450         
##                                                                         (127,893.400)       
## ExtWallDscrPrimCement Board                                             133,029.900*        
##                                                                         (68,689.930)        
## ExtWallDscrPrimFaux Stone                                              259,013.500***       
##                                                                         (80,904.120)        
## ExtWallDscrPrimFrame Asbestos                                            210,880.100        
##                                                                         (296,889.700)       
## ExtWallDscrPrimFrame Stucco                                              110,940.600        
##                                                                         (69,543.820)        
## ExtWallDscrPrimFrame Wood/Shake                                        188,433.000***       
##                                                                         (68,431.630)        
## ExtWallDscrPrimLog                                                     208,744.600***       
##                                                                         (76,302.070)        
## ExtWallDscrPrimMetal                                                    211,603.100*        
##                                                                         (123,119.600)       
## ExtWallDscrPrimMoss Rock/Flagstone                                      169,501.800**       
##                                                                         (73,688.950)        
## ExtWallDscrPrimPainted Block                                             140,232.100        
##                                                                         (91,173.060)        
## ExtWallDscrPrimStrawbale                                                 232,766.200        
##                                                                         (227,928.600)       
## ExtWallDscrPrimVinyl                                                    189,328.100**       
##                                                                         (77,527.780)        
## ExtWallDscrSecBlock Stucco                                             776,115.300***       
##                                                                         (110,499.800)       
## ExtWallDscrSecBrick on Block                                             -24,248.880        
##                                                                         (93,390.680)        
## ExtWallDscrSecBrick Veneer                                               -16,336.050        
##                                                                         (13,236.000)        
## ExtWallDscrSecCedar                                                     415,054.600**       
##                                                                         (167,870.600)       
## ExtWallDscrSecCement Board                                                4,992.113         
##                                                                         (71,259.240)        
## ExtWallDscrSecFaux Stone                                                 23,282.950         
##                                                                         (15,410.530)        
## ExtWallDscrSecFrame Asbestos                                             138,613.900        
##                                                                         (118,416.300)       
## ExtWallDscrSecFrame Stucco                                              -48,032.770*        
##                                                                         (28,795.460)        
## ExtWallDscrSecFrame Wood/Shake                                          36,012.830***       
##                                                                         (13,204.140)        
## ExtWallDscrSecLog                                                        115,407.000        
##                                                                         (145,045.100)       
## ExtWallDscrSecMetal                                                      32,746.370         
##                                                                         (85,785.170)        
## ExtWallDscrSecMoss Rock/Flagstone                                        44,611.370         
##                                                                         (32,128.740)        
## ExtWallDscrSecPainted Block                                              197,007.400        
##                                                                         (168,714.600)       
## ExtWallDscrSecVinyl                                                      -16,726.080        
##                                                                         (97,060.320)        
## Roof_CoverDscrAsphalt                                                    -7,238.716         
##                                                                          (7,740.046)        
## Roof_CoverDscrBuilt-Up                                                 481,071.400***       
##                                                                         (121,856.700)       
## Roof_CoverDscrClay Tile                                                  -71,444.940        
##                                                                         (47,479.230)        
## Roof_CoverDscrConcrete Tile                                            -88,007.710***       
##                                                                         (24,973.460)        
## Roof_CoverDscrMetal                                                    180,195.200***       
##                                                                         (26,062.230)        
## Roof_CoverDscrRoll                                                      -118,481.800        
##                                                                         (288,023.400)       
## Roof_CoverDscrRubber Membrane                                          140,318.600***       
##                                                                         (27,660.920)        
## Roof_CoverDscrShake                                                      -75,356.520        
##                                                                         (48,688.470)        
## Roof_CoverDscrTar and Gravel                                             170,389.000        
##                                                                         (146,434.900)       
## lnschools_nn3                                                            -4,639.976         
##                                                                          (5,389.377)        
## crime_nn3                                                                -13.958***         
##                                                                            (1.960)          
## Constant                                                              -4,493,725.000***     
##                                                                         (609,639.000)       
## --------------------------------------------------------------------------------------------
## Observations                                                                9,475           
## R2                                                                          0.754           
## Adjusted R2                                                                 0.749           
## Residual Std. Error                                                286,444.400 (df = 9295)  
## F Statistic                                                      159.170*** (df = 179; 9295)
## ============================================================================================
## Note:                                                            *p<0.1; **p<0.05; ***p<0.01

The adjusted R-square of the regression model reaches 0.749, which means about 75% of the result can be explained by the model.

23 predictors are included in the regression model training, including 10 categorical variables.

Combining the first model result and stepwise regression results below, most predictors is statistically significant (p-value < 0.001) and one numeric predictor (lnschool_nn3) is not significant (p-value > 0.1). So we decided to remove that variable and train a new model based on the other variables.

fit <- lm(price~ ., data = st_drop_geometry(boulder.train) %>%
                     dplyr::select(price, NAME,year_quarter,bld_num,
                                designCodeDscr, qualityCodeDscr, ConstCodeDscr,
                                builtYear, CompCode, EffectiveYear, bsmtSF,
                                bsmtTypeDscr,carStorageTypeDscr,mainfloorSF, 
                                nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
                                TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
                                ExtWallDscrSec, Roof_CoverDscr, lnschools_nn3, crime_nn3))
step <- stepAIC(fit, direction="both")
## Start:  AIC=238290.7
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr + 
##     ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF + 
##     bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths + 
##     nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr + 
##     ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + lnschools_nn3 + 
##     crime_nn3
## 
##                      Df  Sum of Sq        RSS    AIC
## - lnschools_nn3       1 6.0818e+10 7.6272e+14 238289
## <none>                             7.6266e+14 238291
## - builtYear           1 7.7906e+11 7.6344e+14 238298
## - mainfloorSF         1 8.5076e+11 7.6351e+14 238299
## - nbrFullBaths        1 9.3550e+11 7.6359e+14 238300
## - bsmtSF              1 1.2411e+12 7.6390e+14 238304
## - ConstCodeDscr       7 2.5211e+12 7.6518e+14 238308
## - nbrHalfBaths        1 2.1496e+12 7.6481e+14 238315
## - carStorageTypeDscr  6 3.8632e+12 7.6652e+14 238327
## - ExtWallDscrSec     14 6.2670e+12 7.6893e+14 238340
## - crime_nn3           1 4.1623e+12 7.6682e+14 238340
## - nbrThreeQtrBaths    1 5.0990e+12 7.6776e+14 238352
## - bsmtTypeDscr       10 6.6564e+12 7.6931e+14 238353
## - ExtWallDscrPrim    15 7.7421e+12 7.7040e+14 238356
## - HeatingDscr        11 7.4707e+12 7.7013e+14 238361
## - CompCode            1 6.6408e+12 7.6930e+14 238371
## - Roof_CoverDscr      9 9.6003e+12 7.7226e+14 238391
## - EffectiveYear       1 8.6857e+12 7.7134e+14 238396
## - designCodeDscr      4 1.1591e+13 7.7425e+14 238426
## - TotalFinishedSF     1 1.6828e+13 7.7949e+14 238495
## - bld_num             1 4.0924e+13 8.0358e+14 238784
## - year_quarter        9 4.6558e+13 8.0922e+14 238834
## - qualityCodeDscr    15 1.9295e+14 9.5561e+14 240398
## - NAME               67 2.3608e+14 9.9874e+14 240712
## 
## Step:  AIC=238289.5
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr + 
##     ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF + 
##     bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths + 
##     nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr + 
##     ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + crime_nn3
## 
##                      Df  Sum of Sq        RSS    AIC
## <none>                             7.6272e+14 238289
## + lnschools_nn3       1 6.0818e+10 7.6266e+14 238291
## - builtYear           1 7.9061e+11 7.6351e+14 238297
## - mainfloorSF         1 8.2664e+11 7.6355e+14 238298
## - nbrFullBaths        1 9.2872e+11 7.6365e+14 238299
## - bsmtSF              1 1.2476e+12 7.6397e+14 238303
## - ConstCodeDscr       7 2.5178e+12 7.6524e+14 238307
## - nbrHalfBaths        1 2.1577e+12 7.6488e+14 238314
## - carStorageTypeDscr  6 3.8557e+12 7.6657e+14 238325
## - ExtWallDscrSec     14 6.2526e+12 7.6897e+14 238339
## - crime_nn3           1 4.6916e+12 7.6741e+14 238346
## - nbrThreeQtrBaths    1 5.1152e+12 7.6783e+14 238351
## - bsmtTypeDscr       10 6.6723e+12 7.6939e+14 238352
## - ExtWallDscrPrim    15 7.7401e+12 7.7046e+14 238355
## - HeatingDscr        11 7.4503e+12 7.7017e+14 238360
## - CompCode            1 6.6460e+12 7.6937e+14 238370
## - Roof_CoverDscr      9 9.5749e+12 7.7229e+14 238390
## - EffectiveYear       1 8.7360e+12 7.7146e+14 238395
## - designCodeDscr      4 1.1530e+13 7.7425e+14 238424
## - TotalFinishedSF     1 1.6897e+13 7.7962e+14 238495
## - bld_num             1 4.0969e+13 8.0369e+14 238783
## - year_quarter        9 4.6671e+13 8.0939e+14 238834
## - qualityCodeDscr    15 1.9313e+14 9.5585e+14 240398
## - NAME               67 2.3604e+14 9.9876e+14 240710
# display results
step$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr + 
##     ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF + 
##     bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths + 
##     nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr + 
##     ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + lnschools_nn3 + 
##     crime_nn3
## 
## Final Model:
## price ~ NAME + year_quarter + bld_num + designCodeDscr + qualityCodeDscr + 
##     ConstCodeDscr + builtYear + CompCode + EffectiveYear + bsmtSF + 
##     bsmtTypeDscr + carStorageTypeDscr + mainfloorSF + nbrThreeQtrBaths + 
##     nbrFullBaths + nbrHalfBaths + TotalFinishedSF + HeatingDscr + 
##     ExtWallDscrPrim + ExtWallDscrSec + Roof_CoverDscr + crime_nn3
## 
## 
##              Step Df    Deviance Resid. Df   Resid. Dev      AIC
## 1                                     9295 7.626583e+14 238290.7
## 2 - lnschools_nn3  1 60818410273      9296 7.627191e+14 238289.5

The new regression model we built is shown as follows.

all_reg_fin <- lm(price ~.,data = st_drop_geometry(boulder.train) %>%
                  dplyr::select(price, NAME,year_quarter, bld_num, 
                                designCodeDscr, qualityCodeDscr, 
                                builtYear, CompCode, EffectiveYear, bsmtSF,
                                bsmtTypeDscr,carStorageTypeDscr,mainfloorSF, 
                                nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
                                TotalFinishedSF,  ExtWallDscrPrim, HeatingDscr, 
                                ExtWallDscrSec, Roof_CoverDscr, crime_nn3))
stargazer(all_reg_fin, type = "text",title="Regression Results",align=TRUE,no.space=TRUE)
## 
## Regression Results
## ============================================================================================
##                                                                      Dependent variable:    
##                                                                  ---------------------------
##                                                                             price           
## --------------------------------------------------------------------------------------------
## NAMECensus Tract 121.02, Boulder County, Colorado                      -443,119.600***      
##                                                                         (30,538.960)        
## NAMECensus Tract 121.03, Boulder County, Colorado                      -450,442.600***      
##                                                                         (37,461.770)        
## NAMECensus Tract 121.04, Boulder County, Colorado                      -474,141.100***      
##                                                                         (38,143.220)        
## NAMECensus Tract 121.05, Boulder County, Colorado                      -564,486.200***      
##                                                                         (36,360.070)        
## NAMECensus Tract 122.01, Boulder County, Colorado                      -107,092.700***      
##                                                                         (40,279.420)        
## NAMECensus Tract 122.02, Boulder County, Colorado                      -604,179.900***      
##                                                                         (51,578.850)        
## NAMECensus Tract 122.03, Boulder County, Colorado                      -908,887.200***      
##                                                                         (40,672.780)        
## NAMECensus Tract 122.04, Boulder County, Colorado                        68,755.930         
##                                                                         (72,576.240)        
## NAMECensus Tract 123, Boulder County, Colorado                          -224,758.200        
##                                                                         (204,781.900)       
## NAMECensus Tract 124.01, Boulder County, Colorado                      -547,880.200***      
##                                                                         (44,598.890)        
## NAMECensus Tract 125.01, Boulder County, Colorado                      -832,136.600***      
##                                                                         (48,173.570)        
## NAMECensus Tract 125.05, Boulder County, Colorado                      -403,155.100***      
##                                                                         (34,708.800)        
## NAMECensus Tract 125.07, Boulder County, Colorado                      -811,439.400***      
##                                                                         (38,952.480)        
## NAMECensus Tract 125.08, Boulder County, Colorado                      -785,693.600***      
##                                                                         (47,156.880)        
## NAMECensus Tract 125.09, Boulder County, Colorado                      -641,933.500***      
##                                                                         (38,218.910)        
## NAMECensus Tract 125.10, Boulder County, Colorado                      -406,190.600***      
##                                                                         (36,716.310)        
## NAMECensus Tract 125.11, Boulder County, Colorado                      -831,142.900***      
##                                                                         (48,005.860)        
## NAMECensus Tract 126.03, Boulder County, Colorado                      -782,932.400***      
##                                                                         (37,688.170)        
## NAMECensus Tract 126.05, Boulder County, Colorado                      -823,293.800***      
##                                                                         (119,922.700)       
## NAMECensus Tract 126.07, Boulder County, Colorado                      -794,936.700***      
##                                                                         (75,989.360)        
## NAMECensus Tract 126.08, Boulder County, Colorado                      -835,831.600***      
##                                                                         (43,799.460)        
## NAMECensus Tract 127.01, Boulder County, Colorado                      -793,634.500***      
##                                                                         (34,457.280)        
## NAMECensus Tract 127.05, Boulder County, Colorado                      -977,923.000***      
##                                                                         (50,891.030)        
## NAMECensus Tract 127.07, Boulder County, Colorado                      -527,560.200***      
##                                                                         (53,357.910)        
## NAMECensus Tract 127.08, Boulder County, Colorado                     -1,000,356.000***     
##                                                                         (31,733.700)        
## NAMECensus Tract 127.09, Boulder County, Colorado                      -956,894.000***      
##                                                                         (46,843.480)        
## NAMECensus Tract 127.10, Boulder County, Colorado                      -714,038.100***      
##                                                                         (36,904.710)        
## NAMECensus Tract 128, Boulder County, Colorado                        -1,049,945.000***     
##                                                                         (32,834.850)        
## NAMECensus Tract 129.03, Boulder County, Colorado                     -1,026,141.000***     
##                                                                         (42,445.040)        
## NAMECensus Tract 129.04, Boulder County, Colorado                      -916,862.700***      
##                                                                         (35,028.150)        
## NAMECensus Tract 129.05, Boulder County, Colorado                      -854,026.800***      
##                                                                         (46,646.240)        
## NAMECensus Tract 129.07, Boulder County, Colorado                      -878,942.200***      
##                                                                         (38,673.800)        
## NAMECensus Tract 130.03, Boulder County, Colorado                      -855,484.100***      
##                                                                         (32,498.650)        
## NAMECensus Tract 130.04, Boulder County, Colorado                      -834,528.100***      
##                                                                         (39,450.380)        
## NAMECensus Tract 130.05, Boulder County, Colorado                      -768,116.300***      
##                                                                         (38,333.000)        
## NAMECensus Tract 130.06, Boulder County, Colorado                      -785,315.600***      
##                                                                         (34,524.040)        
## NAMECensus Tract 132.01, Boulder County, Colorado                      -864,235.600***      
##                                                                         (53,772.710)        
## NAMECensus Tract 132.02, Boulder County, Colorado                      -468,161.900***      
##                                                                         (58,466.230)        
## NAMECensus Tract 132.05, Boulder County, Colorado                      -977,960.000***      
##                                                                         (33,472.010)        
## NAMECensus Tract 132.07, Boulder County, Colorado                     -1,035,867.000***     
##                                                                         (42,728.120)        
## NAMECensus Tract 132.08, Boulder County, Colorado                     -1,064,972.000***     
##                                                                         (38,454.710)        
## NAMECensus Tract 132.10, Boulder County, Colorado                     -1,009,256.000***     
##                                                                         (37,958.490)        
## NAMECensus Tract 132.11, Boulder County, Colorado                     -1,067,702.000***     
##                                                                         (32,352.920)        
## NAMECensus Tract 132.12, Boulder County, Colorado                      -998,199.800***      
##                                                                         (33,912.990)        
## NAMECensus Tract 132.13, Boulder County, Colorado                     -1,105,218.000***     
##                                                                         (31,034.140)        
## NAMECensus Tract 133.02, Boulder County, Colorado                      -931,591.800***      
##                                                                         (35,561.400)        
## NAMECensus Tract 133.05, Boulder County, Colorado                     -1,010,090.000***     
##                                                                         (40,351.290)        
## NAMECensus Tract 133.06, Boulder County, Colorado                      -991,054.600***      
##                                                                         (44,735.560)        
## NAMECensus Tract 133.07, Boulder County, Colorado                     -1,006,140.000***     
##                                                                         (42,341.790)        
## NAMECensus Tract 133.08, Boulder County, Colorado                      -966,873.400***      
##                                                                         (41,257.400)        
## NAMECensus Tract 134.01, Boulder County, Colorado                      -983,564.500***      
##                                                                         (42,695.800)        
## NAMECensus Tract 134.02, Boulder County, Colorado                     -1,024,232.000***     
##                                                                         (38,887.100)        
## NAMECensus Tract 135.03, Boulder County, Colorado                      -952,902.900***      
##                                                                         (41,432.020)        
## NAMECensus Tract 135.05, Boulder County, Colorado                      -980,680.100***      
##                                                                         (52,829.210)        
## NAMECensus Tract 135.06, Boulder County, Colorado                     -1,035,724.000***     
##                                                                         (40,427.010)        
## NAMECensus Tract 135.07, Boulder County, Colorado                     -1,013,692.000***     
##                                                                         (43,429.810)        
## NAMECensus Tract 135.08, Boulder County, Colorado                     -1,011,534.000***     
##                                                                         (38,753.720)        
## NAMECensus Tract 136.01, Boulder County, Colorado                      -844,444.900***      
##                                                                         (43,457.960)        
## NAMECensus Tract 136.02, Boulder County, Colorado                      -811,031.700***      
##                                                                         (54,623.040)        
## NAMECensus Tract 137.01, Boulder County, Colorado                      -857,892.200***      
##                                                                         (31,000.510)        
## NAMECensus Tract 137.02, Boulder County, Colorado                      -887,061.900***      
##                                                                         (39,435.060)        
## NAMECensus Tract 606, Boulder County, Colorado                         -962,767.900***      
##                                                                         (35,090.790)        
## NAMECensus Tract 607, Boulder County, Colorado                         -879,919.200***      
##                                                                         (41,654.380)        
## NAMECensus Tract 608, Boulder County, Colorado                         -852,642.900***      
##                                                                         (39,792.880)        
## NAMECensus Tract 609, Boulder County, Colorado                         -929,643.800***      
##                                                                         (37,861.730)        
## NAMECensus Tract 613, Boulder County, Colorado                         -982,603.100***      
##                                                                         (41,392.490)        
## NAMECensus Tract 614, Boulder County, Colorado                         -965,569.100***      
##                                                                         (38,996.000)        
## year_quarter2019-2                                                       12,616.910         
##                                                                         (14,091.770)        
## year_quarter2019-3                                                       23,693.930*        
##                                                                         (14,387.170)        
## year_quarter2019-4                                                      31,147.090**        
##                                                                         (14,869.550)        
## year_quarter2020-1                                                      34,298.120**        
##                                                                         (15,356.700)        
## year_quarter2020-2                                                      42,321.820***       
##                                                                         (14,843.090)        
## year_quarter2020-3                                                      57,282.720***       
##                                                                         (13,835.280)        
## year_quarter2020-4                                                      94,036.190***       
##                                                                         (14,389.950)        
## year_quarter2021-1                                                     168,304.500***       
##                                                                         (15,464.510)        
## year_quarter2021-2                                                     247,198.600***       
##                                                                         (14,808.640)        
## bld_num                                                                535,797.900***       
##                                                                         (23,898.800)        
## designCodeDscr2-3 Story                                                -50,482.490***       
##                                                                         (12,819.180)        
## designCodeDscrBi-level                                                   12,498.260         
##                                                                         (29,598.120)        
## designCodeDscrMULTI STORY- TOWNHOUSE                                   -178,860.800***      
##                                                                         (16,127.810)        
## designCodeDscrSplit-level                                                -25,825.390        
##                                                                         (23,331.500)        
## qualityCodeDscrAVERAGE +                                               -51,043.520***       
##                                                                         (13,409.330)        
## qualityCodeDscrAVERAGE ++                                                -21,157.950        
##                                                                         (13,940.900)        
## qualityCodeDscrEXCELLENT                                              1,014,861.000***      
##                                                                         (31,576.420)        
## qualityCodeDscrEXCELLENT +                                            1,205,749.000***      
##                                                                         (63,750.040)        
## qualityCodeDscrEXCELLENT++                                            1,852,756.000***      
##                                                                         (52,550.330)        
## qualityCodeDscrEXCEPTIONAL 1                                          1,044,320.000***      
##                                                                         (68,128.260)        
## qualityCodeDscrEXCEPTIONAL 2                                          1,399,918.000***      
##                                                                         (176,559.100)       
## qualityCodeDscrFAIR                                                      -54,942.740        
##                                                                         (35,073.940)        
## qualityCodeDscrGOOD                                                     54,058.410***       
##                                                                         (10,198.260)        
## qualityCodeDscrGOOD +                                                   78,145.890***       
##                                                                         (16,287.510)        
## qualityCodeDscrGOOD ++                                                 148,599.100***       
##                                                                         (15,368.160)        
## qualityCodeDscrLOW                                                       -58,446.710        
##                                                                         (72,748.980)        
## qualityCodeDscrVERY GOOD                                               242,727.800***       
##                                                                         (16,543.990)        
## qualityCodeDscrVERY GOOD +                                             451,555.200***       
##                                                                         (27,344.760)        
## qualityCodeDscrVERY GOOD ++                                            571,376.600***       
##                                                                         (23,765.010)        
## builtYear                                                                -713.599***        
##                                                                           (228.302)         
## CompCode                                                               308,362.900***       
##                                                                         (34,592.690)        
## EffectiveYear                                                           3,125.189***        
##                                                                           (301.797)         
## bsmtSF                                                                   -30.826***         
##                                                                            (8.186)          
## bsmtTypeDscrGARDEN BASEMENT FINISHED AREA                               90,927.880***       
##                                                                         (20,603.910)        
## bsmtTypeDscrGARDEN BASEMENT UNFINISHED AREA                              51,633.710         
##                                                                         (36,688.580)        
## bsmtTypeDscrLOWER LVL GARDEN FINISHED (BI-SPLIT LVL)                    82,890.740***       
##                                                                         (26,155.640)        
## bsmtTypeDscrLOWER LVL GARDEN UNFINISHED (BI-SPLIT LVL)                  -107,186.500        
##                                                                         (94,109.090)        
## bsmtTypeDscrLOWER LVL WALKOUT FINISHED (BI-SPLIT LVL)                  173,312.200***       
##                                                                         (52,108.140)        
## bsmtTypeDscrLOWER LVL WALKOUT UNFINISHED (BI-SPLIT LVL)                  65,171.950         
##                                                                         (204,684.300)       
## bsmtTypeDscrSUBTERRANEAN BASEMENT FINISHED AREA                         58,274.030***       
##                                                                         (12,079.670)        
## bsmtTypeDscrSUBTERRANEAN BASEMENT UNFINISHED AREA                        22,612.160*        
##                                                                         (12,164.890)        
## bsmtTypeDscrWALK-OUT BASEMENT FINISHED AREA                            127,795.700***       
##                                                                         (17,007.160)        
## bsmtTypeDscrWALK-OUT BASEMENT UNFINISHED AREA                           62,771.790**        
##                                                                         (24,822.120)        
## carStorageTypeDscrATTACHED GARAGE AREA                                  36,752.290***       
##                                                                         (13,449.260)        
## carStorageTypeDscrBASEMENT GARAGE AREA                                 141,826.700***       
##                                                                         (27,833.650)        
## carStorageTypeDscrCARPORT AREA                                           10,727.650         
##                                                                         (24,318.130)        
## carStorageTypeDscrDETACHED GARAGE                                       76,190.090***       
##                                                                         (14,443.760)        
## carStorageTypeDscrGARAGE SET UP AS A WORKSHOP (ELEC., ETC.) AREA         -66,109.580        
##                                                                         (110,180.000)       
## carStorageTypeDscrGARAGE W/ FINISHED WALLS AREA                          22,351.180         
##                                                                         (102,894.400)       
## mainfloorSF                                                               36.292***         
##                                                                           (12.074)          
## nbrThreeQtrBaths                                                        45,235.750***       
##                                                                          (5,815.871)        
## nbrFullBaths                                                            17,658.330***       
##                                                                          (5,931.780)        
## nbrHalfBaths                                                            35,506.410***       
##                                                                          (6,971.469)        
## TotalFinishedSF                                                          147.823***         
##                                                                           (10.188)          
## ExtWallDscrPrimBlock Stucco                                             167,480.000**       
##                                                                         (72,844.240)        
## ExtWallDscrPrimBrick on Block                                          267,953.100***       
##                                                                         (67,983.460)        
## ExtWallDscrPrimBrick Veneer                                              94,757.290         
##                                                                         (62,421.650)        
## ExtWallDscrPrimCedar                                                     34,328.910         
##                                                                         (124,041.700)       
## ExtWallDscrPrimCement Board                                              93,149.210         
##                                                                         (61,362.260)        
## ExtWallDscrPrimFaux Stone                                              201,153.800***       
##                                                                         (74,682.410)        
## ExtWallDscrPrimFrame Asbestos                                            171,854.400        
##                                                                         (295,639.000)       
## ExtWallDscrPrimFrame Stucco                                              73,069.180         
##                                                                         (62,405.930)        
## ExtWallDscrPrimFrame Wood/Shake                                         147,071.600**       
##                                                                         (61,313.670)        
## ExtWallDscrPrimLog                                                      168,914.800**       
##                                                                         (69,718.270)        
## ExtWallDscrPrimMetal                                                     171,517.900        
##                                                                         (119,286.400)       
## ExtWallDscrPrimMoss Rock/Flagstone                                      135,905.800**       
##                                                                         (67,151.330)        
## ExtWallDscrPrimPainted Block                                             112,289.900        
##                                                                         (85,733.910)        
## ExtWallDscrPrimStrawbale                                                 246,197.000        
##                                                                         (214,830.200)       
## ExtWallDscrPrimVinyl                                                    147,140.600**       
##                                                                         (71,182.630)        
## HeatingDscrElectric                                                    -132,959.500***      
##                                                                         (39,291.580)        
## HeatingDscrElectric Wall Heat (1500W)                                    275,842.300        
##                                                                         (206,672.500)       
## HeatingDscrForced Air                                                   -80,408.020**       
##                                                                         (36,108.330)        
## HeatingDscrGravity                                                       -20,625.360        
##                                                                         (67,712.130)        
## HeatingDscrHeat Pump                                                    -196,400.400*       
##                                                                         (109,438.900)       
## HeatingDscrHot Water                                                     -20,764.100        
##                                                                         (37,418.470)        
## HeatingDscrNo HVAC                                                      -188,088.500        
##                                                                         (208,099.800)       
## HeatingDscrPackage Unit                                                  231,031.600        
##                                                                         (293,667.900)       
## HeatingDscrRadiant Floor                                               126,866.000***       
##                                                                         (47,575.510)        
## HeatingDscrVentilation Only                                             -398,786.000        
##                                                                         (309,432.300)       
## HeatingDscrWall Furnace                                                  36,175.080         
##                                                                         (47,702.750)        
## ExtWallDscrSecBlock Stucco                                             774,196.800***       
##                                                                         (110,633.500)       
## ExtWallDscrSecBrick on Block                                             -10,576.010        
##                                                                         (93,210.560)        
## ExtWallDscrSecBrick Veneer                                               -14,796.800        
##                                                                         (13,230.290)        
## ExtWallDscrSecCedar                                                     418,255.100**       
##                                                                         (168,070.200)       
## ExtWallDscrSecCement Board                                                6,972.315         
##                                                                         (71,306.360)        
## ExtWallDscrSecFaux Stone                                                 19,862.340         
##                                                                         (15,398.080)        
## ExtWallDscrSecFrame Asbestos                                             135,124.200        
##                                                                         (118,539.200)       
## ExtWallDscrSecFrame Stucco                                              -47,876.670*        
##                                                                         (28,780.820)        
## ExtWallDscrSecFrame Wood/Shake                                          29,783.440**        
##                                                                         (12,689.530)        
## ExtWallDscrSecLog                                                        117,830.900        
##                                                                         (145,220.400)       
## ExtWallDscrSecMetal                                                      -35,892.170        
##                                                                         (84,261.590)        
## ExtWallDscrSecMoss Rock/Flagstone                                        44,313.640         
##                                                                         (32,156.520)        
## ExtWallDscrSecPainted Block                                              192,898.000        
##                                                                         (168,895.900)       
## ExtWallDscrSecVinyl                                                      -19,062.270        
##                                                                         (97,143.770)        
## Roof_CoverDscrAsphalt                                                    -6,984.258         
##                                                                          (7,720.393)        
## Roof_CoverDscrBuilt-Up                                                 476,225.500***       
##                                                                         (122,000.900)       
## Roof_CoverDscrClay Tile                                                  -71,847.950        
##                                                                         (47,515.480)        
## Roof_CoverDscrConcrete Tile                                            -84,741.840***       
##                                                                         (24,966.070)        
## Roof_CoverDscrMetal                                                    176,562.000***       
##                                                                         (26,065.220)        
## Roof_CoverDscrRoll                                                      -101,376.000        
##                                                                         (288,217.000)       
## Roof_CoverDscrRubber Membrane                                          130,813.100***       
##                                                                         (27,620.800)        
## Roof_CoverDscrShake                                                      -72,925.380        
##                                                                         (48,736.270)        
## Roof_CoverDscrTar and Gravel                                             169,176.700        
##                                                                         (146,610.200)       
## crime_nn3                                                                -14.710***         
##                                                                            (1.901)          
## Constant                                                              -4,560,358.000***     
##                                                                         (604,719.400)       
## --------------------------------------------------------------------------------------------
## Observations                                                                9,475           
## R2                                                                          0.753           
## Adjusted R2                                                                 0.749           
## Residual Std. Error                                                286,804.800 (df = 9303)  
## F Statistic                                                      166.015*** (df = 171; 9303)
## ============================================================================================
## Note:                                                            *p<0.1; **p<0.05; ***p<0.01

The new model shows an adjusted R-square of 0.749, at a same level of the previous.

boulder.test <-
  boulder.test %>%
  mutate(price.Predict = predict(all_reg_fin, boulder.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict) %>%
  filter(price < 5000000)
MAE <- mean(boulder.test$price.AbsError, na.rm = T)
MAPE <- mean(boulder.test$price.APE, na.rm = T)
acc <- data.frame(MAE, MAPE)
kable(acc) %>% kable_styling(full_width = F)
MAE MAPE
107989.5 0.1968083

The MAE of the model is 107989.5, which means the mean absolute error. The MAPE is 0.1968083, which means there is a 19.68% error between the predicted value and true value.

Cross validation test is applied to the regression model.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

all_reg_fin.cv <- 
  train(price ~ ., data = st_drop_geometry(boulder.train) %>% 
          dplyr::select(price, NAME,year_quarter, bld_num, 
                        designCodeDscr, qualityCodeDscr, 
                        builtYear, CompCode, EffectiveYear, bsmtSF,
                        bsmtTypeDscr,carStorageTypeDscr,mainfloorSF, 
                        nbrThreeQtrBaths, nbrFullBaths, nbrHalfBaths,
                        TotalFinishedSF, HeatingDscr, ExtWallDscrPrim,
                        ExtWallDscrSec, Roof_CoverDscr, crime_nn3),
        method = "lm", trControl = fitControl, na.action = na.pass)

all_reg_fin.cv
## Linear Regression 
## 
## 9475 samples
##   21 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 9381, 9379, 9381, 9380, 9381, 9381, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   283134.8  0.7588802  153884.2
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
all_reg_fin.cv$resample[1:5,]
##       RMSE  Rsquared      MAE Resample
## 1 183483.3 0.8288958 128012.3  Fold001
## 2 167515.6 0.8415570 123110.9  Fold002
## 3 400506.9 0.6638319 191935.7  Fold003
## 4 271663.7 0.7967200 160424.5  Fold004
## 5 167155.8 0.8580706 125578.2  Fold005
sd(all_reg_fin.cv$resample[,3])
## [1] 24030.65
mean(all_reg_fin.cv$resample[,3])
## [1] 153884.2
hist(all_reg_fin.cv$resample[,3], 
     breaks = 50, 
     main = 'Distribution of MAE \nK fold cross validation;k=100', 
     xlab = 'Mean Absolute Mile', 
     ylab = 'count')

The mean of the MAEs is 24030.65 and the standard deviation is 153884.2. The histogram below shows the distribution of the MAEs of the 100 folds in k fold cross validation. As is shown in the graph, the MAEs is not very tightly distributed, which means the generalizability of the model is not very good.

Below is the graph of predicted price as a function of observed price. The red line represents a perfect prediction according to the observed house price, and the orange line represents our average prediction. Our prediction is quite near the perfect prediction, although there is a small gap between them.

ggplot(
  boulder.test, aes(price.Predict, price)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  geom_abline(intercept = 0, slope = 1, color='red',size=1) +
  labs(title = 'Predicted sale price as a function of observed price', subtitle = 'Red line represents a perfect prediction \nOrange line represents an average prediction') +
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'

The graph of lag price error and price error shows a slight positive relationship between them, which also indicates the existence of spatial autocorrelation.

coords.test <-  st_coordinates(boulder.test)   

tractList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(tractList.test, style="W")

boulder.test <- boulder.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, price.Error))

ggplot(data=boulder.test, aes(lagPriceError, price.Error)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "#FA7800") +
  labs(title = 'price error as a function of lag price errors') +
  theme(plot.title = element_text(hjust = 0.5)) +
  plotTheme()
## `geom_smooth()` using formula 'y ~ x'

ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = boulder.test, aes(colour = q5(lagPriceError)),
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder.test,"lagPriceError"),
                   name="Quintile\nBreaks") +
  labs(title="a map of spatial lag in errors") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

Below is the graph of Moran’s I test. The histogram means the randomly permuted Moran’s I and the orange line means the observed Moran’s I. The graph shows that there is somewhat spatial autocorrelation in the residuals’ distribution, but the autocorrelation is not very high. Ideally, there should be no spatial autocorrelation if the model is absolutely generalized.

moranTest <- moran.mc(boulder.test$price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

Below is a list of mean MAPE of each census tract. There are differences between the mean MAPE in each census tract, but the majority lies between 0.10 to 0.25. The max value of MAPE is 0.84 and occurs in Census Tract 132.01, which means our model does not fit well in this particular region. As is shown in the map, the MAPE in the center of the region is lower than the outside of the region, which means our model fit better in center city.

test_map = st_drop_geometry(boulder.test) %>%
    group_by(NAME) %>%
    summarize(MAPE = mean(price.APE, na.rm = T))
test_map
## # A tibble: 63 x 2
##    NAME                                           MAPE
##    <chr>                                         <dbl>
##  1 Census Tract 121.01, Boulder County, Colorado 0.197
##  2 Census Tract 121.02, Boulder County, Colorado 0.235
##  3 Census Tract 121.03, Boulder County, Colorado 0.206
##  4 Census Tract 121.04, Boulder County, Colorado 0.207
##  5 Census Tract 121.05, Boulder County, Colorado 0.182
##  6 Census Tract 122.01, Boulder County, Colorado 0.605
##  7 Census Tract 122.02, Boulder County, Colorado 0.236
##  8 Census Tract 122.03, Boulder County, Colorado 0.136
##  9 Census Tract 124.01, Boulder County, Colorado 0.162
## 10 Census Tract 125.01, Boulder County, Colorado 0.120
## # ... with 53 more rows

Below is a map of the predicted price of our regression model. As is shown in the map, predicted prices tend to cluster. Most high-price houses are in the center city and suburb, while low-price houses locate in the mountainous area in the west.

boulder_data_1 <-
  boulder_data %>%
  mutate(price.Predict = predict(all_reg_fin, boulder_data))%>%
  filter(price < 5000000)
ggplot() +
  geom_sf(data = acsTractsBD19, fill = "grey60") +
  geom_sf(data = boulder_data_1, aes(colour = q5(price.Predict)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(boulder_data_1,"price.Predict"),
                   name="Quintile\nBreaks") +
  labs(title="predicted price with model") +
  theme(plot.title = element_text(hjust = 0.5)) +
  mapTheme()

The scatter plot shows that mean MAPE of each census tract is not that relevant with mean house price.

mm_table <- left_join(
  st_drop_geometry(boulder.test) %>%
    group_by(NAME) %>%
    summarize(meanPrice = mean(price, na.rm = T)),
 st_drop_geometry(boulder.test) %>%
    group_by(NAME) %>%
    summarize(MAPE = mean(price.APE, na.rm = T)))
## Joining, by = "NAME"
mm_table <- merge(acsTractsBD19, mm_table,by.x = 'NAME',by.y = 'NAME') # add geometry of each census tracts
mm_table <- mm_table[,c('NAME','MAPE','meanPrice')]
# mm_table
ggplot(data=st_drop_geometry(mm_table), aes(meanPrice, MAPE)) +
  geom_point(size = 1.0, colour = "#FA7800") +
  labs(title = 'MAPE as a function of mean price in each census tract') +
  theme(plot.title = element_text(hjust = 0.5))+
  plotTheme()

ggplot() +
  geom_sf(data = mm_table, aes(fill = MAPE)) +
  scale_colour_manual(values = palette5,
                   labels=qBr(mm_table, "MAPE"),    
                   name="Quintile\nBreaks") +
  theme(plot.title = element_text(hjust = 0.5))+
  labs(title="MAPE of Each Census Tract") +
  mapTheme()

We select income as the factor to split the census tracts into two groups, high-income group and low-income group. The MAPE of high-income regions is 20% and that of low-income regions is 19%. There is little difference in MAPE for both income groups, showing our model has a good generalizability in different income groups.

# set income context
median_hhincome <- median(acsTractsBD19$med_hh_income)
acsTractsBD19 <- acsTractsBD19 %>%
  mutate(incomeContext = ifelse(med_hh_income > median_hhincome, "High Income", "Low Income"))

#plot incime context
ggplot() + geom_sf(data = na.omit(acsTractsBD19), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context in Boulder County") +
    theme(plot.title = element_text(hjust = 0.5),legend.position="bottom") +
    mapTheme()

st_join(boulder.test, acsTractsBD19) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by census Tracts income context") %>% kable_styling(full_width = F)
Test set MAPE by census Tracts income context
High Income Low Income
20% 19%
# testing_data["price"] <- predict(all_reg_fin, testing_data)
# write.csv(testing_data,"predictions.csv", row.names = FALSE)

Discussion

In general, our model is an effective one. Our R-square reaches 0.75 which means the major variation in house price can be interpreted by our model. The MAPE is at the level of 0.19, showing a moderate accuracy in local house price prediction. Our model performs especially well in the downtown area, with an average MAPE of about 0.12.

However, there is also limitation in our model that there is spatial autocorrelation in our residuals, showing an unsatisfying generalizability. In the center and suburb of the boulder city, our model performs quite well, with a lower MAPE than average. But in some specific outer regions, the high MAPE value indicates a bad model adoption. The reason of this might be that there are more houses in the center city and suburbs, so most of our training data come from these regions. Due to the lack of training data, predicted house price in the outer region varies quite a lot from the observed price. What’s more, houses in the mountainous areas are less dense and therefore farther from the surrounding public facilities, some of our predictors do not apply when predicting this kind of houses.

The map of our prediction result shows that the lower house prices cluster in the downtown, while those houses in the suburb have a greater value. This pattern is quite common in US cities.

There are some interesting predictors when training the regression model. We introduced a couple of census data to the model as predictors and we can see level of income, level of education and poverty rate all have an impact on regional house price. What surprises us when doing the training part is that the accessibility to schools (calculated as 3 nearest schools’ distance) is not significant in influencing the house price, and neither is the accessibility to hospitals. This is quite different from my ‘knowledge’ and surprised me at first.

Conclusion

In general, we’d like to recommend our model to Zillow. Firstly, our model reaches a moderate accuracy and can be applied especially in downtown house price prediction with a MAPE about 12% on average.

Furthermore, we consider a wide variety of predictors in the process of model training. The predictors come from the field of houses’ internal characteristics, public amenities near houses, spatial structure, and socioeconomical characteristics and have been carefully selected to avoid correlation and multi-collinearity.

Different methods, like creating data partition, are also introduced to reduce errors when training our model. Thus, our model is likely to be helpful in house price prediction in boulder county.

There are also a few limitations in our model. Our model does not reach a satisfying generalizability, and is quite restricted in the downtown area of boulder county. We’d like to improve it with more generalizable and useful datasets and pay more attention on model training.