1 Introduction

This project aims to develop a model to predict home prices in Boulder County, CO. Home price prediction is an integral part of real estate industry, yet the accuracy of home price prediction has always been a challenge. The business-as-usual method does not produce predictions as accurate as it should be.

In this project, we develop an OLS model by gathering local data and engaging related variables. The resulting prediction model achieves the average percentage of error to 19% and are more accurate (around 6%) in the county’s three urban cores.

2 Data & Method

2.1 Data sources

We have collected data from three main sources:

  1. The Boulder County Assessor Public Dataset, which includes the prices and the characteristics (e.g., number of rooms, built year, facilities) of homes within Boulder County.

  2. OpenStreetMap and Boulder County Open Data Portal which have open data regarding the recreational, commercial, transportation and other amenities/facilities within Boulder County.

  3. Census data from the American Community Survey (ACS).

2.2 Data Collection

#-----Import necessary packages-----
library(tidyverse)
library(ggplot2)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(osmdata)
library(tigris)
library(osmextract)
library(curl)
library(reshape2)
library(glue)
library(dismo)
library(spgwr)
library(MASS)
library(lme4)
library(data.table)
library(kableExtra)
library(stargazer) # Source: ^[Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer]

options(tigris_class = "sf")

#-----Import external functions & a palette-----
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

plotTheme <- function(base_size = 9, title_size = 10){
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5), 
    plot.subtitle = element_text( face = 'italic',
                                 size = base_size, colour = "black", hjust = 0.5),
    plot.caption = element_text( hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.01),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=.5),
    strip.background = element_blank(),
    strip.text = element_text( size=9),
    axis.title = element_text( size=9),
    axis.text = element_text( size=7),
    axis.text.y = element_text( size=7),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text( colour = "black", face = "italic", size = 9),
    legend.text = element_text( colour = "black", face = "italic", size = 7),
    strip.text.x = element_text( size = 9),
    legend.key.size = unit(.5, 'line')
  )
}

mapTheme <- function(base_size = 9, title_size = 10){
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5), 
    plot.subtitle = element_text( face = 'italic',
                                 size = base_size, colour = "black", hjust = 0.5),
    plot.caption = element_text( hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_blank(),
    strip.text = element_text(size=base_size),
    axis.title = element_text( size=9),
    axis.text = element_blank(),
    axis.text.y = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text( colour = "black", face = "italic", size = 9),
    legend.text = element_text( colour = "black", face = "italic", size = 7),
    strip.text.x = element_text(size=base_size),
    legend.key.size = unit(.5, 'line')
  )
}

palette5 <- c("#2166ac","#67a9cf","#d1e5f0","#fddbc7","#ef8a62")

#-----Geographical boundaries and CRS-----

# Bounding box for getting OSM & CRS
q0 <- opq(bbox = c(-105.7,39.9,-105.05,40.25))
crs = "ESRI:102254"

# Boulder County boundary
boulder <- counties(state = 'CO', cb = FALSE) %>%
  filter(NAME == "Boulder") %>%
  st_as_sf() %>%
  st_transform(crs)

# Boulder County's municipalities
muni = st_read("https://opendata.arcgis.com/datasets/9597d3916aba47e887ca563d5ac15938_0.geojson") %>%
  st_transform(crs)

#-----Self-define functions-----

#Get data from OSM
get_osm_1 <- function(bbox, key, value, geom, crs){
  object <- add_osm_feature(opq = bbox, key = key, value = value) %>%
    osmdata_sf(.)
  geom_name <- paste("osm_", geom, sep = "")
  object.sf <- st_geometry(object[[geom_name]]) %>%
    st_transform(crs) %>%
    st_sf() %>%
    cbind(., object[[geom_name]][[key]]) %>%
    rename(NAME = "object..geom_name....key..")
  return(object.sf)
}

# Buffer Count - Count the number of appearance within the buffer of houses
# Input: home data (sf), object/amenity type, buffer radius, object data
buffer_count = function(trainData, type, radius, data){
  houseBuffer = st_buffer(trainData, radius)
  trainData[type] = st_intersects(houseBuffer, data) %>%
    sapply(., length)
  return(trainData)
}

# Buffer Area - for polygon amenities, calculate the aggregate area of all entries that intersect with the buffer of a home
# Input: home data (sf), object/amenity type, buffer radius, object data
buffer_area = function(trainData, type, radius, data){
  data <- mutate(data, area = st_area(data))
  trainData[type] = st_buffer(trainData, radius) %>%
    aggregate(data %>% dplyr:: select (geometry, area),., sum) %>%
    pull(area)
  return(trainData)
}

# K-near - Calculate the average distance of a home to its nearest (1 to 5) amenity object(s)
# Input: home data, object/amenity data, object/amenity type, geometry type (point or non-point)
knear <- function(trainData, data, type, geom){
  if (geom != "points") {
    data =  data %>% dplyr::select(geometry) %>%st_centroid(.) %>% st_coordinates
  }
  else {
    data = data %>% dplyr::select(geometry) %>% st_coordinates
  }
  trainDataCoords = st_coordinates(trainData)
  nn_1 = nn_function(trainDataCoords, data, 1)
  nn_2 = nn_function(trainDataCoords, data, 2)
  nn_3 = nn_function(trainDataCoords, data, 3)
  nn_4 = nn_function(trainDataCoords, data, 4)
  nn_5 = nn_function(trainDataCoords, data, 5)
  trainData[paste(type, "_nn1", sep = "")] = nn_1
  trainData[paste(type, "_nn2", sep = "")] = nn_2
  trainData[paste(type, "_nn3", sep = "")] = nn_3
  trainData[paste(type, "_nn4", sep = "")] = nn_4
  trainData[paste(type, "_nn5", sep = "")] = nn_5
  return(trainData)
}

#-----Import home data and wrangle them-----

# Import all home data
trainData = st_read("studentData.geojson", crs = "ESRI:102254")

# Wrangle home data
trainData <- dplyr::select(
  trainData,
  -saleDate,-year,-year_quarter,-address,-bld_num,-section_num,-designCode,-qualityCode,-bldgClass,-bldgClassDscr,-ConstCode,-CompCode,-builtYear,-bsmtTypeDscr,-carStorageTypeDscr,-Ac,-Heating,-ExtWallPrim,-ExtWallSec, -IntWall, -Roof_Cover, -Stories,-UnitCount,-status_cd)%>%
  mutate(nbrBaths = nbrFullBaths + 0.5 * nbrHalfBaths + 0.75 * nbrThreeQtrBaths)%>%
  dplyr::select(-nbrFullBaths, -nbrHalfBaths, -nbrThreeQtrBaths)

By this data-set, we map the sale prices of homes (Figure 1). Home prices follow distinct spatial patterns. Higher home prices are converged in the Municipality of Boulder and the southeast municipalities of Louisville, Superior, and Erie. In Longmont and rural areas, home prices are much lower.

ggplot()+
  geom_sf(data = CensusData2, color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent", color="black",size=2)+
  geom_sf(data = trainData, aes(color = q5(price)), size = 3,alpha=0.3)+
  scale_color_manual(values = palette5, 
                    labels = qBr(trainData, "price"),
                    name = "Home sale prices")+
  labs(title = "Map of home sale prices in Boulder County, CO", 
       subtitle = "Real Dollars; In tract unit")+
  mapTheme()
Figure 1. Map of home sale prices in Boulder County, CO

Figure 1. Map of home sale prices in Boulder County, CO


While the dataset of homes (trainData) includes the intrinsic features of each home, we still need external features of homes—that is, the accessibility of each home to recreation, services, transport, and other amenities. Depending on the characteristic of each type of amenity, we calculate accessibility using one or more of the following methods (using the self-defined functions in the last code chunk):

  1. Buffer-count method. We count how many objects there are within a buffer (usually a 1,000-meter buffer) of each home.

  2. Buffer-area method. We calculate the aggregate area of objects within a buffer (usually a 1,000-meter buffer) of each home.

  3. K-nearest-neighbor method. We calculate the average distance of each home to its 1~5 nearest amenity object(s).

#-----Amenity: parks-----

# Get park data, same below
parks = get_osm_1(q0, "leisure", "park", "polygons", "ESRI:102254")

## How many parks within 1000 m buffer of each home, same below
trainData = buffer_count(trainData, "parkCount", 1000, parks)
## Aggregate park area of all intersected park, same below
trainData = buffer_area(trainData, "parkArea", 1000, parks)

#-----Amenity: restaurants-----

restaurants = get_osm_1(q0, "amenity", "restaurant", "points", crs)
trainData = buffer_count(trainData, "restaurantsCount", 1000, restaurants)

# Average distance to 1~5 nearest restaurant(s), same below
trainData = knear(trainData, restaurants, "restaurants", "points")

#-----Amenity: schools-----

schools = get_osm_1(q0, 'amenity', 'school', "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "schoolCount", 1000, schools)
trainData = knear(trainData, schools, "schools", "polygons")

# Assign each home to its nearest school
schoolDistrict = voronoi(st_coordinates(schools %>% st_centroid())) %>% 
  st_as_sf() %>% 
  st_set_crs("ESRI:102254") %>%
  rename(schoolNo = id) %>%
  mutate(schoolNo = as.character(schoolNo))
trainData = st_join(trainData, schoolDistrict)

#-----Amenity: universities-----

# Get university data
universities = get_osm_1(q0, 'amenity', 'university', "polygons", "ESRI:102254")

# Whether university is within 1000 m buffer of each home
trainData = buffer_count(trainData, "universitiesCount", 1000, universities)

#-----Amenity: parking-----

parking = get_osm_1(q0, "amenity", "parking", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "parkingCount", 1000, parking)
trainData = buffer_area(trainData, "parkingArea", 1000, parking)
trainData = knear(trainData, parking, "parking", "polygons")

#-----Amenity: clinics-----

clinics = get_osm_1(q0, "amenity", "clinic", "points", "ESRI:102254")
trainData = knear(trainData, clinics, "clinics", "points")

#-----Amenity: hospitals-----

hospitals = get_osm_1(q0, "amenity", "hospital", "polygons", "ESRI:102254")
trainData = knear(trainData, hospitals, "hospitals", "polygons")

#-----Amenity: cinemas-----
cinemas = get_osm_1(q0, "amenity", "cinema", "points", "ESRI:102254")
trainData = buffer_count(trainData, "cinemasCount", 1000, cinemas)
trainData = buffer_count(trainData, "cinemasCount2", 2000, cinemas)

#-----Amenity: stadiums-----
stadiums = get_osm_1(q0, "building", "stadium", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "stadiumsCount", 1000, stadiums)

#-----Amenity: commerce-----
commercial = get_osm_1(q0, "building", "commercial", "polygons", "ESRI:102254")
trainData = knear(trainData, commercial, "commerce", "polygons")
trainData = buffer_count(trainData, "commerceCount", 1000, commercial)

#-----Amenity: retail-----
retail = get_osm_1(q0, "building", "retail", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "retailCount", 1000, retail)
trainData = knear(trainData, retail, "retail", "polygons")

#-----Amenity: common leisure space-----
commonleisure = get_osm_1(q0, "leisure", "common", "polygons", "ESRI:102254")
trainData = knear(trainData, commonleisure, "commonLeisure", "polygons")

#-----Amenity: fitness centers-----
fitness_centre = get_osm_1(q0, "leisure", "fitness_centre", "points", "ESRI:102254")
trainData = buffer_count(trainData, "fitnessCenterCount", 1000, fitness_centre)
trainData = knear(trainData, fitness_centre, "fitnessCenter", "pooints")

#-----Amenity: public gardens-----
garden = get_osm_1(q0, "leisure", "garden", "polygons", "ESRI:102254")
trainData = knear(trainData, garden, "gardens", "polygons")
trainData = buffer_count(trainData, "gardensCount", 1000, garden)

#-----Jobs: companies-----
companies = get_osm_1(q0, "office", "company", "points", "ESRI:102254")
trainData = knear(trainData, companies, "companies", "points")

#-----Transport: public transport-----
public_transport = get_osm_1(q0, "public_transport", "station", "points", "ESRI:102254")
trainData <-
  trainData %>% 
  mutate(
    public_transport_nn1 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 1),
    public_transport_nn2 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 2))


#-----Transport: highway-----

highway = get_osm_1(q0, "highway", "primary", "lines", "ESRI:102254")
# Distance from each home to primary highways.
highwayunion <- st_buffer(st_union(highway),500)%>%st_sf()
highway_MRBuffer <-
  st_join(trainData,multipleRingBuffer(highwayunion,40000,1000)) %>%
  st_drop_geometry() %>%
  replace_na(list(distance = 0))  %>%
  mutate(distance = distance+500) %>%
  rename(.,highwaydistance = distance)%>%
  dplyr::select(MUSA_ID,highwaydistance)
trainData <- left_join(trainData,highway_MRBuffer,by="MUSA_ID")
rm(highway_MRBuffer,highwayunion)

#-----Nature reserves-----
natureReserves = get_osm_1(q0, "leisure", "nature_reserve", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "natureReservesCount", 1000, natureReserves)

#-----Water bodies-----
water = get_osm_1(q0, "natural", "water", "polygons", "ESRI:102254")
trainData <-trainData %>% mutate(water_nn1 = nn_function(st_coordinates(trainData), st_coordinates(water%>%dplyr::select(geometry)%>%st_centroid(.)), 1))

In addition to data from OpenStreetMap, we have also collected data on block-group level from the ACS (American Community Survey, 5-year estimates, 2019). Each home is added information regarding its census tract and finally, each home is added information of the average price of its five nearest homes.

#-----API-----
census_api_key("b33ec1cb4da108659efd12b3c15412988646cbd8", overwrite = TRUE)

#-----Get census data-----

# Get block-group level information data regarding racial composition, population density, income, resident education levels etc.
CensusData <- 
  get_acs(geography = "block group", 
          variables = c("B01003_001E","B02001_002E","B19013_001E","B25058_001E",'B02001_003E','B02001_005E'), 
          year=2019, state=08, county=013, geometry=T, output="wide") %>%
  st_transform('ESRI:102254') %>%
  rename(Census_TotalPop = B01003_001E, 
         Census_Whites = B02001_002E,
         Census_AfricanAmericans = B02001_003E,
         Census_Asians = B02001_005E,
         Census_MedHHInc = B19013_001E, 
         Census_MedRent = B25058_001E) %>%
  dplyr::select(-NAME,-starts_with("B")) %>%
  mutate(Census_pctWhite = ifelse(Census_TotalPop > 0, 100*Census_Whites / Census_TotalPop,0),
         Census_pctAfricanAmericans = ifelse(Census_TotalPop > 0, 100*Census_AfricanAmericans / Census_TotalPop,0),
         Census_pctAsians = ifelse(Census_TotalPop > 0, 100*Census_Asians / Census_TotalPop,0),
         Census_blockgroupareasqm = as.numeric(st_area(.)),
         Census_areaperpeople = ifelse(Census_blockgroupareasqm > 0,Census_blockgroupareasqm/Census_TotalPop,0),
         year = "2019") %>%
  dplyr::select(-Census_Whites,-Census_AfricanAmericans,-Census_Asians, -GEOID)

# Get tract-level census data for tract geo-info

CensusData2 <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001E"), 
          year=2019, state=08, county=013, geometry=T, output="wide") %>%
  st_transform('ESRI:102254') %>%
  dplyr::select(-NAME,-starts_with("B"))

#-----Join trainData to census data-----

trainData <-st_join(trainData,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData <-st_join(trainData,dplyr::select(CensusData2, GEOID, -geometry))

#-----Join trainData to municipality data-----
trainData <-st_join(trainData,dplyr::select(muni, ZONEDESC)) %>%
  rename(tractID = GEOID, muni = ZONEDESC)
trainData$muni[is.na(trainData$muni)] = 'Other'

#-----Other modification-----

trainData$parkingArea = as.numeric(trainData$parkingArea)
trainData$parkArea = as.numeric(trainData$parkArea)

#-----Spatial lag information-----
# Average price of five nearest homes of each home
# For "toPredict == 1" observations, the nearest five homes are defined as the nearest five homes in the "toPredict == 0" set. 

# First split data by toPredict
dataTest = data[data$toPredict == 1,]
dataTrain = data[data$toPredict == 0,]
rownames(dataTest) <- seq(length = nrow(dataTest))
rownames(dataTrain) <- seq(length = nrow(dataTrain))

# Spatial lag for "toPredict == 1" homes
dataTest = cbind(dataTest, 
                 as.data.frame(get.knnx(st_coordinates(dataTrain), 
                                        st_coordinates(dataTest), 5)$nn.index) %>%
                   rownames_to_column(var = "thisPoint") %>%
                   gather(points, point_index, V1:ncol(.)) %>%
                   arrange(as.numeric(thisPoint)) %>% 
                   left_join(cbind(dataTrain$price, seq(length = length(dataTrain$price))) %>%
                               as.data.frame()%>%
                               rename(price = V1, index = V2),
                             by = c("point_index" = "index")) %>% 
                   group_by(thisPoint)%>%
                   summarize(spatialLag = mean(price))%>%
                   arrange(as.numeric(thisPoint))%>%
                   dplyr::select(-thisPoint))

# Adding spatial lag indicator into "dataTrain"
dataTrain$spatialLag = lag.listw(nb2listw(knn2nb(knearneigh(st_coordinates(dataTrain), 5)), style = "W"),dataTrain$price)

# Un-split data, because we still need to wrangle the data-set as a whole
trainData = rbind(dataTrain, dataTest)

3 Model Building

3.1 Feature engineering

The OLS model requires predictors are generally of Gaussian distribution. Therefore, we have plotted scatter-plots of home prices to all the numeric variables (see Figure 2), and the histograms of all the numeric variables (see Figure 3).

#-----Select Numeric Variables-----
numericColumns = unlist(lapply(trainData, is.numeric))
trainDataNumeric = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>% 
  st_drop_geometry() %>% 
  gather(key,value, -price)

#-----Make scatter-plots of all numeric variables----

ggplot(trainDataNumeric, aes(x = value, y = price))+
  geom_point(size=.05,alpha=0.5)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=0.5)+
  facet_wrap(~key, scales = "free",ncol = 8)+
  plotTheme()
Figure 2. Scatter-plots of home prices to all the numeric variables

Figure 2. Scatter-plots of home prices to all the numeric variables

The scatter-plots show that linear relationships are not strong between home prices and many predictors. Therefore, we consider applying a log transformation to some of the numeric predictors. Figure 3 shows the histograms before and after logarithmic transformation of every numeric predictor.

# Possible Distribution Transformation
trainDataNumeric2 = trainData[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>%
  st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
  names = names(trainDataNumeric2)
  name = names[i:i]
  par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
  i = i+1
}
dev.off()
Figure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictorFigure 3. Histograms before and after logarithmic transformation of every numeric predictor

Figure 3. Histograms before and after logarithmic transformation of every numeric predictor

These plots above help us identify the variables suitable for a logarithmic transformation: those that come to obey Gaussian distribution to a greater extent after logarithmic transformation.

# Variables List
LNVariables <- c("Census_areaperpeople","clinics_nn1",'clinics_nn2','clinics_nn3','clinics_nn4','clinics_nn5',"commerce_nn1","commerce_nn2","commerce_nn3","commerce_nn4","commerce_nn5","commonLeisure_nn1","commonLeisure_nn2","commonLeisure_nn3","commonLeisure_nn4","commonLeisure_nn5","companies_nn1","companies_nn2","companies_nn3","companies_nn4","companies_nn5","gardens_nn1","gardens_nn2","gardens_nn3","gardens_nn4","gardens_nn5","hospitals_nn1","hospitals_nn2","hospitals_nn3","hospitals_nn4","hospitals_nn5","parking_nn1","parking_nn2","parking_nn3","parking_nn4","parking_nn5","restaurants_nn1","restaurants_nn2","restaurants_nn3","restaurants_nn4","restaurants_nn5","retail_nn1","retail_nn2","retail_nn3","retail_nn4","retail_nn5","schools_nn1","schools_nn2","schools_nn3","schools_nn4","schools_nn5","parkArea","public_transport_nn1","public_transport_nn2","water_nn1")

# Geometry has to be dropped before logarithmic transformation
geo_trainData <- trainData %>% dplyr::select(MUSA_ID)
trainData1 = trainData %>% st_drop_geometry

# Logarithmic transformation
trainData1$parkingArea = as.numeric(trainData1$parkingArea)
trainData1$parkArea = as.numeric(trainData1$parkArea)
for (i in LNVariables) {
  a = glue('LN_{i}')
  b = dplyr::select(trainData1,glue('{i}'))
  c = log(1+b)
  trainData1[a] <- c
  trainData1 <- dplyr::select(trainData1,-glue('{i}'))
}

# Join back geometry
trainData1 = trainData1 %>% left_join(geo_trainData, by = "MUSA_ID") %>% st_sf()
trainData1[is.na(trainData1)] = 0

3.2 Feature selection

Our model consists of a large number of interrelated predictors, violating the assumptions of an OLS model. Feature selection is required to make the model more concise and mathematically tenable. A correlation matrix (see Figure 4) was plot to identify which predictors are highly interrelated.

numericColumns3 = unlist(lapply(trainData1, is.numeric))
trainDataNumeric3 = trainData1[,numericColumns] %>% 
  dplyr::select(-MUSA_ID,-toPredict)%>%
  na.omit()%>%
  st_drop_geometry()
Matrix <- select_if(trainDataNumeric3, is.numeric) %>% na.omit()
ggcorrplot(
  cor(Matrix),
  p.mat = cor_pmat(Matrix),
  colors = c("#EF8A62", "white", "#67A9CF"),
  type="lower",
  lab = TRUE,
  lab_size = 1.5,
  tl.cex = 5,
  insig = "blank")
Figure 4. Correlation matrix of all the engineered features

Figure 4. Correlation matrix of all the engineered features

From the correlation matrix, we have identified 11 groups of highly inter-correlated predictors. We will only select one from the inter-correlated predictors in the following.

# Define the 11 groups of highly inter-correlated predictors
LNgardens_nn <- c("LN_gardens_nn2","LN_gardens_nn3","LN_gardens_nn4","LN_gardens_nn5","gardensCount")
LNhospitals_nn <- c("LN_hospitals_nn1","LN_hospitals_nn3")
# Use a more generic name for easier later access. Same below
cbindvars1 <- c(LNgardens_nn,LNhospitals_nn)

LNcommonLeisure_nn <- c("LN_commonLeisure_nn3","LN_commonLeisure_nn4","LN_commonLeisure_nn5")
cbindvars2 <- LNcommonLeisure_nn

LNschools_nn <- c("LN_schools_nn2","LN_schools_nn3","LN_schools_nn4","LN_schools_nn5")
parking_decision <- c("parkingCount","parkingArea","LN_parking_nn1")
cbindvars3 <- c(LNschools_nn,parking_decision)

LNrestaurants_nn <- c("LN_restaurants_nn1","LN_restaurants_nn2","LN_restaurants_nn4","restaurantsCount")
LNcommerce_nn <- c("LN_commerce_nn4","LN_commerce_nn5","commerceCount")
cbindvars4 <- c(LNrestaurants_nn,LNcommerce_nn)

LNretail_nn <- c("LN_retail_nn1","LN_retail_nn4","LN_retail_nn5")
LNclinics_nn <- c("LN_clinics_nn1","LN_clinics_nn2","LN_clinics_nn3","LN_clinics_nn5")
cbindvars5 <- c(LNretail_nn,LNclinics_nn)

LNpublic_transport_nn <- c("LN_public_transport_nn1","LN_public_transport_nn2")
cbindvars6 <- LNpublic_transport_nn

park_decision <- c("parkCount","LN_parkArea")
cbindvars7 <- park_decision

bsmt_decision <- c("bsmtSF","bsmtType")
cbindvars8 <- bsmt_decision

LNcompanies_nn <- c("LN_companies_nn2","LN_companies_nn3","LN_companies_nn4","LN_companies_nn5")
cbindvars9 <- LNcompanies_nn

ExtWall_decision <- c("ExtWallDscrPrim","ExtWallDscrSec") 
cbindvars10 <- ExtWall_decision

The second step is to run a stepwise regression. The step-wise regression is an iterative process of adding and removing predictors and testing statistical significance after each iteration. It is a preliminary “filter” of the insignificant predictors.

#-----A simplified dataset of only the selected variables-----

trainData3 <- dplyr::select(
  trainData1, 
  price,designCodeDscr,qualityCodeDscr,ConstCodeDscr,EffectiveYear,carStorageType,TotalFinishedSF,
  AcDscr,HeatingDscr,IntWallDscr,Roof_CoverDscr,nbrBaths,schoolCount,universitiesCount,stadiumsCount,
  highwaydistance,natureReservesCount,Census_MedHHInc,Census_MedRent,Census_pctWhite,Census_pctAsians,
  tractID,LN_Census_areaperpeople,LN_water_nn1,LN_hospitals_nn1,LN_commonLeisure_nn4,LN_schools_nn2,
  LN_restaurants_nn1,LN_clinics_nn1,LN_public_transport_nn2,LN_parkArea,bsmtSF,LN_companies_nn5,ExtWallDscrPrim,fitnessCenter_nn3,gardensCount, toPredict, muni)

#-----Split dataset by the toPredict variable-----
dataTest = trainData3[trainData3$toPredict == 1,]
dataTrain = trainData3[trainData3$toPredict == 0,]
rownames(dataTest) <- seq(length = nrow(dataTest))
rownames(dataTrain) <- seq(length = nrow(dataTrain))
dropList1 = c("geometry", "MUSA_ID", "toPredict", "muni")
# Running a step-wise model
reg1Data = dataTrain[, !colnames(dataTrain) %in% dropList1] %>% 
  st_drop_geometry
reg.baseline1= lm(price ~ ., data = reg1Data)
reg.baseline1.step = stepAIC(reg.baseline1, direction = "both",trace = FALSE)
reg.baseline1.step$anova

# Examine the step-wise model
summary(reg.baseline1.step)

# The chosen variables are:
# designCodeDscr, qualityCodeDscr, ConstCodeDscr, EffectiveYear, bsmtSF, bsmtType, carStorageType, AcDscr, HeatingDscr, ExtWallDscrPrim, ExtWallDscrSec, Roof_CoverDscr, nbrBaths, parkCount, restaurantsCount, schoolCount, schoolNo, universitiesCount, parkingCount, parkingArea, cinemasCount, stadiumsCount, commerceCount, fitnessCenter_nn3, gardensCount, Census_MedRent, Census_pctAfricanAmericans, tractName, LNclinics_nn1, LNclinics_nn3, LNclinics_nn5, LNcommerce_nn3, LNcommerce_nn4, LNcommerce_nn5, LNcommonLeisure_nn1, LNcommonLeisure_nn3, LNcommonLeisure_nn4, LNcommonLeisure_nn5, LNcompanies_nn1, LNcompanies_nn4, LNgardens_nn2, LNgardens_nn5, LNhospitals_nn1, LNhospitals_nn4, LNhospitals_nn5, LNparking_nn1, LNparking_nn2, LNparking_nn5, LNrestaurants_nn1, LNrestaurants_nn2, LNrestaurants_nn4, LNretail_nn1, LNretail_nn4, LNretail_nn5, LNschools_nn2, LNschools_nn3, LNschools_nn4, LNparkArea, LNpublic_transport_nn1, LNpublic_transport_nn2, LNwater_nn1, parkAreaIntersected, TotalFinishedSF

For the next step, we manually select one predictor out of each group of highly interrelated predictors. To do this, we sequentially add every predictor in each of the 11 groups of highly interrelated predictors, testing for prediction accuracy with every move.

# Making a splitting process
inTrain = createDataPartition(y = paste(dataTrain$ConstCodeDscr,
                                        dataTrain$Roof_CoverDscr,
                                        dataTrain$ExtWallDscrPrim,
                                        dataTrain$tractID),
                              p = .75, list = FALSE)

# Split data into a training set and a test set
dataTrain.training = dataTrain[inTrain,]
dataTrain.test = dataTrain[-inTrain,]
#-----Construct comparison dataframe-----
varcomparison = dataTrain.test %>% 
  dplyr::select(price) %>% 
  st_drop_geometry()

#-----Sequentially run models-----
for( i in cbindvars1) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars2) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars3) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars4) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars5) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars6) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars7) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars8) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars9) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

for( i in cbindvars10) {
  trainingmed <- rename(dataTrain.training, "medname_var1" = i)
  testingmed <- rename(dataTrain.test,"medname_var1" = i)
  reg.test1 = lm(price ~
                   designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr + 
                   IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount + 
                   Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
                   medname_var1,data = trainingmed)
  varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
  varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}

#-----Summarize comparison-----
varcomparison_summary <- colMeans(varcomparison,dims=1)
varcomparison_summary <- as.data.frame.list(varcomparison_summary)

Finally, our finally selection of variables are as follows. The first group of predictor are the intrinsic features of each home (see Table 1).

IntrinsicFeatures = model.matrix(~ TotalFinishedSF+
                                   designCodeDscr+
                                   qualityCodeDscr+
                                   ConstCodeDscr+
                                   EffectiveYear+
                                   bsmtSF+
                                   carStorageType+
                                   AcDscr+
                                   HeatingDscr+
                                   ExtWallDscrSec+
                                   Roof_CoverDscr+
                                   nbrBaths - 1,
                                 data = trainData1) %>%
  as.data.frame()

stargazer(IntrinsicFeatures,
          type = "html",
          digits = 1,
          title = "Table 1. Summary statistics of predictors - intrinsic features",
          out.header = TRUE,
          covariate.labels = c("Total finished area square feet",
                               "Design type - Ranch",
                               "Design type -  2-3 story",
                               "Design type - Bi-level",
                               "Design type - Multi-story-townhouse",
                               "Design type - Split-level",
                               "Quality - Average +",
                               "Quality - Average ++",
                               "Quality - Excellent",
                               "Quality - Excellent +",
                               "Quality - Excellent ++",
                               "Quality - Exceptional 1",
                               "Quality - Exceptional 2",
                               "Quality - Fair",
                               "Quality - Good",
                               "Quality - Good +",
                               "Quality - Good ++",
                               "Quality - Low",
                               "Quality - Very good",
                               "Quality - Very good +",
                               "Quality - Very good ++",
                               "Construction type - Brick",
                               "Construction type - Concrete",
                               "Construction type - Frame",
                               "Construction type - Masonry",
                               "Construction type - Precast",
                               "Construction type - Veneer",
                               "Construction type - Wood",
                               "Year of construction or last major change",
                               "Car storage type - GRA",
                               "Car storage type - GRB",
                               "Car storage type - GRC",
                               "Car storage type - GRD",
                               "Car storage type - GRF",
                               "Car storage type - GRW",
                               "Air conditioning - Fan",
                               "Air conditioning - Cooler",
                               "Air conditioning - Whole house",
                               "Heating - Electric",
                               "Heating - Electric wall heat",
                               "Heating - Forced air",
                               "Heating - Gravity",
                               "Heating - Pump",
                               "Heating - Water",
                               "Heating - HVAC",
                               "Heating - Package unit",
                               "Heating - Radient floor",
                               "Heating - Ventilation only",
                               "Heating - Wall furnace",
                               "External wall - Block stucco",
                               "External wall - Brick on block",
                               "External wall - Brick veneer",
                               "External wall - Cedar",
                               "External wall - Cement board",
                               "External wall - Faux stone",
                               "External wall - Frame asbestos",
                               "External wall - Frame stucco",
                               "External wall - Frame wood/shake",
                               "External wall - Log",
                               "External wall - Log",
                               "External wall - Metal",
                               "External wall - Moss rock/flagstone",
                               "External wall - Painted block",
                               "External wall - Vinyl",
                               "Roof - Asphalt",
                               "Roof - Built-up",
                               "Roof - Clay tile",
                               "Roof - Concrete tile",
                               "Roof - Metal",
                               "Roof - Roll",
                               "Roof - Rubber membrane",
                               "Roof - Shake",
                               "Roof - Tar and gravel",
                               "Number of bathrooms"),
          flip = FALSE)
Table 1. Summary statistics of predictors - intrinsic features
Statistic N Mean St. Dev. Min Max
Total finished area square feet 11,363 1,955.5 892.6 0 10,188
Design type - Ranch 11,363 0.3 0.5 0 1
Design type - 2-3 story 11,363 0.5 0.5 0 1
Design type - Bi-level 11,363 0.03 0.2 0 1
Design type - Multi-story-townhouse 11,363 0.1 0.3 0 1
Design type - Split-level 11,363 0.1 0.3 0 1
Quality - Average + 11,363 0.1 0.2 0 1
Quality - Average ++ 11,363 0.1 0.2 0 1
Quality - Excellent 11,363 0.01 0.1 0 1
Quality - Excellent + 11,363 0.002 0.04 0 1
Quality - Excellent ++ 11,363 0.003 0.1 0 1
Quality - Exceptional 1 11,363 0.002 0.05 0 1
Quality - Exceptional 2 11,363 0.000 0.02 0 1
Quality - Fair 11,363 0.01 0.1 0 1
Quality - Good 11,363 0.3 0.5 0 1
Quality - Good + 11,363 0.1 0.2 0 1
Quality - Good ++ 11,363 0.1 0.3 0 1
Quality - Low 11,363 0.002 0.04 0 1
Quality - Very good 11,363 0.1 0.3 0 1
Quality - Very good + 11,363 0.02 0.1 0 1
Quality - Very good ++ 11,363 0.03 0.2 0 1
Construction type - Brick 11,363 0.01 0.1 0 1
Construction type - Concrete 11,363 0.001 0.02 0 1
Construction type - Frame 11,363 0.8 0.4 0 1
Construction type - Masonry 11,363 0.1 0.3 0 1
Construction type - Precast 11,363 0.000 0.01 0 1
Construction type - Veneer 11,363 0.000 0.01 0 1
Construction type - Wood 11,363 0.000 0.02 0 1
Year of construction or last major change 11,363 1,997.7 17.5 1,879 2,021
Car storage type - GRA 11,363 743.8 627.4 0 4,534
Car storage type - GRB 11,363 0.8 0.4 0 1
Car storage type - GRC 11,363 0.01 0.1 0 1
Car storage type - GRD 11,363 0.02 0.1 0 1
Car storage type - GRF 11,363 0.1 0.3 0 1
Car storage type - GRW 11,363 0.001 0.03 0 1
Air conditioning - Fan 11,363 0.001 0.03 0 1
Air conditioning - Cooler 11,363 0.002 0.04 0 1
Air conditioning - Whole house 11,363 0.02 0.1 0 1
Heating - Electric 11,363 0.6 0.5 0 1
Heating - Electric wall heat 11,363 0.02 0.2 0 1
Heating - Forced air 11,363 0.000 0.01 0 1
Heating - Gravity 11,363 0.9 0.3 0 1
Heating - Pump 11,363 0.002 0.05 0 1
Heating - Water 11,363 0.001 0.03 0 1
Heating - HVAC 11,363 0.1 0.2 0 1
Heating - Package unit 11,363 0.000 0.01 0 1
Heating - Radient floor 11,363 0.000 0.01 0 1
Heating - Ventilation only 11,363 0.01 0.1 0 1
Heating - Wall furnace 11,363 0.000 0.01 0 1
External wall - Block stucco 11,363 0.01 0.1 0 1
External wall - Brick on block 11,363 0.001 0.02 0 1
External wall - Brick veneer 11,363 0.001 0.03 0 1
External wall - Cedar 11,363 0.1 0.2 0 1
External wall - Cement board 11,363 0.000 0.02 0 1
External wall - Faux stone 11,363 0.002 0.04 0 1
External wall - Frame asbestos 11,363 0.1 0.2 0 1
External wall - Frame stucco 11,363 0.001 0.02 0 1
External wall - Frame wood/shake 11,363 0.01 0.1 0 1
External wall - Log 11,363 0.2 0.4 0 1
External wall - Log 11,363 0.000 0.02 0 1
External wall - Metal 11,363 0.001 0.03 0 1
External wall - Moss rock/flagstone 11,363 0.01 0.1 0 1
External wall - Painted block 11,363 0.000 0.02 0 1
External wall - Vinyl 11,363 0.001 0.03 0 1
Roof - Asphalt 11,363 0.7 0.5 0 1
Roof - Built-up 11,363 0.001 0.02 0 1
Roof - Clay tile 11,363 0.004 0.1 0 1
Roof - Concrete tile 11,363 0.02 0.1 0 1
Roof - Metal 11,363 0.01 0.1 0 1
Roof - Roll 11,363 0.000 0.01 0 1
Roof - Rubber membrane 11,363 0.01 0.1 0 1
Roof - Shake 11,363 0.004 0.1 0 1
Roof - Tar and gravel 11,363 0.000 0.02 0 1
Number of bathrooms 11,363 2.5 1.0 0.0 12.0


The second group of predictors regards the accessibility to recreation, services, transportation, and other types of amenities (see Table 2).

Amenities = model.matrix(~ cinemasCount+
                                   stadiumsCount+
                                   fitnessCenter_nn3+
                                   LN_commonLeisure_nn1+
                                   parkingArea+
                                   LN_restaurants_nn1+
                                   LN_hospitals_nn1+
                                   LN_retail_nn4+
                                   LN_parkArea+
                                   LN_water_nn1+
                                   LN_public_transport_nn2
                                   - 1,
                                 data = trainData1) %>%
  as.data.frame()


stargazer(Amenities,
          type = "html",
          title = "Table 2. Summary statistics of predictors - accessibility to recreation, services and other amenities",
          digits = 1,
          out.header = TRUE,
          covariate.labels = c("# of cinemas within 1000 m buffer",
                               "# of stadiums within 1000 m buffer",
                               "Avg. distance to 3 nearest fitness centers",
                               "Log. distance to nearest common leisure ground",
                               "Total area of parking within 1000 m buffer",
                               "Log. distance to nearest restaurant",
                               "Log. distance to nearest hospital",
                               "Avg. distance to four nearest retail points, log.",
                               "Total area of parks within 1000 m buffer",
                               "Log. distance to nearest water body",
                               "Average distance to the two public transit stations, log."),
          flip = FALSE)
Table 2. Summary statistics of predictors - accessibility to recreation, services and other amenities
Statistic N Mean St. Dev. Min Max
# of cinemas within 1000 m buffer 11,363 0.2 2.0 0 40
# of stadiums within 1000 m buffer 11,363 0.1 1.3 0 22
Avg. distance to 3 nearest fitness centers 11,363 3,187.2 3,612.7 38.8 34,166.8
Log. distance to nearest common leisure ground 11,363 7.6 1.0 3.1 10.4
Total area of parking within 1000 m buffer 11,363 37,985.3 47,920.3 0.0 450,407.2
Log. distance to nearest restaurant 11,363 6.9 0.8 2.4 9.4
Log. distance to nearest hospital 11,363 8.3 0.7 4.8 10.5
Avg. distance to four nearest retail points, log. 11,363 7.2 0.9 2.2 10.2
Total area of parks within 1000 m buffer 11,363 10.3 3.8 0.0 14.0
Log. distance to nearest water body 11,363 6.4 0.7 3.4 8.7
Average distance to the two public transit stations, log. 11,363 9.1 0.9 4.7 10.5


The third group of predictors are demographic/socioecomomic. They are the block-group level median rent and the block-group level percentage of African Americans.

Social = model.matrix(~ Census_MedRent+
                        Census_pctAfricanAmericans+
                        - 1,
                      data = trainData1) %>%
  as.data.frame()


stargazer(Social,
          type = "html",
          title = "Table 3. Summary statistics of predictors - demographic and socioeconomic",
          digits = 1,
          out.header = TRUE,
          covariate.labels = c("Block-group median rent",
                               "Block-group percentage of African Americans"),
          flip = FALSE)
Table 3. Summary statistics of predictors - demographic and socioeconomic
Statistic N Mean St. Dev. Min Max
Block-group median rent 11,363 1,403.9 663.0 0 3,281
Block-group percentage of African Americans 11,363 0.7 1.5 0.0 17.6


Finally, we have predictors related to the spatial structure. The first one is the census tract that a home belongs to. The second is the “spatial lag”, that is, the average price of the five nearest homes. We include only one of these two variables in our prediction models.

After feature selection, we can now plot a correlation matrix of the selected numeric variables (Figure 5).

#-----Numeric columns-----
numericColumns4 = unlist(lapply(trainData3, is.numeric))
trainDataNumeric4 = trainData3[,numericColumns4] %>% 
  dplyr::select(-toPredict)%>%
  na.omit()%>%
  st_drop_geometry()

#-----Plot correlation matrix-----
Matrix <- select_if(trainDataNumeric4, is.numeric) %>% na.omit()
ggcorrplot(
  cor(Matrix),
  p.mat = cor_pmat(Matrix),
  colors = c("#EF8A62", "white", "#67A9CF"),
  type="lower",
  lab = TRUE,
  lab_size = 2.5,
  tl.cex = 9,
  insig = "blank")
Figure 5. Correlation matrix of the selected numeric variables

Figure 5. Correlation matrix of the selected numeric variables

3.3 Plotting and mapping variables of interest

Figure 6 shows interesting scatter plots of home sale prices to four predictors of interest. The first two plots depict the relationship between home sale prices and racial composition. While a high percentage of white residents is strongly correlated to higher home prices, the percentage of many other racial groups (e.g., Asians) are negatively correlated to home prices. The third scatter-plot shows the relationship between accessibility to hospitals and home prices. It is not an entirely linear relationship, as being too near or too far to a hospital might indicated lower home prices. The last plot indicates that there is no straightforward relationship between rent and home prices.

trainDataNumeric5 = trainDataNumeric4 %>%
  gather(key,value, -price)

# 1
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_pctAsians")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Scatter plot of price to block-group percentage of Asian residents") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

# 2
trainDataNumeric5 = trainDataNumeric4 %>%
  gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_pctWhite")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Scatter plot of price to block-group percentage of white residents") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

# 3
trainDataNumeric5 = trainDataNumeric4 %>%
  gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "LN_hospitals_nn1")%>%
  mutate(EXP=exp(value))
ggplot(trainDataNumeric5, aes(x = EXP, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Scatter plot of price to logarithmic distance to nearest hospital") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))

# 4
trainDataNumeric5 = trainDataNumeric4 %>%
  gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_MedRent")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Scatter plot of price to block-group median rent") +
  plotTheme()+
  theme(axis.text=element_text(size=12),axis.title=element_text(size=2))
Figure 6 Scatter plots of price to four predictors of interestFigure 6 Scatter plots of price to four predictors of interestFigure 6 Scatter plots of price to four predictors of interestFigure 6 Scatter plots of price to four predictors of interest

Figure 6 Scatter plots of price to four predictors of interest

Figure 7 shows the spatial distribution of three predictors of interest. The first map shows the distribution of total finished home areas. This variable is evidently clustered in space. The second and third plots show that it is not a particular advantage to live near highway transport or waterbodies.

plotuse = trainData[trainData$toPredict == 0,]
ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = plotuse, aes(color = q5(TotalFinishedSF)), size = 2,alpha=0.3)+
  scale_color_manual(values = palette5,
                    labels = qBr(plotuse, "TotalFinishedSF"),
                    name = "Total finished area sf") +
  labs(
    title = "Distribution of total finished area square feet",
    subtitle = "Square Meters; In tract unit") +
  mapTheme() + 
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())

highwayunion <- st_buffer(st_union(highway),500)%>%st_sf()
highwayunion2 <- st_buffer(st_union(highway),2000)%>%st_sf()
highwayunion3 <- st_buffer(st_union(highway),4000)%>%st_sf()
ggplot()+
  geom_sf(data = highwayunion3, color="#DD1144", size = 1,fill="white",linetype = "dashed")+
  geom_sf(data = highwayunion2, color="#DD1144", size = 1.25,fill="white",linetype = "dashed")+
  geom_sf(data = highwayunion, color="#DD1144", size = 1.35,fill="white",linetype = "dashed")+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = plotuse, aes(color = q5(price)), size = 3,alpha=0.5)+
  scale_color_manual(values = palette5, 
                    labels = qBr(trainData, "price"),
                    name = "Home sale prices")+
  labs(
    title = "Home prices with distance to highways",
    subtitle = "Real Dollars; In tract unit") +
  mapTheme() + 
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())

water_plot <-st_intersection(boulder,water)
ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = trainData, aes(color = q5(price)), size = 3,alpha=0.5)+
  geom_sf(data = water_plot,fill = alpha("#2166ac", 0.4),color="darkblue")+
  scale_color_manual(values = palette5, 
                    labels = qBr(trainData, "price"),
                    name = "Home sale prices")+
  labs(
    title = "Home prices with distance to water bodies",
    subtitle = "Real Dollars; In tract unit") +
  mapTheme() + 
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())
Figure 7. Distribution map of three variables of interestFigure 7. Distribution map of three variables of interestFigure 7. Distribution map of three variables of interest

Figure 7. Distribution map of three variables of interest

4 Results

4.1 Linear regression using tract or spatial lag as geospatial variable

Firstly, we have built two OLS models using the predictors that were selected in the previous section. The first model uses tract ID as its geospatial component, and the second model uses spatial lag (average price of five nearest homes). The first model is statistically better than the second one.

#-----Use Tract as geospatial variable-----
dropList2 = c("geometry", "MUSA_ID", "spatialLag", "muni","toPredict")
reg2Data = dataTrain.training[, !colnames(dataTrain.training) %in% dropList2]%>%
  st_drop_geometry()
reg.baseline.fixedEffect = lm(price ~ ., data = reg2Data)

#-----Use spatial lag as geospatial variable-----

dropList3 = c("geometry", "MUSA_ID", "tractID", "muni", "toPredict")
reg3Data = dataTrain.training[, !colnames(dataTrain.training) %in% dropList3]%>%
  st_drop_geometry()

reg.baseline.spatialLag = lm(price ~ ., data = reg3Data)
stargazer(reg.baseline.fixedEffect, reg.baseline.spatialLag, 
          type = "html",
          title = "Table 4. Model summary of the first two linear models",
          dep.var.labels = "Home Price")
Table 4. Model summary of the first two linear models
Dependent variable:
Home Price
(1) (2)
designCodeDscr2-3 Story -61,525.080*** -73,337.190***
(9,682.589) (10,262.450)
designCodeDscrBi-level 29,626.900 46,117.420**
(20,563.460) (21,795.010)
designCodeDscrMULTI STORY- TOWNHOUSE -192,607.200*** -216,920.900***
(14,095.190) (14,435.240)
designCodeDscrSplit-level -241.016 11,943.390
(13,861.060) (14,682.620)
qualityCodeDscrAVERAGE + -71,554.890*** -51,193.150***
(15,006.470) (15,659.110)
qualityCodeDscrAVERAGE ++ -13,692.520 -2,230.711
(15,335.820) (16,141.990)
qualityCodeDscrEXCELLENT 1,010,949.000*** 1,117,186.000***
(34,615.000) (36,682.630)
qualityCodeDscrEXCELLENT + 1,266,494.000*** 1,369,965.000***
(71,917.300) (77,133.140)
qualityCodeDscrEXCELLENT++ 1,876,650.000*** 2,016,477.000***
(58,643.420) (62,669.390)
qualityCodeDscrEXCEPTIONAL 1 1,012,442.000*** 1,048,199.000***
(78,669.580) (84,083.580)
qualityCodeDscrEXCEPTIONAL 2 1,404,832.000*** 1,458,094.000***
(181,896.500) (195,170.800)
qualityCodeDscrFAIR -35,318.250 -20,918.690
(39,286.270) (42,066.970)
qualityCodeDscrGOOD 42,293.880*** 62,876.260***
(11,038.170) (11,240.700)
qualityCodeDscrGOOD + 67,923.690*** 102,103.000***
(17,729.940) (18,496.620)
qualityCodeDscrGOOD ++ 154,544.300*** 174,285.700***
(16,882.780) (17,258.450)
qualityCodeDscrLOW -110,504.400 -80,936.020
(77,416.260) (83,038.960)
qualityCodeDscrVERY GOOD 240,317.000*** 248,371.000***
(17,834.630) (18,682.310)
qualityCodeDscrVERY GOOD + 469,899.200*** 554,168.600***
(30,146.120) (31,973.170)
qualityCodeDscrVERY GOOD ++ 635,278.200*** 679,564.700***
(25,831.300) (27,272.360)
ConstCodeDscrBrick -28,723.510 6,391.292
(53,861.490) (57,599.320)
ConstCodeDscrConcrete 38,692.340 51,261.480
(147,490.100) (158,346.700)
ConstCodeDscrFrame -41,347.150 -30,860.920
(43,304.590) (46,587.930)
ConstCodeDscrMasonry -18,182.580 16,126.440
(45,057.910) (48,359.640)
ConstCodeDscrPrecast -948,140.600*** -994,463.100***
(220,728.900) (237,563.300)
ConstCodeDscrVeneer -909,458.500*** -845,330.300**
(308,020.500) (331,729.800)
ConstCodeDscrWood -50,321.780 -69,557.380
(168,066.600) (181,118.600)
EffectiveYear 1,926.039*** 1,328.303***
(314.811) (333.272)
carStorageTypeGRA 41,655.400*** 11,350.240
(14,297.150) (14,934.700)
carStorageTypeGRB 176,000.300*** 216,155.200***
(30,496.730) (32,622.710)
carStorageTypeGRC -7,766.679 -53,163.760*
(26,124.030) (27,812.100)
carStorageTypeGRD 76,801.220*** 86,583.570***
(15,633.650) (16,645.780)
carStorageTypeGRF 99,265.880 110,511.600
(107,078.000) (115,150.400)
carStorageTypeGRW -59,842.100 -85,015.040
(115,108.600) (123,848.400)
TotalFinishedSF 146.352*** 145.518***
(7.334) (7.732)
AcDscrAttic Fan -117,810.800 -50,550.540
(75,771.260) (81,406.260)
AcDscrEvaporative Cooler -19,990.290 6,904.208
(23,603.440) (25,256.210)
AcDscrWhole House 9,053.811 -1,673.603
(8,720.422) (9,307.403)
HeatingDscrElectric -143,900.300*** -130,918.400***
(44,264.180) (47,411.080)
HeatingDscrElectric Wall Heat (1500W) 155,979.100 122,377.700
(230,159.300) (248,163.400)
HeatingDscrForced Air -116,005.600*** -126,335.900***
(40,683.710) (43,730.760)
HeatingDscrGravity -102,913.800 -113,328.100
(74,320.670) (79,844.520)
HeatingDscrHeat Pump -397,416.500*** -400,860.700***
(121,129.800) (130,180.100)
HeatingDscrHot Water -50,252.650 -36,486.070
(42,137.740) (45,258.220)
HeatingDscrNo HVAC 230,464.500 43,722.310
(217,392.700) (234,400.900)
HeatingDscrPackage Unit 157,142.800 -2,086.943
(338,048.500) (363,979.600)
HeatingDscrRadiant Floor 123,114.800** 208,913.300***
(54,038.300) (58,011.220)
HeatingDscrVentilation Only -588,290.200* -592,421.800*
(320,701.000) (345,672.900)
HeatingDscrWall Furnace 62,108.080 56,366.310
(55,362.130) (59,286.460)
IntWallDscrDrywall 10,679.120 38,266.890
(49,063.880) (52,695.610)
IntWallDscrHardwood Panel -20,622.780 -778.694
(55,280.360) (59,303.110)
IntWallDscrPlaster -57,775.070 -5,065.372
(52,620.060) (55,983.550)
IntWallDscrPlywood -18,345.880 -1,174.350
(81,618.320) (87,675.190)
IntWallDscrUnfinished -31,997.780 9,643.200
(73,826.300) (79,478.860)
IntWallDscrWall Board 553.697 12,592.140
(56,943.750) (61,068.800)
Roof_CoverDscrAsphalt -17,921.740** -3,794.725
(8,272.463) (8,789.940)
Roof_CoverDscrBuilt-Up 483,310.700*** 359,775.700***
(126,684.900) (136,170.300)
Roof_CoverDscrClay Tile -56,501.210 -36,517.790
(49,024.820) (52,636.530)
Roof_CoverDscrConcrete Tile -91,679.200*** -177,392.000***
(26,518.510) (26,465.580)
Roof_CoverDscrMetal 181,614.000*** 220,596.400***
(27,406.370) (29,264.030)
Roof_CoverDscrRoll -67,168.080 -68,023.340
(299,594.500) (322,834.500)
Roof_CoverDscrRubber Membrane 134,316.100*** 102,802.000***
(29,153.580) (30,674.920)
Roof_CoverDscrShake -48,848.030 -22,533.350
(47,479.270) (51,065.590)
Roof_CoverDscrTar and Gravel 105,732.500 179,902.400
(152,062.600) (163,326.100)
nbrBaths 33,868.490*** 36,804.610***
(5,781.562) (6,183.466)
schoolCount -962.571 10,294.260**
(4,791.532) (4,481.424)
universitiesCount -77,909.390*** -119,547.700***
(18,188.290) (10,960.880)
stadiumsCount 4,447.665 7,069.038**
(3,625.631) (3,213.090)
highwaydistance -3.069 -20.605***
(3.270) (1.834)
natureReservesCount -5,882.494 -16,536.050***
(9,524.478) (5,520.931)
Census_MedHHInc 0.378** -0.173
(0.152) (0.131)
Census_MedRent -15.297** -44.334***
(7.218) (6.077)
Census_pctWhite 2,604.623** 1,900.289**
(1,096.884) (885.101)
Census_pctAsians 3,269.440** -4,679.027***
(1,628.097) (1,301.103)
tractID08013012102 -448,631.800***
(36,296.300)
tractID08013012103 -382,307.800***
(44,340.770)
tractID08013012104 -385,133.800***
(48,444.720)
tractID08013012105 -543,179.300***
(45,152.300)
tractID08013012201 -49,498.930
(46,762.680)
tractID08013012202 -422,942.000***
(70,185.190)
tractID08013012203 -768,888.800***
(49,275.480)
tractID08013012204 209,585.500**
(84,550.600)
tractID08013012300 -168,037.400
(226,633.900)
tractID08013012401 -421,112.800***
(63,564.240)
tractID08013012501 -654,764.400***
(61,342.540)
tractID08013012505 -357,433.700***
(47,852.230)
tractID08013012507 -628,223.500***
(56,630.930)
tractID08013012508 -614,846.400***
(62,589.260)
tractID08013012509 -594,605.500***
(51,660.510)
tractID08013012510 -434,298.500***
(52,450.810)
tractID08013012511 -571,052.800***
(65,529.860)
tractID08013012603 -524,040.800***
(56,965.920)
tractID08013012605 -406,341.200***
(145,939.900)
tractID08013012607 -426,905.800***
(101,051.700)
tractID08013012608 -554,549.500***
(65,167.220)
tractID08013012701 -788,645.000***
(44,755.900)
tractID08013012705 -742,742.700***
(70,404.100)
tractID08013012707 -390,046.800***
(70,961.600)
tractID08013012708 -819,907.200***
(54,783.970)
tractID08013012709 -788,277.700***
(65,942.700)
tractID08013012710 -536,211.900***
(52,390.100)
tractID08013012800 -1,043,787.000***
(58,048.600)
tractID08013012903 -935,673.600***
(64,940.990)
tractID08013012904 -793,029.400***
(60,463.700)
tractID08013012905 -746,718.700***
(69,086.170)
tractID08013012907 -856,652.300***
(63,893.740)
tractID08013013003 -770,952.500***
(54,257.500)
tractID08013013004 -779,147.700***
(62,637.610)
tractID08013013005 -746,650.900***
(55,508.960)
tractID08013013006 -675,833.200***
(57,053.800)
tractID08013013201 -804,474.400***
(78,689.960)
tractID08013013202 -338,277.500***
(77,967.460)
tractID08013013205 -910,670.800***
(55,886.960)
tractID08013013207 -770,118.300***
(84,001.400)
tractID08013013208 -722,828.100***
(78,778.180)
tractID08013013210 -771,401.700***
(67,479.380)
tractID08013013211 -877,989.800***
(60,909.200)
tractID08013013212 -727,098.400***
(65,718.830)
tractID08013013213 -867,579.700***
(60,071.950)
tractID08013013302 -670,544.300***
(74,608.710)
tractID08013013305 -773,391.800***
(80,955.350)
tractID08013013306 -767,759.400***
(78,501.370)
tractID08013013307 -669,963.600***
(87,155.060)
tractID08013013308 -729,376.600***
(79,847.910)
tractID08013013401 -741,330.800***
(79,113.780)
tractID08013013402 -865,159.800***
(72,177.160)
tractID08013013503 -787,201.700***
(76,199.850)
tractID08013013505 -795,870.600***
(80,672.650)
tractID08013013506 -916,214.200***
(72,304.540)
tractID08013013507 -907,912.300***
(76,833.040)
tractID08013013508 -930,338.400***
(71,371.270)
tractID08013013601 -686,284.000***
(69,950.810)
tractID08013013602 -648,712.700***
(79,811.450)
tractID08013013701 -768,835.200***
(45,767.080)
tractID08013013702 -891,851.600***
(55,528.660)
tractID08013060600 -945,194.100***
(55,693.720)
tractID08013060700 -728,928.400***
(65,179.130)
tractID08013060800 -744,254.100***
(62,324.780)
tractID08013060900 -796,688.600***
(66,105.560)
tractID08013061300 -1,081,866.000***
(58,729.380)
tractID08013061400 -1,021,191.000***
(56,103.650)
LN_Census_areaperpeople 7,684.633 7,459.608*
(4,875.449) (3,893.895)
LN_water_nn1 -10,268.310 -8,694.154
(6,489.389) (5,779.318)
LN_hospitals_nn1 88,406.900*** 64,484.300***
(18,045.600) (8,446.433)
LN_commonLeisure_nn4 36,197.650*** 144,066.900***
(13,267.680) (5,751.646)
LN_schools_nn2 -48,317.100*** -56,434.100***
(11,748.620) (10,235.480)
LN_restaurants_nn1 895.498 15,547.390***
(6,664.239) (5,861.988)
LN_clinics_nn1 15,331.590 -38,827.820***
(10,845.390) (6,855.422)
LN_public_transport_nn2 -116,324.600*** -190,805.600***
(21,546.910) (6,155.473)
LN_parkArea -6,932.354*** -1,871.241
(1,486.819) (1,322.534)
bsmtSF -7.662 -19.474***
(6.835) (7.220)
LN_companies_nn5 27,005.380* 17,890.070**
(13,840.510) (7,456.896)
ExtWallDscrPrimBlock Stucco 184,720.800** 195,557.200**
(89,088.040) (95,824.730)
ExtWallDscrPrimBrick on Block 332,069.500*** 280,010.400***
(85,204.800) (91,335.230)
ExtWallDscrPrimBrick Veneer 128,156.200 50,301.600
(79,860.080) (85,952.310)
ExtWallDscrPrimCedar 83,571.080 5,448.035
(137,576.000) (148,163.000)
ExtWallDscrPrimCement Board 81,967.420 -16,892.040
(79,638.740) (85,672.390)
ExtWallDscrPrimFaux Stone 234,193.200** 139,217.000
(90,929.110) (97,841.970)
ExtWallDscrPrimFrame Asbestos 312,646.900 106,297.800
(311,419.500) (335,380.900)
ExtWallDscrPrimFrame Stucco 129,250.400 66,118.900
(80,414.020) (86,522.730)
ExtWallDscrPrimFrame Wood/Shake 168,519.700** 92,364.070
(79,254.870) (85,272.770)
ExtWallDscrPrimLog 189,890.500** 140,692.600
(88,010.530) (94,446.580)
ExtWallDscrPrimMetal 107,363.000 94,608.930
(132,831.500) (142,834.600)
ExtWallDscrPrimMoss Rock/Flagstone 161,191.100* 118,441.700
(83,406.930) (89,693.060)
ExtWallDscrPrimPainted Block 139,408.000 25,325.950
(99,646.170) (107,314.800)
ExtWallDscrPrimStrawbale 123,930.300 -53,479.770
(238,308.400) (255,697.400)
ExtWallDscrPrimVinyl 181,857.500** 92,318.990
(88,287.310) (94,936.110)
fitnessCenter_nn3 -21.114*** -9.621***
(4.563) (2.311)
gardensCount -1,227.576 198.988
(1,891.882) (1,476.265)
Constant -2,828,696.000*** -1,758,924.000***
(687,591.900) (681,401.700)
Observations 8,837 8,837
R2 0.728 0.680
Adjusted R2 0.723 0.676
Residual Std. Error 297,420.200 (df = 8668) 321,558.000 (df = 8735)
F Statistic 138.069*** (df = 168; 8668) 183.411*** (df = 101; 8735)
Note: p<0.1; p<0.05; p<0.01


We then use the two models to predict home prices. In the first model, the mean absolute error (MAE) is around 137,000 and the mean absolute percentage of error (MAPE) is around 25%.

In the second model, the MAE is about 158,000, and the MAPE is 28%. In conclusion, the first model is more accurate than the second one.

baseline.test.error = dataTrain.test %>%
  mutate(
    # Predicted price
    price.predict.fixedEffect = predict(reg.baseline.fixedEffect, dataTrain.test),
    price.predict.spatialLag = predict(reg.baseline.spatialLag, dataTrain.test),
    # Residues between predicted price and real price
    price.error.fixedEffect = price.predict.fixedEffect - price,
    price.error.spatialLag = price.predict.spatialLag - price,
    # Absolute Residues between predicted price and real price
    price.absError.fixedEffect = abs(price.predict.fixedEffect - price),
    price.absError.spatialLag = abs(price.predict.spatialLag - price),
    # Absolute Percentage of Residues between predicted price and real price
    price.ape.fixedEffect = (abs(price.predict.fixedEffect - price)) / price,
    price.ape.spatialLag = (abs(price.predict.spatialLag - price)) / price
  ) %>% dplyr::select(starts_with("price"))

MAE1 = mean(baseline.test.error$price.absError.fixedEffect, na.rm = T)
MAE2 = mean(baseline.test.error$price.absError.spatialLag, na.rm = T)
MAPE1 = mean(baseline.test.error$price.ape.fixedEffect, na.rm = T)
MAPE2 = mean(baseline.test.error$price.ape.spatialLag, na.rm = T)

MAE = c(MAE1, MAE2) %>% format(., digits = 3)
MAPE = c(MAPE1, MAPE2) %>% format(., digits = 3)
Model = c("Model with Tract ID", "Model with Spatial Lag")
summaryTable1 = cbind(Model, MAE, MAPE)

kable(summaryTable1, digits = 1, caption = "Table 5. Prediction precision of the first two models") %>%
  kable_classic(full_width = T)%>%
  footnote()
Table 5. Prediction precision of the first two models
Model MAE MAPE
Model with Tract ID 135877 0.231
Model with Spatial Lag 158782 0.274


We have plotted the distribution of prediction errors of the first two models (see Figure 8). We may see from the plots that, although we have counted for geospatial factors, the MAE and MAPE are still very clustered in space.

ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = baseline.test.error, aes(color = q5(price.absError.fixedEffect)), size = 3,alpha=0.5)+
  scale_color_manual(values = palette5,
                     labels = qBr(baseline.test.error, "price.absError.fixedEffect"),
                     name = "MAE")+
  labs(subtitle = "In Tract") +
  mapTheme() 



ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = baseline.test.error, aes(color = q5(price.absError.spatialLag)), size = 3,alpha=0.5)+
  scale_color_manual(values = palette5,
                     labels = qBr(baseline.test.error, "price.absError.spatialLag"),
                     name = "MAE")+
  labs(subtitle = "In Spatial Lag") +
  mapTheme()
Figure 8. Distribution of absolute prediction error of the first two linear modelsFigure 8. Distribution of absolute prediction error of the first two linear models

Figure 8. Distribution of absolute prediction error of the first two linear models

An analysis with Moran’s I affirms this spatial clustering. Both models are suffered from significant spatial clustering of errors (Figure 9). This suggests that our models still need improvements regarding modelling spatial relations.

# Moran's I of fixedEffect model
coords.test = st_coordinates(baseline.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(baseline.test.error$price.absError.fixedEffect, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
  geom_histogram(fill="#69b3a2",binwidth = 0.01)+
  geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
  scale_x_continuous(limits = c(-1, 1))+
  labs(x = "Model using tract ID")+
  plotTheme()

# Moran's I of Spatiallag model
coords.test = st_coordinates(baseline.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(baseline.test.error$price.absError.spatialLag, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
  geom_histogram(fill="#69b3a2",binwidth = 0.01)+
  geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
  scale_x_continuous(limits = c(-1, 1))+
  labs(x = "Model using spatial lag")+
  plotTheme()
Figure 9. Moran test for the first two linear modelsFigure 9. Moran test for the first two linear models

Figure 9. Moran test for the first two linear models

4.2 An HLM model

Rather than having a fixed and independent effect on home prices, the spatial structure may alter the way in which other factors influence home prices. Therefore, we used a hierarchical linear model in the machine learning. Hierarchical linear modeling (HLM) is a model designed to take into account the hierarchical or nested structure of the data.

dataTrain = trainData3[trainData3$toPredict == 0,]
dataTrain$muni[is.na(dataTrain$muni)] = 'Other'

# To avoid errors in running the HLM model, merge some small categories

dataTrain_merge = dataTrain

list1 = c("Electric Wall Heat (1500W)", "No HVAC", "Package Unit", "Heat Pump", "Gravity","Radiant Floor", "Ventilation Only", "Wall Furnace", "")
for (i in list1){
  a=1
  dataTrain_merge$HeatingDscr[dataTrain_merge$HeatingDscr == i] = glue("other{i}")
  a=a+1
}

list2 = c("Concrete", "Precast", "Veneer", "Wood", "", "Brick")
for (i in list2){
  a=1
  dataTrain_merge$ConstCodeDscr[dataTrain_merge$ConstCodeDscr == i] = glue("other{i}")
  a=a+1
}

list3 = c("", "Block Stucco", "Cedar", "Faux Stone", "Frame Asbestos",
          "Log", "Metal", "Painted Block", "Strawbale", "Vinyl")
for (i in list3){
  a=1
  dataTrain_merge$ExtWallDscrPrim[dataTrain_merge$ExtWallDscrPrim == i] = glue("other{i}")
  a=a+1
}
list4 = c("", "Built-Up", "Clay Tile", "Roll", "Shake", "Tar and Gravel")
for (i in list4){
  a=1
  dataTrain_merge$Roof_CoverDscr[dataTrain_merge$Roof_CoverDscr == i] = glue("other{i}")
  a=a+1
}

list5 =c("08013012204","08013012300","08013012202","08013012605",
        "08013012607", "08013012705","08013012707","08013012709",
        "08013013201","08013013202","08013013205","08013013602")
for (i in list5){
  a=1
  dataTrain_merge$tractID[dataTrain_merge$tractID == i] = glue("other{i}")
  a=a+1
}

dataTrain = dataTrain_merge

# Split training and test data-sets
rownames(dataTrain) <- seq(length = nrow(dataTrain))
inTrain = createDataPartition(y = paste(
  dataTrain_merge$ConstCodeDscr,
  dataTrain$qualityCodeDscr,
  dataTrain$tractID),
                              p = .75, list = FALSE)

dataTrain.training = dataTrain[inTrain,]
dataTrain.test = dataTrain[-inTrain,]
# HLM model using tract ID
reg.lmer.fixedEffect = lmer(price ~ 
                              designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + bsmtSF + carStorageType + AcDscr + HeatingDscr + ExtWallDscrSec + Roof_CoverDscr + nbrBaths + schoolNo + universitiesCount +  parkingArea + cinemasCount + stadiumsCount + fitnessCenter_nn3 + Census_MedRent + Census_pctAfricanAmericans + LN_commonLeisure_nn1 + LN_hospitals_nn1 +LN_restaurants_nn1 + LN_retail_nn4 + LN_parkArea + LN_public_transport_nn2 +LN_water_nn1 + TotalFinishedSF+ 
                              (1|tractID), data = dataTrain.training)

lmer.test.error = dataTrain.test %>%
  mutate(
    price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test),
    price.error = price.predict - price,
    price.absError = abs(price.predict - price),
    price.ape = (abs(price.predict - price)) / price.predict)

MAPE3 = mean
stargazer(reg.lmer.fixedEffect, 
          type = "html",
          title = "Table 5. Model summary of the HLM model",
          dep.var.labels = "Home Price")
Table 5. Model summary of the HLM model
Dependent variable:
Home Price
designCodeDscr2-3 Story -72,545.540***
(9,711.402)
designCodeDscrBi-level 30,283.060
(20,624.210)
designCodeDscrMULTI STORY- TOWNHOUSE -197,077.800***
(14,922.650)
designCodeDscrSplit-level -162.388
(14,124.740)
qualityCodeDscrAVERAGE + -46,276.920***
(14,774.230)
qualityCodeDscrAVERAGE ++ -20,164.300
(15,342.600)
qualityCodeDscrEXCELLENT 1,023,333.000***
(33,629.360)
qualityCodeDscrEXCELLENT + 1,191,657.000***
(67,703.320)
qualityCodeDscrEXCELLENT++ 1,911,265.000***
(56,798.760)
qualityCodeDscrEXCEPTIONAL 1 1,063,068.000***
(73,317.910)
qualityCodeDscrEXCEPTIONAL 2 1,563,574.000***
(188,391.100)
qualityCodeDscrFAIR -33,157.400
(38,129.800)
qualityCodeDscrGOOD 40,466.930***
(11,253.580)
qualityCodeDscrGOOD + 64,423.320***
(17,715.860)
qualityCodeDscrGOOD ++ 158,647.700***
(16,967.090)
qualityCodeDscrLOW -130,494.400
(80,036.390)
qualityCodeDscrVERY GOOD 236,203.600***
(17,994.590)
qualityCodeDscrVERY GOOD + 465,059.800***
(29,285.550)
qualityCodeDscrVERY GOOD ++ 603,071.200***
(25,559.450)
ConstCodeDscrBrick -12,859.100
(50,952.010)
ConstCodeDscrConcrete 28,872.220
(143,797.500)
ConstCodeDscrFrame -10,576.460
(39,058.650)
ConstCodeDscrMasonry 13,341.520
(40,653.920)
ConstCodeDscrPrecast -821,225.600***
(231,648.100)
ConstCodeDscrVeneer -926,917.100***
(322,246.800)
ConstCodeDscrWood -42,665.240
(172,498.100)
EffectiveYear 1,584.046***
(297.922)
bsmtSF -7.040
(6.916)
carStorageTypeGRA 37,405.360***
(14,317.640)
carStorageTypeGRB 147,117.600***
(28,955.770)
carStorageTypeGRC -4,879.090
(26,240.690)
carStorageTypeGRD 61,707.630***
(15,850.470)
carStorageTypeGRF 107,532.100
(110,101.100)
carStorageTypeGRW -58,636.150
(104,053.800)
AcDscrAttic Fan -79,899.990
(71,283.740)
AcDscrEvaporative Cooler -13,343.290
(23,587.530)
AcDscrWhole House 10,361.860
(8,878.835)
HeatingDscrElectric -105,542.900***
(40,041.750)
HeatingDscrElectric Wall Heat (1500W) 49,495.130
(236,278.800)
HeatingDscrForced Air -88,593.730**
(35,943.780)
HeatingDscrGravity -82,785.920
(71,243.340)
HeatingDscrHeat Pump -318,031.600***
(115,319.100)
HeatingDscrHot Water -28,341.360
(37,468.830)
HeatingDscrNo HVAC 213,396.900
(219,931.900)
HeatingDscrPackage Unit 3,865.141
(338,362.700)
HeatingDscrRadiant Floor 111,868.300**
(48,674.100)
HeatingDscrVentilation Only -501,601.400
(308,246.300)
HeatingDscrWall Furnace 63,687.660
(51,203.990)
ExtWallDscrSecBlock Stucco 908,666.000***
(127,702.100)
ExtWallDscrSecBrick on Block -42,264.380
(98,955.410)
ExtWallDscrSecBrick Veneer -1,366.410
(14,352.200)
ExtWallDscrSecCedar 433,608.800**
(178,897.700)
ExtWallDscrSecCement Board -12,606.980
(80,053.230)
ExtWallDscrSecFaux Stone -6,093.838
(15,999.100)
ExtWallDscrSecFrame Asbestos 186,674.500
(125,694.400)
ExtWallDscrSecFrame Stucco -46,967.420
(30,382.230)
ExtWallDscrSecFrame Wood/Shake -4,179.883
(10,205.090)
ExtWallDscrSecLog 40,038.290
(154,805.600)
ExtWallDscrSecMetal 43,124.020
(95,520.170)
ExtWallDscrSecMoss Rock/Flagstone 10,256.060
(35,255.790)
ExtWallDscrSecPainted Block 120,912.400
(179,075.700)
ExtWallDscrSecVinyl -19,079.120
(116,907.800)
Roof_CoverDscrAsphalt -19,904.090**
(8,385.809)
Roof_CoverDscrBuilt-Up 114,449.500
(144,809.900)
Roof_CoverDscrClay Tile -81,109.460
(53,747.290)
Roof_CoverDscrConcrete Tile -116,049.400***
(26,820.320)
Roof_CoverDscrMetal 156,150.500***
(27,813.380)
Roof_CoverDscrRoll -285,926.300
(309,616.900)
Roof_CoverDscrRubber Membrane 119,353.700***
(30,067.690)
Roof_CoverDscrShake -109,701.900**
(52,658.250)
Roof_CoverDscrTar and Gravel 108,187.400
(154,584.700)
nbrBaths 39,554.760***
(5,786.099)
schoolNo1 192,300.900
(192,251.800)
schoolNo12 -119,864.300
(196,499.400)
schoolNo13 -161,541.300
(195,477.900)
schoolNo14 -195,113.600
(184,817.600)
schoolNo15 -300,800.400
(184,928.300)
schoolNo16 105,217.800
(175,728.500)
schoolNo17 29,316.720
(174,740.800)
schoolNo18 156,905.000
(173,374.900)
schoolNo2 66,824.150
(175,009.600)
schoolNo21 174,364.800
(183,585.900)
schoolNo22 180,331.200
(174,869.100)
schoolNo23 393,430.100**
(173,845.100)
schoolNo24 133,414.400
(168,335.900)
schoolNo25 113,651.400
(171,733.300)
schoolNo26 150,971.800
(179,127.600)
schoolNo27 316,918.100*
(185,175.600)
schoolNo28 336,067.500*
(182,302.200)
schoolNo29 426,510.200**
(176,021.400)
schoolNo3 28,220.400
(176,339.600)
schoolNo30 693,750.100***
(180,523.300)
schoolNo31 105,340.200
(174,045.400)
schoolNo32 37,449.280
(176,749.000)
schoolNo33 274,278.600
(173,243.300)
schoolNo34 24,006.640
(170,669.400)
schoolNo35 615,575.900***
(176,757.100)
schoolNo36 171,007.900
(176,174.300)
schoolNo37 192,857.900
(181,847.200)
schoolNo38 144,846.000
(176,851.200)
schoolNo39 21,149.090
(173,714.100)
schoolNo4 68,037.020
(175,572.000)
schoolNo40 275,455.800
(183,348.000)
schoolNo41 173,047.300
(174,123.900)
schoolNo42 81,388.530
(175,637.700)
schoolNo43 110,996.100
(176,977.800)
schoolNo44 428,906.800**
(199,764.000)
schoolNo45 349,675.900**
(164,821.700)
schoolNo46 369,797.700*
(210,955.900)
schoolNo47 186,814.600
(184,736.000)
schoolNo48 177,218.800
(173,133.500)
schoolNo49 249,096.900
(180,152.200)
schoolNo5 61,257.920
(181,081.800)
schoolNo50 571,839.500***
(198,270.200)
schoolNo51 686,919.600***
(180,082.600)
schoolNo54 405,471.500**
(185,366.500)
schoolNo55 278,577.300
(172,424.000)
schoolNo56 211,352.400
(181,742.600)
schoolNo58 372,964.500*
(190,294.700)
schoolNo59 739,832.200***
(179,449.900)
schoolNo6 19,266.380
(174,192.100)
schoolNo61 -47,656.880
(234,528.700)
schoolNo63 134,924.200
(174,353.300)
schoolNo64 -12,895.180
(192,549.500)
schoolNo67 203,809.200
(167,637.800)
schoolNo68 323,488.300**
(164,651.400)
schoolNo69 295,910.300*
(171,662.700)
schoolNo7 82,359.650
(185,490.900)
schoolNo70 566,610.500***
(178,149.600)
schoolNo71 193,386.100
(175,177.400)
schoolNo72 142,885.600
(180,158.900)
schoolNo73 475,764.200**
(199,578.000)
schoolNo74 337,472.200**
(168,337.000)
schoolNo75 492,803.900***
(181,046.300)
schoolNo76 504,156.000***
(173,229.000)
schoolNo77 518,184.800***
(180,048.900)
schoolNo78 587,631.000***
(192,509.300)
schoolNo79 102,339.000
(174,270.600)
schoolNo8 -279.539
(182,162.100)
schoolNo80 20,427.450
(176,424.700)
schoolNo81 108,261.000
(176,972.600)
schoolNo82 562,526.800***
(180,760.000)
schoolNo83 333,340.700*
(170,383.800)
schoolNo84 249,962.200
(171,229.400)
schoolNo85 57,035.700
(182,273.400)
schoolNo86 306,487.900*
(164,455.300)
schoolNo87 -19,465.510
(172,936.500)
schoolNo88 178,433.800
(171,346.600)
schoolNo89 303,843.100*
(176,713.400)
schoolNo91 1,210,779.000***
(254,248.700)
schoolNo92 159,554.500
(170,914.000)
schoolNo93 128,712.200
(174,927.500)
schoolNo94 490,524.800***
(181,760.100)
schoolNo95 163,329.900
(175,701.800)
schoolNo96 117,419.700
(174,023.400)
schoolNo97 358,121.400**
(169,468.100)
universitiesCount -112,973.500***
(18,643.440)
parkingArea -0.048
(0.160)
cinemasCount -3,921.342
(2,632.090)
stadiumsCount 9,597.537**
(4,327.427)
fitnessCenter_nn3 -16.570***
(4.767)
Census_MedRent -30.053***
(7.201)
Census_pctAfricanAmericans -6,216.680**
(2,976.965)
LN_commonLeisure_nn1 -7,396.233
(9,287.533)
LN_hospitals_nn1 64,322.760***
(24,152.100)
LN_restaurants_nn1 -8,720.972
(8,198.562)
LN_retail_nn4 8,447.126
(12,011.010)
LN_parkArea -6,953.213***
(1,579.352)
LN_public_transport_nn2 -163,364.500***
(25,394.020)
LN_water_nn1 -15,003.790**
(6,955.407)
TotalFinishedSF 156.830***
(7.206)
Constant -1,630,137.000**
(663,991.200)
Observations 9,202
Log Likelihood -127,310.600
Akaike Inf. Crit. 254,969.200
Bayesian Inf. Crit. 256,209.400
Note: p<0.1; p<0.05; p<0.01
# Plot observed vs. predicted home prices in the test set
ggplot(data = lmer.test.error, aes(x=price.predict,y=price))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.5)+
  geom_abline(intercept = 0, slope = 1, size = 2,color = "#7A986D")+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="Predicted sale price as a function of observed price") +
  plotTheme()+
  theme(plot.title = element_text(size=22),
        legend.position="right",
        axis.text=element_text(size=22),
        axis.title=element_text(size=22))
Figure 10. Scatter-plot of observed home prices to predicted home prices (test set)

Figure 10. Scatter-plot of observed home prices to predicted home prices (test set)

We have run the HLM model using the tract as the grouping method. Although the Moran’s I test show minor improvement on the spatial clustering of errors, this model produces an much lower MAPE of 0.19 on the test set. Table 5 is a summary of the HLM model.

#Moran's I
coords.test = st_coordinates(lmer.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(lmer.test.error$price.absError, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
  geom_histogram(fill="#69b3a2",binwidth = 0.01)+
  geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
  scale_x_continuous(limits = c(-1, 1))+
  labs(title="Moran's I", x = "Moran Test") +
  plotTheme()+
  theme(plot.title = element_text(size=22),
        legend.position="right",
        axis.text=element_text(size=22),
        axis.title=element_text(size=22))
Figure 11. Moran test for HLM model

Figure 11. Moran test for HLM model

Figure 12 maps the residual on the test set. Figure 13 is the map of spatial lag of errors. Although the spatial clustering of errors still exists, the clustering of positive or negative errors has diminished compared to our first two models.

# a map of residuals for test set.
ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = lmer.test.error, aes(color = q5(price.error)), size = 3,alpha=0.5)+
  scale_color_manual(values = palette5,
                    labels = qBr(lmer.test.error, "price.error"),
                    name = "price.error\n(Quintile Breaks)")+
  labs(title = "Map of errors of HLM model") +
  mapTheme() + 
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())
Figure 12. Map of errors of HLM model

Figure 12. Map of errors of HLM model

lmer.test.error$spatialLag = lag.listw(nb2listw(knn2nb(knearneigh(st_coordinates(lmer.test.error), 5)), style = "W"),lmer.test.error$price.absError)

ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = lmer.test.error, aes(color = q5(spatialLag)), size = 3,alpha=0.5)+
  scale_color_manual(values = palette5,
                    labels = qBr(lmer.test.error, "spatialLag"),
                    name = "spatialLag\n(Quintile Breaks)")+
  labs(
    title = "Map of errors' spatial lag of HLM mode") +
  mapTheme()+
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())
Figure 13. Map of the spatial lag of errors of HLM model

Figure 13. Map of the spatial lag of errors of HLM model

Figure 14 maps the distribution of the absolute percent error across different census tracts of our HLM model. In the three dense urban areas (Boulder, Longmont, and Louisville/Superior/Erie), the MAPE is pretty low.

Figure 15 is the scatter-plot of MAPE by census tract as a function of mean price by census tract. It shows that our model is slightly less precise for higher-priced homes.

# a map of mape by GEOID.
dataTrain.test2 <- st_join(dataTrain.test,CensusData2,left=TRUE)
lmer.tract_MAPE = dataTrain.test2 %>%
  mutate(
    price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test2),
    price.ape = (abs(price.predict - price)) / price.predict)%>%
  st_drop_geometry()
  
tract_MAPE <- lmer.tract_MAPE%>%group_by(GEOID)%>%summarize(MAPE = 100*mean(price.ape, na.rm = T))
tract_MAPE <- left_join(tract_MAPE,CensusData2,by="GEOID")%>%
  st_sf()

ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = tract_MAPE, aes(fill = q5(MAPE)),alpha=0.7)+
  scale_fill_manual(values = palette5,
                    labels = qBr(tract_MAPE, "MAPE"),
                    name = "MAPE%\n(Quintile Breaks)")+
  labs(
    title = "Map of absolute percent error of HLM model") +
  mapTheme() + 
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())
Figure 14. Map of absolute percent error of HLM model

Figure 14. Map of absolute percent error of HLM model

lmer.tract_MAPE = dataTrain.test2 %>%
  mutate(
    price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test2),
    price.ape = (abs(price.predict - price)) / price.predict)%>%
  st_drop_geometry()
  
tract_MAPE_PRICE <- lmer.tract_MAPE%>%group_by(GEOID)%>%summarize(MAPE = mean(price.ape, na.rm = T),MPRICE=mean(price, na.rm = T))

ggplot(tract_MAPE_PRICE, aes(x = MPRICE, y = MAPE))+
  geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.5)+
  geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
  labs(title="MAPE by neighborhood as a function of mean price by neighborhood") +
  plotTheme()+
  theme(plot.title = element_text(size=22),
        legend.position="right",
        axis.text=element_text(size=22),
        axis.title=element_text(size=22))
Figure 15. Scatter-plot of MAPE by census tract as a function of mean price by census tract

Figure 15. Scatter-plot of MAPE by census tract as a function of mean price by census tract

4.3 Model cross-validation

Until now, we have only tested our model on the test set. To further cross-validate the model, we do the 100-fold cross validation.

fitControl <- trainControl(method = "cv", number = 100)

# Merge small categories to facilitate k-fold validation
dataTrain_merge = dataTrain
list1 = c("Electric Wall Heat (1500W)", "No HVAC", "Package Unit", "Heat Pump", "Gravity",
          "Radiant Floor", "Ventilation Only", "Wall Furnace", "")
dataTrain_merge$HeatingDscr[dataTrain_merge$HeatingDscr %in% list1] = "Other"

list2 = c("Concrete", "Precast", "Veneer", "Wood", "", "Brick")
dataTrain_merge$ConstCodeDscr[dataTrain_merge$ConstCodeDscr %in% list2] = "Other"

list3 = c("", "Block Stucco", "Cedar", "Faux Stone", "Frame Asbestos",
          "Log", "Metal", "Painted Block", "Strawbale", "Vinyl")
dataTrain_merge$ExtWallDscrPrim[dataTrain_merge$ExtWallDscrPrim %in% list3] = "Other"

list4 = c("", "Built-Up", "Clay Tile", "Roll", "Shake", "Tar and Gravel")
dataTrain_merge$Roof_CoverDscr[dataTrain_merge$Roof_CoverDscr %in% list4] = "Other"


fld = cvFolds(nrow(dataTrain), K = 100, R = 1)
fldTable = cbind(fld$which, fld$subsets) %>% as.data.frame()
MAE = c()

for (i in 1:100) {
  testIndex = fldTable[fldTable$V1 == i,]$V2
  kTest = dataTrain_merge[testIndex,]
  kTrain = dataTrain_merge[-testIndex,]
  reg.k = lmer(price ~ 
                 designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + 
                 bsmtSF + carStorageType + AcDscr + HeatingDscr + 
                 ExtWallDscrSec + Roof_CoverDscr + nbrBaths + 
                 schoolNo + universitiesCount +  parkingArea + cinemasCount + stadiumsCount + 
                 fitnessCenter_nn3 + Census_MedRent + 
                 Census_pctAfricanAmericans + LN_commonLeisure_nn1 + LN_hospitals_nn1 + 
                 LN_restaurants_nn1 + LN_retail_nn4 + LN_parkArea + LN_public_transport_nn2 + 
                 LN_water_nn1 + TotalFinishedSF+ (1|tractID), data = kTrain)
  err = kTest %>%
    mutate(
      price.predict = predict(reg.k, newdata = kTest),
      price.absError = abs(price.predict - price),
      price.ape = (price.absError) / price.predict,
    )
  MAE <- append(MAE, mean(err$price.absError, na.rm = T))

}
ggplot()+
  geom_histogram(data = MAE.df, aes(x = MAE),fill = "#69b3a2",alpha=0.5, color = "white",bins=35)+
  labs(title = "Distribution of MAE in k-fold cross-validation")+
  plotTheme()
Figure 16. Distribution of MAE in k-fold cross-validation

Figure 16. Distribution of MAE in k-fold cross-validation

4.4 Model generalizability

Next, we test whether the model remains tenable across different socioeconomic statuses. We have divided the census tracts within Boulder county into two categories: High income: with median household income over 80,000 dollars, and low income, with median household income less than 80,000 dollars. Then we calculate the mean absolute percent error in high- and low-income tracts (Table 6). The results are similar, suggesting relatively good generalizability of our model.

Census <- CensusData%>%
  mutate(incomeContext = ifelse(Census_MedHHInc > 80000,"High Income", "Low Income"))

kable2 <- st_join(lmer.test.error, Census) %>% na.omit()%>%
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(price.ape,na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Table 6. MAPE by neighborhood income context")

kable2 %>% 
  kable_classic(full_width = T)%>%
  footnote()

4.5 Mapping of final price prediction

Finally, Figure 17 shows our final prediction on the entire data-set.

# -Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1.
usetopredictall <- trainData1
Predict.all = usetopredictall %>%
  mutate(price.predict = predict(reg.lmer.fixedEffect, newdata = usetopredictall))%>%
  dplyr::select(starts_with("price"))

ggplot()+
  geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
  geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
  geom_sf(data = Predict.all, aes(color = q5(price.predict)),size = 3,alpha=0.2)+
  scale_color_manual(values = palette5,
                    labels = qBr(Predict.all, "price.predict"),
                    name = "Predicted Price\n(Quintile Breaks)")+
 labs(
    title = "Final prediction on the entire data-set")+
  mapTheme()+
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle=element_text(size = 15,face="italic"),
    panel.border = element_blank(),
    axis.text = element_blank(),
    panel.background = element_blank(),
    panel.grid = element_blank(),
    axis.ticks = element_blank())
Figure 17. Final prediction on the entire data-set

Figure 17. Final prediction on the entire data-set

5. Discussion and conclusion

In this project, we have developed an effective model for home price prediction in terms of MAE, especially in central urban areas. We have made three categories of variables: intrinsic features, accessibility to amenities, and community effect.

The spatial clustering of home prices poses a big challenge on home price prediction. By introducing a HLM-OLS model, the spatial variation of prices is still not fully accounted for, but we have managed to reduce the overall errors. In particular, our model has done a pretty good job in the three urban cores.

However, with homes in rural areas our model becomes less accurate. It could be that there are fewer observations outside the urban cores, leading to less accurate predictions in these areas.