This project aims to develop a model to predict home prices in Boulder County, CO for Zillow. Home price prediction is an integral part of Zillow’s functionality, yet the accuracy of home price prediction has always been a challenge. The current method of Zillow does not produce predictions as accurate as it should be, as it does not engage enough local intelligence. In this project, we develop an OLS prediction model by gathering local data and engaging numerous related variables. Then we modify the model until it attains adequate accuracy and simplicity. The resulting prediction model reduces the average percentage of error to 19% and are more accurate (around 10%) in the county’s three urban cores.
Our prediction model is based on machine learning—that is, we identify a series of predictors that potentially influence home prices, and explore the relationships between these predictors and actual home prices using a training set. Applying these relationships, we use the predictors to predict home prices in a new dataset. In particular, the OLS (Ordinary Least Square) model that we employ assumes a linear relation between the predictors and home prices.
The OLS model requires that we collect a series of data to prepare the predictors. We have collected data from three main sources:
The Boulder County Assessor Public Dataset, which includes the prices and the characteristics (e.g., number of rooms, built year, facilities) of homes within Boulder County. From this dataset, we have extracted 11,263 entries for the training and validation of our model.
Open data regarding the recreational, commercial, transportation and other amenities/facilities within Boulder County. We collect these data from OpenStreetMap using an API and from the Boulder County Open Data Portal. Then, we calculate the accessibility of homes to these amenities and consider it as factor influencing home prices. Furthermore, geometry border data are also collected from these open sources.
Census data from the American Community Survey (ACS).
The model-building process consists of four stages: 1) environment and data preparation, 2) feature engineering and feature selection, and 3) model validation. The latter two steps may be repeated several times to refine the model.
In the first step, we set up the environment in the compiler, and import all potentially useful data.
#-----Import necessary packages-----
library(tidyverse)
library(ggplot2)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(osmdata)
library(tigris)
library(osmextract)
library(curl)
library(reshape2)
library(glue)
library(dismo)
library(spgwr)
library(MASS)
library(lme4)
library(data.table)
library(kableExtra)
library(stargazer) # Source: ^[Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2.2. https://CRAN.R-project.org/package=stargazer]
options(tigris_class = "sf")
#-----Import external functions & a palette-----
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
palette5 <- c("#2166ac","#67a9cf","#d1e5f0","#fddbc7","#ef8a62")
#-----Geographical boundaries and CRS-----
# Bounding box for getting OSM & CRS
q0 <- opq(bbox = c(-105.7,39.9,-105.05,40.25))
crs = "ESRI:102254"
# Boulder County boundary
boulder <- counties(state = 'CO', cb = FALSE) %>%
filter(NAME == "Boulder") %>%
st_as_sf() %>%
st_transform(crs)
# Boulder County's municipalities
muni = st_read("https://opendata.arcgis.com/datasets/9597d3916aba47e887ca563d5ac15938_0.geojson") %>%
st_transform(crs)
#-----Self-define functions-----
#Get data from OSM
get_osm_1 <- function(bbox, key, value, geom, crs){
object <- add_osm_feature(opq = bbox, key = key, value = value) %>%
osmdata_sf(.)
geom_name <- paste("osm_", geom, sep = "")
object.sf <- st_geometry(object[[geom_name]]) %>%
st_transform(crs) %>%
st_sf() %>%
cbind(., object[[geom_name]][[key]]) %>%
rename(NAME = "object..geom_name....key..")
return(object.sf)
}
# Buffer Count - Count the number of appearance within the buffer of houses
# Input: home data (sf), object/amenity type, buffer radius, object data
buffer_count = function(trainData, type, radius, data){
houseBuffer = st_buffer(trainData, radius)
trainData[type] = st_intersects(houseBuffer, data) %>%
sapply(., length)
return(trainData)
}
# Buffer Area - for polygon amenities, calculate the aggregate area of all entries that intersect with the buffer of a home
# Input: home data (sf), object/amenity type, buffer radius, object data
buffer_area = function(trainData, type, radius, data){
data <- mutate(data, area = st_area(data))
trainData[type] = st_buffer(trainData, radius) %>%
aggregate(data %>% dplyr:: select (geometry, area),., sum) %>%
pull(area)
return(trainData)
}
# K-near - Calculate the average distance of a home to its nearest (1 to 5) amenity object(s)
# Input: home data, object/amenity data, object/amenity type, geometry type (point or non-point)
knear <- function(trainData, data, type, geom){
if (geom != "points") {
data = data %>% dplyr::select(geometry) %>%st_centroid(.) %>% st_coordinates
}
else {
data = data %>% dplyr::select(geometry) %>% st_coordinates
}
trainDataCoords = st_coordinates(trainData)
nn_1 = nn_function(trainDataCoords, data, 1)
nn_2 = nn_function(trainDataCoords, data, 2)
nn_3 = nn_function(trainDataCoords, data, 3)
nn_4 = nn_function(trainDataCoords, data, 4)
nn_5 = nn_function(trainDataCoords, data, 5)
trainData[paste(type, "_nn1", sep = "")] = nn_1
trainData[paste(type, "_nn2", sep = "")] = nn_2
trainData[paste(type, "_nn3", sep = "")] = nn_3
trainData[paste(type, "_nn4", sep = "")] = nn_4
trainData[paste(type, "_nn5", sep = "")] = nn_5
return(trainData)
}
#-----Import home data and wrangle them-----
# Import all home data
trainData = st_read("studentData.geojson", crs = "ESRI:102254")
# Wrangle home data
trainData <- dplyr::select(
trainData,
-saleDate,-year,-year_quarter,-address,-bld_num,-section_num,-designCode,-qualityCode,-bldgClass,-bldgClassDscr,-ConstCode,-CompCode,-builtYear,-bsmtTypeDscr,-carStorageTypeDscr,-Ac,-Heating,-ExtWallPrim,-ExtWallSec, -IntWall, -Roof_Cover, -Stories,-UnitCount,-status_cd)%>%
mutate(nbrBaths = nbrFullBaths + 0.5 * nbrHalfBaths + 0.75 * nbrThreeQtrBaths)%>%
dplyr::select(-nbrFullBaths, -nbrHalfBaths, -nbrThreeQtrBaths)
By this data-set, we may map our the sale prices of homes (Figure 1). Home prices follow distinct spatial patterns in Boulder County, as evident in Figure 4. Higher home prices are converged in the Municipality of Boulder and the southeast municipalities of Louisville, Superior, and Erie. In Longmont and rural areas, home prices are much lower.
ggplot()+
geom_sf(data = CensusData2, color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent", color="black",size=2)+
geom_sf(data = trainData, aes(color = q5(price)), size = 3,alpha=0.3)+
scale_color_manual(values = palette5,
labels = qBr(trainData, "price"),
name = "Home sale prices")+
labs(title = "Map of home sale prices in Boulder County, CO",
subtitle = "Real Dollars; In tract unit")+
mapTheme()+
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
While the dataset of homes (trainData
) includes the intrinsic features of each
home, we still need external features of homes—that is, the accessibility of each home
to recreation, services, transport, and other amenities. Depending on the characteristic of each type of
amenity, we may calculate accessibility using one or more of the following methods (using the
self-defined functions in the last code chunk):
Buffer-count method. We count how many objects there are within a buffer (usually a 1,000-meter buffer) of each home.
Buffer-area method. We calculate the aggregate area of objects within a buffer (usually a 1,000-meter buffer) of each home.
K-nearest-neighbor method. We calculate the average distance of each home to its 1~5 nearest amenity object(s).
#-----Amenity: parks-----
# Get park data, same below
parks = get_osm_1(q0, "leisure", "park", "polygons", "ESRI:102254")
## How many parks within 1000 m buffer of each home, same below
trainData = buffer_count(trainData, "parkCount", 1000, parks)
## Aggregate park area of all intersected park, same below
trainData = buffer_area(trainData, "parkArea", 1000, parks)
#-----Amenity: restaurants-----
restaurants = get_osm_1(q0, "amenity", "restaurant", "points", crs)
trainData = buffer_count(trainData, "restaurantsCount", 1000, restaurants)
# Average distance to 1~5 nearest restaurant(s), same below
trainData = knear(trainData, restaurants, "restaurants", "points")
#-----Amenity: schools-----
schools = get_osm_1(q0, 'amenity', 'school', "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "schoolCount", 1000, schools)
trainData = knear(trainData, schools, "schools", "polygons")
# Assign each home to its nearest school
schoolDistrict = voronoi(st_coordinates(schools %>% st_centroid())) %>%
st_as_sf() %>%
st_set_crs("ESRI:102254") %>%
rename(schoolNo = id) %>%
mutate(schoolNo = as.character(schoolNo))
trainData = st_join(trainData, schoolDistrict)
#-----Amenity: universities-----
# Get university data
universities = get_osm_1(q0, 'amenity', 'university', "polygons", "ESRI:102254")
# Whether university is within 1000 m buffer of each home
trainData = buffer_count(trainData, "universitiesCount", 1000, universities)
#-----Amenity: parking-----
parking = get_osm_1(q0, "amenity", "parking", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "parkingCount", 1000, parking)
trainData = buffer_area(trainData, "parkingArea", 1000, parking)
trainData = knear(trainData, parking, "parking", "polygons")
#-----Amenity: clinics-----
clinics = get_osm_1(q0, "amenity", "clinic", "points", "ESRI:102254")
trainData = knear(trainData, clinics, "clinics", "points")
#-----Amenity: hospitals-----
hospitals = get_osm_1(q0, "amenity", "hospital", "polygons", "ESRI:102254")
trainData = knear(trainData, hospitals, "hospitals", "polygons")
#-----Amenity: cinemas-----
cinemas = get_osm_1(q0, "amenity", "cinema", "points", "ESRI:102254")
trainData = buffer_count(trainData, "cinemasCount", 1000, cinemas)
trainData = buffer_count(trainData, "cinemasCount2", 2000, cinemas)
#-----Amenity: stadiums-----
stadiums = get_osm_1(q0, "building", "stadium", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "stadiumsCount", 1000, stadiums)
#-----Amenity: commerce-----
commercial = get_osm_1(q0, "building", "commercial", "polygons", "ESRI:102254")
trainData = knear(trainData, commercial, "commerce", "polygons")
trainData = buffer_count(trainData, "commerceCount", 1000, commercial)
#-----Amenity: retail-----
retail = get_osm_1(q0, "building", "retail", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "retailCount", 1000, retail)
trainData = knear(trainData, retail, "retail", "polygons")
#-----Amenity: common leisure space-----
commonleisure = get_osm_1(q0, "leisure", "common", "polygons", "ESRI:102254")
trainData = knear(trainData, commonleisure, "commonLeisure", "polygons")
#-----Amenity: fitness centers-----
fitness_centre = get_osm_1(q0, "leisure", "fitness_centre", "points", "ESRI:102254")
trainData = buffer_count(trainData, "fitnessCenterCount", 1000, fitness_centre)
trainData = knear(trainData, fitness_centre, "fitnessCenter", "pooints")
#-----Amenity: public gardens-----
garden = get_osm_1(q0, "leisure", "garden", "polygons", "ESRI:102254")
trainData = knear(trainData, garden, "gardens", "polygons")
trainData = buffer_count(trainData, "gardensCount", 1000, garden)
#-----Jobs: companies-----
companies = get_osm_1(q0, "office", "company", "points", "ESRI:102254")
trainData = knear(trainData, companies, "companies", "points")
#-----Transport: public transport-----
public_transport = get_osm_1(q0, "public_transport", "station", "points", "ESRI:102254")
trainData <-
trainData %>%
mutate(
public_transport_nn1 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 1),
public_transport_nn2 = nn_function(st_coordinates(trainData), st_coordinates(public_transport%>%dplyr::select(geometry)%>%st_centroid(.)), 2))
#-----Transport: highway-----
highway = get_osm_1(q0, "highway", "primary", "lines", "ESRI:102254")
# Distance from each home to primary highways.
highwayunion <- st_buffer(st_union(highway),500)%>%st_sf()
highway_MRBuffer <-
st_join(trainData,multipleRingBuffer(highwayunion,40000,1000)) %>%
st_drop_geometry() %>%
replace_na(list(distance = 0)) %>%
mutate(distance = distance+500) %>%
rename(.,highwaydistance = distance)%>%
dplyr::select(MUSA_ID,highwaydistance)
trainData <- left_join(trainData,highway_MRBuffer,by="MUSA_ID")
rm(highway_MRBuffer,highwayunion)
#-----Nature reserves-----
natureReserves = get_osm_1(q0, "leisure", "nature_reserve", "polygons", "ESRI:102254")
trainData = buffer_count(trainData, "natureReservesCount", 1000, natureReserves)
#-----Water bodies-----
water = get_osm_1(q0, "natural", "water", "polygons", "ESRI:102254")
trainData <-trainData %>% mutate(water_nn1 = nn_function(st_coordinates(trainData), st_coordinates(water%>%dplyr::select(geometry)%>%st_centroid(.)), 1))
In addition to data from OpenStreetMap, we have also collected data on block-group level from the ACS (American Community Survey, 5-year estimates, 2019). Then, each home is added information regarding its census tract and municipality. Finally, each home is added information of the average price of its five nearest homes.
The complete table of summary statistics of all variables will not be presented here, as it would be too lengthy and unnecessary. We shall present a table of summary statistics of the selected variables later in this report.
#-----API-----
census_api_key("b33ec1cb4da108659efd12b3c15412988646cbd8", overwrite = TRUE)
#-----Get census data-----
# Get block-group level information data regarding racial composition, population density, income, resident education levels etc.
CensusData <-
get_acs(geography = "block group",
variables = c("B01003_001E","B02001_002E","B19013_001E","B25058_001E",'B02001_003E','B02001_005E'),
year=2019, state=08, county=013, geometry=T, output="wide") %>%
st_transform('ESRI:102254') %>%
rename(Census_TotalPop = B01003_001E,
Census_Whites = B02001_002E,
Census_AfricanAmericans = B02001_003E,
Census_Asians = B02001_005E,
Census_MedHHInc = B19013_001E,
Census_MedRent = B25058_001E) %>%
dplyr::select(-NAME,-starts_with("B")) %>%
mutate(Census_pctWhite = ifelse(Census_TotalPop > 0, 100*Census_Whites / Census_TotalPop,0),
Census_pctAfricanAmericans = ifelse(Census_TotalPop > 0, 100*Census_AfricanAmericans / Census_TotalPop,0),
Census_pctAsians = ifelse(Census_TotalPop > 0, 100*Census_Asians / Census_TotalPop,0),
Census_blockgroupareasqm = as.numeric(st_area(.)),
Census_areaperpeople = ifelse(Census_blockgroupareasqm > 0,Census_blockgroupareasqm/Census_TotalPop,0),
year = "2019") %>%
dplyr::select(-Census_Whites,-Census_AfricanAmericans,-Census_Asians, -GEOID)
# Get tract-level census data for tract geo-info
CensusData2 <-
get_acs(geography = "tract",
variables = c("B01003_001E"),
year=2019, state=08, county=013, geometry=T, output="wide") %>%
st_transform('ESRI:102254') %>%
dplyr::select(-NAME,-starts_with("B"))
#-----Join trainData to census data-----
trainData <-st_join(trainData,dplyr::select(CensusData,-Census_blockgroupareasqm,-year,-geometry,-Census_TotalPop))
trainData <-st_join(trainData,dplyr::select(CensusData2, GEOID, -geometry))
#-----Join trainData to municipality data-----
trainData <-st_join(trainData,dplyr::select(muni, ZONEDESC)) %>%
rename(tractID = GEOID, muni = ZONEDESC)
trainData$muni[is.na(trainData$muni)] = 'Other'
#-----Other modification-----
trainData$parkingArea = as.numeric(trainData$parkingArea)
trainData$parkArea = as.numeric(trainData$parkArea)
#-----Spatial lag information-----
# Average price of five nearest homes of each home
# For "toPredict == 1" observations, the nearest five homes are defined as the nearest five homes in the "toPredict == 0" set.
# First split data by toPredict
dataTest = data[data$toPredict == 1,]
dataTrain = data[data$toPredict == 0,]
rownames(dataTest) <- seq(length = nrow(dataTest))
rownames(dataTrain) <- seq(length = nrow(dataTrain))
# Spatial lag for "toPredict == 1" homes
dataTest = cbind(dataTest,
as.data.frame(get.knnx(st_coordinates(dataTrain),
st_coordinates(dataTest), 5)$nn.index) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_index, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
left_join(cbind(dataTrain$price, seq(length = length(dataTrain$price))) %>%
as.data.frame()%>%
rename(price = V1, index = V2),
by = c("point_index" = "index")) %>%
group_by(thisPoint)%>%
summarize(spatialLag = mean(price))%>%
arrange(as.numeric(thisPoint))%>%
dplyr::select(-thisPoint))
# Adding spatial lag indicator into "dataTrain"
dataTrain$spatialLag = lag.listw(nb2listw(knn2nb(knearneigh(st_coordinates(dataTrain), 5)), style = "W"),dataTrain$price)
# Un-split data, because we still need to wrangle the data-set as a whole
trainData = rbind(dataTrain, dataTest)
The OLS model requires that the input variables (predictors) are generally of Gaussian distribution. Therefore, we have plotted scatter-plots of home prices to all the numeric variables (see Figure 2). We have also plotted the histograms of all the numeric variables (see Figure 3).
#-----Select Numeric Variables-----
numericColumns = unlist(lapply(trainData, is.numeric))
trainDataNumeric = trainData[,numericColumns] %>%
dplyr::select(-MUSA_ID,-toPredict)%>%
st_drop_geometry() %>%
gather(key,value, -price)
#-----Make scatter-plots of all numeric variables----
ggplot(trainDataNumeric, aes(x = value, y = price))+
geom_point(size=.05,alpha=0.5)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=0.5)+
facet_wrap(~key, scales = "free")+
theme(axis.text=element_text(size=5), axis.title=element_text(size=2),
panel.background = element_blank(),
strip.background = element_rect(fill = "white", color = "white"),
strip.text = element_text(size=7),
panel.border = element_rect(colour = "black", fill=NA, size=1))
The scatter-plots show that there is not a strong linear relationship between home prices and many predictors; and the variance of home prices often does not stay constant with the predictors. Therefore, we consider applying a logarithmic transformation to some of the numeric predictors—that is, transforming the values of predictors to their logarithms. Figure 3 shows the histograms before and after logarithmic transformation of every numeric predictor.
# Possible Distribution Transformation
trainDataNumeric2 = trainData[,numericColumns] %>%
dplyr::select(-MUSA_ID,-toPredict)%>%
st_drop_geometry()
i=1
par(mfrow=c(65,1))
for (column in trainDataNumeric2) {
names = names(trainDataNumeric2)
name = names[i:i]
par(mfrow=c(1,2));hist(as.numeric(column),breaks=80,main=c(name,"Before"));hist(log(1+as.numeric(column)),breaks=80,main=c(name,"After"))
i = i+1
}
dev.off()
These plots above help us identify the variables suitable for a logarithmic transformation: those that come to obey Gaussian distribution to a greater extent after logarithmic transformation.
# Variables List
LNVariables <- c("Census_areaperpeople","clinics_nn1",'clinics_nn2','clinics_nn3','clinics_nn4','clinics_nn5',"commerce_nn1","commerce_nn2","commerce_nn3","commerce_nn4","commerce_nn5","commonLeisure_nn1","commonLeisure_nn2","commonLeisure_nn3","commonLeisure_nn4","commonLeisure_nn5","companies_nn1","companies_nn2","companies_nn3","companies_nn4","companies_nn5","gardens_nn1","gardens_nn2","gardens_nn3","gardens_nn4","gardens_nn5","hospitals_nn1","hospitals_nn2","hospitals_nn3","hospitals_nn4","hospitals_nn5","parking_nn1","parking_nn2","parking_nn3","parking_nn4","parking_nn5","restaurants_nn1","restaurants_nn2","restaurants_nn3","restaurants_nn4","restaurants_nn5","retail_nn1","retail_nn2","retail_nn3","retail_nn4","retail_nn5","schools_nn1","schools_nn2","schools_nn3","schools_nn4","schools_nn5","parkArea","public_transport_nn1","public_transport_nn2","water_nn1")
# Geometry has to be dropped before logarithmic transformation
geo_trainData <- trainData %>% dplyr::select(MUSA_ID)
trainData1 = trainData %>% st_drop_geometry
# Logarithmic transformation
trainData1$parkingArea = as.numeric(trainData1$parkingArea)
trainData1$parkArea = as.numeric(trainData1$parkArea)
for (i in LNVariables) {
a = glue('LN_{i}')
b = dplyr::select(trainData1,glue('{i}'))
c = log(1+b)
trainData1[a] <- c
trainData1 <- dplyr::select(trainData1,-glue('{i}'))
}
# Join back geometry
trainData1 = trainData1 %>% left_join(geo_trainData, by = "MUSA_ID") %>% st_sf()
trainData1[is.na(trainData1)] = 0
After feature engineering, our model consists of a large number of interrelated predictors, violating the assumptions of an OLS model. Feature selection is required to make the model more concise and mathematically tenable. As a first step, we plot a correlation matrix (see Figure 4) to identify which predictors are highly interrelated.
numericColumns3 = unlist(lapply(trainData1, is.numeric))
trainDataNumeric3 = trainData1[,numericColumns] %>%
dplyr::select(-MUSA_ID,-toPredict)%>%
na.omit()%>%
st_drop_geometry()
Matrix <- select_if(trainDataNumeric3, is.numeric) %>% na.omit()
ggcorrplot(
cor(Matrix),
p.mat = cor_pmat(Matrix),
colors = c("#EF8A62", "white", "#67A9CF"),
type="lower",
lab = TRUE,
lab_size = 1,
tl.cex = 5,
insig = "blank")
From the correlation matrix, we have identified 11 groups of highly inter-correlated predictors. Among a group of highly interrelated predictors, only one will stay in the final model.
# Define the 11 groups of highly inter-correlated predictors
LNgardens_nn <- c("LN_gardens_nn2","LN_gardens_nn3","LN_gardens_nn4","LN_gardens_nn5","gardensCount")
LNhospitals_nn <- c("LN_hospitals_nn1","LN_hospitals_nn3")
# Use a more generic name for easier later access. Same below
cbindvars1 <- c(LNgardens_nn,LNhospitals_nn)
LNcommonLeisure_nn <- c("LN_commonLeisure_nn3","LN_commonLeisure_nn4","LN_commonLeisure_nn5")
cbindvars2 <- LNcommonLeisure_nn
LNschools_nn <- c("LN_schools_nn2","LN_schools_nn3","LN_schools_nn4","LN_schools_nn5")
parking_decision <- c("parkingCount","parkingArea","LN_parking_nn1")
cbindvars3 <- c(LNschools_nn,parking_decision)
LNrestaurants_nn <- c("LN_restaurants_nn1","LN_restaurants_nn2","LN_restaurants_nn4","restaurantsCount")
LNcommerce_nn <- c("LN_commerce_nn4","LN_commerce_nn5","commerceCount")
cbindvars4 <- c(LNrestaurants_nn,LNcommerce_nn)
LNretail_nn <- c("LN_retail_nn1","LN_retail_nn4","LN_retail_nn5")
LNclinics_nn <- c("LN_clinics_nn1","LN_clinics_nn2","LN_clinics_nn3","LN_clinics_nn5")
cbindvars5 <- c(LNretail_nn,LNclinics_nn)
LNpublic_transport_nn <- c("LN_public_transport_nn1","LN_public_transport_nn2")
cbindvars6 <- LNpublic_transport_nn
park_decision <- c("parkCount","LN_parkArea")
cbindvars7 <- park_decision
bsmt_decision <- c("bsmtSF","bsmtType")
cbindvars8 <- bsmt_decision
LNcompanies_nn <- c("LN_companies_nn2","LN_companies_nn3","LN_companies_nn4","LN_companies_nn5")
cbindvars9 <- LNcompanies_nn
ExtWall_decision <- c("ExtWallDscrPrim","ExtWallDscrSec")
cbindvars10 <- ExtWall_decision
The second step is to run a step-wise linear model. The step-wise regression is an iterative process of adding and removing predictors and testing statistical significance after each iteration. It is a preliminary “filter” of the insignificant predictors.
#-----A simplified dataset of only the selected variables-----
trainData3 <- dplyr::select(
trainData1,
price,designCodeDscr,qualityCodeDscr,ConstCodeDscr,EffectiveYear,carStorageType,TotalFinishedSF,
AcDscr,HeatingDscr,IntWallDscr,Roof_CoverDscr,nbrBaths,schoolCount,universitiesCount,stadiumsCount,
highwaydistance,natureReservesCount,Census_MedHHInc,Census_MedRent,Census_pctWhite,Census_pctAsians,
tractID,LN_Census_areaperpeople,LN_water_nn1,LN_hospitals_nn1,LN_commonLeisure_nn4,LN_schools_nn2,
LN_restaurants_nn1,LN_clinics_nn1,LN_public_transport_nn2,LN_parkArea,bsmtSF,LN_companies_nn5,ExtWallDscrPrim,fitnessCenter_nn3,gardensCount, toPredict, muni)
#-----Split dataset by the toPredict variable-----
dataTest = trainData3[trainData3$toPredict == 1,]
dataTrain = trainData3[trainData3$toPredict == 0,]
rownames(dataTest) <- seq(length = nrow(dataTest))
rownames(dataTrain) <- seq(length = nrow(dataTrain))
dropList1 = c("geometry", "MUSA_ID", "toPredict", "muni")
# Running a step-wise model
reg1Data = dataTrain[, !colnames(dataTrain) %in% dropList1] %>%
st_drop_geometry
reg.baseline1= lm(price ~ ., data = reg1Data)
reg.baseline1.step = stepAIC(reg.baseline1, direction = "both",trace = FALSE)
reg.baseline1.step$anova
# Examine the step-wise model
summary(reg.baseline1.step)
# The chosen variables are:
# designCodeDscr, qualityCodeDscr, ConstCodeDscr, EffectiveYear, bsmtSF, bsmtType, carStorageType, AcDscr, HeatingDscr, ExtWallDscrPrim, ExtWallDscrSec, Roof_CoverDscr, nbrBaths, parkCount, restaurantsCount, schoolCount, schoolNo, universitiesCount, parkingCount, parkingArea, cinemasCount, stadiumsCount, commerceCount, fitnessCenter_nn3, gardensCount, Census_MedRent, Census_pctAfricanAmericans, tractName, LNclinics_nn1, LNclinics_nn3, LNclinics_nn5, LNcommerce_nn3, LNcommerce_nn4, LNcommerce_nn5, LNcommonLeisure_nn1, LNcommonLeisure_nn3, LNcommonLeisure_nn4, LNcommonLeisure_nn5, LNcompanies_nn1, LNcompanies_nn4, LNgardens_nn2, LNgardens_nn5, LNhospitals_nn1, LNhospitals_nn4, LNhospitals_nn5, LNparking_nn1, LNparking_nn2, LNparking_nn5, LNrestaurants_nn1, LNrestaurants_nn2, LNrestaurants_nn4, LNretail_nn1, LNretail_nn4, LNretail_nn5, LNschools_nn2, LNschools_nn3, LNschools_nn4, LNparkArea, LNpublic_transport_nn1, LNpublic_transport_nn2, LNwater_nn1, parkAreaIntersected, TotalFinishedSF
However, the step-wise model does not eliminate all instances of inter-correlation among predictors. For the next step, we manually select one predictor out of each group of highly interrelated predictors. To do this, we sequentially add every predictor in each of the 11 groups of highly interrelated predictors, testing for prediction accuracy with every move. Accuracy is denoted by the prediction error as a percentage of the observed price (MAPE, mean absolute percentage of error). To calculate MAPE, we randomly split the data-set into a training set (75% of observations) and a test set (25% of observations). We build models off the training set, and use this model to predict home prices for the test set and calculate its MAPE.
# Making a splitting process
inTrain = createDataPartition(y = paste(dataTrain$ConstCodeDscr,
dataTrain$Roof_CoverDscr,
dataTrain$ExtWallDscrPrim,
dataTrain$tractID),
p = .75, list = FALSE)
# Split data into a training set and a test set
dataTrain.training = dataTrain[inTrain,]
dataTrain.test = dataTrain[-inTrain,]
#-----Construct comparison dataframe-----
varcomparison = dataTrain.test %>%
dplyr::select(price) %>%
st_drop_geometry()
#-----Sequentially run models-----
for( i in cbindvars1) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars2) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars3) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars4) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars5) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars6) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars7) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars8) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars9) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
for( i in cbindvars10) {
trainingmed <- rename(dataTrain.training, "medname_var1" = i)
testingmed <- rename(dataTrain.test,"medname_var1" = i)
reg.test1 = lm(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + carStorageType + TotalFinishedSF + AcDscr + HeatingDscr +
IntWallDscr + Roof_CoverDscr + nbrBaths + + schoolCount + universitiesCount + stadiumsCount + highwaydistance + natureReservesCount +
Census_MedHHInc + Census_MedRent + Census_pctWhite + Census_pctAsians + tractID + LN_Census_areaperpeople + LN_water_nn1 +
medname_var1,data = trainingmed)
varcomparison = varcomparison %>% mutate(predict.price = predict(reg.test1, newdata = testingmed))
varcomparison[i] = abs(varcomparison$predict.price - varcomparison$price) / varcomparison$predict.price
}
#-----Summarize comparison-----
varcomparison_summary <- colMeans(varcomparison,dims=1)
varcomparison_summary <- as.data.frame.list(varcomparison_summary)
Finally, we present our finally selection of variables as follows. The first group of predictor are the intrinsic features of each home (see Table 1). Many of them are categorical and therefore turned into multiple dummies. These data come directly from the home price data-set.
IntrinsicFeatures = model.matrix(~ TotalFinishedSF+
designCodeDscr+
qualityCodeDscr+
ConstCodeDscr+
EffectiveYear+
bsmtSF+
carStorageType+
AcDscr+
HeatingDscr+
ExtWallDscrSec+
Roof_CoverDscr+
nbrBaths - 1,
data = trainData1) %>%
as.data.frame()
stargazer(IntrinsicFeatures,
type = "html",
digits = 1,
title = "Table 1. Summary statistics of predictors - intrinsic features",
out.header = TRUE,
covariate.labels = c("Total finished area square feet",
"Design type - Ranch",
"Design type - 2-3 story",
"Design type - Bi-level",
"Design type - Multi-story-townhouse",
"Design type - Split-level",
"Quality - Average +",
"Quality - Average ++",
"Quality - Excellent",
"Quality - Excellent +",
"Quality - Excellent ++",
"Quality - Exceptional 1",
"Quality - Exceptional 2",
"Quality - Fair",
"Quality - Good",
"Quality - Good +",
"Quality - Good ++",
"Quality - Low",
"Quality - Very good",
"Quality - Very good +",
"Quality - Very good ++",
"Construction type - Brick",
"Construction type - Concrete",
"Construction type - Frame",
"Construction type - Masonry",
"Construction type - Precast",
"Construction type - Veneer",
"Construction type - Wood",
"Year of construction or last major change",
"Car storage type - GRA",
"Car storage type - GRB",
"Car storage type - GRC",
"Car storage type - GRD",
"Car storage type - GRF",
"Car storage type - GRW",
"Air conditioning - Fan",
"Air conditioning - Cooler",
"Air conditioning - Whole house",
"Heating - Electric",
"Heating - Electric wall heat",
"Heating - Forced air",
"Heating - Gravity",
"Heating - Pump",
"Heating - Water",
"Heating - HVAC",
"Heating - Package unit",
"Heating - Radient floor",
"Heating - Ventilation only",
"Heating - Wall furnace",
"External wall - Block stucco",
"External wall - Brick on block",
"External wall - Brick veneer",
"External wall - Cedar",
"External wall - Cement board",
"External wall - Faux stone",
"External wall - Frame asbestos",
"External wall - Frame stucco",
"External wall - Frame wood/shake",
"External wall - Log",
"External wall - Log",
"External wall - Metal",
"External wall - Moss rock/flagstone",
"External wall - Painted block",
"External wall - Vinyl",
"Roof - Asphalt",
"Roof - Built-up",
"Roof - Clay tile",
"Roof - Concrete tile",
"Roof - Metal",
"Roof - Roll",
"Roof - Rubber membrane",
"Roof - Shake",
"Roof - Tar and gravel",
"Number of bathrooms"),
flip = FALSE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
Total finished area square feet | 11,363 | 1,955.5 | 892.6 | 0 | 1,276 | 2,443.5 | 10,188 |
Design type - Ranch | 11,363 | 0.3 | 0.5 | 0 | 0 | 1 | 1 |
Design type - 2-3 story | 11,363 | 0.5 | 0.5 | 0 | 0 | 1 | 1 |
Design type - Bi-level | 11,363 | 0.03 | 0.2 | 0 | 0 | 0 | 1 |
Design type - Multi-story-townhouse | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Design type - Split-level | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Quality - Average + | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
Quality - Average ++ | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
Quality - Excellent | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Quality - Excellent + | 11,363 | 0.002 | 0.04 | 0 | 0 | 0 | 1 |
Quality - Excellent ++ | 11,363 | 0.003 | 0.1 | 0 | 0 | 0 | 1 |
Quality - Exceptional 1 | 11,363 | 0.002 | 0.05 | 0 | 0 | 0 | 1 |
Quality - Exceptional 2 | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
Quality - Fair | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Quality - Good | 11,363 | 0.3 | 0.5 | 0 | 0 | 1 | 1 |
Quality - Good + | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
Quality - Good ++ | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Quality - Low | 11,363 | 0.002 | 0.04 | 0 | 0 | 0 | 1 |
Quality - Very good | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Quality - Very good + | 11,363 | 0.02 | 0.1 | 0 | 0 | 0 | 1 |
Quality - Very good ++ | 11,363 | 0.03 | 0.2 | 0 | 0 | 0 | 1 |
Construction type - Brick | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Construction type - Concrete | 11,363 | 0.001 | 0.02 | 0 | 0 | 0 | 1 |
Construction type - Frame | 11,363 | 0.8 | 0.4 | 0 | 1 | 1 | 1 |
Construction type - Masonry | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Construction type - Precast | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Construction type - Veneer | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Construction type - Wood | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
Year of construction or last major change | 11,363 | 1,997.7 | 17.5 | 1,879 | 1,989 | 2,010 | 2,021 |
Car storage type - GRA | 11,363 | 743.8 | 627.4 | 0 | 134 | 1,125 | 4,534 |
Car storage type - GRB | 11,363 | 0.8 | 0.4 | 0 | 1 | 1 | 1 |
Car storage type - GRC | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Car storage type - GRD | 11,363 | 0.02 | 0.1 | 0 | 0 | 0 | 1 |
Car storage type - GRF | 11,363 | 0.1 | 0.3 | 0 | 0 | 0 | 1 |
Car storage type - GRW | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
Air conditioning - Fan | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
Air conditioning - Cooler | 11,363 | 0.002 | 0.04 | 0 | 0 | 0 | 1 |
Air conditioning - Whole house | 11,363 | 0.02 | 0.1 | 0 | 0 | 0 | 1 |
Heating - Electric | 11,363 | 0.6 | 0.5 | 0 | 0 | 1 | 1 |
Heating - Electric wall heat | 11,363 | 0.02 | 0.2 | 0 | 0 | 0 | 1 |
Heating - Forced air | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Heating - Gravity | 11,363 | 0.9 | 0.3 | 0 | 1 | 1 | 1 |
Heating - Pump | 11,363 | 0.002 | 0.05 | 0 | 0 | 0 | 1 |
Heating - Water | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
Heating - HVAC | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
Heating - Package unit | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Heating - Radient floor | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Heating - Ventilation only | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Heating - Wall furnace | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
External wall - Block stucco | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
External wall - Brick on block | 11,363 | 0.001 | 0.02 | 0 | 0 | 0 | 1 |
External wall - Brick veneer | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
External wall - Cedar | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
External wall - Cement board | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
External wall - Faux stone | 11,363 | 0.002 | 0.04 | 0 | 0 | 0 | 1 |
External wall - Frame asbestos | 11,363 | 0.1 | 0.2 | 0 | 0 | 0 | 1 |
External wall - Frame stucco | 11,363 | 0.001 | 0.02 | 0 | 0 | 0 | 1 |
External wall - Frame wood/shake | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
External wall - Log | 11,363 | 0.2 | 0.4 | 0 | 0 | 0 | 1 |
External wall - Log | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
External wall - Metal | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
External wall - Moss rock/flagstone | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
External wall - Painted block | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
External wall - Vinyl | 11,363 | 0.001 | 0.03 | 0 | 0 | 0 | 1 |
Roof - Asphalt | 11,363 | 0.7 | 0.5 | 0 | 0 | 1 | 1 |
Roof - Built-up | 11,363 | 0.001 | 0.02 | 0 | 0 | 0 | 1 |
Roof - Clay tile | 11,363 | 0.004 | 0.1 | 0 | 0 | 0 | 1 |
Roof - Concrete tile | 11,363 | 0.02 | 0.1 | 0 | 0 | 0 | 1 |
Roof - Metal | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Roof - Roll | 11,363 | 0.000 | 0.01 | 0 | 0 | 0 | 1 |
Roof - Rubber membrane | 11,363 | 0.01 | 0.1 | 0 | 0 | 0 | 1 |
Roof - Shake | 11,363 | 0.004 | 0.1 | 0 | 0 | 0 | 1 |
Roof - Tar and gravel | 11,363 | 0.000 | 0.02 | 0 | 0 | 0 | 1 |
Number of bathrooms | 11,363 | 2.5 | 1.0 | 0 | 1.8 | 3.2 | 12 |
The second group of predictors regards the accessibility to recreation, services, transportation, and other types of amenities (see Table 2). The amenities data are downloaded from OpenStreetMap via an API.
Amenities = model.matrix(~ cinemasCount+
stadiumsCount+
fitnessCenter_nn3+
LN_commonLeisure_nn1+
parkingArea+
LN_restaurants_nn1+
LN_hospitals_nn1+
LN_retail_nn4+
LN_parkArea+
LN_water_nn1+
LN_public_transport_nn2
- 1,
data = trainData1) %>%
as.data.frame()
stargazer(Amenities,
type = "html",
title = "Table 2. Summary statistics of predictors - accessibility to recreation, services and other amenities",
digits = 1,
out.header = TRUE,
covariate.labels = c("# of cinemas within 1000 m buffer",
"# of stadiums within 1000 m buffer",
"Avg. distance to 3 nearest fitness centers",
"Log. distance to nearest common leisure ground",
"Total area of parking within 1000 m buffer",
"Log. distance to nearest restaurant",
"Log. distance to nearest hospital",
"Avg. distance to four nearest retail points, log.",
"Total area of parks within 1000 m buffer",
"Log. distance to nearest water body",
"Average distance to the two public transit stations, log."),
flip = FALSE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
# of cinemas within 1000 m buffer | 11,363 | 0.2 | 2.0 | 0 | 0 | 0 | 40 |
# of stadiums within 1000 m buffer | 11,363 | 0.1 | 1.3 | 0 | 0 | 0 | 22 |
Avg. distance to 3 nearest fitness centers | 11,363 | 3,187.2 | 3,612.7 | 38.8 | 1,265.6 | 3,554.3 | 34,166.8 |
Log. distance to nearest common leisure ground | 11,363 | 7.6 | 1.0 | 3.1 | 7.0 | 8.2 | 10.4 |
Total area of parking within 1000 m buffer | 11,363 | 37,985.3 | 47,920.3 | 0.0 | 1,912.8 | 53,898.5 | 450,407.2 |
Log. distance to nearest restaurant | 11,363 | 6.9 | 0.8 | 2.4 | 6.4 | 7.4 | 9.4 |
Log. distance to nearest hospital | 11,363 | 8.3 | 0.7 | 4.8 | 7.9 | 8.7 | 10.5 |
Avg. distance to four nearest retail points, log. | 11,363 | 7.2 | 0.9 | 2.2 | 6.7 | 7.7 | 10.2 |
Total area of parks within 1000 m buffer | 11,363 | 10.3 | 3.8 | 0 | 10.8 | 12.4 | 14 |
Log. distance to nearest water body | 11,363 | 6.4 | 0.7 | 3.4 | 6.0 | 6.9 | 8.7 |
Average distance to the two public transit stations, log. | 11,363 | 9.1 | 0.9 | 4.7 | 8.5 | 9.9 | 10.5 |
The third group of predictors are demographic/socioecomomic. They are the block-group level median rent and the block-group level percentage of African Americans. These data come from the ACS (5-year estimates, 2019).
Social = model.matrix(~ Census_MedRent+
Census_pctAfricanAmericans+
- 1,
data = trainData1) %>%
as.data.frame()
stargazer(Social,
type = "html",
title = "Table 3. Summary statistics of predictors - demographic and socioeconomic",
digits = 1,
out.header = TRUE,
covariate.labels = c("Block-group median rent",
"Block-group percentage of African Americans"),
flip = FALSE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
Block-group median rent | 11,363 | 1,403.9 | 663.0 | 0 | 1,154 | 1,806 | 3,281 |
Block-group percentage of African Americans | 11,363 | 0.7 | 1.5 | 0 | 0 | 0.9 | 18 |
Finally, we have predictors related to the spatial structure. The first one is the census tract that a home belongs to. The second is the “spatial lag”, that is, the average price of the five nearest homes. As home prices tend to form high or low clusters in space, we use spatial variables to depict this phenomena. We include only one of these two variables in our prediction models.
After feature selection, we can now plot a correlation matrix of the selected numeric variables (Figure 5).
#-----Numeric columns-----
numericColumns4 = unlist(lapply(trainData3, is.numeric))
trainDataNumeric4 = trainData3[,numericColumns4] %>%
dplyr::select(-toPredict)%>%
na.omit()%>%
st_drop_geometry()
#-----Plot correlation matrix-----
Matrix <- select_if(trainDataNumeric4, is.numeric) %>% na.omit()
ggcorrplot(
cor(Matrix),
p.mat = cor_pmat(Matrix),
colors = c("#EF8A62", "white", "#67A9CF"),
type="lower",
lab = TRUE,
lab_size = 2.5,
tl.cex = 9,
insig = "blank")
Figure 6 shows interesting scatter plots of home sale prices to four predictors of interest. The first two plots depict the relationship between home sale prices and racial composition. While a high percentage of white residents is strongly correlated to higher home prices, the percentage of many other racial groups (e.g., Asians) are negatively correlated to home prices. The third scatter-plot shows the relationship between accessibility to hospitals and home prices. It is not an entirely linear relationship, as being too near or too far to a hospital might indicated lower home prices. The last plot indicates that there is no straightforward relationship between rent and home prices.
trainDataNumeric5 = trainDataNumeric4 %>%
gather(key,value, -price)
# 1
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_pctAsians")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="Scatter plot of price to block-group percentage of Asian residents") +
plotTheme()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=2))
# 2
trainDataNumeric5 = trainDataNumeric4 %>%
gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_pctWhite")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="Scatter plot of price to block-group percentage of white residents") +
plotTheme()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=2))
# 3
trainDataNumeric5 = trainDataNumeric4 %>%
gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "LN_hospitals_nn1")%>%
mutate(EXP=exp(value))
ggplot(trainDataNumeric5, aes(x = EXP, y = price))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="Scatter plot of price to logarithmic distance to nearest hospital") +
plotTheme()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=2))
# 4
trainDataNumeric5 = trainDataNumeric4 %>%
gather(key,value, -price)
trainDataNumeric5 <- trainDataNumeric5 %>% filter(key == "Census_MedRent")
ggplot(trainDataNumeric5, aes(x = value, y = price))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.3)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="Scatter plot of price to block-group median rent") +
plotTheme()+
theme(axis.text=element_text(size=12),axis.title=element_text(size=2))
Figure 7 shows the spatial distribution of three predictors of interest. The first map shows the distribution of total finished home areas. This variable is evidently clustered in space. The second and third plots show that it is not a particular advantage to live near highway transport or waterbodies.
plotuse = trainData[trainData$toPredict == 0,]
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = plotuse, aes(color = q5(TotalFinishedSF)), size = 2,alpha=0.3)+
scale_color_manual(values = palette5,
labels = qBr(plotuse, "TotalFinishedSF"),
name = "Total finished area sf") +
labs(
title = "Distribution of total finished area square feet",
subtitle = "Square Meters; In tract unit") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
highwayunion <- st_buffer(st_union(highway),500)%>%st_sf()
highwayunion2 <- st_buffer(st_union(highway),2000)%>%st_sf()
highwayunion3 <- st_buffer(st_union(highway),4000)%>%st_sf()
ggplot()+
geom_sf(data = highwayunion3, color="#DD1144", size = 1,fill="white",linetype = "dashed")+
geom_sf(data = highwayunion2, color="#DD1144", size = 1.25,fill="white",linetype = "dashed")+
geom_sf(data = highwayunion, color="#DD1144", size = 1.35,fill="white",linetype = "dashed")+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = plotuse, aes(color = q5(price)), size = 3,alpha=0.5)+
scale_color_manual(values = palette5,
labels = qBr(trainData, "price"),
name = "Home sale prices")+
labs(
title = "Home prices with distance to highways",
subtitle = "Real Dollars; In tract unit") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
water_plot <-st_intersection(boulder,water)
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = trainData, aes(color = q5(price)), size = 3,alpha=0.5)+
geom_sf(data = water_plot,fill = alpha("#2166ac", 0.4),color="darkblue")+
scale_color_manual(values = palette5,
labels = qBr(trainData, "price"),
name = "Home sale prices")+
labs(
title = "Home prices with distance to water bodies",
subtitle = "Real Dollars; In tract unit") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
Firstly, we have built two OLS models using the predictors that were selected in the 3.3 Feature selection section. The first model uses tract ID as its geospatial component, and the second model uses spatial lag (average price of five nearest homes). The models prove tenable with a significant F-value and sufficient goodness of fit. But the first model is statistically better than the second one.
#-----Use Tract as geospatial variable-----
dropList2 = c("geometry", "MUSA_ID", "spatialLag", "muni","toPredict")
reg2Data = dataTrain.training[, !colnames(dataTrain.training) %in% dropList2]%>%
st_drop_geometry()
reg.baseline.fixedEffect = lm(price ~ ., data = reg2Data)
#-----Use spatial lag as geospatial variable-----
dropList3 = c("geometry", "MUSA_ID", "tractID", "muni", "toPredict")
reg3Data = dataTrain.training[, !colnames(dataTrain.training) %in% dropList3]%>%
st_drop_geometry()
reg.baseline.spatialLag = lm(price ~ ., data = reg3Data)
stargazer(reg.baseline.fixedEffect, reg.baseline.spatialLag,
type = "html",
title = "Table 4. Model summary of the first two linear models",
dep.var.labels = "Home Price")
Dependent variable: | ||
Home Price | ||
(1) | (2) | |
designCodeDscr2-3 Story | -67,265.080*** | -77,150.940*** |
(9,666.445) | (10,203.880) | |
designCodeDscrBi-level | 43,116.650** | 60,018.260*** |
(20,787.460) | (21,941.210) | |
designCodeDscrMULTI STORY- TOWNHOUSE | -195,140.900*** | -214,211.800*** |
(14,205.440) | (14,462.970) | |
designCodeDscrSplit-level | 6,803.756 | 24,175.500* |
(13,903.520) | (14,655.490) | |
qualityCodeDscrAVERAGE + | -61,061.020*** | -42,830.950*** |
(14,917.640) | (15,588.400) | |
qualityCodeDscrAVERAGE ++ | -24,284.220 | -14,395.580 |
(15,416.160) | (16,252.000) | |
qualityCodeDscrEXCELLENT | 1,045,202.000*** | 1,145,270.000*** |
(34,146.120) | (36,028.290) | |
qualityCodeDscrEXCELLENT + | 1,255,809.000*** | 1,371,288.000*** |
(70,747.750) | (75,599.850) | |
qualityCodeDscrEXCELLENT++ | 2,141,649.000*** | 2,278,119.000*** |
(60,513.450) | (64,538.440) | |
qualityCodeDscrEXCEPTIONAL 1 | 1,192,994.000*** | 1,234,443.000*** |
(75,564.810) | (80,438.780) | |
qualityCodeDscrEXCEPTIONAL 2 | 1,750,635.000*** | 1,844,428.000*** |
(227,366.900) | (242,478.100) | |
qualityCodeDscrFAIR | -68,085.910* | -58,087.280 |
(39,642.070) | (42,332.650) | |
qualityCodeDscrGOOD | 49,411.010*** | 64,122.750*** |
(11,207.680) | (11,351.330) | |
qualityCodeDscrGOOD + | 73,595.990*** | 95,644.880*** |
(18,049.580) | (18,838.630) | |
qualityCodeDscrGOOD ++ | 155,902.700*** | 169,788.000*** |
(16,965.380) | (17,295.230) | |
qualityCodeDscrLOW | -131,013.500 | -110,461.800 |
(82,047.980) | (87,968.320) | |
qualityCodeDscrVERY GOOD | 243,143.800*** | 251,717.600*** |
(18,227.420) | (19,032.230) | |
qualityCodeDscrVERY GOOD + | 461,494.100*** | 539,076.900*** |
(29,566.560) | (31,255.870) | |
qualityCodeDscrVERY GOOD ++ | 647,132.300*** | 684,105.900*** |
(25,968.190) | (27,260.030) | |
ConstCodeDscrBrick | -27,786.770 | 4,811.586 |
(54,143.470) | (57,757.630) | |
ConstCodeDscrConcrete | 41,465.310 | 59,838.590 |
(148,239.700) | (158,730.100) | |
ConstCodeDscrFrame | -47,796.420 | -43,168.010 |
(43,549.860) | (46,723.150) | |
ConstCodeDscrMasonry | -15,204.260 | 14,970.310 |
(45,293.530) | (48,481.400) | |
ConstCodeDscrPrecast | -1,037,887.000*** | -1,083,050.000*** |
(221,952.300) | (238,206.600) | |
ConstCodeDscrVeneer | -959,315.300*** | -895,479.900*** |
(309,357.100) | (332,372.600) | |
ConstCodeDscrWood | -154,771.400 | -171,664.800 |
(182,918.200) | (196,540.700) | |
EffectiveYear | 1,740.945*** | 1,185.249*** |
(314.359) | (331.305) | |
carStorageTypeGRA | 21,273.260 | 1,116.081 |
(14,434.770) | (15,082.820) | |
carStorageTypeGRB | 172,914.700*** | 231,393.900*** |
(29,835.480) | (31,822.620) | |
carStorageTypeGRC | -46,923.640* | -75,262.190*** |
(26,405.090) | (28,059.280) | |
carStorageTypeGRD | 75,319.010*** | 91,998.000*** |
(15,839.030) | (16,796.260) | |
carStorageTypeGRF | 31,654.000 | 46,173.570 |
(107,726.900) | (115,504.300) | |
carStorageTypeGRW | -96,763.050 | -119,198.200 |
(124,177.100) | (133,111.900) | |
TotalFinishedSF | 148.850*** | 150.943*** |
(7.339) | (7.713) | |
AcDscrAttic Fan | -113,612.800 | -80,483.710 |
(75,681.780) | (81,202.840) | |
AcDscrEvaporative Cooler | -11,090.340 | 20,542.310 |
(24,033.510) | (25,643.900) | |
AcDscrWhole House | 13,057.710 | 1,958.156 |
(8,834.608) | (9,411.818) | |
HeatingDscrElectric | -155,160.600*** | -154,427.200*** |
(45,022.800) | (48,126.510) | |
HeatingDscrElectric Wall Heat (1500W) | 461,577.900 | 467,516.900 |
(355,230.400) | (382,479.400) | |
HeatingDscrForced Air | -133,427.500*** | -147,714.100*** |
(41,217.570) | (44,191.930) | |
HeatingDscrGravity | -103,215.400 | -132,309.500 |
(77,602.240) | (83,173.440) | |
HeatingDscrHeat Pump | -437,700.400*** | -453,707.400*** |
(122,420.000) | (131,190.900) | |
HeatingDscrHot Water | -71,249.180* | -72,332.070 |
(42,676.500) | (45,704.120) | |
HeatingDscrNo HVAC | 93,647.100 | -146,636.300 |
(308,176.200) | (331,797.400) | |
HeatingDscrPackage Unit | 121,777.100 | -37,404.890 |
(339,834.300) | (364,940.300) | |
HeatingDscrRadiant Floor | 86,490.880 | 139,402.800** |
(53,929.390) | (57,784.330) | |
HeatingDscrVentilation Only | -575,388.500* | -578,355.800* |
(322,338.700) | (346,507.200) | |
HeatingDscrWall Furnace | 69,532.930 | 62,015.280 |
(54,752.750) | (58,555.260) | |
IntWallDscrDrywall | 11,707.580 | 47,702.370 |
(47,013.250) | (50,276.890) | |
IntWallDscrHardwood Panel | -14,591.660 | 14,678.100 |
(53,548.530) | (57,199.440) | |
IntWallDscrPlaster | -58,804.930 | -5,127.843 |
(50,751.920) | (53,766.460) | |
IntWallDscrPlywood | 7,808.966 | 31,978.380 |
(76,437.730) | (81,749.600) | |
IntWallDscrUnfinished | -16,712.270 | 21,446.430 |
(72,025.050) | (77,236.660) | |
IntWallDscrWall Board | -4,333.807 | 24,577.960 |
(55,116.160) | (58,873.550) | |
Roof_CoverDscrAsphalt | -18,821.520** | -3,139.079 |
(8,344.362) | (8,843.004) | |
Roof_CoverDscrBuilt-Up | 441,918.200*** | 305,235.200** |
(129,299.900) | (138,552.400) | |
Roof_CoverDscrClay Tile | -73,781.680 | -54,477.910 |
(49,259.860) | (52,736.190) | |
Roof_CoverDscrConcrete Tile | -116,475.300*** | -198,037.600*** |
(26,701.200) | (26,645.560) | |
Roof_CoverDscrMetal | 184,652.600*** | 221,708.900*** |
(27,579.260) | (29,381.460) | |
Roof_CoverDscrRoll | -55,269.810 | -86,787.420 |
(301,098.200) | (323,591.400) | |
Roof_CoverDscrRubber Membrane | 123,838.500*** | 95,807.980*** |
(29,322.110) | (30,768.920) | |
Roof_CoverDscrShake | -55,876.160 | -24,969.230 |
(47,719.680) | (51,188.980) | |
Roof_CoverDscrTar and Gravel | 110,205.700 | 206,619.100 |
(152,767.900) | (163,660.000) | |
nbrBaths | 37,552.900*** | 37,398.950*** |
(5,810.065) | (6,193.296) | |
schoolCount | -3,180.243 | 7,430.762* |
(4,853.008) | (4,500.771) | |
universitiesCount | -77,800.660*** | -112,032.000*** |
(18,199.630) | (10,882.920) | |
stadiumsCount | 5,304.644 | 6,996.879** |
(3,691.543) | (3,316.121) | |
highwaydistance | -5.460* | -21.942*** |
(3.260) | (1.845) | |
natureReservesCount | -9,535.534 | -15,683.950*** |
(9,650.467) | (5,570.419) | |
Census_MedHHInc | 0.405*** | -0.170 |
(0.151) | (0.130) | |
Census_MedRent | -24.607*** | -51.306*** |
(7.171) | (6.048) | |
Census_pctWhite | 2,085.728* | 1,885.058** |
(1,102.564) | (884.525) | |
Census_pctAsians | 3,924.150** | -3,836.992*** |
(1,624.937) | (1,304.387) | |
tractID08013012102 | -418,740.700*** | |
(36,585.710) | ||
tractID08013012103 | -379,483.300*** | |
(44,666.860) | ||
tractID08013012104 | -378,629.700*** | |
(49,055.400) | ||
tractID08013012105 | -530,923.000*** | |
(45,462.870) | ||
tractID08013012201 | -28,978.910 | |
(46,989.370) | ||
tractID08013012202 | -397,826.500*** | |
(70,256.720) | ||
tractID08013012203 | -743,343.200*** | |
(49,563.910) | ||
tractID08013012204 | 326,435.200*** | |
(84,418.530) | ||
tractID08013012300 | -153,458.300 | |
(228,450.700) | ||
tractID08013012401 | -384,177.900*** | |
(63,821.090) | ||
tractID08013012501 | -665,005.800*** | |
(61,490.850) | ||
tractID08013012505 | -364,354.000*** | |
(48,419.950) | ||
tractID08013012507 | -608,439.200*** | |
(56,806.440) | ||
tractID08013012508 | -613,007.100*** | |
(63,205.530) | ||
tractID08013012509 | -603,515.600*** | |
(52,192.670) | ||
tractID08013012510 | -409,037.200*** | |
(52,640.320) | ||
tractID08013012511 | -550,204.600*** | |
(65,612.770) | ||
tractID08013012603 | -544,724.900*** | |
(57,572.960) | ||
tractID08013012605 | -410,273.100*** | |
(147,491.600) | ||
tractID08013012607 | -427,588.300*** | |
(101,764.800) | ||
tractID08013012608 | -521,065.700*** | |
(65,564.520) | ||
tractID08013012701 | -733,190.200*** | |
(44,668.220) | ||
tractID08013012705 | -763,648.800*** | |
(70,582.130) | ||
tractID08013012707 | -345,834.200*** | |
(71,701.920) | ||
tractID08013012708 | -834,672.200*** | |
(54,920.570) | ||
tractID08013012709 | -790,427.600*** | |
(66,281.990) | ||
tractID08013012710 | -578,812.300*** | |
(53,243.120) | ||
tractID08013012800 | -1,077,702.000*** | |
(58,604.450) | ||
tractID08013012903 | -997,037.300*** | |
(65,778.630) | ||
tractID08013012904 | -827,024.100*** | |
(61,232.560) | ||
tractID08013012905 | -794,779.700*** | |
(69,817.560) | ||
tractID08013012907 | -850,862.900*** | |
(64,763.720) | ||
tractID08013013003 | -817,530.300*** | |
(54,861.260) | ||
tractID08013013004 | -813,165.700*** | |
(63,344.960) | ||
tractID08013013005 | -755,164.700*** | |
(56,139.790) | ||
tractID08013013006 | -705,957.900*** | |
(57,441.300) | ||
tractID08013013201 | -834,223.600*** | |
(79,700.820) | ||
tractID08013013202 | -377,458.400*** | |
(78,885.100) | ||
tractID08013013205 | -904,112.600*** | |
(56,345.410) | ||
tractID08013013207 | -866,987.900*** | |
(84,685.050) | ||
tractID08013013208 | -793,986.400*** | |
(79,630.560) | ||
tractID08013013210 | -825,718.600*** | |
(68,195.420) | ||
tractID08013013211 | -939,934.900*** | |
(61,903.830) | ||
tractID08013013212 | -774,618.000*** | |
(66,619.980) | ||
tractID08013013213 | -907,936.100*** | |
(60,814.830) | ||
tractID08013013302 | -737,533.600*** | |
(75,332.700) | ||
tractID08013013305 | -843,661.600*** | |
(81,950.110) | ||
tractID08013013306 | -831,020.100*** | |
(79,318.740) | ||
tractID08013013307 | -753,824.700*** | |
(88,113.390) | ||
tractID08013013308 | -778,162.100*** | |
(80,774.680) | ||
tractID08013013401 | -846,293.700*** | |
(80,252.760) | ||
tractID08013013402 | -924,350.400*** | |
(73,095.610) | ||
tractID08013013503 | -883,155.500*** | |
(77,118.890) | ||
tractID08013013505 | -855,633.800*** | |
(81,522.650) | ||
tractID08013013506 | -971,831.100*** | |
(73,375.000) | ||
tractID08013013507 | -961,319.600*** | |
(77,866.650) | ||
tractID08013013508 | -978,675.800*** | |
(72,390.300) | ||
tractID08013013601 | -731,484.200*** | |
(70,771.560) | ||
tractID08013013602 | -674,953.800*** | |
(79,019.540) | ||
tractID08013013701 | -774,020.300*** | |
(46,089.190) | ||
tractID08013013702 | -888,737.800*** | |
(55,055.930) | ||
tractID08013060600 | -943,009.300*** | |
(56,045.910) | ||
tractID08013060700 | -743,598.100*** | |
(65,471.530) | ||
tractID08013060800 | -763,505.000*** | |
(63,246.270) | ||
tractID08013060900 | -806,999.300*** | |
(66,997.740) | ||
tractID08013061300 | -1,028,000.000*** | |
(58,934.820) | ||
tractID08013061400 | -1,017,687.000*** | |
(56,446.270) | ||
LN_Census_areaperpeople | 7,776.529 | 9,370.015** |
(4,959.982) | (3,928.262) | |
LN_water_nn1 | -9,870.036 | -6,671.373 |
(6,562.767) | (5,838.722) | |
LN_hospitals_nn1 | 85,739.700*** | 66,229.150*** |
(18,099.700) | (8,431.987) | |
LN_commonLeisure_nn4 | 22,355.520* | 140,926.100*** |
(13,436.220) | (5,797.134) | |
LN_schools_nn2 | -48,298.500*** | -62,143.940*** |
(11,842.080) | (10,323.740) | |
LN_restaurants_nn1 | -2,077.858 | 12,205.340** |
(6,742.155) | (5,916.300) | |
LN_clinics_nn1 | 16,407.610 | -31,743.490*** |
(10,916.310) | (6,892.401) | |
LN_public_transport_nn2 | -88,451.570*** | -189,045.900*** |
(21,977.290) | (6,178.566) | |
LN_parkArea | -5,692.841*** | -1,107.513 |
(1,490.685) | (1,322.252) | |
bsmtSF | -11.448* | -24.669*** |
(6.833) | (7.183) | |
LN_companies_nn5 | 28,083.780** | 16,869.700** |
(13,740.640) | (7,443.528) | |
ExtWallDscrPrimBlock Stucco | 198,781.800** | 209,105.200** |
(89,678.590) | (96,180.410) | |
ExtWallDscrPrimBrick on Block | 364,518.500*** | 314,994.900*** |
(85,734.000) | (91,635.500) | |
ExtWallDscrPrimBrick Veneer | 151,944.000* | 73,422.470 |
(80,363.470) | (86,230.570) | |
ExtWallDscrPrimCedar | 107,901.300 | 33,215.790 |
(138,346.900) | (148,536.900) | |
ExtWallDscrPrimCement Board | 110,175.300 | 10,306.900 |
(80,106.540) | (85,923.450) | |
ExtWallDscrPrimFaux Stone | 245,983.300*** | 152,098.000 |
(91,457.300) | (98,150.040) | |
ExtWallDscrPrimFrame Asbestos | 323,728.300 | 113,814.600 |
(312,968.000) | (336,163.500) | |
ExtWallDscrPrimFrame Stucco | 136,169.400* | 75,203.220 |
(80,917.060) | (86,801.970) | |
ExtWallDscrPrimFrame Wood/Shake | 196,182.100** | 120,214.100 |
(79,783.450) | (85,585.950) | |
ExtWallDscrPrimLog | 178,046.100** | 126,497.200 |
(88,641.160) | (94,804.180) | |
ExtWallDscrPrimMetal | 135,744.100 | 117,659.200 |
(133,552.000) | (143,261.900) | |
ExtWallDscrPrimMoss Rock/Flagstone | 197,348.100** | 154,026.500* |
(83,942.060) | (90,020.550) | |
ExtWallDscrPrimPainted Block | 167,901.700* | 52,411.890 |
(100,341.700) | (107,781.900) | |
ExtWallDscrPrimStrawbale | 166,766.700 | -11,265.470 |
(239,300.800) | (256,105.900) | |
ExtWallDscrPrimVinyl | 202,586.200** | 113,999.800 |
(88,821.210) | (95,257.130) | |
fitnessCenter_nn3 | -19.541*** | -8.873*** |
(4.542) | (2.319) | |
gardensCount | -282.591 | 1,033.850 |
(1,876.980) | (1,460.147) | |
Constant | -2,507,530.000*** | -1,490,994.000** |
(686,431.000) | (677,506.900) | |
Observations | 8,837 | 8,837 |
R2 | 0.732 | 0.686 |
Adjusted R2 | 0.727 | 0.683 |
Residual Std. Error | 298,911.400 (df = 8668) | 322,308.300 (df = 8735) |
F Statistic | 141.025*** (df = 168; 8668) | 189.085*** (df = 101; 8735) |
Note: | p<0.1; p<0.05; p<0.01 |
We then use the two models to predict home prices in the test set, and compare the predicted prices with the observed prices. In the first model, the mean absolute error (MAE) is around 137,000 and the mean absolute percentage of error (MAPE) is around 25%. In the second model, the MAE is about 158,000, and the MAPE is 28%. In conclusion, the first model is more accurate than the second one.
baseline.test.error = dataTrain.test %>%
mutate(
# Predicted price
price.predict.fixedEffect = predict(reg.baseline.fixedEffect, dataTrain.test),
price.predict.spatialLag = predict(reg.baseline.spatialLag, dataTrain.test),
# Residues between predicted price and real price
price.error.fixedEffect = price.predict.fixedEffect - price,
price.error.spatialLag = price.predict.spatialLag - price,
# Absolute Residues between predicted price and real price
price.absError.fixedEffect = abs(price.predict.fixedEffect - price),
price.absError.spatialLag = abs(price.predict.spatialLag - price),
# Absolute Percentage of Residues between predicted price and real price
price.ape.fixedEffect = (abs(price.predict.fixedEffect - price)) / price,
price.ape.spatialLag = (abs(price.predict.spatialLag - price)) / price
) %>% dplyr::select(starts_with("price"))
MAE1 = mean(baseline.test.error$price.absError.fixedEffect, na.rm = T)
MAE2 = mean(baseline.test.error$price.absError.spatialLag, na.rm = T)
MAPE1 = mean(baseline.test.error$price.ape.fixedEffect, na.rm = T)
MAPE2 = mean(baseline.test.error$price.ape.spatialLag, na.rm = T)
MAE = c(MAE1, MAE2) %>% format(., digits = 3)
MAPE = c(MAPE1, MAPE2) %>% format(., digits = 3)
Model = c("Model with Tract ID", "Model with Spatial Lag")
summaryTable1 = cbind(Model, MAE, MAPE)
kable(summaryTable1, digits = 1, caption = "Table 5. Prediction precision of the first two models") %>%
kable_classic(full_width = T)%>%
footnote()
Model | MAE | MAPE |
---|---|---|
Model with Tract ID | 139179 | 0.221 |
Model with Spatial Lag | 163573 | 0.263 |
We have plotted the distribution of prediction errors of the first two models (see Figure 8). We may see from the plots that, although we have counted for geospatial factors, the MAE and MAPE are still very clustered in space.
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = baseline.test.error, aes(color = q5(price.absError.fixedEffect)), size = 3,alpha=0.5)+
scale_color_manual(values = palette5,
labels = qBr(baseline.test.error, "price.absError.fixedEffect"),
name = "MAE")+
labs(subtitle = "In Tract") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = baseline.test.error, aes(color = q5(price.absError.spatialLag)), size = 3,alpha=0.5)+
scale_color_manual(values = palette5,
labels = qBr(baseline.test.error, "price.absError.spatialLag"),
name = "MAE")+
labs(subtitle = "In Spatial Lag") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
An analysis with Moran’s I affirms this spatial clustering. Both models are suffered from significant spatial clustering of errors (Figure 9). This suggests that our models still need improvements regarding modelling spatial relations.
# Moran's I of fixedEffect model
coords.test = st_coordinates(baseline.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(baseline.test.error$price.absError.fixedEffect, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
geom_histogram(fill="#69b3a2",binwidth = 0.01)+
geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
scale_x_continuous(limits = c(-1, 1))+
labs(x = "Model using tract ID")+
plotTheme()+
theme(plot.title = element_text(size=10),
legend.position="right",
axis.text=element_text(size=10),
axis.title=element_text(size=10))
# Moran's I of Spatiallag model
coords.test = st_coordinates(baseline.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(baseline.test.error$price.absError.spatialLag, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
geom_histogram(fill="#69b3a2",binwidth = 0.01)+
geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
scale_x_continuous(limits = c(-1, 1))+
labs(x = "Model using spatial lag")+
plotTheme()+
theme(plot.title = element_text(size=22),
legend.position="right",
axis.text=element_text(size=22),
axis.title=element_text(size=22))
On this basis, the models suggest that, rather than having a fixed and independent effect on home prices, the spatial structure may alter the way in which other factors influence home prices. Therefore, we propose using a hierarchical linear model, a more complicated version of OLS models. Hierarchical linear modeling (HLM), also known as the mixed-effect model, is a model designed to take into account the hierarchical or nested structure of the data.
dataTrain = trainData3[trainData3$toPredict == 0,]
dataTrain$muni[is.na(dataTrain$muni)] = 'Other'
# To avoid errors in running the HLM model, merge some small categories
dataTrain_merge = dataTrain
list1 = c("Electric Wall Heat (1500W)", "No HVAC", "Package Unit", "Heat Pump", "Gravity","Radiant Floor", "Ventilation Only", "Wall Furnace", "")
for (i in list1){
a=1
dataTrain_merge$HeatingDscr[dataTrain_merge$HeatingDscr == i] = glue("other{i}")
a=a+1
}
list2 = c("Concrete", "Precast", "Veneer", "Wood", "", "Brick")
for (i in list2){
a=1
dataTrain_merge$ConstCodeDscr[dataTrain_merge$ConstCodeDscr == i] = glue("other{i}")
a=a+1
}
list3 = c("", "Block Stucco", "Cedar", "Faux Stone", "Frame Asbestos",
"Log", "Metal", "Painted Block", "Strawbale", "Vinyl")
for (i in list3){
a=1
dataTrain_merge$ExtWallDscrPrim[dataTrain_merge$ExtWallDscrPrim == i] = glue("other{i}")
a=a+1
}
list4 = c("", "Built-Up", "Clay Tile", "Roll", "Shake", "Tar and Gravel")
for (i in list4){
a=1
dataTrain_merge$Roof_CoverDscr[dataTrain_merge$Roof_CoverDscr == i] = glue("other{i}")
a=a+1
}
list5 =c("08013012204","08013012300","08013012202","08013012605",
"08013012607", "08013012705","08013012707","08013012709",
"08013013201","08013013202","08013013205","08013013602")
for (i in list5){
a=1
dataTrain_merge$tractID[dataTrain_merge$tractID == i] = glue("other{i}")
a=a+1
}
dataTrain = dataTrain_merge
# Split training and test data-sets
rownames(dataTrain) <- seq(length = nrow(dataTrain))
inTrain = createDataPartition(y = paste(
dataTrain_merge$ConstCodeDscr,
dataTrain$qualityCodeDscr,
dataTrain$tractID),
p = .75, list = FALSE)
dataTrain.training = dataTrain[inTrain,]
dataTrain.test = dataTrain[-inTrain,]
# HLM model using tract ID
reg.lmer.fixedEffect = lmer(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear + bsmtSF + carStorageType + AcDscr + HeatingDscr + ExtWallDscrSec + Roof_CoverDscr + nbrBaths + schoolNo + universitiesCount + parkingArea + cinemasCount + stadiumsCount + fitnessCenter_nn3 + Census_MedRent + Census_pctAfricanAmericans + LN_commonLeisure_nn1 + LN_hospitals_nn1 +LN_restaurants_nn1 + LN_retail_nn4 + LN_parkArea + LN_public_transport_nn2 +LN_water_nn1 + TotalFinishedSF+
(1|tractID), data = dataTrain.training)
lmer.test.error = dataTrain.test %>%
mutate(
price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test),
price.error = price.predict - price,
price.absError = abs(price.predict - price),
price.ape = (abs(price.predict - price)) / price.predict)
MAPE3 = mean
stargazer(reg.lmer.fixedEffect,
type = "html",
title = "Table 5. Model summary of the HLM model",
dep.var.labels = "Home Price")
Dependent variable: | |
Home Price | |
designCodeDscr2-3 Story | -72,545.540*** |
(9,711.402) | |
designCodeDscrBi-level | 30,283.060 |
(20,624.210) | |
designCodeDscrMULTI STORY- TOWNHOUSE | -197,077.800*** |
(14,922.650) | |
designCodeDscrSplit-level | -162.388 |
(14,124.740) | |
qualityCodeDscrAVERAGE + | -46,276.920*** |
(14,774.230) | |
qualityCodeDscrAVERAGE ++ | -20,164.300 |
(15,342.600) | |
qualityCodeDscrEXCELLENT | 1,023,333.000*** |
(33,629.360) | |
qualityCodeDscrEXCELLENT + | 1,191,657.000*** |
(67,703.320) | |
qualityCodeDscrEXCELLENT++ | 1,911,265.000*** |
(56,798.760) | |
qualityCodeDscrEXCEPTIONAL 1 | 1,063,068.000*** |
(73,317.910) | |
qualityCodeDscrEXCEPTIONAL 2 | 1,563,574.000*** |
(188,391.100) | |
qualityCodeDscrFAIR | -33,157.400 |
(38,129.800) | |
qualityCodeDscrGOOD | 40,466.930*** |
(11,253.580) | |
qualityCodeDscrGOOD + | 64,423.320*** |
(17,715.860) | |
qualityCodeDscrGOOD ++ | 158,647.700*** |
(16,967.090) | |
qualityCodeDscrLOW | -130,494.400 |
(80,036.390) | |
qualityCodeDscrVERY GOOD | 236,203.600*** |
(17,994.590) | |
qualityCodeDscrVERY GOOD + | 465,059.800*** |
(29,285.550) | |
qualityCodeDscrVERY GOOD ++ | 603,071.200*** |
(25,559.450) | |
ConstCodeDscrBrick | -12,859.100 |
(50,952.010) | |
ConstCodeDscrConcrete | 28,872.220 |
(143,797.500) | |
ConstCodeDscrFrame | -10,576.460 |
(39,058.650) | |
ConstCodeDscrMasonry | 13,341.520 |
(40,653.920) | |
ConstCodeDscrPrecast | -821,225.600*** |
(231,648.100) | |
ConstCodeDscrVeneer | -926,917.100*** |
(322,246.800) | |
ConstCodeDscrWood | -42,665.240 |
(172,498.100) | |
EffectiveYear | 1,584.046*** |
(297.922) | |
bsmtSF | -7.040 |
(6.916) | |
carStorageTypeGRA | 37,405.360*** |
(14,317.640) | |
carStorageTypeGRB | 147,117.600*** |
(28,955.770) | |
carStorageTypeGRC | -4,879.090 |
(26,240.690) | |
carStorageTypeGRD | 61,707.630*** |
(15,850.470) | |
carStorageTypeGRF | 107,532.100 |
(110,101.100) | |
carStorageTypeGRW | -58,636.150 |
(104,053.800) | |
AcDscrAttic Fan | -79,899.990 |
(71,283.740) | |
AcDscrEvaporative Cooler | -13,343.290 |
(23,587.530) | |
AcDscrWhole House | 10,361.860 |
(8,878.835) | |
HeatingDscrElectric | -105,542.900*** |
(40,041.750) | |
HeatingDscrElectric Wall Heat (1500W) | 49,495.130 |
(236,278.800) | |
HeatingDscrForced Air | -88,593.730** |
(35,943.780) | |
HeatingDscrGravity | -82,785.920 |
(71,243.340) | |
HeatingDscrHeat Pump | -318,031.600*** |
(115,319.100) | |
HeatingDscrHot Water | -28,341.360 |
(37,468.830) | |
HeatingDscrNo HVAC | 213,396.900 |
(219,931.900) | |
HeatingDscrPackage Unit | 3,865.141 |
(338,362.700) | |
HeatingDscrRadiant Floor | 111,868.300** |
(48,674.100) | |
HeatingDscrVentilation Only | -501,601.400 |
(308,246.300) | |
HeatingDscrWall Furnace | 63,687.660 |
(51,203.990) | |
ExtWallDscrSecBlock Stucco | 908,666.000*** |
(127,702.100) | |
ExtWallDscrSecBrick on Block | -42,264.380 |
(98,955.410) | |
ExtWallDscrSecBrick Veneer | -1,366.410 |
(14,352.200) | |
ExtWallDscrSecCedar | 433,608.800** |
(178,897.700) | |
ExtWallDscrSecCement Board | -12,606.980 |
(80,053.230) | |
ExtWallDscrSecFaux Stone | -6,093.838 |
(15,999.100) | |
ExtWallDscrSecFrame Asbestos | 186,674.500 |
(125,694.400) | |
ExtWallDscrSecFrame Stucco | -46,967.420 |
(30,382.230) | |
ExtWallDscrSecFrame Wood/Shake | -4,179.883 |
(10,205.090) | |
ExtWallDscrSecLog | 40,038.290 |
(154,805.600) | |
ExtWallDscrSecMetal | 43,124.020 |
(95,520.170) | |
ExtWallDscrSecMoss Rock/Flagstone | 10,256.060 |
(35,255.790) | |
ExtWallDscrSecPainted Block | 120,912.400 |
(179,075.700) | |
ExtWallDscrSecVinyl | -19,079.120 |
(116,907.800) | |
Roof_CoverDscrAsphalt | -19,904.090** |
(8,385.809) | |
Roof_CoverDscrBuilt-Up | 114,449.500 |
(144,809.900) | |
Roof_CoverDscrClay Tile | -81,109.460 |
(53,747.290) | |
Roof_CoverDscrConcrete Tile | -116,049.400*** |
(26,820.320) | |
Roof_CoverDscrMetal | 156,150.500*** |
(27,813.380) | |
Roof_CoverDscrRoll | -285,926.300 |
(309,616.900) | |
Roof_CoverDscrRubber Membrane | 119,353.700*** |
(30,067.690) | |
Roof_CoverDscrShake | -109,701.900** |
(52,658.250) | |
Roof_CoverDscrTar and Gravel | 108,187.400 |
(154,584.700) | |
nbrBaths | 39,554.760*** |
(5,786.099) | |
schoolNo1 | 192,300.900 |
(192,251.800) | |
schoolNo12 | -119,864.300 |
(196,499.400) | |
schoolNo13 | -161,541.300 |
(195,477.900) | |
schoolNo14 | -195,113.600 |
(184,817.600) | |
schoolNo15 | -300,800.400 |
(184,928.300) | |
schoolNo16 | 105,217.800 |
(175,728.500) | |
schoolNo17 | 29,316.720 |
(174,740.800) | |
schoolNo18 | 156,905.000 |
(173,374.900) | |
schoolNo2 | 66,824.150 |
(175,009.600) | |
schoolNo21 | 174,364.800 |
(183,585.900) | |
schoolNo22 | 180,331.200 |
(174,869.100) | |
schoolNo23 | 393,430.100** |
(173,845.100) | |
schoolNo24 | 133,414.400 |
(168,335.900) | |
schoolNo25 | 113,651.400 |
(171,733.300) | |
schoolNo26 | 150,971.800 |
(179,127.600) | |
schoolNo27 | 316,918.100* |
(185,175.600) | |
schoolNo28 | 336,067.500* |
(182,302.200) | |
schoolNo29 | 426,510.200** |
(176,021.400) | |
schoolNo3 | 28,220.400 |
(176,339.600) | |
schoolNo30 | 693,750.100*** |
(180,523.300) | |
schoolNo31 | 105,340.200 |
(174,045.400) | |
schoolNo32 | 37,449.280 |
(176,749.000) | |
schoolNo33 | 274,278.600 |
(173,243.300) | |
schoolNo34 | 24,006.640 |
(170,669.400) | |
schoolNo35 | 615,575.900*** |
(176,757.100) | |
schoolNo36 | 171,007.900 |
(176,174.300) | |
schoolNo37 | 192,857.900 |
(181,847.200) | |
schoolNo38 | 144,846.000 |
(176,851.200) | |
schoolNo39 | 21,149.090 |
(173,714.100) | |
schoolNo4 | 68,037.020 |
(175,572.000) | |
schoolNo40 | 275,455.800 |
(183,348.000) | |
schoolNo41 | 173,047.300 |
(174,123.900) | |
schoolNo42 | 81,388.530 |
(175,637.700) | |
schoolNo43 | 110,996.100 |
(176,977.800) | |
schoolNo44 | 428,906.800** |
(199,764.000) | |
schoolNo45 | 349,675.900** |
(164,821.700) | |
schoolNo46 | 369,797.700* |
(210,955.900) | |
schoolNo47 | 186,814.600 |
(184,736.000) | |
schoolNo48 | 177,218.800 |
(173,133.500) | |
schoolNo49 | 249,096.900 |
(180,152.200) | |
schoolNo5 | 61,257.920 |
(181,081.800) | |
schoolNo50 | 571,839.500*** |
(198,270.200) | |
schoolNo51 | 686,919.600*** |
(180,082.600) | |
schoolNo54 | 405,471.500** |
(185,366.500) | |
schoolNo55 | 278,577.300 |
(172,424.000) | |
schoolNo56 | 211,352.400 |
(181,742.600) | |
schoolNo58 | 372,964.500* |
(190,294.700) | |
schoolNo59 | 739,832.200*** |
(179,449.900) | |
schoolNo6 | 19,266.380 |
(174,192.100) | |
schoolNo61 | -47,656.880 |
(234,528.700) | |
schoolNo63 | 134,924.200 |
(174,353.300) | |
schoolNo64 | -12,895.180 |
(192,549.500) | |
schoolNo67 | 203,809.200 |
(167,637.800) | |
schoolNo68 | 323,488.300** |
(164,651.400) | |
schoolNo69 | 295,910.300* |
(171,662.700) | |
schoolNo7 | 82,359.650 |
(185,490.900) | |
schoolNo70 | 566,610.500*** |
(178,149.600) | |
schoolNo71 | 193,386.100 |
(175,177.400) | |
schoolNo72 | 142,885.600 |
(180,158.900) | |
schoolNo73 | 475,764.200** |
(199,578.000) | |
schoolNo74 | 337,472.200** |
(168,337.000) | |
schoolNo75 | 492,803.900*** |
(181,046.300) | |
schoolNo76 | 504,156.000*** |
(173,229.000) | |
schoolNo77 | 518,184.800*** |
(180,048.900) | |
schoolNo78 | 587,631.000*** |
(192,509.300) | |
schoolNo79 | 102,339.000 |
(174,270.600) | |
schoolNo8 | -279.539 |
(182,162.100) | |
schoolNo80 | 20,427.450 |
(176,424.700) | |
schoolNo81 | 108,261.000 |
(176,972.600) | |
schoolNo82 | 562,526.800*** |
(180,760.000) | |
schoolNo83 | 333,340.700* |
(170,383.800) | |
schoolNo84 | 249,962.200 |
(171,229.400) | |
schoolNo85 | 57,035.700 |
(182,273.400) | |
schoolNo86 | 306,487.900* |
(164,455.300) | |
schoolNo87 | -19,465.510 |
(172,936.500) | |
schoolNo88 | 178,433.800 |
(171,346.600) | |
schoolNo89 | 303,843.100* |
(176,713.400) | |
schoolNo91 | 1,210,779.000*** |
(254,248.700) | |
schoolNo92 | 159,554.500 |
(170,914.000) | |
schoolNo93 | 128,712.200 |
(174,927.500) | |
schoolNo94 | 490,524.800*** |
(181,760.100) | |
schoolNo95 | 163,329.900 |
(175,701.800) | |
schoolNo96 | 117,419.700 |
(174,023.400) | |
schoolNo97 | 358,121.400** |
(169,468.100) | |
universitiesCount | -112,973.500*** |
(18,643.440) | |
parkingArea | -0.048 |
(0.160) | |
cinemasCount | -3,921.342 |
(2,632.090) | |
stadiumsCount | 9,597.537** |
(4,327.427) | |
fitnessCenter_nn3 | -16.570*** |
(4.767) | |
Census_MedRent | -30.053*** |
(7.201) | |
Census_pctAfricanAmericans | -6,216.680** |
(2,976.965) | |
LN_commonLeisure_nn1 | -7,396.233 |
(9,287.533) | |
LN_hospitals_nn1 | 64,322.760*** |
(24,152.100) | |
LN_restaurants_nn1 | -8,720.972 |
(8,198.562) | |
LN_retail_nn4 | 8,447.126 |
(12,011.010) | |
LN_parkArea | -6,953.213*** |
(1,579.352) | |
LN_public_transport_nn2 | -163,364.500*** |
(25,394.020) | |
LN_water_nn1 | -15,003.790** |
(6,955.407) | |
TotalFinishedSF | 156.830*** |
(7.206) | |
Constant | -1,630,137.000** |
(663,991.200) | |
Observations | 9,202 |
Log Likelihood | -127,310.600 |
Akaike Inf. Crit. | 254,969.200 |
Bayesian Inf. Crit. | 256,209.400 |
Note: | p<0.1; p<0.05; p<0.01 |
# Plot observed vs. predicted home prices in the test set
ggplot(data = lmer.test.error, aes(x=price.predict,y=price))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.5)+
geom_abline(intercept = 0, slope = 1, size = 2,color = "#7A986D")+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="Predicted sale price as a function of observed price") +
plotTheme()+
theme(plot.title = element_text(size=22),
legend.position="right",
axis.text=element_text(size=22),
axis.title=element_text(size=22))
We have run the HLM model using the tract as the grouping method. That is, we expect that the tract a home belongs can, besides having a fixed effect, influence how other factors influence its price. Although the Moran’s I test show minor improvement on the spatial clustering of errors, this model produces an much lower MAPE of 0.19 on the test set, an improvement from the first two linear models. Also, Table 5 is a summary of the HLM model.
#Moran's I
coords.test = st_coordinates(lmer.test.error)
neighborList.test = knn2nb(knearneigh(coords.test, 5))
spatialWeights.test = nb2listw(neighborList.test, style = "W")
moranTest <- moran.mc(lmer.test.error$price.absError, spatialWeights.test, nsim = 999)
ggplot(as.data.frame(moranTest$rec[c(1:999)]), aes(moranTest$res[c(1:999)]))+
geom_histogram(fill="#69b3a2",binwidth = 0.01)+
geom_vline(aes(xintercept = moranTest$statistic), color = "#DC986D", size = 2) +
scale_x_continuous(limits = c(-1, 1))+
labs(title="Moran's I", x = "Moran Test") +
plotTheme()+
theme(plot.title = element_text(size=22),
legend.position="right",
axis.text=element_text(size=22),
axis.title=element_text(size=22))
Figure 12 maps the residual (difference between observed and predicted prices) on the test set. Figure 13 is the map of spatial lag of errors. Although the spatial clustering of errors still exists, the clustering of positive or negative errors has diminished compared to our first two models.
# a map of residuals for test set.
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = lmer.test.error, aes(color = q5(price.error)), size = 3,alpha=0.5)+
scale_color_manual(values = palette5,
labels = qBr(lmer.test.error, "price.error"),
name = "price.error\n(Quintile Breaks)")+
labs(title = "Map of errors of HLM model") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
lmer.test.error$spatialLag = lag.listw(nb2listw(knn2nb(knearneigh(st_coordinates(lmer.test.error), 5)), style = "W"),lmer.test.error$price.absError)
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = lmer.test.error, aes(color = q5(spatialLag)), size = 3,alpha=0.5)+
scale_color_manual(values = palette5,
labels = qBr(lmer.test.error, "spatialLag"),
name = "spatialLag\n(Quintile Breaks)")+
labs(
title = "Map of errors' spatial lag of HLM mode") +
mapTheme()+
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
Figure 14 maps the distribution of the absolute percent error across different census tracts of our HLM model. In the three dense urban areas (Boulder, Longmont, and Louisville/Superior/Erie), the MAPE is pretty low. Figure 15 is the scatter-plot of MAPE by census tract as a function of mean price by census tract. It shows that our model is slightly less precise for higher-priced homes.
# a map of mape by GEOID.
dataTrain.test2 <- st_join(dataTrain.test,CensusData2,left=TRUE)
lmer.tract_MAPE = dataTrain.test2 %>%
mutate(
price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test2),
price.ape = (abs(price.predict - price)) / price.predict)%>%
st_drop_geometry()
tract_MAPE <- lmer.tract_MAPE%>%group_by(GEOID)%>%summarize(MAPE = 100*mean(price.ape, na.rm = T))
tract_MAPE <- left_join(tract_MAPE,CensusData2,by="GEOID")%>%
st_sf()
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = tract_MAPE, aes(fill = q5(MAPE)),alpha=0.7)+
scale_fill_manual(values = palette5,
labels = qBr(tract_MAPE, "MAPE"),
name = "MAPE%\n(Quintile Breaks)")+
labs(
title = "Map of absolute percent error of HLM model") +
mapTheme() +
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
lmer.tract_MAPE = dataTrain.test2 %>%
mutate(
price.predict = predict(reg.lmer.fixedEffect, newdata = dataTrain.test2),
price.ape = (abs(price.predict - price)) / price.predict)%>%
st_drop_geometry()
tract_MAPE_PRICE <- lmer.tract_MAPE%>%group_by(GEOID)%>%summarize(MAPE = mean(price.ape, na.rm = T),MPRICE=mean(price, na.rm = T))
ggplot(tract_MAPE_PRICE, aes(x = MPRICE, y = MAPE))+
geom_point(size=5,shape = 21, fill = "#69b3a2",alpha=0.5)+
geom_smooth(method = lm, se=F,colour = "#DC986D",size=2)+
labs(title="MAPE by neighborhood as a function of mean price by neighborhood") +
plotTheme()+
theme(plot.title = element_text(size=22),
legend.position="right",
axis.text=element_text(size=22),
axis.title=element_text(size=22))
Until now, we have only tested our model on the test set. To further cross-validate the model, we split the data-set into 100 equal folds, and run/test our model 100 times to test on these folds. Then we calculate the mean absolute error and percentage error of these 100 iterations.
fitControl <- trainControl(method = "cv", number = 100)
# Merge small categories to facilitate k-fold validation
dataTrain_merge = dataTrain
list1 = c("Electric Wall Heat (1500W)", "No HVAC", "Package Unit", "Heat Pump", "Gravity",
"Radiant Floor", "Ventilation Only", "Wall Furnace", "")
dataTrain_merge$HeatingDscr[dataTrain_merge$HeatingDscr %in% list1] = "Other"
list2 = c("Concrete", "Precast", "Veneer", "Wood", "", "Brick")
dataTrain_merge$ConstCodeDscr[dataTrain_merge$ConstCodeDscr %in% list2] = "Other"
list3 = c("", "Block Stucco", "Cedar", "Faux Stone", "Frame Asbestos",
"Log", "Metal", "Painted Block", "Strawbale", "Vinyl")
dataTrain_merge$ExtWallDscrPrim[dataTrain_merge$ExtWallDscrPrim %in% list3] = "Other"
list4 = c("", "Built-Up", "Clay Tile", "Roll", "Shake", "Tar and Gravel")
dataTrain_merge$Roof_CoverDscr[dataTrain_merge$Roof_CoverDscr %in% list4] = "Other"
fld = cvFolds(nrow(dataTrain), K = 100, R = 1)
fldTable = cbind(fld$which, fld$subsets) %>% as.data.frame()
MAE = c()
for (i in 1:100) {
testIndex = fldTable[fldTable$V1 == i,]$V2
kTest = dataTrain_merge[testIndex,]
kTrain = dataTrain_merge[-testIndex,]
reg.k = lmer(price ~
designCodeDscr + qualityCodeDscr + ConstCodeDscr + EffectiveYear +
bsmtSF + carStorageType + AcDscr + HeatingDscr +
ExtWallDscrSec + Roof_CoverDscr + nbrBaths +
schoolNo + universitiesCount + parkingArea + cinemasCount + stadiumsCount +
fitnessCenter_nn3 + Census_MedRent +
Census_pctAfricanAmericans + LN_commonLeisure_nn1 + LN_hospitals_nn1 +
LN_restaurants_nn1 + LN_retail_nn4 + LN_parkArea + LN_public_transport_nn2 +
LN_water_nn1 + TotalFinishedSF+ (1|tractID), data = kTrain)
err = kTest %>%
mutate(
price.predict = predict(reg.k, newdata = kTest),
price.absError = abs(price.predict - price),
price.ape = (price.absError) / price.predict,
)
MAE <- append(MAE, mean(err$price.absError, na.rm = T))
}
ggplot()+
geom_histogram(data = MAE.df, aes(x = MAE),fill = "#69b3a2",alpha=0.5, color = "white",bins=35)+
labs(title = "Distribution of MAE in k-fold cross-validation")+plotTheme()+
theme(plot.title = element_text(size=22),
legend.position="right",
axis.text=element_text(size=22),
axis.title=element_text(size=22))
Next, we shall text whether the model remains tenable across different socioeconomic statuses. We have divided the census tracts within Boulder county into two categories: High income: with median household income over 80,000 dollars, and low income, with median household income less than 80,000 dollars. Then we calculate the mean absolute percent error in high- and low-income tracts (Table 6). The results are similar, suggesting relatively good generalizability of our model.
Census <- CensusData%>%
mutate(incomeContext = ifelse(Census_MedHHInc > 80000,"High Income", "Low Income"))
kable2 <- st_join(lmer.test.error, Census) %>% na.omit()%>%
group_by(incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(price.ape,na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = "Table 6. MAPE by neighborhood income context")
kable2 %>%
kable_classic(full_width = T)%>%
footnote()
Finally, Figure 17 shows our final prediction on the entire data-set.
# -Provide a map of your predicted values for where ‘toPredict’ is both 0 and 1.
usetopredictall <- trainData1
Predict.all = usetopredictall %>%
mutate(price.predict = predict(reg.lmer.fixedEffect, newdata = usetopredictall))%>%
dplyr::select(starts_with("price"))
ggplot()+
geom_sf(data = CensusData2,color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = boulder,fill = "transparent",color="black",size=2)+
geom_sf(data = Predict.all, aes(color = q5(price.predict)),size = 3,alpha=0.2)+
scale_color_manual(values = palette5,
labels = qBr(Predict.all, "price.predict"),
name = "Predicted Price\n(Quintile Breaks)")+
labs(
title = "Final prediction on the entire data-set")+
mapTheme()+
theme(
plot.title = element_text(size = 20),
plot.subtitle=element_text(size = 15,face="italic"),
panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
In this project, we have developed an effective model for home price prediction in terms of mean absolute percent error, especially in central urban areas. We have made four categories of variables: intrinsic features, accessibility to amenities, demographic and socioeconomic status, and the spatial structure. Although the second category produces interesting findings and has refined our model, we find that the intrinsic features and the spatial structure possess the most predictive power.
The spatial clustering of home prices poses a big challenge on home price prediction. In our first two models, by adding a geospatial component into the OLS model, we are not able to account for the spatial variations of prices. By introducing a third, HLM-OLS model, the spatial variation of prices is still not fully accounted for, but we have managed to reduce the overall errors. In particular, our model has done a pretty good job in the three urban cores, and with homes whose prices are neither too high nor too low. However, with homes in suburban or rural areas or homes whose prices are too high or too low, our model becomes less accurate. This might be due to the fundamental assumptions of an OLS model. An OLS model assumes a linear relationship between the price and its factors, which might not be the case for homes, especially those with extremely high or low prices. It could also be that there are fewer observations outside the urban cores, leading to less accurate predictions in these areas.
I would recommend this model to Zillow based on its good MAPE performance in central urban areas, where most transactions occur. However, our model could be further improved by finding a better way to account for the spatial clustering or variation of home prices.