For the public sectors, predictive policing is one of the prevailed machine learning algorithms, for it can determine which are is in high risk and which are is in low risk. In this way, a limited resource can be much more efficiently allocated to achieve a better effect.
In this project, a model to predict the occurrence of narcotics-heroine in Chicago is built. The hypothesis is that crime risk (the narcotics-heroine is specified here) is a function of exposure to a series of geospatial risk and protective factors. As exposure to risk factors increases, so does crime risk. The model utilizes indicators like unemployment data to attempt to build up the relationship between exposure to these risks and narcotics-heroine occurrence given the hope that it obtains a well generalizability in space and time.
However, a consensus is shared that a biased machine learning will lead a irrational police resources allocation and yield biased effects exaggerating the biased in a iterative way. We have to admit that the police narcotics-heroine data set suffers from selection bias against some communities. The growing awareness of Black Lives Matter is the best example.
To account for this selection bias and achieve a better accuracy and generalization, spatial features are introduced. And also the risking modeling is compared with traditional kernel-density model. It turns out the risking modeling model of narcotics-heroine is better than kernel-density traditional model, for it has a better accuracy and generalizatility.
In the following sections of this document, eight different predictors are finally selected, after iterative attempts until it attains reasonable accuracy and generalizability, to predict narcotics-heroine occurrence in Chicago in the year of 2018 based on data from 2017.
In the first step, municipal units of different scales (Police Districts, Police Beats, and Neighborhoods) are introduced from Chicago Open Data Platform. These data of different scales are prepared for further spatial regression attempts to figure out in which scale, the risk modeling can achieve the best effect. Further more, Police Districts, Police Beats are two policing units which have more references value when we try to apply the modeling outcome in practice.
#Municipal Units
policeDistricts <-
st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = dist_num)
policeBeats <-
st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
st_transform('ESRI:102271') %>%
dplyr::select(District = beat_num)
bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"),
mutate(policeBeats, Legend = "Police Beats"))
chicagoBoundary <-
st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
st_transform('ESRI:102271')
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform("ESRI:102271")
Chicago was once notorious for its drug problem. To have a comprehensive understanding of its current situation and advices on how to improve the situation, the crime type, narcotics-heroine, is selected as the dependent variable.
crime_meta <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr")
NARCOTICS <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr") %>%
filter(Primary.Type == "NARCOTICS" & Description == "POSS: HEROIN(WHITE)") %>%
mutate(x = gsub("[()]", "", Location))%>%
separate(x,into= c("Y","X"), sep=",")%>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
ggplot()+
geom_sf(data = st_intersection(policeDistricts,chicagoBoundary), color="gray50",fill = "transparent",size=1,linetype = "dashed")+
geom_sf(data = NARCOTICS, color="#83A3C9", size = 2.5,alpha=1)+
geom_sf(data = chicagoBoundary,fill = "transparent", color="black",size=3)+
labs(title = "Map of reported Narcotics-Heroine in Chicago - 2017",
subtitle = "One dot indicates a reportted narcotics case")+
mapTheme()+
theme(
plot.title = element_text(size = 30),
plot.subtitle=element_text(size = 25,face="italic"),
#panel.border = element_blank(),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank())
Figure 1 shows the distribution of reported narcotics-heroine cases in Chicago 2017. From the plot, a highly spatial clustering is observed. Most reported cases are located in the mid-west area of Chicago. Some of them also clustered at the south of Chicago.Selection bias here might be an issue, if the clustering is caused simply because cases happened in the center city are more easily to be recorded. Even though we can tell the degree of bias baked in the risk prediction, we should keep reminding ourselves of the possibility of selection bias in the modeling process.
The police administration units are similar to grids, to make the prediction model beneficial to allocate the police resources, the narcotics-heroine data set is aggregated into a grids net (each grid size is 500 *500 meter).
# Establish fishnet
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
# Aggregate the narcotics data into the established fishnet
Narcotics_net <-
dplyr::select(NARCOTICS) %>%
mutate(count = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(count = replace_na(count, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))
ggplot()+
geom_sf(data = Narcotics_net, aes(fill = count), color="transparent",size=0)+
geom_sf(data = chicagoBoundary,fill = "transparent", color="black",size=3)+
scale_fill_gradient(low = "#EDEDED", high = "#206CC9") +
labs(title = "Map of Count of Narcotics for the fishnet - 2017",
subtitle = "Cellsize = 500 METERs")+
mapTheme()+
theme(
plot.title = element_text(size = 30),
plot.subtitle=element_text(size = 25,face="italic"),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = c(.86,.86),
legend.title = element_blank(),
legend.key.size = unit(2.5, "line"),
legend.text = element_text(color = "black", size = 20))
Eight indicators (“Abandoned Cars”, “Abandoned Buildings”, “Graffiti”, “Sanitation”, “Liquor Retail”, “Street Lights Out”, “Unemployment”, and “Potholes”) are finally chose to model narcotics-heroine in Chicago.
abandonCars <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd")) %>%
mutate(year=substr(date_service_request_was_received,1,4))%>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
filter(where_is_the_graffiti_located_ %in%
c("Front","Rear","Side")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
sanitation <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata(paste0("https://data.cityofchicago.org/resource/nrmj-3kcf.json")) %>%
filter(business_activity ==
"Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
streetLightsOut <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
acs_variable_list.2017 <- load_variables(2017,"acs5",cache = TRUE)
unemployment <-
get_acs(geography = "tract",
variables = c("B23025_005"),
year=2017, state=17, county=031, geometry=T, output="wide") %>%
st_transform(st_crs(fishnet)) %>%
rename(unemployment = B23025_005E)%>%
mutate( legend = "unemployment")%>%
dplyr::select(-NAME,-starts_with("B"))
potholes <-
read.socrata(paste0("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Pot-Holes-Reported-Historical/7as2-ds3y")) %>%
mutate(year = substr(creation_date,1,4)) %>%
filter(year == "2017") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X","Y"), crs=4326, agr="constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "potholes")
abandonCars=st_join(abandonCars,chicagoBoundary)%>%na.omit()
abandonBuildings=st_join(abandonBuildings,chicagoBoundary)%>%na.omit()
graffiti=st_join(graffiti,chicagoBoundary)%>%na.omit()
sanitation=st_join(sanitation,chicagoBoundary)%>%na.omit()
streetLightsOut=st_join(streetLightsOut,chicagoBoundary)%>%na.omit()
liquorRetail=st_join(liquorRetail,chicagoBoundary)%>%na.omit()
potholes=st_join(potholes,chicagoBoundary)%>%na.omit()
unemployment=st_intersection(unemployment,chicagoBoundary)%>%na.omit()
abandonCars_plot =
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=abandonCars,size=0.5,color="#83A3C9")+
labs(title = "Abandon Cars-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,10), "line"),
plot.title = element_text(size = 30))
abandonBuildings_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=abandonBuildings,size=0.5,color="#83A3C9")+
labs(title = "Abandon Buildings-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,0), "line"),
plot.title = element_text(size = 30))
graffiti_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=graffiti,size=0.5,color="#83A3C9")+
labs(title = "Graffitis-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,-10), "line"),
plot.title = element_text(size = 30))
sanitation_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=sanitation,size=0.5,color="#83A3C9")+
labs(title = "Sanitation-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,-20), "line"),
plot.title = element_text(size = 30))
streetLightsOut_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=streetLightsOut,size=0.5,color="#83A3C9")+
labs(title = "Street Lightout-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,10), "line"),
plot.title = element_text(size = 30))
liquorRetail_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=liquorRetail,size=0.5,color="#83A3C9")+
labs(title = "Liquo Retail-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,0), "line"),
plot.title = element_text(size = 30))
potholes_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=potholes,size=0.5,color="#83A3C9")+
labs(title = "Pot holes-2017")+
mapTheme()+
theme(plot.margin = unit(c(1,0,0,-10), "line"),
plot.title = element_text(size = 30))
unemployment_plot=
ggplot()+
geom_sf(data = chicagoBoundary,fill = "transparent",size=1,color="black")+
geom_sf(data=unemployment,aes(fill=unemployment),color="transparent")+
scale_fill_gradient(low = "#EDEDED", high = "#387AC9") +
labs(title = "Unemployment-2017")+
mapTheme()+
theme(legend.position = c(.83,.86),
plot.margin = unit(c(1,0,0,-20), "line"),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30))
grid.arrange(ncol = 4, nrow = 2,
abandonCars_plot, abandonBuildings_plot,
graffiti_plot,sanitation_plot,streetLightsOut_plot,
liquorRetail_plot,potholes_plot,unemployment_plot)
These predictors are chose because they indicate, to some extent, the safety conditions in the community level. Their 2017 data in Chicago Open Data platform are loaded. However, there might be problems in these features. Since the data sets come from 311 complaints, there might also be some selection bias in these data sets. Some communities tend to take part in the community building and will report these situations, while others might not.
To make these indicators and municipal units at the same level, these indicators are aggregated into the grid net. The value of each grid is the number that dots (Predictors’ Occurrence) fall into each grid.
# Add unemployment data, which is in tract unit, into grids
unemployment_fishnet <-
st_join(st_centroid(fishnet),unemployment,join = st_within)%>%
st_drop_geometry()%>%
dplyr::select(uniqueID,unemployment)
# Add the rest indicators into grids
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail, graffiti, sanitation,potholes) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
# Combine to one dataset
vars_net2 <- vars_net%>%
merge(.,unemployment_fishnet,by="uniqueID")%>%
na.omit()
vars_net.long <-gather(vars_net2, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i),
aes(fill=value),
color="transparent") +
scale_fill_gradient(low = "#EDEDED", high = "#387AC9") +
labs(title=i) +
mapTheme()+
theme(plot.margin = unit(c(1,0,0,-10), "line"),
legend.position = c(.83,.86),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30))}
do.call(grid.arrange,c(mapList, ncol=4))
Because the number of the predictors’ occurrence in each fixed grid may contain scale bias, that says different scales may display different pattern to influence the interpretation, Nearest Neighbor Features are introduced here.
Nearest Neighbor Features mean the average distance of each centroid of the grid to its ‘K’ nearest occurrence. This is the other way to calculate the exposure. Nearest Neighbor Features highlight the effect of distance instead of the effect of the number.
# Abbreviate functions
st_c <- st_coordinates
st_coid <- st_centroid
# Add above indicators
a=st_c(potholes)
vars_net3.nn <- vars_net2 %>%
mutate(
Abandoned_Buildings.nn =nn_function(st_c(st_coid(vars_net2)),st_c(abandonBuildings),3),
Abandoned_Cars.nn =nn_function(st_c(st_coid(vars_net2)),st_c(abandonCars),3),
Graffiti.nn =nn_function(st_c(st_coid(vars_net2)),st_c(graffiti),3),
Liquor_Retail.nn =nn_function(st_c(st_coid(vars_net2)),st_c(liquorRetail),3),
Street_Lights_Out.nn =nn_function(st_c(st_coid(vars_net2)),st_c(streetLightsOut),3),
Sanitation.nn =nn_function(st_c(st_coid(vars_net2)),st_c(sanitation),3),
potholes.nn = nn_function(st_c(st_coid(vars_net2)),a%>%na.omit(),3))
# Add 'loop distance' indicators
loopPoint <-
filter(neighborhoods, name == "Loop") %>%
st_centroid()
vars_net4.loop <- vars_net3.nn
vars_net4.loop$loopDistance =st_distance(st_centroid(vars_net4.loop), loopPoint) %>%
as.numeric()
vars_net.long.nn <-
dplyr::select(vars_net4.loop, ends_with(".nn"),unemployment,loopDistance) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
vars <- vars[vars != "unemployment"]
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i),
aes(fill=value), color="transparent") +
scale_fill_gradient(low = "#EDEDED", high = "#387AC9") +
labs(title=i) +
mapTheme()+
theme(plot.margin = unit(c(1,0,0,-10), "line"),
legend.position = c(.83,.86),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30))}
do.call(grid.arrange,c(mapList, ncol=4))
Figure 5 illustrates the distribution. From the graphs, we can know that south of Chicago may have a relatively better living environment. And some indicators like Abandoned Cars and Sanitation may have a relatively even spatially distribution. This reminds us to make a choice between “Count Number” and “Nearest Distance”. This selection will be carried out in the following.
For further regression calculations, here we combine dependent data, predictors data, areal data “neighborhoods” & “policeDistricts” into one data set, using uniqueID to make sure every data is allocated into its correct location.
final_net <-
left_join(Narcotics_net, st_drop_geometry(vars_net4.loop), by="uniqueID")
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
rename(narcotics_count = count)%>%
na.omit()
Crimes (Narcotics-Heroine) tend to cluster in space. So far, however, each grid cell is considered to be an independent observation with certain exposure to risk factors; the spatial variation of assault occurrences is not fully accounted for.
In this section, Local Moran I is used to test for spatial autocorrelation (clustering) of Narcotics-Heroine. This information provided insights into the spatial process of Narcotics-Heroine.
# To build neighborhood weights
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W",zero.policy=TRUE)
# To Calculate local Moran I
local_morans <- localmoran(final_net$narcotics_count, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
# To add results back to final dataset
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(Narcotics_Count = narcotics_count,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
vars <- unique(final_net.localMorans$Variable)
varList <- list()
for(i in vars){
varList[[i]] <-
ggplot() +
geom_sf(data = filter(final_net.localMorans, Variable == i),
aes(fill = Value), color="transparent") +
scale_fill_gradient(low = "#EDEDED", high = "#387AC9") +
labs(title=i) +
mapTheme() +
theme(plot.margin = unit(c(1,0,0,-10), "line"),
legend.position = c(.83,.86),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30))}
do.call(grid.arrange,c(varList, ncol = 4))
In the Figure 6, we can learn that the hotspots of Narcotics-Heroine are significantly concentrated in the center city. With the outcome of Local Moran I, we define the hotspots of Narcotics-Heroine. In the following section,we add a character variable “isSig” into the data set to determine whether the grid is a hotspot. And also we add a new predictor ‘Distance to Significant Narcotics Clusters’ to accout for its spatial clustering.
final_net_Sigdisct <- final_net %>%
mutate(narcotics.isSig = ifelse(local_morans[,5] <= 0.001, 1, 0)) %>%
mutate(narcotics.isSig.dist = nn_function(st_c(st_coid(final_net)),
st_c(st_coid(filter(final_net, narcotics.isSig == 1))),
k = 1))
ggplot()+
geom_sf(data=final_net_Sigdisct,aes(fill=narcotics.isSig.dist), color="transparent")+
geom_sf(data = st_intersection(policeDistricts,chicagoBoundary), color="white",fill = "transparent",size=1,linetype = "dashed")+
scale_fill_gradient(low = "#206CC9", high = "#EDEDED") +
labs(title="Distance to significant narcotics hotspots") +
mapTheme() +
theme(legend.position = c(.83,.86),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30))
In the Figure 7, we can learn that except the city center area, the mid-south area of Chicago is also close to Narcotics-Heroine hotspots. In the following section, the hotspot predictors will be examined in different models to find out whether these indicators actually help improve the model.
Correlation gives important contexts while also providing intuition on features that may predict narcotics, this approach can help with feature selection. For a given risk factor, avoid colinearity by selecting either the count or nearest neighbor feature.
In this section, correlation tests are done for each indicator and their different forms. “Count Number” & “Nearest Neighbor” of each indicators are plotted with Narcotics-Heroine to find out in which form there is a higher correlation with the Narcotics-Heroine.
correlation.long <-
st_drop_geometry(final_net_Sigdisct) %>%
dplyr::select(-cvID, -name,-District,-unemployment,-loopDistance) %>%
dplyr::select(-uniqueID) %>%
gather(Variable, Value, -narcotics_count)
correlation.long2 <-
st_drop_geometry(final_net_Sigdisct) %>%
dplyr::select(narcotics_count,unemployment,loopDistance) %>%
gather(Variable, Value, -narcotics_count)
correlation.cor <-correlation.long %>%
group_by(Variable) %>%
summarize(correlation =cor(Value, narcotics_count, use = "complete.obs"))
correlation.cor2 <-correlation.long2 %>%
group_by(Variable) %>%
summarize(correlation =cor(Value, narcotics_count, use = "complete.obs"))
ggplot(correlation.long, aes(Value, narcotics_count)) +
geom_point(size = 2.5, color = "#83A3C9") +
geom_text(data = correlation.cor,aes(label = paste("r =", round(correlation, 2))),x=-Inf, y=Inf, vjust = 1.5, hjust = -.1,size=6) +
geom_smooth(method = "lm", se = FALSE, colour = "#FFA596",size=2.5) +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Narcotics-Heroine as a function of risk factors") +
plotTheme()+
theme(
plot.title = element_text(hjust = 0.5,size=25),
panel.spacing = unit(3, 'lines'),
strip.text.x = element_text(size=15,face = "bold"))
ggplot(correlation.long2, aes(Value, narcotics_count)) +
geom_point(size = 2.5, color = "#83A3C9") +
geom_text(data = correlation.cor2,aes(label = paste("r =", round(correlation, 2))),x=-Inf, y=Inf, vjust = 1.5, hjust = -.1,size=6) +
geom_smooth(method = "lm", se = FALSE, colour = "#FFA596",size=2.5) +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "") +
plotTheme()+
theme(
plot.title = element_text(hjust = 0.5,size=25),
panel.spacing = unit(3, 'lines'),
strip.text.x = element_text(size=15,face = "bold"))
In the Figure 9, different correlations are showed. Based on the correlation results, a specific form of each indicator is selected. These indicators are “Abandoned_Buildings”, “Abandoned_Cars.nn”, “Graffiti.nn”, “Liquor_Retail.nn”, “Street_Lights_Out”, “Sanitation”, “unemployment”, “potholes”, and “loopDistance”. These indicators will be used in the following section to test each regression method.
When data distribution is skewed seriously toward the left, an OLS regression is inappropriate. Here, a Poisson regression is used, which is based on a Poisson distribution and uniquely suited to skewed distribution, simulated in the right-most histogram of Figure 10.
# Narcotics Distribution
narcoticsDistribution =
ggplot(data = final_net_Sigdisct,aes(narcotics_count)) +
geom_histogram(bins =30, color = 'black', fill = "#83A3C9") +
labs(title = 'Narcotics distribution')+
plotTheme()+
theme(plot.title = element_text(hjust = 0.5))
# A regular poisson distribution
set.seed(123)
poisson = as.data.frame(rpois(10000, 1))
names(poisson)[1] = 'poisson'
poissonDistribution = ggplot(data = poisson,
aes(poisson)) +
geom_histogram(bins = 8, color = 'black', fill = "#83A3C9") +
labs(title = 'Poisson distribution')+
plotTheme()+
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(ncol = 2, narcoticsDistribution, poissonDistribution)
Take a look at the skewed distribution of narcotics distribution in the left-most histogram of Figure 10. Given narcotics is a relatively rare event, it is reasonable for most grid cells to contain no crime events.
In this section, regressions are made. Because geospatial risk models are purely spatial, spatial cross-validation becomes an important option. So we move directly to cross-validation.
Below, goodness of fit metrics are generated for six regressions - three including Just Risk Factors (reg.vars), and a third (reg.ss.vars) includes risk factors plus the Local Moran’s I Spatial Process features created.
These six types of regression are divided into three types. The first two are using ‘K-FOLD’ as a cross validation method. The second two are using ‘Spatial LOGO - in Police Beats’ as a cross validation method. And the third two are using ‘Spatial LOGO - in Police Districts’ as a cross validation method.
reg.vars <-c("Abandoned_Buildings","Abandoned_Cars.nn","Graffiti.nn","Liquor_Retail.nn","Street_Lights_Out","Sanitation","unemployment","potholes","loopDistance")
reg.ss.vars <-c("Abandoned_Buildings","Abandoned_Cars.nn","Graffiti.nn","Liquor_Retail.nn","Street_Lights_Out","Sanitation","unemployment","potholes","loopDistance","narcotics.isSig", "narcotics.isSig.dist")
final_net_prediction <- final_net_Sigdisct
reg.cv_cvid <- crossValidate(
dataset = final_net_prediction,
id = "cvID",
dependentVariable = "narcotics_count",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID,
narcotics_count,
Prediction,
geometry)
reg.ss.cv_cvid <- crossValidate(
dataset = final_net_prediction,
id = "cvID",
dependentVariable = "narcotics_count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID,
narcotics_count,
Prediction,
geometry)
reg.spatialCV_name <- crossValidate(
dataset = final_net_prediction,
id = "name",
dependentVariable = "narcotics_count",
indVariables = reg.vars) %>%
dplyr::select(cvID = name,
narcotics_count,
Prediction,
geometry)
reg.ss.spatialCV_name <- crossValidate(
dataset = final_net_prediction,
id = "name",
dependentVariable = "narcotics_count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name,
narcotics_count,
Prediction,
geometry)
reg.spatialCV_district <- crossValidate(
dataset = final_net_prediction,
id = "District",
dependentVariable = "narcotics_count",
indVariables = reg.vars) %>%
dplyr::select(cvID = District,
narcotics_count,
Prediction,
geometry)
reg.ss.spatialCV_district <- crossValidate(
dataset = final_net_prediction,
id = "District",
dependentVariable = "narcotics_count",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = District,
narcotics_count,
Prediction,
geometry)
In this section, the accuracy and generalzability of the models are estimated. In the Figure 11, some interesting insights are concluded. First, models using spatial process indicators have a relatively small MAE. Second, models using random K-fold have less outliers, making the models more stable when testing the new data.
# Combine Regression Errors together
reg.summary <-
rbind(
mutate(reg.cv_cvid,
Error = Prediction - narcotics_count,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv_cvid,
Error = Prediction - narcotics_count,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV_name,
Error = Prediction - narcotics_count,
Regression = "Spatial LOGO-CV-Police Beats: Just Risk Factors"),
mutate(reg.ss.spatialCV_name,
Error = Prediction - narcotics_count,
Regression = "Spatial LOGO-CV-Police Beats: Spatial Process"),
mutate(reg.spatialCV_district,
Error = Prediction - narcotics_count,
Regression = "Spatial LOGO-CV-Police Districts: Just Risk Factors"),
mutate(reg.ss.spatialCV_district,
Error = Prediction - narcotics_count,
Regression = "Spatial LOGO-CV-Police Districts: Spatial Process"))%>%
st_sf()
# Calculate the error
error_by_reg_and_fold <-reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - narcotics_count,na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
ungroup()
# Visulize the comparison
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 40, colour="black", fill="#83A3C9") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0,color="#FFA596",size=2.5) +
scale_x_continuous(breaks = seq(0, 12, by = 1)) +
labs(title="Distribution of MAE - k-fold cross-validation vs. LOGO-CV",
x="Mean Absolute Error", y="Count") +
plotTheme()+
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15),
strip.text.x = element_text(size=15,face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=1),
plot.title = element_text(hjust = 0.5,size=25),
panel.spacing = unit(3, 'lines'))
st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(narcotics_Decile = ntile(narcotics_count, 10)) %>%
group_by(Regression, narcotics_Decile) %>%
summarize(meanObserved = mean(narcotics_count, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -narcotics_Decile) %>%
ggplot(aes(narcotics_Decile, Value, shape = Variable)) +
geom_path(aes(group = narcotics_Decile), size=1,colour = "#FFA596") +
geom_point(size = 2.5,color="#83A3C9") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) +
xlim(0,10) +
ylim(0,10)+
labs(title = "Predicted and observed narcotics by observed narcotics decile") +
plotTheme()+
theme(axis.text=element_text(size=12),
axis.title=element_text(size=12),
strip.text.x = element_text(size=12,face = "bold"),
panel.border = element_rect(colour = "black", fill=NA, size=1),
plot.title = element_text(hjust = 0.5,size=25),
panel.spacing = unit(3, 'lines'),
legend.position="bottom")
Figure 12 shows that all models over-predict in low Narcotics-Heroine areas and under-predict in hotspot areas. And all models have a very high deviation in high Narcotics-Heroine area. Over-predictions in lower burglary areas may highlight areas of latent risk. Under-prediction in higher burglary areas may reflect difficulty predicting the hotspots. Besides, Random k-fold CV:Spatial Process and Spatial LOGO-CV_Police Beats:Spatial PRocesshas relatively small deviations in nearly every narcotics decile area.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable(caption = '<center>Table 1. Average errors for each model</center>') %>%
kable_classic(full_width = F, html_font = "Cambria")%>%
row_spec(2, color = "black", background = "#83A3C9") %>%
row_spec(4, color = "black", background = "#83A3C9")%>%
row_spec(6, color = "black", background = "#83A3C9")
Regression | Mean_MAE | SD_MAE |
---|---|---|
Random k-fold CV: Just Risk Factors | 0.66 | 0.82 |
Random k-fold CV: Spatial Process | 0.42 | 0.54 |
Spatial LOGO-CV-Police Beats: Just Risk Factors | 1.05 | 1.80 |
Spatial LOGO-CV-Police Beats: Spatial Process | 0.30 | 0.42 |
Spatial LOGO-CV-Police Districts: Just Risk Factors | 1.68 | 2.87 |
Spatial LOGO-CV-Police Districts: Spatial Process | 0.43 | 0.80 |
Figure 13 tells a similar story as Figure 12. Due to the Spatial Process efficiently accounts for its spatial clustering, all regression models with Spatial Process display a smaller MAE & SD_MAE compared to models without Spatial Process. Among the three Spatial Process regression models, Spatial LOGO-CV_Police Beats:Spatial Process has smallest MAE & SD_MAE. Considering all mathematical analysis above, Spatial LOGO-CV_Police Beats:Spatial Process is selected to used in the following analysis.
error_by_reg_and_fold %>%
filter(str_detect(Regression, "Spatial Process")) %>%
ggplot() +
geom_sf(aes(fill = MAE), color="transparent") +
facet_wrap(~Regression) +
scale_fill_gradient(low = "#EDEDED", high = "#387AC9") +
labs(title = "Narcotics errors by K-fold and LOGO-CV Regression") +
mapTheme() +
theme(legend.position="bottom",
panel.spacing = unit(10, 'lines'),
strip.text.x = element_text(size=15,face = "bold"),
legend.title = element_blank(),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15),
plot.title = element_text(size = 30,hjust = 0.5))
Figure 14 tells that even though Spatial LOGO-CV_Police Beats:Spatial Process has the smallest MAE and SD_MAE, its spatial clustering effect still exists. The city center of Chicago still displays a relatively high absolute error, which needs a further improve. Due to the small grid of Random k-fold CV:Spatial Process, its errors spred more evenly and do not display spatial clustering.
To test the generalizability of our regression models across different racial contexts, percentWhite is calculated and tracts are split into two groups, Majority_White and Majority_Non_White. After that, errors are subtracted from the previous calculation outcome.
ggplot()+
geom_sf(data = tracts18, aes(fill = raceContext),color="white",size=1,linetype = "dashed")+
geom_sf(data=st_union(tracts18),fill="transparent",color="black",size=2.5)+
scale_fill_manual(values = c("#83A3C9", "#FFA596"),
labels = c('<50% white', '>50% white'),
name = '')+
labs(title = 'Race Neighborhood - 2018',
subtitle='Majority-white and majority-nnon-white neighborhoods, Chicago')+
mapTheme()+
theme(
plot.title = element_text(size = 30),
plot.subtitle=element_text(size = 25,face="italic"),
axis.text = element_blank(),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
legend.position = c(.9,0.9),
legend.key.size = unit(1.5, "line"),
legend.text = element_text(color = "black", size = 15))
reg.summary %>%
filter(str_detect(Regression, "Spatial Process")) %>%
st_centroid() %>%
st_join(tracts18) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = round(mean(Error, na.rm = T),2)) %>%
spread(raceContext, mean.Error) %>%
kable(caption ='<center>Table 2. Mean Error by neighborhood racial context</center>') %>%
kable_classic(full_width = F, html_font = "Cambria")%>%
row_spec(1, color = "black", background = "#83A3C9")
Regression | Majority_Non_White | Majority_White |
---|---|---|
Random k-fold CV: Spatial Process | -0.11 | 0.13 |
Spatial LOGO-CV-Police Beats: Spatial Process | -0.10 | 0.13 |
Spatial LOGO-CV-Police Districts: Spatial Process | -0.21 | 0.14 |
The outcome is so interesting. The model on average, under-predicts in Majority_Non_White neighborhoods and overpredicts in Majority_White neighborhoods. Even the Spatial LOGO-CV_Police Beats:Spatial Process reports lower errors overall, it has a larger difference in errors across neighborhood context. On the contrary, the Random k-fold CV:Spatial Process has a smaller difference in errors across neighborhood context, indicating a relatively better generalzabilty when considering the racial context.
Police departments all over the world use hotspot policing to target police resources to the places where crime is most concentrated, that is what we call “kernel density method”. Because of its reliance on nearby points, we can think of kernel density as one making ‘predictions’ based purely on spatial autocorrelation.
In this section, we ask whether risk predictions, with added predictors making it not only rely on spatial autocorrelation but also other indepedent variables, outperform traditional ‘kernel density’ hotspot mapping.
narcotics_ppp <- as.ppp(st_coordinates(NARCOTICS), W = st_bbox(final_net_Sigdisct))
narcotics_KD.2000 <- spatstat.core::density.ppp(narcotics_ppp, 2000)
# Load 2018 narcotics data
NARCOTICS18 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>%
filter(Primary.Type == "NARCOTICS" & Description == "POSS: HEROIN(WHITE)") %>%
mutate(x = gsub("[()]", "", Location))%>%
separate(x,into= c("Y","X"), sep=",")%>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()%>%
.[fishnet,]
# Kernal Density Method - Traditonal way
narcotics_KDE_sf <- as.data.frame(narcotics_KD.2000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net_Sigdisct)) %>%
aggregate(., final_net_Sigdisct, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(aggregate(dplyr::select(NARCOTICS18) %>% mutate(narcotics_count = 1), ., sum) %>%
mutate(narcotics_count = replace_na(narcotics_count, 0))) %>%
dplyr::select(label, Risk_Category, narcotics_count)
# Risk Prediction - Our way
narcotics_risk_sf <-
reg.ss.spatialCV_name %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(NARCOTICS18) %>% mutate(narcotics_count = 1), ., sum) %>%
mutate(narcotics_count = replace_na(narcotics_count, 0))) %>%
dplyr::select(label, Risk_Category, narcotics_count)
narcotics_KDE_RISK <- rbind(narcotics_KDE_sf, narcotics_risk_sf) %>%
na.omit()
narcotics_KDE_RISK%>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), color="transparent") +
geom_sf(data = sample_n(NARCOTICS18, 2390), color="BLACK",size =2) +
facet_wrap(~label, ) +
scale_fill_manual(values = palette5_3,
name = "Risk_Category") +
labs(title="Comparison of Kernel Density and Risk Predictions Model",
subtitle="2017 NARCOTICS predictions; 2018 NARCOTICS") +
mapTheme() +
theme(legend.position="bottom",
panel.spacing = unit(10, 'lines'),
panel.background = element_blank(),
strip.text.x = element_text(size=25,face = "bold"),
legend.title = element_blank(),
legend.key.size = unit(2.5, "line"),
legend.text = element_text(color = "black", size = 30),
plot.title = element_text(size = 30,hjust = 0.5),
plot.subtitle=element_text(size = 25,face="italic",hjust = 0.5))
Figure 16 tells that compared to the traditional “kernel density method”, the risk modeling is more precise. It not only indicates the city center as a hotspot but also indicates other locations where narcotics-heroine also may happens.
However, through Figure 17, we can know that risk modeling method under predict narcotics-heroine in high risk area, even though it may have a better performance in medium risk area.
rbind(narcotics_KDE_sf, narcotics_risk_sf) %>%
st_set_geometry(NULL) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(narcotics_count = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = narcotics_count / sum(narcotics_count)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label),position="dodge", stat="identity") +
scale_fill_manual(values = c("#83A3C9","#FFA596"),name = "") +
labs(title = "Risk prediction vs. Kernel density, 2018 narcotics") +
plotTheme()+
xlab('Cells\' risk quintile')+
ylab('Rate of observed assaults')+
theme(legend.position="bottom",
legend.title = element_blank(),
legend.key.size = unit(0.5, "line"),
legend.text = element_text(color = "black", size = 10),
plot.title = element_text(size = 15, hjust = 0.5),
axis.text.x = element_text(angle = 15, vjust = 0.5))
I believe the this model should be applied into practice. Overall, this geospatial risk prediction model borrows the Narcotics-Heroine experience in places where it has been observed, and tests whether that experience generalizes to places where Narcotics-Heroine risk may be high. The resulting predictions can be thought of as ‘latent risk’ for Narcotics-Heroine and can be used to allocate limited police response across space. This improvement in the efficiency is so vital to the government and to the citizens. So this model has its unique value in this aspect.
At the smae time, new and powerful feature engineering strategies are introduced to discribe and predict Narcotics-Heroine. Despite the model generalizes not so well across different neighborhood contexts, and there may even be some selection bias in the data set, this model successfully reduces the MAE, comapred to the traditional model. And it successfully predicts the mid-risks areas, which is more difficult to predict. To improve its generalizability, some explorations should be carried out like scaling the geospatial scale in the model to a smaller level. Nevertheless, this model has demonstrated that even a simple risk prediction algorithm may outperform traditional hotspot analysis.