1 Motivation

A bike share program is a shared transport service in which bicycles are made available for shared use to individuals on a short term basis for a price. Many bike share systems allow people to borrow a bike from a “dock” and return it at another dock belonging to the same system. One of the most difficult operational problems for urban bike share systems is the need to ‘re-balance’ bicycles across the network.

Given the uneven distribution of the directions of bike-share trips at given times of the day, bikes and opening docks tend to be highly unbalanced in space. Bike share is not useful if a dock has no bikes to pick up, nor if there are no open docking spaces to deposit a bike. Re-balancing is the practice of anticipating (or predicting) bike share demand for all docks at all times and manually redistributing bikes to ensure a bike or a docking place is available when needed.

In this project, NYC Citi Bike Share system is chose to research about rebalancing strategies. An algorithm will be developed to predict the trip starts across time and in a given bike dock. The outcome displays the serial and spatial patterns of Citi Bike Share system and inform the manual distribution advice. The re-balancing process consists of two mechanisms:

To re-balance bikes, a small fleet of trucks will be operated to move bikes from one dock station to another dock station. To achieve a better arrangement, time lag feature will be used to build the model and predict future bike share needs.

library(tidyverse)
library(ggplot2)
library(sf)
library(tidycensus)
library(lubridate)
library(tigris)
library(osmdata)
library(gganimate)
library(riem)
library(FNN)
library(gridExtra)
library(grid)
library(knitr)
library(gifski)
library(rjson)
library(kableExtra)
options(tigris_class = "sf")
root.dir =
paste0("https://raw.githubusercontent.com/urbanSpatial",
"/Public-Policy-Analytics-Landing/master/DATA/")
source(
paste0("https://raw.githubusercontent.com/urbanSpatial/",
"Public-Policy-Analytics-Landing/master/functions.r"))

palette5 <- c("#f9b294","#f2727f","#c06c86","#6d5c7e","#315d7f")
palette4 <- c("#f9b294","#f2727f","#c06c86","#6d5c7e")
palette2 <- c("#f9b294","#f2727f")
palette1_main <- "#F2727F"
palette1_assist <- '#F9B294'

mapTheme <- function(base_size = 35, title_size = 45) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(hjust = 0.5, size = title_size, colour = "black",face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5,size=base_size,face="italic"),
    plot.caption = element_text(size=base_size,hjust=0),
    axis.ticks = element_blank(),
    panel.spacing = unit(6, 'lines'),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=base_size),
    axis.title = element_text(face = "bold",size=base_size),
    axis.text = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size=base_size,colour = "black", face = "italic"),
    legend.text = element_text(size=base_size,colour = "black", face = "italic"),
    strip.text.x = element_text(size = base_size,face = "bold")
  )
}

plotTheme <- function(base_size = 35, title_size = 45) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(hjust = 0.5, size = title_size, colour = "black",face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5,size=base_size,face="italic"),
    plot.caption = element_text(size=base_size,hjust=0),
    axis.ticks = element_blank(),
    panel.spacing = unit(6, 'lines'),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=6),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=base_size),
    axis.title = element_text(face = "bold",size=base_size),
    axis.text = element_text(size=base_size),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size=base_size,colour = "black", face = "italic"),
    legend.text = element_text(size=base_size,colour = "black", face = "italic"),
    strip.text.x = element_text(size = base_size,face = "bold")
  )
}
ride_0_1 <- read.csv(file="D:/OneDrive/WORK/Upenn/MUSA_508_Public_Policy_Analytics_Michael/Assignment/Assignment5/202108-citibike-tripdata.csv/202108-citibike-tripdata.csv")%>%na.omit()

ride_0_2 <- read.csv(file="D:/OneDrive/WORK/Upenn/MUSA_508_Public_Policy_Analytics_Michael/Assignment/Assignment5/202109-citibike-tripdata.csv/202109-citibike-tripdata.csv")%>%na.omit()

ride <- rbind(ride_0_1,ride_0_2)
ride2 <-ride %>%
  mutate(interval_hour =ymd_h(substr(started_at, 1, 13)),
         week = week(interval_hour),
         dayotw = wday(interval_hour, label=TRUE),
         start_station_name = as.character(start_station_name),
         end_station_name = as.character(end_station_name))%>%
  filter(week %in% c(32:36))
st_write(ride2, "ride2.csv")

The Week 32 - Week 36 dataset of Citi Bike Share system is chosen to do the research. It is because the research also want to analyze the transportation pattern of New Yorkers after COVID-19. Another reason is because activities of Citi Bike Share system are much, leading to a huge datasets. The trimmed dataset can provide a more efficient analysis work flow, which also keeps the accuracy and generalizability.



2 Feature Engineering

2.1 Research Sites

The research focus on Manhattan Island, the most densely populated and geographically smallest of the five boroughs of New York City. Manhattan serves as the city’s economic and administrative center, cultural identifier, and historical birthplace; it has abundant transportation activities and Citi Bike Share activities, is a very ideal research area.

ride_rsch <- ride2%>%
  group_by(start_station_name)%>%
  summarize(lat=first(start_lat),lng=first(start_lng))

NYCTracts <-
  tigris::tracts(state = "New York", county = "New York") %>%
  dplyr::select(GEOID)%>%
  filter(GEOID!="36061000100")


ride2_sf = st_as_sf(ride_rsch, coords = c("lng", "lat"), 
                 crs = st_crs(NYCTracts), agr = "constant")

ride_station <- st_join(ride2_sf,NYCTracts,left=TRUE)%>%na.omit()
ride_tract <- st_join(NYCTracts,ride2_sf,left=TRUE)%>%na.omit()

station_list <- ride_station[["start_station_name"]]

ride3 <- ride2 %>%
  filter(start_station_name %in% station_list,end_station_name %in% station_list)
ggplot()+
  geom_sf(data=st_union(NYCTracts),color="black",size=5,fill = "transparent")+
  geom_sf(data=NYCTracts,color="black",size=0.5,linetype ="dashed",fill = "transparent")+
  geom_sf(data=ride_station,size=4,color=palette1_main)+
  labs(title = "Map of Share Bike Stations, Manhattan Island", 
       subtitle = "Red dots are stations")+
  mapTheme()

For the practical meanness, re-balancing will happen among these bike share stations, the unit of analysis is the bike share stations instead of census tracts. In total, there are 602 bike share stations in Manhattan Island. The bike share stations are displayed above.



2.2 Expand Grid/panel

study.panel <-expand.grid(
  interval_hour = unique(ride3$interval_hour),
  start_station_name =
    unique(ride3$start_station_name))

ride3.panel <-ride3 %>%
  mutate(Trip_Counter = 1) %>%
  group_by(interval_hour, start_station_name) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  right_join(study.panel) %>%
  replace(is.na(.),0) %>%
  left_join(ride_station,by = "start_station_name") %>%
  mutate(week = week(interval_hour),dayotw = wday(interval_hour, label = TRUE)) %>%
  st_sf()

Because we are observing time lag pattern, the dataset for this analysis must be a complete panel with an observation for every possible space/time combination. The data frame we are using is incomplete as some space/time intervals saw no trips. In the above code block, the panel is completed by finding all unique space/time intervals and inserting 0 trips where necessary.



2.3 Weather Features

One might reasonably assume that inclement weather in the Manhattan Island would incentivize rideshare. Therefore, weather is an important feature that should be added into the prediction. In the following, the weather data in NYC is used to summarize temperature, precipitation, and wind speed for every hour between August and September.

weather.Data <-
  riem_measures(station = "NYC", date_start = "2021-08-01",
                date_end = "2021-10-01")

weather.Panel <-
  weather.Data %>%
  mutate_if(is.character,
            list(~replace(as.character(.), is.na(.), "0"))) %>%
  replace(is.na(.), 0) %>%
  mutate(interval_hour = ymd_h(substr(valid, 1, 13))) %>%
  mutate(week = week(interval_hour),
         dayotw = wday(interval_hour, label=TRUE)) %>%
  filter(week %in% c(32:36))%>%
  group_by(interval_hour,week,dayotw) %>%
  summarize(Temp = max(tmpf),
            Precip = sum(p01i),
            Wind = max(sknt)) %>%
  mutate(Temp = ifelse(Temp == 0, 42, Temp))%>%
  replace(is.na(.),0)
grid.arrange(
ggplot(weather.Panel, aes(interval_hour,Precip)) + 
  geom_line(color=palette1_main,size=2)+
  labs(x="",
       title = "Weather Data - NYC - Week 32 - Week 36, 2021")+
  plotTheme(),
ggplot(weather.Panel, aes(interval_hour,Wind)) + 
  geom_line(color=palette1_main,size=2)+
  labs(x="",
       title = "")+
  plotTheme(),
ggplot(weather.Panel, aes(interval_hour,Temp)) + 
  geom_line(color=palette1_main,size=2)+
  labs(x="Time",
       title = "")+
  plotTheme())

From the precipitation graph, we can see there are severe rainfalls in Aug.23 and Sept.2. The Sept.2 rainfall is caused by Hurricane Ida. An assumption will be drew that heavy rainfalls will decrease citizens’ desires to use bike share, and we can see the validation in the followings. From Wind and Temperature graphs, periodic patterns can be roughly observed, which should arise attention that these two indicators may correlate with time features. In the following, the correlation will be examined.



2.4 Population

Users of Citi Bike Share system are citizens, therefore, it is reasonable to assume that high population density may increase needs to use Citi Bike Share. In the following, population data is obtained from Census data using census tract unit. Every bike dock in a census tract will be allocated with the population density data of the census tract.

census_api_key("e9466e56db1114acc85df98003dfcf59c1b3c0af", overwrite = TRUE)
CensusData <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001E"), 
          year=2019, state="New York", county="New York",geometry = T, output="wide") %>%
  st_transform(st_crs(NYCTracts)) %>%
  rename(Census_TotalPop = B01003_001E) %>%
  dplyr::select(-NAME,-starts_with("B")) %>%
  mutate(Census_areasqm = as.numeric(st_area(.)),
         Census_PopDensity = Census_TotalPop/Census_areasqm) %>%
  dplyr::select(-Census_TotalPop,-Census_areasqm)%>%
  replace(is.na(.),0)
st_write(CensusData, "CensusData.geojson")
ggplot()+
  geom_sf(data=st_union(NYCTracts),color="black",size=5,fill = "transparent")+
  geom_sf(data=CensusData,aes(fill=q5(Census_PopDensity)))+
  geom_sf(data=ride_station,color="blue")+
  scale_fill_manual(values = palette5,
                    labels = qBr(CensusData, "Census_PopDensity"),
                    name = "Per1000sqm")+
  labs(title = "Map of Population Density, Manhattan Island", 
       subtitle = "Population Per 1000sqm")+
  mapTheme()

From the above graph, we can the spatial pattern of population clustering: NYC citizens tend to live around the central park area, and the population will decrease as the the distance to central park area increases.



2.5 Apartments & Park

Besides the population, will the features “the number of apartments” & “the number of parks” influence the bike share activities? These are reasonable assumptions that the number of floating population will be larger if the number of apartment and the number of park are larger. More floating population means more chance of using share bike systems. In the following, these two datasets are obtained from the Open Street Map, and their spatial distribution are plotted as follows.

q0 <- opq(bbox = c(-74.025203,40.699282,-73.901210,40.881792))
get_osm_1 <- function(bbox, key, value, geom, crs){
  object <- add_osm_feature(opq = bbox, key = key, value = value) %>%
    osmdata_sf(.)
  geom_name <- paste("osm_", geom, sep = "")
  object.sf <- st_geometry(object[[geom_name]]) %>%
    st_transform(crs) %>%
    st_sf() %>%
    cbind(., object[[geom_name]][[key]]) %>%
    rename(NAME = "object..geom_name....key..")
  return(object.sf)
}

apartments = get_osm_1(q0, "building", "apartments", "points", st_crs(NYCTracts))
apartments_select <- st_intersects(st_union(NYCTracts)%>%st_sf(),apartments)
apartments2 <- apartments_select[[1]]
apartments3 <- apartments[apartments2,]

park = get_osm_1(q0, "leisure", "park", "polygons", st_crs(NYCTracts))%>%
  st_centroid()
park_select <- st_intersects(st_union(NYCTracts)%>%st_sf(),park)
park2 <- park_select[[1]]
park3 <- park[park2,]

#st_write(apartments3, "apartments3.geojson")
#st_write(park3, "park3.geojson")
grid.arrange(ncol=2,
ggplot()+
  geom_sf(data=st_union(NYCTracts),color="black",size=5,fill = "transparent")+
  geom_sf(data=NYCTracts,color="black",size=0.5,linetype ="dashed",fill = "transparent")+
  geom_sf(data=park3,size=4,color=palette1_main)+
  geom_sf(data=ride_station,color="blue")+
  labs(title = "Map of Parks, Manhattan Island", 
       subtitle = "Red dots are parks")+
  mapTheme(),
ggplot()+
  geom_sf(data=st_union(NYCTracts),color="black",size=5,fill = "transparent")+
  geom_sf(data=NYCTracts,color="black",size=0.5,linetype ="dashed",fill = "transparent")+
  geom_sf(data=apartments3,size=2,color=palette1_main)+
  geom_sf(data=ride_station,color="blue")+
  labs(title = "Map of Apartments, Manhattan Island", 
       subtitle = "Red dots are parks")+
  mapTheme()
)

From the graphs, we can see a relatively even distribution of parks in NYC (ignore the sheer scale of central park) and a south-concentrated distribution of apartment distribution.



2.6 Add Amenity Features to dataframe

ride4 <-ride3.panel%>%
  left_join(weather.Panel%>%dplyr::select(-dayotw), by = "interval_hour")%>%
  rename("week"="week.x")%>%
  dplyr::select(-week.y)

apartments3_coord =  apartments3 %>% dplyr::select(geometry)%>% st_coordinates
park3_coord =  park3 %>% dplyr::select(geometry)%>% st_coordinates
ride4_coord = st_coordinates(ride4)

ride4['park_nn5'] = nn_function(ride4_coord, park3_coord, 5)
ride4['apartment_nn5'] = nn_function(ride4_coord, apartments3_coord, 5)
ride4['trans_apartment_nn5'] = log(1+sqrt(ride4['apartment_nn5']))

ride4 <- ride4%>%replace(is.na(.),0)

In the following, these features (Weather, Park, and Apartment) are added into the dataset of rideshare. Specific time slot weather data will be added into every station data in that specific time slot. And the distance to 5 nearest parks will be calculated and added into every station. Besides, the transformed data (using log transformation and square foot transformation) , the distance to 5 nearest apartments will be calculated and added into every station.



2.7 Time Lag

ride5 <-
  ride4 %>%
  arrange(start_station_name, interval_hour) %>%
  group_by(start_station_name) %>%
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
ungroup()

In the common sense, citizens activities have strong time patterns. Most people will go to work in the morning and go back home in the dusk. These floating population will increase uses of bike share system. To test for serial (temporal) correlation, additional feature engineering creates time lags. The time lags are an hour, two hours, three hours, four hours, twelve hours, and twenty four hours respectively.



3 Exploratory Analysis

3.1 Split datasets

ride5 <-ride5%>%
  left_join(CensusData%>%st_drop_geometry()%>%dplyr::select(GEOID,Census_PopDensity), by = "GEOID")%>%
  mutate(isPrecip = ifelse(Precip > 0,"Rain/Snow", "None"))
ride.Train <- filter(ride5, week < 35)
ride.Test <- filter(ride5, week >= 35)
ride.cross <- ride5

Here, a time series approach is taken, training on three weeks of data, Week 32 - Week 34, and testing on the following two weeks, Week 35 - Week 36. If the time experience is generalizable, it can be used to project into the near future.



3.2 Time Lag

As it is talked before, citizens activities have time patterns and increased citizens activities will corelate with increased bike share activities. Is this assumption correct? In the following, relationship between bike share activities and time are plotted.

Fridays <-
  mutate(ride5,friday = ifelse(dayotw == "Fri" & hour(interval_hour) == 1,interval_hour, 0)) %>%
  filter(friday != 0)

Hurricane_Henri <- as.POSIXct("2021-08-22 01:00:00 UTC")

rbind(
  mutate(ride.Train, Legend = "Training"),
  mutate(ride.Test, Legend = "Testing")) %>%
  group_by(Legend, interval_hour) %>%
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>%
  ggplot(aes(interval_hour, Trip_Count, colour = Legend)) +
  geom_line(size=2) +
  scale_colour_manual(values = palette2) +
  geom_vline(data = Fridays, aes(xintercept = interval_hour),linetype = "dashed",size=1) +
  geom_vline(xintercept = Hurricane_Henri,linetype = "dotted",size=3,color=palette5[5]) +
  labs(
    title="Rideshare trips by week: week32- week36",
    subtitle="Dashed lines for every Friday, Dotted line for Hurrican Henri",
    x="Day", y="Trip Count") +
  plotTheme()+ theme(panel.grid.major = element_blank())

The vertical lines are used to visualize Fridays as well as Hurricane Henri (dotted lines). There are some interesting trends to note. Most weeks exhibit remarkably comparable time series patterns with consistent peaks and troughs. This suggests the presence of serial correlation. Days surrounding the occurrence of Hurricane Henri experience a overall decrease.

plotData.lag <-
  filter(as.data.frame(ride5), week == 34) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable,"lagHour","lag2Hours","lag3Hours","lag4Hours","lag12Hours","lag1day"))

correlation.lag <-
  group_by(plotData.lag, Variable) %>%
  summarize(correlation = round(cor(Value, Trip_Count,
                                    use = "complete.obs"), 2))
ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 2.5, color = palette1_assist) +
  geom_text(data = correlation.lag,aes(label = paste("r =", round(correlation, 2))),x=-Inf, y=Inf, vjust = 2, hjust = -.5,size=10) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(
    x="Lag Trip Count",
    title = "Rideshare trip count as a function of time lags") +
  plotTheme()

At the same time, different time lags are plotted with trip count to examine the relationships between them. From the above plotting, we can see that time lags have high coefficients with trip count, and the correlation decreases with each additional lag hour, but predictive power returns with the 1-day lag (because time are continuous). These features should be strong predictors in a model.



3.3 Spatial Autocorrelation

Besides time lags, spatial lags are important features that needs to be examined. In the followings, bike share activities are plotted spatially using two scales, week and day of week, to see the relationships between space and trip count.

ride5_sf = st_as_sf(ride5, sf_column_name = "geometry" ,
                 crs = st_crs(NYCTracts), agr = "constant")

group_by(ride5_sf, week, start_station_name) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>%
  ggplot() + 
  geom_sf(data=NYCTracts,color="grey50",size=0.5,fill = "transparent")+
  geom_sf(aes(color = q5(Sum_Trip_Count)),size=2) +
  geom_sf(data=st_union(NYCTracts),color="black",size=3,fill = "transparent")+
  facet_wrap(~week, ncol = 5) +
  scale_color_manual(values = palette5,
                     labels = c("10","196","529","826","1283"),
                    name = "Trip_Count") +
  labs(title="Sum of rideshare trips by tract and week") +
  mapTheme(25,35) + 
  theme(legend.position = "bottom",
        panel.border = element_rect(color = "black",fill = "transparent", size = 6))

In this plot, we can see constant spatial clustering in the scale of week. The clustering is concentrated at the south of Manhattan Island, which correspond to population density and the the number of apartments that are mentioned above.

test <- group_by(ride5_sf,dayotw,start_station_name) %>%
  summarize(Sum_Trip_Count = sum(Trip_Count)) %>%
  ungroup()

ggplot() + 
  geom_sf(data=NYCTracts,color="grey50",size=0.5,fill = "transparent")+
  geom_sf(data=test,aes(color = q5(Sum_Trip_Count)),size=1.25) +
  geom_sf(data=st_union(NYCTracts),color="black",size=2,fill = "transparent")+
  facet_wrap(~dayotw, ncol = 7) +
  scale_color_manual(values = palette5,
                     labels = c("12","140","375","596","916"),
                    name = "Trip_Count") +
  labs(title="Sum of rideshare trips by tract and week") +
  mapTheme(25,35) + 
  theme(legend.position = "bottom",
        panel.border = element_rect(color = "black",fill = "transparent", size = 6))

In this plot, similar patterns with “week - trip count” are observed. Even during the weekend, the spatial pattern appears consistent, with trips concentrated in south of Manhattan Island.



3.4 Spatial & Seriel Correlation

ride.animation.data <- ride5_sf%>%
  filter(week == 32)


ride.animation.data2 <-ride.animation.data%>%
  mutate(Trips =
           case_when(
             Trip_Count == 0 ~ "0 trips",
             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
             Trip_Count > 3 & Trip_Count <= 11 ~ "4-11 trips",
             Trip_Count > 11 & Trip_Count <= 28 ~ "12-28 trips",
             Trip_Count > 28 ~ "28+ trips")) %>%
  mutate(Trips = fct_relevel(Trips,"0 trips","1-3 trips","4-11 trips","12-28 trips","28+ trips"))
mapTheme2 <- function(base_size = 12, title_size = 15) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(hjust = 0.5, size = title_size, colour = "black",face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5,size=base_size,face="italic"),
    plot.caption = element_text(size=base_size,hjust=0),
    axis.ticks = element_blank(),
    panel.spacing = unit(6, 'lines'),
    panel.background = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=base_size),
    axis.title = element_text(face = "bold",size=base_size),
    axis.text = element_blank(),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(size=base_size,colour = "black", face = "italic"),
    legend.text = element_text(size=base_size,colour = "black", face = "italic"),
    strip.text.x = element_text(size = base_size,face = "bold")
  )
}

rideshare_animation <-ggplot() +
  geom_sf(data=NYCTracts,color="grey80",size=0.01,fill = "transparent")+
  geom_sf(data = ride.animation.data2, aes(color = Trips),size=1.32) +
  geom_sf(data=st_union(NYCTracts),color="black",size=1.2,fill = "transparent")+
  scale_color_manual(values = palette5) +
  labs(title ="Rideshare pickups for week 32, Manhattan Island",
       subtitle = "One hour intervals: {current_frame}") +
  transition_manual(interval_hour) +
  mapTheme2() + 
  theme(legend.position = "bottom",
        panel.border = element_rect(color = "black",fill = "transparent", size = 1))

animate(rideshare_animation, duration=10,renderer = gifski_renderer())
knitr::include_graphics("gif.gif")

In the above, an animation is made to display the spatial & temporal correlation. Every frame displays a specific spatial distribution in specific time. We can see that in the midnight, when citizens activities decrease dramatically, the spatial lag of trip count disappears. However, in the daytime, especially in the city peak time, the spatial lag of trip count is very obvious.



3.5 Other features Correlation

In the following, the effect of additional features (“Precipitation”,“Wind”,“Temp”,“Population Density”,“Distance to five nearest parks”,“Distance to five nearest apartments”) are examined respectively.

group_by(ride5,interval_hour) %>%
  summarize(Trip_Count = mean(Trip_Count),Precip = first(Precip)) %>%
  mutate(isPrecip = ifelse(Precip > 0,"Rain/Snow", "None")) %>%
  na.omit()%>%
  group_by(isPrecip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
  ggplot(aes(isPrecip, Mean_Trip_Count)) +
  geom_bar(stat = "identity",aes(fill=isPrecip)) +
  scale_fill_manual(values=palette2)+
  labs(title="Does ridership vary with precipitation?",
       x="Precipitation", y="Mean Trip Count") +
  plotTheme()

To examine the effect of Precipitation, a dummy variable, isPercip, is created to ask whether precipitation or not effects mean Trip_Count. From the above, we can see moderate difference between Rain/Snow and None, different from the textbook. That might because people in different cities have different tolerance toward weather. By way of precaution, the categorical predictor “isPercip” is still introduced to the regression model.

Temp <- ride5%>%
  mutate(hour=hour(interval_hour))%>%
  filter(hour==9)%>%
  group_by(interval_hour) %>%summarize(Trip_Count = mean(Trip_Count),Temp = first(Temp)) %>%na.omit()

Wind <- ride5%>%
  mutate(hour=hour(interval_hour))%>%
  filter(hour==9)%>%
  group_by(interval_hour) %>%summarize(Trip_Count = mean(Trip_Count),Wind = first(Wind)) %>%na.omit()


correlation.Temp <-as.data.frame(round(cor(Temp$Temp, Temp$Trip_Count,use = "complete.obs"), 2))

correlation.Wind <-as.data.frame(round(cor(Wind$Wind, Wind$Trip_Count,use = "complete.obs"), 2))

Temp_plot <- ggplot(data=Temp,aes(Temp, Trip_Count)) +
  geom_point(size = 10, color = palette1_assist) +
  geom_text(data=correlation.Temp,aes(label = paste("r =",correlation.Temp[1,]),x=-Inf, y=Inf, vjust = 2, hjust = -.25),size=15) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  labs(
    title = "Trip Count as a fuction of Temp",
    x="Temperature at 9:00 am") +
  plotTheme()+theme(legend.text = element_blank())

Wind_plot <- ggplot(data=Wind,aes(Wind, Trip_Count)) +
  geom_point(size = 10, color = palette1_assist) +
  geom_text(data=correlation.Wind,aes(label = paste("r =",correlation.Wind[1,]),x=-Inf, y=Inf, vjust = 2, hjust = -.25),size=15) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  labs(
    title = "Trip Count as a fuction of Wind",
    y="",
    x="Wind at 9:00 am") +
  plotTheme()+theme(legend.text = element_blank())

grid.arrange(ncol=2,Temp_plot,Wind_plot)

Since Temperature and Wind both increase with the hour of the day, to examine their effect, specific time slot 9:00am is selected. From the above two graphs, we can the coefficient of Temperature is 0.05, while wind’s is -0.16. That suggest Temperature may not have effect on people’s willingness toward using bike share, same as wind. However, in the following, we can see the different situation when encoutering extreme weather.

Pop <- group_by(ride5,start_station_name) %>%
  summarize(Trip_Count = mean(Trip_Count),Census_PopDensity = first(Census_PopDensity)) %>%na.omit()

Park <- group_by(ride5,start_station_name) %>%
  summarize(Trip_Count = mean(Trip_Count),park_nn5 = first(park_nn5)) %>%na.omit()

Apt <- group_by(ride5,start_station_name) %>%
  summarize(Trip_Count = mean(Trip_Count),trans_apartment_nn5 = first(trans_apartment_nn5)) %>%na.omit()

correlation.pop <-as.data.frame(round(cor(Pop$Census_PopDensity, Pop$Trip_Count,use = "complete.obs"), 2))

correlation.park <-as.data.frame(round(cor(Park$park_nn5, Park$Trip_Count,use = "complete.obs"), 2))

correlation.apt <-as.data.frame(round(cor(Apt$trans_apartment_nn5, Apt$Trip_Count,use = "complete.obs"), 2))

Pop_plot <- ggplot(data=Pop,aes(Census_PopDensity, Trip_Count)) +
  geom_point(size = 2.5, color = palette1_assist) +
  geom_text(data=correlation.pop,aes(label = paste("r =",correlation.pop[1,]),x=-Inf, y=Inf, vjust = 2, hjust = -.25),size=15) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  labs(
    title = "Trip Count - Population",
    x="Census Population") +
  plotTheme()+theme(legend.text = element_blank())

Park_plot <- ggplot(data=Park,aes(park_nn5, Trip_Count)) +
  geom_point(size = 2.5, color = palette1_assist) +
  geom_text(data=correlation.park,aes(label = paste("r =",correlation.park[1,]),x=-Inf, y=Inf, vjust = 2, hjust = -.25),size=15) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  labs(
    title = "Trip Count - Park",
    x="Distance to five nearest parks",
    y="") +
  plotTheme()+theme(legend.text = element_blank())

Apt_plot <- ggplot(data=Apt,aes(trans_apartment_nn5, Trip_Count)) +
  geom_point(size = 2.5, color = palette1_assist) +
  geom_text(data=correlation.apt,aes(label = paste("r =",correlation.apt[1,]),x=-Inf, y=Inf, vjust = 2, hjust = -.25),size=15) +
  geom_smooth(method = "lm", se = FALSE, colour = palette1_main,size=2.5) +
  labs(
    title = "Trip Count - Apartment",
    x="Distance to five nearest apartments",
    y="") +
  plotTheme()+theme(legend.text = element_blank())

grid.arrange(ncol=3,Pop_plot,Park_plot,Apt_plot)

In the above plots, we can see that the coefficient of population and trip count is -0.06, the park_nn5’s is -0.13, and the apartment_nn5’s is -0.4, suggesting that among these three predictors, the effect is apartment_nn5 is biggest. By way of precaution, these three predictors are still introduced to the regression model.



4 Regression

4.1 Regression Models

#no time, no spatial
reg0 <- lm(Trip_Count ~dayotw+Temp + isPrecip + Wind + Census_PopDensity + park_nn5 + trans_apartment_nn5,data=ride.Train)
#yes time, no spatial
reg1 <- lm(Trip_Count ~hour(interval_hour) + dayotw + Temp + isPrecip + Wind + Census_PopDensity + park_nn5 + trans_apartment_nn5,data=ride.Train)
#no time, yes spatial
reg2 <- lm(Trip_Count ~start_station_name + dayotw + Temp + isPrecip + Wind + Census_PopDensity + park_nn5 + trans_apartment_nn5,data=ride.Train)
#yes time, yes spatial
reg3 <- lm(Trip_Count ~start_station_name + hour(interval_hour) + dayotw + Temp + isPrecip + Wind + Census_PopDensity + park_nn5 + trans_apartment_nn5,data=ride.Train)
#yes time&lag, yes spatial
reg4 <- lm(Trip_Count ~start_station_name + hour(interval_hour) + dayotw + Temp + isPrecip + Wind + Census_PopDensity + park_nn5 + trans_apartment_nn5 + lagHour + lag2Hours + lag3Hours + lag4Hours +lag12Hours + lag1day,
           data=ride.Train)

ride.Test.weekNest <-as.data.frame(ride.Test) %>%nest(-week)

model_pred <- function(dat, fit){pred <- predict(fit, newdata = dat)}

week_predictions <-ride.Test.weekNest %>%
  mutate(
    A_Baseline = map(.x = data, fit = reg0, .f = model_pred),
    B_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
    C_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
    D_Space_Time_FE = map(.x = data, fit = reg3,.f = model_pred),
    E_Space_Time_Lags = map(.x = data, fit = reg4,.f = model_pred))

In this section, five different Linear regressions are estimated with different fixed effects:

  • rer 0 focus on just additional features

  • reg1 focuses on just time, including additional features

  • reg2 focuses on just space fixed effects, including additional features

  • reg3 focuses on time, space fixed effects and additional features

  • reg4 adds the time lag features.

Time features like hour is set as a continuous feature, the interpretation is that a 1 hour increase is associated with an estimated change in Trip_Count. Spatial fixed effects for start_station_name are also included to account for the across tract differences, like amenities, access to transit, distance to the Loop, etc.

Ordinary least squares (OLS) is chosen, and the assumptions to the OLS Regression are met here.



4.2 Validation by time

week_predictions <- week_predictions %>%
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean),
         sd_AE = map_dbl(Absolute_Error, sd))

week_predictions2 <- week_predictions
week_predictions2$week <- as.character(week_predictions2$week)

week_predictions2 %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) +
  geom_bar(aes(fill = Regression),
           position = "dodge", stat="identity") +
  scale_fill_manual(values = palette5) +
  labs(title = "Mean Absolute Errors by model and week") +
  plotTheme()+
  theme(legend.position = "bottom")

In this section, mean absolute error (MAE) is calculated for each of the five models. From the graph we can see that with increasing sophistication, the model becomes more accurate and more generalizable. Incorporating closer time-lag features significantly improves the models’ predicting power. At the same time, time and space fixed-effect predictors can also improve the model, decreasing the MAE.

week_predictions2 %>%
  mutate(interval_hour = map(data, pull, interval_hour),
         start_station_name =map(data, pull, start_station_name)) %>%
  dplyr::select(interval_hour, start_station_name,
                Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval_hour,-start_station_name) %>%
  group_by(Regression, Variable, interval_hour) %>%
  summarize(Value = mean(Value)) %>%
  ggplot(aes(interval_hour, Value, colour=Variable)) +
  geom_line(size = 3) +
  facet_wrap(~Regression, ncol=1) +
  scale_colour_manual(values = palette2) +
  labs(
    title = "Mean Predicted/Observed rideshare by hour",
    x = "Hour", y= "Rideshare Trips") +
  plotTheme()+
  theme(legend.position = "bottom")

For each model, predicted and observed Trip_Count is taken out of the spatial context and their means plotted in time. Corresponding to previous observations, with more sophistication comes the ability to predict for the highest peaks, and the time lags help make this happen. The time series does show that in Sept.22, there are huge MAEs in A,B,C,D models. Through searching the news, the reason is found out as follows.

knitr::include_graphics("Mamaroneck, N.Y.Mike SegarReuters.jpg")
Mamaroneck, N.Y.Mike SegarReuters

Mamaroneck, N.Y.Mike SegarReuters

On Sept.22, the Hurrian Ida attacked the east cost in the midnight. The remnants of Hurricane Ida dumped flooding rain, heavily precipitation and strong wind. During the day, the hurricane disappeared; people’s life in city are not affected. However, this strong rainfall and wind data is recorded in the weather dataset. Because of the data collected in the midnight is used to predict people’s life during the day, the errors appear. This reminds us the caution should be kept when filtering the data especially using one specific time weather data to represent the whole day condition.



4.3 Validation by space

error.byWeek <-
  filter(week_predictions2, Regression != "A_Baseline") %>%
  unnest() %>% 
  st_sf() %>%
  dplyr::select(start_station_name, Absolute_Error,week, geometry,Regression) %>%
  gather(Variable, Value, -start_station_name,-week, -geometry,-Regression) %>%
  group_by(Variable, start_station_name, week,Regression) %>%
  summarize(MAE = mean(Value))

What are the situations when plot these model MAEs in the space? In this section, mean absolute error is plotted by station and by week 35-36. Another plot maps mean absolute error by station and by hour in the Monday of the Week 35.

q4 <- function(variable) {as.factor(ntile(variable, 4))}
ggplot() +
  geom_sf(data=st_union(NYCTracts),color="black",size=3,fill = "transparent")+
  geom_sf(data=NYCTracts,color="black",size=0.5,linetype ="dashed",fill = "transparent")+
  geom_sf(data = error.byWeek, aes(color = q4(MAE)), size = 2) +
  scale_color_manual(values=palette5,
                     labels = c("0","1","2","3"),
                     name = "MAE") +
  facet_wrap(week~Regression, ncol = 4)+
  labs(title = 'Mean absolute error by station and by week') +
  mapTheme(25,35)+ 
  theme(
    legend.position = "bottom",
    panel.border = element_rect(color = "black",fill = "transparent", size = 6))

In this graph, we can see that E_Space_Time_Lags regression model has the least error clustering. Overall, we can know that most of the MAEs concentrate at the south of Manhattan Island, where the trip count is high. That might suggest the models have inaccurate predictions on high trip count area.

error.byHour <-
  filter(week_predictions2, Regression == "E_Space_Time_Lags" & week==35) %>%
  unnest() %>%
  filter(dayotw == "Mon")%>%
  st_sf() %>%
  dplyr::select(start_station_name, Absolute_Error,
                geometry, interval_hour) %>%
  gather(Variable, Value, -interval_hour,-start_station_name, -geometry) %>%
  group_by(hour = hour(interval_hour), start_station_name) %>%
  summarize(MAE = mean(Value))
ggplot() +
  geom_sf(data=st_union(NYCTracts),color="black",size=2,fill = "transparent")+
  geom_sf(data=NYCTracts,color="grey70",size=1,linetype ="dashed",fill = "transparent")+
  geom_sf(data = error.byHour, aes(color = q4(MAE)), size = 2) +
  scale_color_manual(values=palette4,
                     labels = c("0","1","2","2+"),
                     name = "MAE") +
  facet_wrap(~hour, ncol = 8)+
  labs(title = 'Mean absolute error by station and by hour') +
  mapTheme()+ 
  theme(
    legend.position = "bottom",
    panel.border = element_rect(color = "black",fill = "transparent", size = 4))

In this graph, we can see that MAEs is still clustering on the south of the Manhattan Island. However, high MAEs always happen at the NYC peak time, from 07:00 am to 08:00 am, and from 17:00 pm to 19:00 pm.



4.4 Cross Validation by Time

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry,interval_hour, indVariables, dependentVariable)
    
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry,interval_hour, indVariables, dependentVariable)
    
    regression <- lm(paste0(dependentVariable,"~."), 
    data = fold.train %>% dplyr::select(-geometry))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
    gc()
  }
  return(st_sf(allPredictions))
}

ride.cross <- ride.cross%>%mutate(hour=hour(interval_hour))
reg.logo_vars <- c("hour","start_station_name","dayotw","Temp","isPrecip","Wind", "Census_PopDensity","park_nn5","trans_apartment_nn5","lagHour","lag2Hours", "lag3Hours","lag4Hours","lag12Hours","lag1day")

reg.logo_hour <- crossValidate(
  dataset = ride.cross,
  id = "hour",
  dependentVariable = "Trip_Count",
  indVariables = reg.logo_vars)

To further validate the generalizability of the model, I cross validate the E_Space_Time_Lags model (with features, time, space fixed-effects, additional features and time lag) by time hour. Each hour takes turns to be the test set: the model is trained on the rest of the hour of a day and tested on this hour.

reg.logo_hour <- reg.logo_hour%>%
  mutate(MAE = abs(Prediction - Trip_Count))%>%
  na.omit()

reg.logo_hour %>%
  ggplot(aes(MAE)) +
  geom_histogram(bins = 40, colour="white", fill=palette1_main) +
  scale_x_continuous(breaks = seq(0, 10, by = 1)) +
  labs(title="Distribution of MAE - LOGO-CV",
       x="Mean Absolute Error", y="Trip Count") +
  plotTheme()

st_drop_geometry(reg.logo_hour) %>%
  mutate(Count_Decile = ntile(Trip_Count, 10)) %>%
  group_by(Count_Decile) %>%
  summarize(meanObserved = mean(Trip_Count, na.rm=T),
            meanPrediction = mean(Prediction, na.rm=T)) %>%
  gather(Variable, Value, -Count_Decile) %>%
  ggplot(aes(Count_Decile, Value, shape = Variable)) +
  geom_path(aes(group = Count_Decile), size=2,colour = palette1_assist) +
  geom_point(size = 6,color=palette1_main) +
  scale_shape_manual(values = c(2, 17)) +
  xlim(0,10) +
  ylim(0,10)+
  labs(title = "Predicted and observed count by decile") +
  plotTheme()+
  theme(legend.position="bottom")

st_drop_geometry(reg.logo_hour) %>%
  summarize(Mean_MAE = round(mean(MAE), 2),
            SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = '<center>Table. MAE & SD- LOGO-CV ', align=rep('c', 2)) %>%
  kable_classic(full_width = T, html_font = "Cambria")
reg.logo_hour <- reg.logo_hour%>%
  mutate(MAE = abs(Prediction - Trip_Count))%>%
  na.omit()

grid.arrange(ncol=2,
reg.logo_hour %>%
  ggplot(aes(MAE)) +
  geom_histogram(bins = 40, colour="white", fill=palette1_main) +
  scale_x_continuous(breaks = seq(0, 3, by = 1)) +
  labs(title="Distribution of MAE, LOGO-CV",
       x="Mean Absolute Error", y="Trip Count") +
  plotTheme()+
  theme(legend.position="bottom",
    axis.text = element_text(size=25)),
st_drop_geometry(reg.logo_hour) %>%
  mutate(Count_Decile = ntile(Trip_Count, 10)) %>%
  group_by(Count_Decile) %>%
  summarize(meanObserved = mean(Trip_Count, na.rm=T),
            meanPrediction = mean(Prediction, na.rm=T)) %>%
  gather(Variable, Value, -Count_Decile) %>%
  ggplot(aes(Count_Decile, Value, shape = Variable)) +
  geom_path(aes(group = Count_Decile), size=2,colour = palette1_assist) +
  geom_point(size = 6,color=palette1_main) +
  scale_shape_manual(values = c(2, 17)) +
  xlim(0,10) +
  ylim(0,10)+
  labs(title = "Predicted and observed") +
  plotTheme()+
  theme(legend.position = c(.3,.8)))

Table. MAE & SD- LOGO-CV
Mean_MAE SD_MAE Regression
2.32 0.02 Time_Space_Time Lags
2.32 2.86 LOGO-CV

In this plot, we can see that MAEs are mainly located at range 0-3. And this LOGO-CV model tends to over predict low trip count data, and has a relatively good performance to predict high trip count data. Even though the MAE remain the same, the SD_MAE increases in the LOGO-CV regression model.

logo_error2 %>%
  dplyr::select(hour,Mean_MAE, SD_MAE) %>%
  gather(Variable, value, -hour) %>%
  ggplot(aes(hour, value, fill=palette1_main)) +
  geom_bar(stat ="summary", fun ="mean") +
  facet_wrap(~Variable, scales = "free",ncol=2) +
  labs(x="Hour", y="Value",
       title ="LOGO Error DIstribution") +
  plotTheme() + 
  theme(legend.position = "none")

reg.logo_hour %>%
  st_drop_geometry()%>%
  dplyr::select(interval_hour, start_station_name,
                Trip_Count, Prediction) %>%
  gather(Variable, Value, -interval_hour,-start_station_name) %>%
  group_by(Variable, interval_hour) %>%
  summarize(Value = mean(Value))%>%
  ggplot(aes(interval_hour, Value, colour=Variable)) +
  geom_line(size = 2) +
  scale_colour_manual(values = palette2) +
  labs(
    title = "Mean Predicted/Observed rideshare by time",
    x = "Time", y= "Rideshare Trips") +
  plotTheme()+
  theme(legend.position = "bottom")

If we plot the MAEs by hour, we can correspond to previous observations that high MAEs always happen at the NYC peak time, from 07:00 am to 08:00 am, and from 17:00 pm to 19:00 pm.

LOGO_SPD <- reg.logo_hour %>%
  st_drop_geometry()%>%
  filter(dayotw == "Mon")%>%
  dplyr::select(hour, start_station_name,MAE) %>%
  gather(Variable, Value, -hour,-start_station_name) %>%
  group_by(Variable, hour,start_station_name,) %>%
  summarize(Value = mean(Value))%>%
  left_join(ride_station%>%dplyr::select(-GEOID),by="start_station_name")%>%
  st_sf()

ggplot() +
  geom_sf(data=st_union(NYCTracts),color="black",size=2,fill = "transparent")+
  geom_sf(data=NYCTracts,color="grey80",size=0.5,fill = "transparent")+
  geom_sf(data = LOGO_SPD, aes(color = q4(Value)), size = 2) +
  scale_color_manual(values=palette4,
                     labels = c("0","1","2","2+"),
                     name = "MAE") +
  facet_wrap(~hour, ncol = 8)+
  labs(title = 'Mean absolute error by station and by hour - LOGO') +
  mapTheme()+ 
  theme(
    legend.position = "bottom",
    panel.border = element_rect(color = "black",fill = "transparent", size = 3))

In this spatial distribution, we can see that clustering exaggerate on the south of the Manhattan Island, which means the LOGO-CV regression model has worse spatial prediction ability. At the same time, corresponding to above bar plot, MAEs exaggerate during the daytime.

Overall, accuracy and generalizability of this LOGO-CV regression model is not good, compared to the E_Time_Space_Lags model. Because based on Professor Ken Steif, LOGO-CV has such a rigid assumption, the experience in other time slots can generalizes this unique time slot which has a particularly unique temporal experience.



5 Conclusion

In this project, I have developed an algorithm to facilitate the bike re-balancing of New York’s CitiBike system in Manhattan Island. Using spatial, temporal and time-lag features, the model is very successful at predicting trip count of each station at a particular time. At the same time, this research reminds the caution when using specific weather data to represent the whole day weather data, especially when encountering extreme weather conditions. The LOGO-CV regression model used here does not display a predication accuracy because of its rigid assumption. Although prediction errors of E_Time_Space_Lags model (which performs best among these models ) are still concentrated in the south of Manhattan Island, the model proves useful enough to guide CitiBike’s re-balancing system for the absolute error mainly concentrate at range 1-3.