Nowadays, riding a bike is one of the major ways for people to get around. More and more people start to use shared bikes because of their convenience. Sometimes there is no useful bike to pick up in the bike stations, or there are no open docking spaces to deposit a bike. These unpleasant experience hurt shared bike company as well as bike riders. To better allocate the shared bikes across the network, there is a demand to carry out a ‘re-balance’ plan. Re-balancing is the practice of anticipating (or predicting) bike share demand for all bike stations at all times and manually redistributing bikes to ensure a bike or a station is available when needed.
As for re-balance plans, I’d recommend the company to apply a combination of using trucks to move bikes from here to there and offer rewards to bike riders to re-allocate the bike resources.
In this assignment, I will analyze the shared bike rides in the city of Philadelphia. I aim to build a model with moderate accuracy and generalizability, predict the demand of bikes spatially and temporally with the given data, and give brief suggestions on the re-balance plan to Indego. I wish to predict the demand in the future 2-3 weeks which can fully meet the demand of shared bike companies.
In this project, I use Indego bike share data to analyze and predict bike share in Philadelphia. The time range for the data is from April 16, 2019 to May 20, 2019, five weeks in total. The Indego data comes from its official website https://www.rideindego.com/about/data/.
The Indego bike share data is processed to add columns of 15-minute (interval15) and 60-minute (interval60) intervals, week, and day of the week (dotw) features.
# philly indego data
<-
indego read.csv("indego_416_520.csv")
# glimpse(indego)
<-
indego %>%
indego mutate(interval60 = floor_date(ymd_hm(start_time), unit = "hour"),
interval15 = floor_date(ymd_hm(start_time), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE))
# glimpse(indego)
Data of bus stops and schools are imported as amenity features to help modify the future models. Both bus stops and schools data come from Open Data Philly https://www.opendataphilly.org/dataset.
Nearest neighbor features for schools and bus stops are calculated as predictors to be put into the regression models.
<-
bus_stops read.csv("bus_stops.csv")
<-
bus_stops %>%
bus_stops st_as_sf(., coords = c("stop_lon", "stop_lat"), crs = 4326)
<-
schools st_read("Schools.geojson") %>%
st_set_crs('4326')
<-
start_stat %>%
indego filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
<- st_coordinates
st_c
<-
indego %>%
indego filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
mutate(
schools_nn3 = nn_function(na.omit(st_c(start_stat)), na.omit(st_c(schools)), 3),
busstops_nn3 = nn_function(na.omit(st_c(start_stat)), na.omit(st_c(bus_stops)), 3)
)
Census data is also involved in the process. Census data marks the socioeconomic characteristics of a specific region, which have relationship with the bike share prediction. Besides, census tracts are also useful geographical study units, though census tracts are not the study unit in this assignment.
<-
phillyCensus get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2019,
state = "PA",
geometry = TRUE,
county=c("Philadelphia"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
dplyr
Means_of_Transport, Total_Public_Trans,%>%
Med_Age, GEOID, geometry) mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
<-
phillyTracts %>%
phillyCensus as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
::select(GEOID, geometry) %>%
dplyr st_sf
<- st_join(indego %>%
bike_census filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lon) == FALSE &
is.na(end_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
%>%
phillyTracts st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lon = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
::select(-geometry)%>%
dplyrst_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
st_join(., phillyTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lon = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
::select(-geometry) dplyr
Weather data is essential in predicting people’s decision on whether ride a bike or not. It is predictable that when it rains heavily, few people would like to ride a bike outside. The weather data is imported with riem_measures()
function in PHL station, and modified to time interval, temperature, precipitation, and wind speed features. The features are also plotted below to be more intuitive.
# philly weather data
<-
weather.phl riem_measures(station = "PHL", date_start = "2019-04-16", date_end = "2019-05-20") %>%
::select(valid, tmpf, p01i, sknt)%>%
dplyrreplace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.phl)
## Rows: 816
## Columns: 4
## $ interval60 <dttm> 2019-04-16 00:00:00, 2019-04-16 01:00:00, 2019-04-16 02~
## $ Temperature <dbl> 51.1, 48.9, 48.0, 46.9, 46.0, 44.1, 44.1, 46.0, 45.0, 43~
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Wind_Speed <dbl> 26, 23, 22, 16, 11, 9, 10, 11, 11, 11, 13, 14, 13, 16, 1~
grid.arrange(
ggplot(weather.phl, aes(interval60, Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") ,
ggplot(weather.phl, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") ,
ggplot(weather.phl, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") ,
top="Weather Data - PHL - April-May, 2019")
In this part, a space-time panel is created for future regression model. This space-time panel is to ensure each unique bike station and time combo exists in the data set. Each row represents whether an observation take place then or not. All other features needed for model training are also involved in the space-time panel.
# length(unique(bike_census$interval60)) * length(unique(bike_census$start_station))
<-
bike.panel expand.grid(interval60=unique(bike_census$interval60),
start_station = unique(bike_census$start_station)) %>%
left_join(., bike_census %>%
::select(start_station, Origin.Tract, start_lon, start_lat, schools_nn3, busstops_nn3)%>%
dplyrdistinct() %>%
group_by(start_station) %>%
slice(1))
<-
bike.panel2 %>%
bike_census mutate(Trip_Counter = 1) %>%
right_join(bike.panel) %>%
group_by(interval60, start_station, Origin.Tract, start_lon, start_lat, schools_nn3, busstops_nn3) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.phl, by = "interval60") %>%
ungroup() %>%
filter(is.na(start_station) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
# glimpse(bike.panel2)
<-
bike.panel2 left_join(bike.panel2, phillyCensus %>%
as.data.frame() %>%
::select(-geometry), by = c("Origin.Tract" = "GEOID")) dplyr
Time lags are created to add additional information about the demand during a given time period. Time lags of 1 hour, 2 hours, 3 hours, 4 hours, 12 hours, and 1 day are created in the panel.
The correlations in these lags are calculated in the table below. As the result shows, 1 hour lag and 1 day lag have the top 2 largest positive correlation with the bike share data while the 12 hour lag has the largest negative correlation. This can be interpreted as the bike ride demand is more similar within the closer time periods, and also similar in the same time period next day. The bike ride demand might be totally different in the same location after 12 hours, because peoples’ need change a lot then.
<-
bike.panel.lag %>%
bike.panel2 arrange(start_station, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24)) %>%
mutate(day = yday(interval60))
as.data.frame(bike.panel.lag) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
kable(caption = "Time Lag Correlation") %>%
kable_styling("striped", full_width = F)
Variable | correlation |
---|---|
lagHour | 0.84 |
lag2Hours | 0.62 |
lag3Hours | 0.42 |
lag4Hours | 0.26 |
lag12Hours | -0.41 |
lag1day | 0.75 |
In this part, I make some visualizations of the bike trips data. First, in order to keep the time pattern of the data, I split the space-time panel with time lags into 3 weeks of training data and 2 weeks of testing data. Then, a plot of bike share trips per hour is made. As is shown in the plot, there is a rather similar pattern every week and day, which also provides evidence of our method of including the time lags.
# 3 weeks of training data, 2 weeks of testing data
<- filter(bike.panel.lag, week <= 18)
bike.Train <- filter(bike.panel.lag, week > 18) bike.Test
rbind(
mutate(bike.Train, Legend = "Training"),
mutate(bike.Test, Legend = "Testing")) %>%
group_by(Legend, interval60) %>%
summarize(Trip_Count = sum(Trip_Count)) %>%
ungroup() %>%
ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
scale_colour_manual(values = palette2) +
labs(title="Bike share trips per hour, Philly, April&May, 2019",
x="Date", y="Number of trips")
## `summarise()` has grouped output by 'Legend'. You can override using the `.groups` argument.
Histograms of mean trip numbers each station of four time periods are made. The mean number of trips each station shows different patterns during am rush hours, pm rush hours, mid-day, and overnight. People are more likely to ride a bike in the pm rush hours.
%>%
bike_census mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(interval60, start_station, time_of_day) %>%
tally()%>%
group_by(start_station, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1)+
labs(title="Mean Number of Hourly Trips Per Station. Philly, April&May, 2019",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)
Below is the histogram of bike share trips per hour by station. Most stations have less than 10 bike rides per hour.
ggplot(bike_census %>%
group_by(interval60, start_station) %>%
tally())+
geom_histogram(aes(n), binwidth = 5)+
labs(title="Bike share trips per hour by station. Philly, April&May, 2019",
x="Trip Counts",
y="Number of Stations")
Below is the plot of bike share trips distribution with time, sorted by weekdays and weekends. Weekdays show a pattern of bimodal curve, and the peaks occur in am and pm rush hours. Weekends show only one peak, approximately in the afternoon of the day when many people go out and relax.
ggplot(bike_census %>% mutate(hour = hour(start_time)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
labs(title="Bike share trips in Philly, by day of the week, April&May, 2019",
x="Hour",
y="Trip Counts")
ggplot(bike_census %>%
mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
labs(title="Bike share trips in Philly - weekend vs weekday, April&May, 2019",
x="Hour",
y="Trip Counts")
I also make the origin map for the bike trips, and the maps are sorted both by weekdays/weekends and time of the day. In general, there are more trips generated in weekdays than weekends, and more trips in pm rush hours. Bike trips show a concentration in the center city of Philadelphia.
<-
bike_points %>%
bike_census mutate(hour = hour(start_time),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
group_by(start_station, start_lat, start_lon, weekend, time_of_day)
ggplot()+
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_point(data = bike_points %>%
tally(),
aes(x=start_lon, y = start_lat, color = n),
fill = "transparent", alpha = 0.4, size = 1.0)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_points$start_lat), max(bike_points$start_lat))+
xlim(min(bike_points$start_lon), max(bike_points$start_lon))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hour by station. Philly, April&May, 2019")
To better illustrate the change of numbers of bike trips each station with time, an animated map is generated. Trip count is of the 60-minute intervals, and a whole week of week 19 is put into the map.
# animation
<-
stations%>%
indego ::select(start_station, start_lon, start_lat) %>%
dplyrgroup_by(start_station) %>%
distinct()%>%
filter(is.na(start_lon) == FALSE &
is.na(start_lat) == FALSE) %>%
st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
<-
week19 filter(bike_census , week == 19)
<-
week19.panel expand.grid(
interval60 = unique(week19$interval60),
Origin.Tract = unique(bike.panel2$Origin.Tract))
<-
bike.animation.data mutate(week19, Trip_Counter = 1) %>%
right_join(week19) %>%
group_by(interval60, start_station) %>%
summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
left_join(stations, by=c("start_station" = "start_station")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
== 1 ~ "1 trips",
Trip_Count == 2 ~ "2 trips",
Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
Trip_Count > 4 ~ "5+ trips")) %>%
Trip_Count mutate(Trips = fct_relevel(Trips, "0 trips","1 trips","2 trips",
"3-4 trips","5+ trips"))
<-
bike_animation ggplot() +
geom_sf(data = phillyTracts %>%
st_transform(crs=4326))+
geom_sf(data = bike.animation.data,
aes(color = Trips), size = 2) +
scale_color_manual(values = palette4_2) +
ylim(min(bike_points$start_lat), max(bike_points$start_lat))+
xlim(min(bike_points$start_lon), max(bike_points$start_lon))+
labs(title = "Bike Trips by Start Stations for Week 19 in May, 2019",
subtitle = "60-minute intervals: {current_frame}") +
transition_manual(interval60)
animate(bike_animation, duration=50, renderer = gifski_renderer())
In this part, I start building my prediction model. I build 5 models for predicting bike rides in Philadelphia.
Model 1 predictors: time interval, day of the week, temperature
Model 2 predictors: start station, day of the week, temperature
Model 3 predictors: start station, time interval, day of the week, temperature, precipitation
Model 4 predictors: start station, time interval, day of the week, temperature, precipitation, time lags
Model 5 predictors: start station, time interval, day of the week, temperature, precipitation, time lags, amenity features
<-
reg1 lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=bike.Train)
<-
reg2 lm(Trip_Count ~ start_station + dotw + Temperature, data=bike.Train)
<-
reg3 lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation,
data=bike.Train)
<-
reg4 lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
+ lag2Hours +lag3Hours + lag12Hours + lag1day,
lagHour data=bike.Train)
<-
reg5 lm(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
+ lag2Hours +lag3Hours + lag12Hours + lag1day + schools_nn3 + busstops_nn3,
lagHour data=bike.Train)
I make predictions using the 5 models above, results are shown in the table below.
<-
bike.Test.weekNest %>%
bike.Test nest(-week)
<- function(dat, fit){
model_pred <- predict(fit, newdata = dat)} pred
<-
week_predictions %>%
bike.Test.weekNest mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_amenity = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions
## # A tibble: 10 x 8
## week data Regression Prediction Observed Absolute_Error MAE sd_AE
## <dbl> <list> <chr> <list> <list> <list> <dbl> <dbl>
## 1 19 <tibble~ ATime_FE <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.750 0.883
## 2 20 <tibble~ ATime_FE <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.847 1.05
## 3 19 <tibble~ BSpace_FE <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.754 0.882
## 4 20 <tibble~ BSpace_FE <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.881 1.03
## 5 19 <tibble~ CTime_Space_FE <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.733 0.890
## 6 20 <tibble~ CTime_Space_FE <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.847 1.04
## 7 19 <tibble~ DTime_Space_F~ <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.607 0.795
## 8 20 <tibble~ DTime_Space_F~ <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.701 0.919
## 9 19 <tibble~ ETime_Space_F~ <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.607 0.795
## 10 20 <tibble~ ETime_Space_F~ <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.701 0.919
As the MAE results shown in the bar plot below, model 4 and 5 have the similarly best performance in accurately predicting the bike rides, which are much better than the first three models. This reveals that time lags are really sufficient in bike ride prediction here, better than time, space, weather, and amenity features. Model 4 and 5 show very similar performance, indicating that amenity features do not help a lot.
%>%
week_predictions ::select(week, Regression, MAE) %>%
dplyrgather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week")
The time series of predicted and observed bike trips is shown in the plot below, and the prediction of model 4 and 5 fit quite well with the observed bike trips.
%>%
week_predictions mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station)) %>%
::select(interval60, start_station, Observed, Prediction, Regression) %>%
dplyrunnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; test set of 2 weeks", x = "Hour", y= "Station Trips")
I also make a map of the spatial distribution of MAEs. The map shows that higher MAEs concentrate in the center city of Philadelphia, while the surrounding bike stations tend to have better prediction results. The difference in spatial distribution of MAE indicates a not good spatial generalizability.
%>%
week_predictions mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon)) %>%
::select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
dplyrunnest() %>%
filter(Regression == "DTime_Space_FE_timeLags") %>%
group_by(start_station, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
labs(title="Mean Abs Error, Test Set, Model 4")
To further test the generalizability of the models, I do the cross validation for all 5 models. Cross-validation ensures that the goodness of fit results for a single hold out is not a fluke. Here, I use k-fold cross validation algorithm.
# cross validation
<- trainControl(method = "cv", number = 100)
fitControl set.seed(825)
<-
reg.cv1 train(Trip_Count ~ ., data = bike.Train %>%
::select(Trip_Count,
dplyr
interval60 , dotw , Temperature), method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv1
## Linear Regression
##
## 67635 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 66959, 66958, 66959, 66959, 66959, 66959, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.244738 0.02406023 0.8239749
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
$resample[1:5,] reg.cv1
## RMSE Rsquared MAE Resample
## 1 1.246781 0.01354562 0.8351608 Fold001
## 2 1.184233 0.02038483 0.8080574 Fold002
## 3 1.438124 0.02742306 0.8726988 Fold003
## 4 1.272523 0.02510258 0.7964387 Fold004
## 5 1.154037 0.02941682 0.7985907 Fold005
<-
reg.cv2 train(Trip_Count ~ ., data = bike.Train %>%
::select(Trip_Count,
dplyr
start_station , dotw , Temperature), method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv2
## Linear Regression
##
## 67635 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 66959, 66959, 66959, 66959, 66959, 66957, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.240909 0.0307076 0.8167002
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
$resample[1:5,] reg.cv2
## RMSE Rsquared MAE Resample
## 1 1.250656 0.02997857 0.8104710 Fold001
## 2 1.307995 0.05359102 0.8166103 Fold002
## 3 1.059737 0.04928481 0.7683359 Fold003
## 4 1.221404 0.03296341 0.8269587 Fold004
## 5 1.073397 0.03115525 0.7748588 Fold005
<-
reg.cv3 train(Trip_Count ~ ., data = bike.Train %>%
::select(Trip_Count,
dplyr
start_station ,interval60, dotw , Temperature, Precipitation), method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv3
## Linear Regression
##
## 67635 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 66959, 66959, 66959, 66959, 66958, 66959, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.240422 0.03254132 0.8146031
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
$resample[1:5,] reg.cv3
## RMSE Rsquared MAE Resample
## 1 1.204926 0.03389480 0.8156071 Fold001
## 2 1.299039 0.05672315 0.8286128 Fold002
## 3 1.105991 0.02645551 0.7880938 Fold003
## 4 1.111435 0.05652394 0.7772887 Fold004
## 5 1.386299 0.01848572 0.8593479 Fold005
<-
reg.cv4 train(Trip_Count ~ ., data = bike.Train %>%
::select(Trip_Count,
dplyr
start_station ,interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day), method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv4
## Linear Regression
##
## 67635 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 66957, 66959, 66958, 66958, 66959, 66959, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.071432 0.2781287 0.6623675
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
$resample[1:5,] reg.cv4
## RMSE Rsquared MAE Resample
## 1 0.9869258 0.3878815 0.6265984 Fold001
## 2 1.1162556 0.3028182 0.6736958 Fold002
## 3 1.1075544 0.1430809 0.6710457 Fold003
## 4 1.1666896 0.3972810 0.6699220 Fold004
## 5 1.0242816 0.3191940 0.6563819 Fold005
<-
reg.cv5 train(Trip_Count ~ ., data = bike.Train %>%
::select(Trip_Count,
dplyr
start_station ,interval60, dotw, Temperature, Precipitation,
lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, schools_nn3, busstops_nn3), method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv5
## Linear Regression
##
## 67635 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 66958, 66957, 66959, 66959, 66959, 66959, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.071338 0.2784893 0.6619877
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
$resample[1:5,] reg.cv5
## RMSE Rsquared MAE Resample
## 1 1.280631 0.2484158 0.7143398 Fold001
## 2 1.109740 0.3707316 0.6597857 Fold002
## 3 1.086604 0.3297630 0.6472452 Fold003
## 4 1.115532 0.2744072 0.6408160 Fold004
## 5 1.100181 0.2382145 0.6885531 Fold005
The distribution of MAE of model 4 cluster quite tightly together (most of which lie between 0.61 and 0.75 approximately), and this suggest that model 4 have a quite good generalizability.
<- reg.cv4$resample
MAE4 ggplot(MAE4, aes(x=MAE)) +
geom_histogram(color="black", fill="white") +
xlim(0, 1) +
labs(title = "Histogram of MAE of Cross Validation", subtitle = "Model: DTime_Space_FE_timeLags", x = "MAE", y= "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at the scatter plot of observed and predicted bike trips, I can see that the predicted results fall slightly below the observed value. Thus the red line of predicted results is under the perfect model black line. In terms of time, there is no significant difference in model errors.
%>%
week_predictions mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
::select(interval60, start_station, start_lon,
dplyr
start_lat, Observed, Prediction, Regression,%>%
dotw) unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted Trips",
x="Observed trips",
y="Predicted trips")
However, there are significant difference in MAE’s spatial distribution of different time periods. As is shown below, on weekdays, there are higher MAE stations in the center city during am and pm rush hours, while the MAEs do not differ a lot during mid-day and overnight periods. On weekends, higher MAE stations occur in the center city during mid-day and pm rush hours, while the MAEs do not differ a lot during am rush hours and overnight periods.
%>%
week_predictions mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw)) %>%
::select(interval60, start_station, start_lon,
dplyr
start_lat, Observed, Prediction, Regression,%>%
dotw) unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
geom_point(aes(x = start_lon, y = start_lat, color = MAE),
fill = "transparent", size = 0.5, alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")
The below plots show the relationship between model MAE and some socio-economic factors. As regional median household income or percentage of white goes up, the model MAE also goes up. As regional percentage of taking public transit goes up, the model MAE tends to go down.
%>%
week_predictions mutate(interval60 = map(data, pull, interval60),
start_station = map(data, pull, start_station),
start_lat = map(data, pull, start_lat),
start_lon = map(data, pull, start_lon),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White)) %>%
::select(interval60, start_station, start_lon,
dplyr
start_lat, Observed, Prediction, Regression,%>%
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start_station, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")
Generally speaking, my model (model 4: DTime_Space_FE_timeLags) can be recognized as a useful one in predicting bike demands in the start stations, and it can be applied in bike re-balancing according to the bike demand prediction. First, the model reaches an acceptable overall accuracy, with a mean MAE of 0.6-0.7. This means, when predicting the bike demands every station, the error is less than one bike. Since one bike cannot make up a big portion of the total number of bikes in a station, I could say the error is acceptable. Secondly, the model fits quite well with the changing bike demand over time, which can be proved by the plot of ‘Predicted/Observed bike share time series’. Moreover, the result of cross validation shows a good generalizability, which means the model will probably continue to give reliable predictions for bike re-balancing.
However, there are still some shortcomings of the model. The model still can be refined to achieve a better accuracy in predicting bike demands. And there are still some differences in the spatial distribution of model MAE, which means the model may not be that accurate in some specific regions, for example, center city.