0. Introduction

Nowadays, riding a bike is one of the major ways for people to get around. More and more people start to use shared bikes because of their convenience. Sometimes there is no useful bike to pick up in the bike stations, or there are no open docking spaces to deposit a bike. These unpleasant experience hurt shared bike company as well as bike riders. To better allocate the shared bikes across the network, there is a demand to carry out a ‘re-balance’ plan. Re-balancing is the practice of anticipating (or predicting) bike share demand for all bike stations at all times and manually redistributing bikes to ensure a bike or a station is available when needed.

As for re-balance plans, I’d recommend the company to apply a combination of using trucks to move bikes from here to there and offer rewards to bike riders to re-allocate the bike resources.

In this assignment, I will analyze the shared bike rides in the city of Philadelphia. I aim to build a model with moderate accuracy and generalizability, predict the demand of bikes spatially and temporally with the given data, and give brief suggestions on the re-balance plan to Indego. I wish to predict the demand in the future 2-3 weeks which can fully meet the demand of shared bike companies.

1. Data Wrangling

1.1 Indego Data

In this project, I use Indego bike share data to analyze and predict bike share in Philadelphia. The time range for the data is from April 16, 2019 to May 20, 2019, five weeks in total. The Indego data comes from its official website https://www.rideindego.com/about/data/.

The Indego bike share data is processed to add columns of 15-minute (interval15) and 60-minute (interval60) intervals, week, and day of the week (dotw) features.

# philly indego data
indego <-
  read.csv("indego_416_520.csv")
# glimpse(indego)
indego <- 
  indego %>%
  mutate(interval60 = floor_date(ymd_hm(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hm(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

# glimpse(indego)

1.2 Amenity Data

Data of bus stops and schools are imported as amenity features to help modify the future models. Both bus stops and schools data come from Open Data Philly https://www.opendataphilly.org/dataset.

Nearest neighbor features for schools and bus stops are calculated as predictors to be put into the regression models.

bus_stops <-
  read.csv("bus_stops.csv")

bus_stops <-
  bus_stops %>%
  st_as_sf(., coords = c("stop_lon", "stop_lat"), crs = 4326)

schools <-
  st_read("Schools.geojson") %>%
  st_set_crs('4326')
start_stat <-
  indego %>%
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)
st_c <- st_coordinates

indego <-
  indego %>%
  filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
  mutate(
      schools_nn3 = nn_function(na.omit(st_c(start_stat)), na.omit(st_c(schools)), 3),
      busstops_nn3 = nn_function(na.omit(st_c(start_stat)), na.omit(st_c(bus_stops)), 3)
      )

1.3 Census Data

Census data is also involved in the process. Census data marks the socioeconomic characteristics of a specific region, which have relationship with the bike share prediction. Besides, census tracts are also useful geographical study units, though census tracts are not the study unit in this assignment.

phillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2019, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age, GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
phillyTracts <- 
  phillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf
bike_census <- st_join(indego %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        phillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)

1.4 Weather Data

Weather data is essential in predicting people’s decision on whether ride a bike or not. It is predictable that when it rains heavily, few people would like to ride a bike outside. The weather data is imported with riem_measures() function in PHL station, and modified to time interval, temperature, precipitation, and wind speed features. The features are also plotted below to be more intuitive.

# philly weather data
weather.phl <- 
  riem_measures(station = "PHL", date_start = "2019-04-16", date_end = "2019-05-20") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.phl)
## Rows: 816
## Columns: 4
## $ interval60    <dttm> 2019-04-16 00:00:00, 2019-04-16 01:00:00, 2019-04-16 02~
## $ Temperature   <dbl> 51.1, 48.9, 48.0, 46.9, 46.0, 44.1, 44.1, 46.0, 45.0, 43~
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Wind_Speed    <dbl> 26, 23, 22, 16, 11, 9, 10, 11, 11, 11, 13, 14, 13, 16, 1~
grid.arrange(
  ggplot(weather.phl, aes(interval60, Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") ,
  ggplot(weather.phl, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") ,
  ggplot(weather.phl, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") ,
  top="Weather Data - PHL - April-May, 2019")

2. Create Space-Time Panel

2.1 Basic Panel

In this part, a space-time panel is created for future regression model. This space-time panel is to ensure each unique bike station and time combo exists in the data set. Each row represents whether an observation take place then or not. All other features needed for model training are also involved in the space-time panel.

# length(unique(bike_census$interval60)) * length(unique(bike_census$start_station))

bike.panel <- 
  expand.grid(interval60=unique(bike_census$interval60), 
              start_station = unique(bike_census$start_station)) %>%
  left_join(., bike_census %>%
              dplyr::select(start_station, Origin.Tract, start_lon, start_lat, schools_nn3, busstops_nn3)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))
bike.panel2 <- 
  bike_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(bike.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat, schools_nn3, busstops_nn3) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.phl, by = "interval60") %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

# glimpse(bike.panel2)
bike.panel2 <- 
  left_join(bike.panel2, phillyCensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

2.2 Panel with Time Lags

Time lags are created to add additional information about the demand during a given time period. Time lags of 1 hour, 2 hours, 3 hours, 4 hours, 12 hours, and 1 day are created in the panel.

The correlations in these lags are calculated in the table below. As the result shows, 1 hour lag and 1 day lag have the top 2 largest positive correlation with the bike share data while the 12 hour lag has the largest negative correlation. This can be interpreted as the bike ride demand is more similar within the closer time periods, and also similar in the same time period next day. The bike ride demand might be totally different in the same location after 12 hours, because peoples’ need change a lot then.

bike.panel.lag <- 
  bike.panel2 %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24)) %>%
   mutate(day = yday(interval60))
as.data.frame(bike.panel.lag) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
    kable(caption = "Time Lag Correlation") %>%
    kable_styling("striped", full_width = F)
Time Lag Correlation
Variable correlation
lagHour 0.84
lag2Hours 0.62
lag3Hours 0.42
lag4Hours 0.26
lag12Hours -0.41
lag1day 0.75

3. Exploratory Analysis

3.1 Static Charts

In this part, I make some visualizations of the bike trips data. First, in order to keep the time pattern of the data, I split the space-time panel with time lags into 3 weeks of training data and 2 weeks of testing data. Then, a plot of bike share trips per hour is made. As is shown in the plot, there is a rather similar pattern every week and day, which also provides evidence of our method of including the time lags.

# 3 weeks of training data, 2 weeks of testing data
bike.Train <- filter(bike.panel.lag, week <= 18)
bike.Test <- filter(bike.panel.lag, week > 18)
rbind(
  mutate(bike.Train, Legend = "Training"), 
  mutate(bike.Test, Legend = "Testing")) %>%
  group_by(Legend, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>% 
  ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
  scale_colour_manual(values = palette2) +
  labs(title="Bike share trips per hour, Philly, April&May, 2019",
       x="Date", y="Number of trips") 
## `summarise()` has grouped output by 'Legend'. You can override using the `.groups` argument.

Histograms of mean trip numbers each station of four time periods are made. The mean number of trips each station shows different patterns during am rush hours, pm rush hours, mid-day, and overnight. People are more likely to ride a bike in the pm rush hours.

bike_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. Philly, April&May, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)

Below is the histogram of bike share trips per hour by station. Most stations have less than 10 bike rides per hour.

ggplot(bike_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5)+
  labs(title="Bike share trips per hour by station. Philly, April&May, 2019",
       x="Trip Counts", 
       y="Number of Stations")

Below is the plot of bike share trips distribution with time, sorted by weekdays and weekends. Weekdays show a pattern of bimodal curve, and the peaks occur in am and pm rush hours. Weekends show only one peak, approximately in the afternoon of the day when many people go out and relax.

ggplot(bike_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philly, by day of the week, April&May, 2019",
       x="Hour", 
       y="Trip Counts")

ggplot(bike_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philly - weekend vs weekday, April&May, 2019",
       x="Hour", 
       y="Trip Counts")

I also make the origin map for the bike trips, and the maps are sorted both by weekdays/weekends and time of the day. In general, there are more trips generated in weekdays than weekends, and more trips in pm rush hours. Bike trips show a concentration in the center city of Philadelphia.

bike_points <-
  bike_census %>% 
  mutate(hour = hour(start_time),
         weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(start_station, start_lat, start_lon, weekend, time_of_day)

ggplot()+
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_point(data = bike_points %>% 
             tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.4, size = 1.0)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(bike_points$start_lat), max(bike_points$start_lat))+
  xlim(min(bike_points$start_lon), max(bike_points$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hour by station. Philly, April&May, 2019")

3.2 Animated Map

To better illustrate the change of numbers of bike trips each station with time, an animated map is generated. Trip count is of the 60-minute intervals, and a whole week of week 19 is put into the map.

# animation
stations<-
  indego %>% 
  dplyr::select(start_station, start_lon, start_lat) %>%
  group_by(start_station) %>%
  distinct()%>%
  filter(is.na(start_lon) == FALSE &
           is.na(start_lat) == FALSE) %>%
  st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)

week19 <-
  filter(bike_census , week == 19)

week19.panel <-
  expand.grid(
    interval60 = unique(week19$interval60),
    Origin.Tract = unique(bike.panel2$Origin.Tract))

bike.animation.data <-
  mutate(week19, Trip_Counter = 1) %>%
  right_join(week19) %>% 
  group_by(interval60, start_station) %>%
  summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  left_join(stations, by=c("start_station" = "start_station")) %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count == 1 ~ "1 trips",
                           Trip_Count == 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
                           Trip_Count > 4 ~ "5+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trips","2 trips",
                              "3-4 trips","5+ trips"))

bike_animation <-
  ggplot() +
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_sf(data = bike.animation.data, 
             aes(color = Trips), size = 2) +
  scale_color_manual(values = palette4_2) +
  ylim(min(bike_points$start_lat), max(bike_points$start_lat))+
  xlim(min(bike_points$start_lon), max(bike_points$start_lon))+
  labs(title = "Bike Trips by Start Stations for Week 19 in May, 2019",
       subtitle = "60-minute intervals: {current_frame}") +
  transition_manual(interval60)

animate(bike_animation, duration=50, renderer = gifski_renderer())

4. Regression Models

4.1 Models

In this part, I start building my prediction model. I build 5 models for predicting bike rides in Philadelphia.

  • Model 1 predictors: time interval, day of the week, temperature

  • Model 2 predictors: start station, day of the week, temperature

  • Model 3 predictors: start station, time interval, day of the week, temperature, precipitation

  • Model 4 predictors: start station, time interval, day of the week, temperature, precipitation, time lags

  • Model 5 predictors: start station, time interval, day of the week, temperature, precipitation, time lags, amenity features

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=bike.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=bike.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=bike.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=bike.Train)

reg5 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + schools_nn3 + busstops_nn3,
     data=bike.Train)

4.2 Predictions

I make predictions using the 5 models above, results are shown in the table below.

bike.Test.weekNest <- 
  bike.Test %>%
  nest(-week) 
model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}
week_predictions <- 
  bike.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_amenity = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
## # A tibble: 10 x 8
##     week data     Regression     Prediction Observed  Absolute_Error   MAE sd_AE
##    <dbl> <list>   <chr>          <list>     <list>    <list>         <dbl> <dbl>
##  1    19 <tibble~ ATime_FE       <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.750 0.883
##  2    20 <tibble~ ATime_FE       <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.847 1.05 
##  3    19 <tibble~ BSpace_FE      <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.754 0.882
##  4    20 <tibble~ BSpace_FE      <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.881 1.03 
##  5    19 <tibble~ CTime_Space_FE <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.733 0.890
##  6    20 <tibble~ CTime_Space_FE <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.847 1.04 
##  7    19 <tibble~ DTime_Space_F~ <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.607 0.795
##  8    20 <tibble~ DTime_Space_F~ <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.701 0.919
##  9    19 <tibble~ ETime_Space_F~ <dbl [22,~ <dbl [22~ <dbl [22,410]> 0.607 0.795
## 10    20 <tibble~ ETime_Space_F~ <dbl [19,~ <dbl [19~ <dbl [19,305]> 0.701 0.919

5. Model Evaluation

5.1 General Model Accuracy

As the MAE results shown in the bar plot below, model 4 and 5 have the similarly best performance in accurately predicting the bike rides, which are much better than the first three models. This reveals that time lags are really sufficient in bike ride prediction here, better than time, space, weather, and amenity features. Model 4 and 5 show very similar performance, indicating that amenity features do not help a lot.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") 

The time series of predicted and observed bike trips is shown in the plot below, and the prediction of model 4 and 5 fit quite well with the observed bike trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Philly; test set of 2 weeks",  x = "Hour", y= "Station Trips") 

I also make a map of the spatial distribution of MAEs. The map shows that higher MAEs concentrate in the center city of Philadelphia, while the surrounding bike stations tend to have better prediction results. The difference in spatial distribution of MAE indicates a not good spatial generalizability.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    dplyr::select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 4")

5.2 Cross Validation

To further test the generalizability of the models, I do the cross validation for all 5 models. Cross-validation ensures that the goodness of fit results for a single hold out is not a fluke. Here, I use k-fold cross validation algorithm.

# cross validation
fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

reg.cv1 <- 
  train(Trip_Count ~ ., data = bike.Train %>% 
          dplyr::select(Trip_Count, 
                        interval60 , dotw ,  Temperature), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv1
## Linear Regression 
## 
## 67635 samples
##     3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 66959, 66958, 66959, 66959, 66959, 66959, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.244738  0.02406023  0.8239749
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv1$resample[1:5,]
##       RMSE   Rsquared       MAE Resample
## 1 1.246781 0.01354562 0.8351608  Fold001
## 2 1.184233 0.02038483 0.8080574  Fold002
## 3 1.438124 0.02742306 0.8726988  Fold003
## 4 1.272523 0.02510258 0.7964387  Fold004
## 5 1.154037 0.02941682 0.7985907  Fold005
reg.cv2 <- 
  train(Trip_Count ~ ., data = bike.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station , dotw ,  Temperature), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv2
## Linear Regression 
## 
## 67635 samples
##     3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 66959, 66959, 66959, 66959, 66959, 66957, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.240909  0.0307076  0.8167002
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv2$resample[1:5,]
##       RMSE   Rsquared       MAE Resample
## 1 1.250656 0.02997857 0.8104710  Fold001
## 2 1.307995 0.05359102 0.8166103  Fold002
## 3 1.059737 0.04928481 0.7683359  Fold003
## 4 1.221404 0.03296341 0.8269587  Fold004
## 5 1.073397 0.03115525 0.7748588  Fold005
reg.cv3 <- 
  train(Trip_Count ~ ., data = bike.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw ,  Temperature, Precipitation), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv3
## Linear Regression 
## 
## 67635 samples
##     5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 66959, 66959, 66959, 66959, 66958, 66959, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.240422  0.03254132  0.8146031
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv3$resample[1:5,]
##       RMSE   Rsquared       MAE Resample
## 1 1.204926 0.03389480 0.8156071  Fold001
## 2 1.299039 0.05672315 0.8286128  Fold002
## 3 1.105991 0.02645551 0.7880938  Fold003
## 4 1.111435 0.05652394 0.7772887  Fold004
## 5 1.386299 0.01848572 0.8593479  Fold005
reg.cv4 <- 
  train(Trip_Count ~ ., data = bike.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv4
## Linear Regression 
## 
## 67635 samples
##    10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 66957, 66959, 66958, 66958, 66959, 66959, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.071432  0.2781287  0.6623675
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv4$resample[1:5,]
##        RMSE  Rsquared       MAE Resample
## 1 0.9869258 0.3878815 0.6265984  Fold001
## 2 1.1162556 0.3028182 0.6736958  Fold002
## 3 1.1075544 0.1430809 0.6710457  Fold003
## 4 1.1666896 0.3972810 0.6699220  Fold004
## 5 1.0242816 0.3191940 0.6563819  Fold005
reg.cv5 <- 
  train(Trip_Count ~ ., data = bike.Train %>% 
          dplyr::select(Trip_Count, 
                        start_station ,interval60, dotw, Temperature, Precipitation, 
                        lagHour, lag2Hours, lag3Hours, lag12Hours, lag1day, schools_nn3, busstops_nn3), 
        method = "lm", trControl = fitControl, na.action = na.pass)

reg.cv5
## Linear Regression 
## 
## 67635 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 66958, 66957, 66959, 66959, 66959, 66959, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.071338  0.2784893  0.6619877
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
reg.cv5$resample[1:5,]
##       RMSE  Rsquared       MAE Resample
## 1 1.280631 0.2484158 0.7143398  Fold001
## 2 1.109740 0.3707316 0.6597857  Fold002
## 3 1.086604 0.3297630 0.6472452  Fold003
## 4 1.115532 0.2744072 0.6408160  Fold004
## 5 1.100181 0.2382145 0.6885531  Fold005

The distribution of MAE of model 4 cluster quite tightly together (most of which lie between 0.61 and 0.75 approximately), and this suggest that model 4 have a quite good generalizability.

MAE4 <- reg.cv4$resample
ggplot(MAE4, aes(x=MAE)) + 
  geom_histogram(color="black", fill="white") +
  xlim(0, 1) +
  labs(title = "Histogram of MAE of Cross Validation", subtitle = "Model: DTime_Space_FE_timeLags",  x = "MAE", y= "Count") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.3 Space-Time Error Evaluation

Looking at the scatter plot of observed and predicted bike trips, I can see that the predicted results fall slightly below the observed value. Thus the red line of predicted results is under the perfect model black line. In terms of time, there is no significant difference in model errors.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    dplyr::select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted Trips",
       x="Observed trips", 
       y="Predicted trips")

However, there are significant difference in MAE’s spatial distribution of different time periods. As is shown below, on weekdays, there are higher MAE stations in the center city during am and pm rush hours, while the MAEs do not differ a lot during mid-day and overnight periods. On weekends, higher MAE stations occur in the center city during mid-day and pm rush hours, while the MAEs do not differ a lot during am rush hours and overnight periods.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    dplyr::select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y =  start_lat, color = MAE), 
             fill = "transparent", size = 0.5, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
  xlim(min(bike_census$start_lon), max(bike_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")

The below plots show the relationship between model MAE and some socio-economic factors. As regional median household income or percentage of white goes up, the model MAE also goes up. As regional percentage of taking public transit goes up, the model MAE tends to go down.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    dplyr::select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")

6. Conclusions

Generally speaking, my model (model 4: DTime_Space_FE_timeLags) can be recognized as a useful one in predicting bike demands in the start stations, and it can be applied in bike re-balancing according to the bike demand prediction. First, the model reaches an acceptable overall accuracy, with a mean MAE of 0.6-0.7. This means, when predicting the bike demands every station, the error is less than one bike. Since one bike cannot make up a big portion of the total number of bikes in a station, I could say the error is acceptable. Secondly, the model fits quite well with the changing bike demand over time, which can be proved by the plot of ‘Predicted/Observed bike share time series’. Moreover, the result of cross validation shows a good generalizability, which means the model will probably continue to give reliable predictions for bike re-balancing.

However, there are still some shortcomings of the model. The model still can be refined to achieve a better accuracy in predicting bike demands. And there are still some differences in the spatial distribution of model MAE, which means the model may not be that accurate in some specific regions, for example, center city.