In this project, I've received a dataset of 4.5 million Uber rides orders in New-York City, which connects drivers and those who are interested in a ride.
The dataset contained observations from months April-July 2014 and included locations and time of order.
Based on the dataset, I've been asked to predict the number of orders in a future month, September 2014, on each 15-minutes interval.
First, I've researched the data to find patterns and connections, then based on the research insights I've cross-checked additional data.
Finally, I've built a model that predicts future trends for rides.
In the following chapters, I shall describe the research steps, the insights, and the model I've to choose to implement.
These documents contain images that describe the milestones in my research work, most of them contain links to higher-resolution.
First, I've cleaned the data, based on project specifications; narrowed the rides in a 1-kilometer radius from New York Stock Exchange, and during 17 PM till midnight, and made sure there are no missing data cells.
Then, orders have been divided into 15-minutes time slots windows.
I've created the following graph that describes the number of times, in a 15-minutes time slots windows, there have been a given number of orders.
(i.e. The number of times there have been 20 orders [x-axis], in the 15-minutes time slots windows is on [y-axis]).
As you witness, this distribution peak is close to it's mean and afterward drops drastically.
Distribution tail's composed mostly of anomality time slots windows, in which there has been a soaring demand for rides.
This patterns remains consistent among the other months as well.
This phenomena, in which there has been an extraordinary amount of orders triggered my curiosity - so I've organized in the following table which describes major events and weather conditions.
The table is ordered in descending number of rides and proposed to discover if there is a correlation between major events and rainy weather and the number of pick up numbers.
As you witness, in 52% of the time slot windows with above 40 rides and in 48% of the time slot windows with between 30-40 rides there indeed has been cold weather of major event - that may describe the soaring demand for rides.
Although this can explain some of the anomalies in time slot windows, these reasons (such as future weather or major events) cannot be predicted, and special attention has been taken to handle them when devising the model.
Data research has been continued in creating "heat-map" that describe the number of orders in every round hour in each day of the week.
As you witness, there is a rise in demand from 17 PM - 19 PM during workdays (probably explained by commuting from work), as well as a drop in demand during late-night hours of workdays.
This patterns remains consistent among the other months as well.
Besides, the trend of the total amount of orders in different months, demonstrates Uber growth in months April-July with an incline during May.
In addition, I've researched demand-areas and their patterns.
The following heatmap describes the areas with the highest orders counts with warmer colors, during different days of the week.
As you witness, demand changes during different days of the week and most significantly between workdays and the weekend.
However, the "warmer" areas remain stationary between different months and demonstrate lower entropy in comparison to the daily heatmaps.
These warm areas are correlated with attraction points and interest point mentioned in Manhatten.
Another interesting observation is, that the warm areas are not close to train-stations, who pass during those hours frequently.
It is reasonable to believe, that trains are substitute goods for Uber rides in some cases.
Lastly, a correlation matrix has been created.
The developed model is a clustering model.
Each order has been assigned to a cluster, and cluster centers were in the centers of the warmest areas.
Such division is meant to learn patterns on each area independently, as different areas get warm on different days and hours - yet the centers of the clusters AKA centroids are almost stationary.
To choose the right amount of clusters, I've created a Total Within Cluster Sum of Squares graph.
This graph is used to determine a reasonable amount of clusters (denoted by K), using the "elbow method" heuristic; the cutoff point where diminishing returns are no longer worth the additional cost (stop adding new clusters, or raising K, when the amount of explained data is inconsiderable).
The chosen amount of clusters (configurable in code) is K=8, whose centroids matched the warmer areas mentioned above.
To build the model, after dividing orders into clusters, a designated table of tables has been built using the dplyr library developed by Hadley Wickham for the clusters.
Each row in the main table matches data for a cluster and a linear regression has been applied to it.
In this way, the model is trained to learn patterns for each cluster independently, and regression would yield different coefficients for each cluster based on unique characteristics of each cluster.
The desired prediction, for the future 15-minutes time slot windows, is the total sum of prediction of all clusters.
Linear regression model is:
pick_num ~ minute + hour + day_in_week + hour*day_in_week
The interaction between the hour and day in a week has been added to grasp the effects of different combinations of them.
In addition, as described above, in every month there are anomaly time slot windows with an extraordinary amount of orders;
to tackle this issue - and cancel the mal side effects of weather or unpredictable events - a threshold on orders amount has been introduced of threshold=9 for each cluster, which is configurable by code.
Cutting by the threshold in a cluster-based manner is beneficial to disable animality in one cluster without side effects on the other and more agile in comparison to setting a global threshold.
To sum up, I've started with data research and recognizing patterns in different days and hours, and continued in researching patterns in different months and warm areas.
I've found out that between different months, the warm areas remain almost stationary.
However, on different days of the week and especially when comparing workdays and weekends the warm areas shifted.
Then, an anomaly in time slot windows has been researched, in which orders amount soared.
Correlation between cold-weather, major events, and those anomalies has been proved.
Armed with those insights, I've developed a cluster model that matched the warm areas and learned those patterns for each of the clustered independently.
In addition to the developed model, simple linear models (although it is clear they cannot grasp the whole picture), as well as random forest models have been tested (and combinations with the cluster model) - yet none of those exceeded the R^2 achieved by the model described above.
Finally, a model that divides the city into interest areas, and learning for each of them unique coefficients, and minimizes the bad effect of anomalies has been presented.
E. \0. F.