The objective of the project is to compute a KPI that demonstrates how attractive a city is for travelers.
It is : indeed as a Blablacar driver, I know myself that prices entered by drivers are based on what they think the passenger is ready to pay. The price need to be attractive because if it's not then the car is not full. For instance one might have 3 passengers for 10€ each while another will have 2 passengers for 12.5€ each. The first case is better for both the driver and the passengers. The only risk would be to have only 2 passengers, therefore second case would be better. But this case never happens. Why ? Well we notice that usually the cheapest trips are full.
I guess that we now agree in saying that the price does now depend on what passengers and drivers agree to be "the worth of the city". We can expect a price distribution that goes with most of the prices really cheap, and then have some higher prices, that are usually mistakes (prices suggested by the app, people trying the app for the first time, people who didn't check how much the competitors sell the trip, so they put the price far too expensive). Usually mistakes result in people don't booking the trip. We will study the following metrics to build our KPI :
- how many trips ? (by date)
- what is the average/median price ? (by date)
- what is the prices distribution ? (all time)
I think that combining can be a good starting point to building this KPI. Let's see how the EDA goes.
We will build a database of routes and keep track of all trips and their prices for a month.
At first, I wanted to query the Blablacar API to check which routes are the busiest. But that wasn't possible, because the API requires the origin and the destination to be cities. After some research, still impossible to get the data. So I decided to create my own map, based on the top 25 cities in France.
I cannot study every city because I have a limit of 1000 queries per day to Blablacar, and then I also have a trickier limit to follow on Google Cloud Functions. Therefore I will build routes from these 25 cities from and EDA.
We will do a first study to select the routes we will choose between those 25 cities.
We chose the top 25 cities. It means we have 600 possible routes. We will do a selection of the major routes.
We can only get info for 100 trips for each query. But actually, for now we haven't seen any route with more than 100 trips in a day. So it's ok, we will not deal with this problem. But we will keep record of the number of trips we get per route.
What we need to do also is to drop duplicates from the trip id, because of Blablacar'search engine. Indeed, it matches trips requests with a sub-segment of drivers’ planned route, but it's not the actual route. Price will be generated by Blablacar's pricing algorithm, cannot be modified, and comes with additional costs. So I don't want to take it into account. I will only keep the trip with the longest distance.
- ✅ Get API.
- ✅ Build the architecture.
- ✅ Write the code for one route only.
- ✅ Integrate in the architecture, and test the CRON.
- ✅ Build a list of all the routes we will study.
- 🔁 Data collection in a bucket. (To check 01/10/2020)
- ❌ Study the results : statistics on dataset. (✅ code ready)
- ❌ See the results ... How does it go ?
For now we have :
Our results will be really biased, in a way that it will not enable us to compare cities. There are far too many biases (like how connected the city is : is it connected to a train line ? plane ? how expensive is it to have a car in that city : is the parking ticket expensive ? is fuel expensive ?). It could be interesting to build a KPI that get us rid of these biases. But that would be for another study.
The KPI we built has to be tracked over time, city by city.
This study gave me ideas to compute an ecofriendliness KPI. Indeed, Blablacar is an ecofriendly mean of transport. Number of trips by Blablacar can be a useful data to study how ecofriendly the population living in a city is. This study will be more complicated, I will have to gather other sources of data.