Unified New York City Taxi and For-Hire Vehicle Data

Code in support of this post: "Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance"

This repo provides scripts to download, process, and analyze data for over 1.8 billion taxi and for-hire vehicle (Uber, Lyft, etc.) trips originating in New York City since 2009. The data is stored in a PostgreSQL database, and uses PostGIS for spatial calculations, in particular mapping latitude/longitude coordinates to census tracts.

Most of the raw data comes from the NYC Taxi & Limousine Commission. The 2014 Uber data comes via FiveThirtyEight, who obtained it via a FOIL request. In August 2016, the TLC began providing for-hire vehicle trip records in addition to taxi trips.

Instructions

Your mileage may vary, but on my MacBook Air, this process took about 3 days to complete. The unindexed database takes up over 300 GB on disk.

1. Install PostgreSQL and PostGIS

Both are available via Homebrew on Mac OS X

2. Download raw data

./download_raw_data.sh

3. Initialize database and set up schema

./initialize_database.sh

4. Import taxi and FHV data

./import_trip_data.sh
./import_fhv_trip_data.sh

5. Optional: download and import 2014 Uber data

The FiveThirtyEight Uber dataset contains Uber trip records from Apr–Sep 2014. Uber and other FHV (Lyft, Juno, Via, etc.) data is available since Jan 2015 in the TLC's data.

./download_raw_2014_uber_data.sh
./import_2014_uber_trip_data.sh

6. Analysis

Additional Postgres and R scripts for analysis are in the analysis/ folder, or you can do your own!

Schema

trips table contains all yellow and green taxi trips. Each trip has a cab_type_id, which references the cab_types table and refers to one of yellow or green
fhv_trips table contains all for-hire vehicle trip records made available by the TLC, including ride-hailing apps Uber, Lyft, Via, and Juno
fhv_bases maps fhv_trips to base names and "doing business as" labels, which include ride-hailing app names
nyct2010 table contains NYC census tracts plus the Newark Airport. It also maps census tracts to NYC's official neighborhood tabulation areas
taxi_zones table contains the TLC's official taxi zone boundaries. Starting in July 2016, the TLC no longer provides pickup and dropoff coordinates. Instead, each trip comes with taxi zone pickup and dropoff location IDs
central_park_weather_observations has summary weather data by date

Other data sources

These are bundled with the repository, so no need to download separately, but:

Shapefile for NYC census tracts and neighborhood tabulation areas comes from Bytes of the Big Apple
Shapefile for taxi zone locations comes from the TLC
Mapping of FHV base numbers to names comes from the TLC
Central Park weather data comes from the National Climatic Data Center

Data issues encountered

Remove carriage returns and empty lines from TLC data before passing to Postgres COPY command
Some raw data files have extra columns with empty data, had to create dummy columns junk1 and junk2 to absorb them
Two of the yellow taxi raw data files had a small number of rows containing extra columns. I discarded these rows
The official NYC neighborhood tabulation areas (NTAs) included in the census tracts shapefile are not exactly what I would have expected. Some of them are bizarrely large and contain more than one neighborhood, e.g. "Hudson Yards-Chelsea-Flat Iron-Union Square", while others are confusingly named, e.g. "North Side-South Side" for what I'd call "Williamsburg", and "Williamsburg" for what I'd call "South Williamsburg". In a few instances I modified NTA names, but I kept the NTA geographic definitions
The shapefile includes only NYC census tracts. Trips to New Jersey, Long Island, Westchester, and Connecticut are not mapped to census tracts, with the exception of the Newark Airport

Why not use BigQuery or Redshift?

Google BigQuery and Amazon Redshift would probably provide significant performance improvements over PostgreSQL. A lot of the data is already available on BigQuery, but in scattered tables, and each trip has only latitude and longitude coordinates, not census tracts and neighborhoods. PostGIS seemed like the easiest way to map coordinates to census tracts. Once the mapping is complete, it might make sense to load the data back into BigQuery or Redshift to make the analysis faster. Note that BigQuery and Redshift cost some amount of money, while PostgreSQL and PostGIS are free.

Mark Litwintschik has used the taxi dataset to benchmark performance of many different technology stacks, his summary is here: http://tech.marksblogg.com/benchmarks.html

TLC summary statistics

There's a Ruby script in the tlc_statistics/ folder to import data from the TLC's summary statistics reports:

ruby import_statistics_data.rb

Taxi vs. Citi Bike comparison

Code in support of the post "When Are Citi Bikes Faster Than Taxis in New York City?" lives in the citibike_comparison/ folder

2017 update

Code in support of the 2017 update to the original post lives in the analysis/2017_update/ folder

Questions/issues/contact

todd@toddwschneider.com, or open a GitHub issue

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
analysis		analysis
citibike_comparison		citibike_comparison
data		data
nyct2010_15b		nyct2010_15b
taxi_zones		taxi_zones
tlc_statistics		tlc_statistics
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_newark_airport.sql		add_newark_airport.sql
add_tract_to_zone_mapping.sql		add_tract_to_zone_mapping.sql
create_nyc_taxi_schema.sql		create_nyc_taxi_schema.sql
download_raw_2014_uber_data.sh		download_raw_2014_uber_data.sh
download_raw_data.sh		download_raw_data.sh
import_2014_uber_trip_data.sh		import_2014_uber_trip_data.sh
import_fhv_trip_data.sh		import_fhv_trip_data.sh
import_trip_data.sh		import_trip_data.sh
initialize_database.sh		initialize_database.sh
populate_green_trips.sql		populate_green_trips.sql
populate_uber_trips.sql		populate_uber_trips.sql
populate_yellow_trips.sql		populate_yellow_trips.sql
raw_2014_uber_data_urls.txt		raw_2014_uber_data_urls.txt
raw_data_urls.txt		raw_data_urls.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified New York City Taxi and For-Hire Vehicle Data

Instructions

1. Install PostgreSQL and PostGIS

2. Download raw data

3. Initialize database and set up schema

4. Import taxi and FHV data

5. Optional: download and import 2014 Uber data

6. Analysis

Schema

Other data sources

Data issues encountered

Why not use BigQuery or Redshift?

TLC summary statistics

Taxi vs. Citi Bike comparison

2017 update

Questions/issues/contact

About

Releases

Packages

Languages

License

weiishann/nyc-taxi-data

Folders and files

Latest commit

History

Repository files navigation

Unified New York City Taxi and For-Hire Vehicle Data

Instructions

1. Install PostgreSQL and PostGIS

2. Download raw data

3. Initialize database and set up schema

4. Import taxi and FHV data

5. Optional: download and import 2014 Uber data

6. Analysis

Schema

Other data sources

Data issues encountered

Why not use BigQuery or Redshift?

TLC summary statistics

Taxi vs. Citi Bike comparison

2017 update

Questions/issues/contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages