New York Times articles dashboard

ETL pipeline on the New York Times data with Looker for data visualization

Motivation

I've always been interested in reading the news and it can be hard to get data analysis on news article produced. I choose the New York Times because you could access their data easily through an API.

Data source

I'm using metadata about historical articles from the New York time. Their data goes all the way back to 1852 but I only used data since 2000.

https://developer.nytimes.com/docs/archive-product/1/overview

Instructions:

Since it would be too long to include them here, go check the instructions here

If you encounter a problem when trying to run the project don't hesitate to create an issue on github so I can have a look and help you.

Solution

Tools and infrastructure
Data Lake: Google cloud storage
Data warehouse: BigQuery
Data pipeline: Python in Mage AI
Analytics engineering: dbt
Orchestration: Mage AI (triggrer and backfill for historical data)
Data visualization: Looker

Data pipeline

ETL pipeline

Data modeling

I modeled my data in three tables:

a fact table
a keyword dimension table (relationship many-to-many)
a authors dimension table (relationship many-to-many)

Fact table:

Data lineage:

Results

Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
diagrams		diagrams
images		images
notebooks		notebooks
nyt-etl-pipeline		nyt-etl-pipeline
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
RUN_PROJECT.md		RUN_PROJECT.md
docker-compose.yml		docker-compose.yml
setup.sh		setup.sh
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

New York Times articles dashboard

Motivation

Data source

Instructions: