This is a project I worked on in the summer of 2023. Cronicle is a multi-stage system for collecting articles from the web, uploading to cloud, and ultimately serving curated emails based on AI predictions. This repo is arranged largely as a collection of jupyter notebooks, python files, and scripts - from the stages of data collection to final product.
Key technologies/libraries:
Further details below!
- Scrapers collect data from a number of sites
- Initial processing, data is dumped to object storage (GCS)
- Further processing - a Dataflow job is triggered by Pubsub to perform ETL, uploading to BQ
- A containerized process can then be run to:
- query BQ
- run AI model inference
- publish emails based on model results
I selected a number of interesting sites regarding topics I enjoyed (tech news, movies, books) and wrote scraping scripts to collect data from each. The experimental process can be viewed in the notebooks. I compiled these into a module to facilitate easy scraping with some args to customize the process. Scripts are run on a schedule, scraping and then uploading to GCS.
Uploading to Cloud Storage triggers a Cloud Function (via PubSub) to run a Dataflow job that uploads processed articles as rows to BQ tables. This scales indefinitely as jobs are run in parallel. Scripts and templates used are in the etl folder.
First, the inference script queries BQ. From there, it's set up to pretty much drop in any NLP inference model from the huggingface transformers library. The implementation available uses a BERT-based model to measure prominent emotions among the comments in hackernews posts. After filtering, an email is published with links to the articles selected in inference.
These scripts can be found in the publish folder.