Improvements

Tests

Integration and unit testing could be implemented to check all the components of this project work as expected, and work together without issue.

Data validation checks could also be utilised to ensure the data we're receiving is of the expected quality and amount.

Tools such as great expectations and pytest could be useful here.

Architecture

Using cron works, but makes extending this pipeline more challenging, and limits our ability to backfill. Other schedulers could be explored, such as airflow or prefect.

We could also look into utilising AWS serverless functionality such as lambda which we could trigger with cloudwatch.

In a real production environment, we'd most likely want to load our data into a real data warehouse such as Redshift, or a more easily managed database system like RDS.

Streaming

Right now we're running batch jobs every 5 minutes. However, it would make more sense to implement streaming, perhaps with something like Kafka. This would be more complex, but would allow us to extract real-time data without the 5 minute delay.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvements.md

improvements.md

Improvements

Tests

Architecture

Streaming

Files

improvements.md

Latest commit

History

improvements.md

File metadata and controls

Improvements

Tests

Architecture

Streaming