Integration and unit testing could be implemented to check all the components of this project work as expected, and work together without issue.
Data validation checks could also be utilised to ensure the data we're receiving is of the expected quality and amount.
Tools such as great expectations and pytest could be useful here.
Using cron works, but makes extending this pipeline more challenging, and limits our ability to backfill. Other schedulers could be explored, such as airflow or prefect.
We could also look into utilising AWS serverless functionality such as lambda which we could trigger with cloudwatch.
In a real production environment, we'd most likely want to load our data into a real data warehouse such as Redshift, or a more easily managed database system like RDS.
Right now we're running batch jobs every 5 minutes. However, it would make more sense to implement streaming, perhaps with something like Kafka. This would be more complex, but would allow us to extract real-time data without the 5 minute delay.