If you have a Kinesis Firehose setup to send incomding data to Redshift, and at some point that copy failed for any reason, the delivery stream will keep saving the data in S3 and creating MANIFEST files point to the files that failed to be copied.
Since there could be a lot of manifest files, manually running the COPY command to each of them may not be a good solution. That's why this repository has been created.
- AWS Cli
- Docker or Python3
This will work as index of manifest files you want to reprocess.
Using AWS Cli, run the following command:
aws s3api list-objects-v2 --bucket <bucket> --prefix "<prefix>" --query "Contents[].[Key]" --output text > manifests.txt
This will print only the filenames in S3 and save them into a file called manifests.txt.
The parameters are:
<bucket>
: the S3 bucket where your data are located<prefix>
: The path within the bucket, e.g.:ses-logs/errors/manifests/2020/03
Clone this repository and place the manifests.txt file created previously in the same folder of the cloned files.
Change the firehose.yaml configuration file appropriately. You will need to set the Redshift connection parameters and the COPY paramenters (which you can find in your Firehose Delivery stream settings).
The docker image will make it easier to setup the environment to run the COPY commands.
Run this command to build the image:
docker build -t <name> .
where the name parameter is the name of the docker container. It can be anything.
To run the application using Docker, execute the following command:
docker run -it --rm --name my-running-app -v "$PWD":/usr/src/myapp -w /usr/src/myapp <image_name>
replacing the parameter image_name with the name used in the previous step.
The application will show the commands it's sending to Redshift, and stop once it's finished.
In case you don't want to run the application using Docker, you can run it with Python by yourself.
Install the necessary libraries by running:
pip install --no-cache-dir -r requirements.txt
and execute the application by running:
python ./load_manifests.py