The San Francisco Controller's Office maintains a database of spending and revenue data sourced from it's citywide financial system. Transactions data is scraped from the DataSF official website and conformed into a dimensional model in a data warehouse. The purpose of this data engineering is to help support dashboarding and ad-hoc query from business users.
- Transfer Family: Assumes role on the client's behalf as a SSH user to transfer data to a remote AWS server through the SSH File Transfer Protocol (SFTP). SFTP enables secured data transfer from local SFTP-enabled server to the AWS environment over the internet. A pair of SSH keys (public & private) will be required to support user authentication with the SFTP server.
- S3: The SFTP server establishes a relationship with the S3 bucket which serves as its persistent file storage system so that data transfered through SFTP will be stored directly S3. All CSV data will be transfered and stored in S3.
- Redshift: The Redshift cluster assumes role on the client's behalf to pull data from S3 and load it into the data warehouse reporting area. To enable working with Redshift remotely, a VPC security group is attached to the cluster to route all incoming traffic to port 5439, which is the default port on which Redsfhit database is exposed.
- S3 Policies: Defined the set of actions (e.g GET, DELETE, PUT, etc) allowed to be performed on the S3 bucket by Transfer Family and Redshift assuming roles on the client's behalf, respectively. Transfer Family has permissions to put transfered objects into the bucket, whereas Redshift has permissions to fetch and copy data from the bucket over to the cluster.
- Program: A program identifies the service under which a transaction is recorded. Each program belongs to a department under an organization group.
- Type: Specify the detail of resources involved in a transaction. Sub-object indicates the most granular level of detail, and falls under object and character in that order.
- Fund: An activity receives governmental funding under a particular category based on its characteristics. Fund category is the lowest level under the hierarchy, falling under fund and fund type in that order.
- Finance: Indicates whether a transaction is a spending or a revenue.
- Transaction: The amount of dollars processed in a transaction within an accounting period that begins on July 1 and ends on June 30 the following year.
Make sure your local server is SFTP-enabled
1. Preliminary Setup
pip3 install -r requirements.txt
- Make sure AWS account has full access to work with S3, Transfer Famiy, Redshift, and IAM
- Follow the boto3 configuration link if boto3 isn't already set up
- Complete the params.cfg file where a 'TODO: Replace the value below' indicates
2. Generate an SSH Key Pair
- Create a folder called ssh in the project root directory
- Generate an RSA 4096-bit key pair in the ssh directory
ssh-keygen -t rsa -b 4096 -N "" -m PEM -f <key_name> # TODO: Replace <key_name> with a key name of your choice
3. Build AWS Infrastructures
python3 infrastructures.py # Make sure in the project root directory
- After all AWS resources have been provisioned, the SFTP server endpoint will show up on the terminal
4. Connect to the Transfer Family SFTP Server
- Refer to the sftp_server_username in params.cfg
sftp -i ssh/<key_name> <sftp_server_username>@<sftp_server_endpoint>
- Enter 'yes' if prompted for connection, you shall be directed to the SFTP terminal
5. Transfer Local Datasets to S3 Buckets over SFTP
put data/*.csv
- Disconnect from the SFTP server, or open a new terminal
6. Load the Datasets into the Redshift Data Warehouse
python3 load_tables.py
7. Query Dimensional Model in Redshift Query Editor V2
8. Tear Down AWS Infrastructures
python3 clean_up.py