This solution is built to allow ingestion of Medical Device Data into a Data Lake. The sources of data can be diverse. The current solution is designed for ingestion via files . It is recommended that you run the HIPAA QuickStart prior to running the scripts here.
Data encryption provides a strong layer of security to protect data that you store within AWS services. AWS services can help you achieve ubiquitous encryption for data in transit as well as data at rest.
In the Ingestion segment this solution cerates the following component
- Staging S3 Bucket : It will be used to ingest raw dataset from the source.
In the Data Processing segment, this solution creates the following components
- SQS Queue : Holds list of files that are yet to be processed or in-processing
- AWS Lambda : This processes 1 file at a time from the SQS Queue
- DynamoDB : Holds the list of file names already processed. Used for duplicate file detection
- SNS : Creates a topic and subscription for any error notification
- Glue : Sample Glue job is created based on the file you will be downloading and saving to the S3 location later
- S3 Processed Bucket : This will hold the raw files that are moved away from the Staging Bucket once processing is successful.
In the Data Lake segment this solution creates the following components
- S3 Data Lake Bucket : This bucket holds the content of the data lake in the specified partition_schme : Metric/year/month/day/patient
The AWS Lake Formation component can be created using the instructions from here
In the Analytics segment, this solution doesn't create any components. The architecture diagram is showing the possibilities of using the data from Data Lake .
This architecture uses IAM Policies for Service based access, KMS keys for encryption of S3 buckets, DynamoDB, Parameter Store The Parameter Store is used to save the values of
- DataLake Bucket Name
- Data Lake folder Name
- Processed Bucket Name
- Processed Bucket Folder Name
- SQS Queue Name
- Failure Notification ARN
- Athena Database to use for creating the tables.
The processing Logs will be recorded in the Cloudwatch logs.
This script does not create the VPC, subnets , route tables etc.
- Please ensure that you have run the HIPAA QuickStart
To get started now, just sign in to your AWS account and create a stack based on criteria below.
git clone git@github.com:aws-samples/analysis-of-medical-device-data-using-data-lake.git
- Upload the sample heart_rate_job.py to your S3 bucket and copy the location
cd analysis-of-medical-device-data-using-data-lake
aws s3 cp heart_rate_job.py s3://[YOUR-BUCKET-NAME-HERE]
Copy the location of the job file s3://[YOUR-BUCKET-NAME-HERE]/heart_rate_job.py
-
If you want to ensure that all traffic to your AWS resources is within the AWS Network, use the script "Cloudformation_WithVPC.json". It will create VPC Endpoints for SQS, S3, DynamoDB, Glue, Athena, SSM.
-
If you don't have to ensure all traffic to your AWS resorces is not to be restricted to AWS Network, you can use the script "Cloudformation_WithoutVPC.json". It will create same resources as
-
Supply all the parameters as required .
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License. See the LICENSE file.