This repository is a sample guide to building a serverless document processing application that can make intelligent flow-control decisions after classifying the input document type.
A video presentation of the architecture and a demo is available in my AWS Innovate AI/ML Edition 2021 talk.
These prequisites should be installed first:
- Install and configure Apache Maven
- Install and configure the AWS Cloud Development Kit
- Install and configure git
- Install the JDK
Once those are installed, clone this repository:
git clone https://github.com/aws-samples/amazon-textract-comprehend-stepfunctions-example
Then import the project into your IDE, and review the sample code. There are two models, cdk and lambda. The cdk module contains the application and infrastructure CDK code. It is responsible for building and deploying the serverless Lambda functions. It defines the Step Functions flow.
The lambda module contains the application runtime code as Lambda functions. These functions show examples of calling extracting a single page from a PDF and calling Textract synchronously, classifying its content using a Comprehend custom classifier, and an asynchronous Textract call with an AWS SNS ping on completion. The initial flow is triggered by an upload to S3 which starts a Step Functions execution.
Before deploying the sample, you will need to train a Comprehend classifier. A very simple reference example is located in /samples/trainer.csv
. (It contains the plain text of three sample documents, one per row; in practice you will train with many more samples each.) To train the classifier:
- Visit the Comprehend console (switch to the region you wish to use)
- Click Train classifier
- Give it a name and check the other details (the defaults are fine to start - use a Multi-class classifier)
- Specify the S3 location of the training file (upload it first)
- Click Train classifier
Once it's trained (it will take a few minutes), start an endpoint by clicking Create endpoint from the classifier's console page. (You can also use batch mode classification which does not require a running endpoint, but is not available for real-time processing.) Note the ARN of the classifier.
From the project root directory, run the command: mvn install
to compile the project and create the deployable Lambda artefacts. This will download all of the project dependencies first, so the initial run may take a few minutes. You should re-run this command each time you make an application change.
Once that is successful, you will need to bootstrap the CDK environment:
cdk bootstrap --profile YOUR_AWS_CLI_PROFILE_HERE
Make sure to update the command with the relevant Comprehend ARN and AWS CLI profile.
The CDK application needs some permissions to be able to create the stack. Visit the IAM console to set these up. You can run the CDK deploy step iteratively and correct any missing permissions. I defined a Group with these policies and set that to a permission boundary restricted User on an account (to allow it only to self-created resources). You may restrict these further as appropriate.
- AmazonComprehendServicePolicy-InnovateClassifier
- AWSLambdaFullAccess
- IAMFullAccess
- AmazonS3FullAccess
- AmazonTextractFullAccess
- AmazonSNSFullAccess
- AWSStepFunctionsFullAccess
- AWSCloudFormationFullAccess
Note that these are only build-time permissions, required when running the CDK. The run-time permissions of the application are configured (and more restricted) in the CDK stack.
Once that's done, run the command:
cdk --context ComprehendArn=YOUR_COMPREHEND_ARN_HERE --profile YOUR_AWS_CLI_PROFILE_HERE deploy
Once completed, you'll see a resource stack like:
textract-comprehend-sample: deploying...
[0%] start: Publishing 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current
[100%] success: Published 12b0d822082799daf4b7651c9ab809985e721f2575d572943da86f53a055f29b:current
textract-comprehend-sample: creating CloudFormation changeset...
0/11 | 12:53:56 pm | UPDATE_IN_PROGRESS | AWS::CloudFormation::Stack | textract-comprehend-sample User Initiated
0/11 Currently in progress: textract-comprehend-sample
2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS | AWS::Lambda::Function | FirstPageFunction (FirstPageFunctionBEE9EA11)
2/11 | 12:54:35 pm | UPDATE_IN_PROGRESS | AWS::Lambda::Function | StartTextractFunction (StartTextractFunctionCD0185FF)
2/11 | 12:54:36 pm | UPDATE_COMPLETE | AWS::Lambda::Function | FirstPageFunction (FirstPageFunctionBEE9EA11)
2/11 | 12:54:36 pm | UPDATE_COMPLETE | AWS::Lambda::Function | StartTextractFunction (StartTextractFunctionCD0185FF)
3/11 | 12:54:39 pm | UPDATE_IN_PROGRESS | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)
3/11 | 12:54:41 pm | UPDATE_COMPLETE | AWS::StepFunctions::StateMachine | UploadFlow (UploadFlow6C932FD7)
5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS | AWS::Lambda::Function | S3UploadListener (S3UploadListener4E242122)
5/11 | 12:54:45 pm | UPDATE_IN_PROGRESS | AWS::Lambda::Function | TextractCompletion (TextractCompletion5AE7AEDD)
5/11 | 12:54:46 pm | UPDATE_COMPLETE | AWS::Lambda::Function | S3UploadListener (S3UploadListener4E242122)
5/11 | 12:54:47 pm | UPDATE_COMPLETE | AWS::Lambda::Function | TextractCompletion (TextractCompletion5AE7AEDD)
6/11 | 12:54:49 pm | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack | textract-comprehend-sample
6/11 | 12:54:50 pm | UPDATE_COMPLETE | AWS::CloudFormation::Stack | textract-comprehend-sample
✅ textract-comprehend-sample
After deployment, visit the deployed stack in the CloudFormation console. Check the different tabs for details on the stack including the different resources created.
When you upload a PDF (example are in the samples directory) to the source S3 bucket that was created, if everything was set up correctly, a Step Functions flow will be started. View that to see the flow of the document and the Lambda logs.
Congratulations! You have successfully created a scalable, serverless application stack using the CDK to intelligenly process documents on demand.
Note that for as long as you have the stack deployed, charges may apply to your account. You should delete the resources (using cdk destroy
) when you are done with the sample. You will need to empty the buckets prior to deletion, and also you will need to terminate the Comprehend endpoint.