Resume Info Extractor is a serverless application that automatically extracts information from PDF resumes uploaded to an Amazon S3 bucket. It utilizes AWS Lambda, OpenAI API, and MongoDB Atlas to process and store structured data from resumes. The project is bootstrapped using AWS SAM CLI and includes a Lambda layer for pdf-dist
node_module for handling PDF extraction.
Before you begin, ensure you have the following:
- AWS account with appropriate permissions.
- SAM CLI installed locally: AWS SAM CLI Installation Guide
- MongoDB Atlas cluster URL (MONGO_URI) for storing extracted data.
- OpenAI API key for utilizing the OpenAI API (OPENAI_API_KEY).
- Clone the repository:
git clone https://github.com/yaldram/resume-extractor.git cd resume-extractor
- Install dependencies and build the project using SAM CLI:
sam build
- Deploy the application using SAM CLI:
sam deploy --guided
- Upload PDF resumes to the designated S3 bucket.
- The uploaded resumes trigger the Lambda function, which utilizes the OpenAI API to extract structured data, including candidate name, experience, companies worked for, skillset, and languages.
- Extracted data is stored in MongoDB Atlas.
Set the following environment variables in your AWS Lambda environment:
MONGO_URI
: MongoDB Atlas cluster URL for storing extracted data.OPENAI_API_KEY
: OpenAI API key for accessing the OpenAI API.
When deploying the application using SAM CLI (sam deploy --guided
), make sure to provide a unique name for your S3 bucket. S3 bucket names must be globally unique across AWS accounts.
After deploying the application, navigate to the AWS Lambda Management Console. Locate the deployed Lambda function handling the resume extraction. Inside the function configuration, add the following environment variables:
- MONGO_URI: MongoDB Atlas cluster URL for storing extracted data.
- OPENAI_API_KEY: OpenAI API key for accessing the OpenAI API.
extractor
: Lambda function to handle PDF extraction, using OPENAI API and add the extracted information to MongoDB.pdfdist-layer/
: Lambda layer containing pdf-dist module to handle PDF extraction.
The SAM template.yml file is provided to configure AWS resources and permissions required for the application. We created the following resources -
- S3 Bucket (
ResumeBucket
):- Stores uploaded PDF resumes.
- Lambda Layer (
PdfdistLayer
):- Layer for the
pdf-dist
node module.
- Layer for the
- Lambda Function (
ResumeInfoExtracterFunction
):- Extracts information from uploaded PDF resumes.
- Triggered by S3 bucket upload events for
.pdf
files. - Uses
PdfdistLayer
for PDF extraction. - Requires read access to the specified S3 bucket (
ResumeBucketName
parameter).
If you encounter issues, please check the Issues section of this repository to see if the problem has already been reported. If not, please feel free to create a new issue.
Contributions are welcome! Please fork the repository and create a pull request with your changes. Ensure your code follows the project's coding standards and test your changes thoroughly before submitting.