Resume Info Extractor is a serverless application that automatically extracts information from PDF resumes uploaded to an Amazon S3 bucket. It utilizes AWS Lambda, OpenAI API, and MongoDB Atlas to process and store structured data from resumes. The project is bootstrapped using AWS SAM CLI and includes a Lambda layer for pdf-dist
node_module for handling PDF extraction.
- Upload a PDF: Resumes are uploaded to an S3 bucket, which triggers a Lambda.
- Automatic Processing: Lambda function extracts data using OpenAI API and stores it in MongoDB.
- Retrieve Candidate Info: Extracted details include:
- Personal Info: Name, Email, Phone Number
- Experience: Total years, Employment History
- Education: Degrees, Schools, Timelines
- Skills & Languages
- AWS account with appropriate permissions.
- SAM CLI installed locally: AWS SAM CLI Installation Guide
- MongoDB Atlas cluster URL (MONGO_URI) for storing extracted data.
- OpenAI API key for utilizing the OpenAI API (OPENAI_API_KEY).
- Clone the repository:
git clone https://github.com/yaldram/resume-extractor.git cd resume-extractor
- Install dependencies and build the project using SAM CLI:
npm install sam build
- Deploy the application using SAM CLI:
sam deploy --guided
Set the following environment variables in your AWS Lambda environment:
MONGO_URI
: MongoDB Atlas cluster URL for storing extracted data.OPENAI_API_KEY
: OpenAI API key for accessing the OpenAI API.
extractor
: Lambda function for PDF parsing and data extraction.pdfdist-layer/
: Lambda layer containingpdf-dist
module to handle PDF extraction.
The SAM template.yml file is provided to configure AWS resources and permissions required for the application. We created the following resources -
- S3 Bucket (
ResumeBucket
):- Stores uploaded PDF resumes.
- Lambda Layer (
PdfdistLayer
):- Layer for the
pdf-dist
node module.
- Layer for the
- Lambda Function (
ResumeInfoExtracterFunction
):- Extracts information from uploaded PDF resumes.
- Triggered by S3 bucket upload events for
.pdf
files. - Uses
PdfdistLayer
for PDF extraction. - Requires read access to the specified S3 bucket (
ResumeBucketName
parameter).
If you encounter issues, please check the Issues section of this repository to see if the problem has already been reported. If not, please feel free to create a new issue.
Contributions are welcome! Please fork the repository and create a pull request with your changes. Ensure your code follows the project's coding standards and test your changes thoroughly before submitting.