Reducing Review Overhead with ML based Application Screening

The Immigration and Nationality Act (INA) of the United States permits foreign workers to work temporarily or permanently. It also protects US workers from adverse impacts in the workplace and enforces strict hiring requirements when employers seek to fill workforce shortages with foreign employees. These immigration programs are managed by the Office of Foreign Labor Certification (OFLC).

Problem Statement

The OFLC processes job certification applications from organizations looking to bring foreign workers to the United States. With an increasing volume of applications, a machine learning model is needed to efficiently shortlist visa applicants.

This project creates a classification model that determines whether a visa application should be approved or denied. The model not only forecasts results, but it also recommends profiles that are more likely to be accepted, so streamlining the decision-making process.

Data Collection

The dataset used in this project is provided by the Office of Foreign Labor Certification (OFLC).

Tech Stack

Python: Core programming language
MongoDB: Data storage
Evidently: Data validation and monitoring
Optuna: Hyperparameter tuning
MLflow: Experiment tracking
GitHub Actions: CI/CD pipeline
Docker: Containerization
AWS (EC2, ECR): Deployment platform

Exploratory Data Analysis:

Initial Analysis Report

no_of_employees has many outliers which can be handled in Feature Engineering and no_of_employees is Right Skewed.
yr_of_estab is left-skewed and some outliers are below the lower bound of the Box plot.
prevailing_wage is right-skewed with outliers above the upper bound of the box plot.
There are No missing values in the dataset.
The case_id column can be deleted because each row has unique values.
The case_status column is the target for predicting.
In the Categorical column, features can be made Binary/ numerical in feature Encoding

Final Analysis

case_id column can be dropped as it is an ID.
The requires_job_training column can be dropped as it doesn't have much impact on the target variable, as Demonstrated in visualization and the chi2 test.
no_of_employees, prevailing_wage columns have outllier which should be handled.
continent columns have few unique values with minimal count, which can be made as others.
Target column case_status is imbalanced should be handled with techniques like SMOTE before model building.

Flow of project

Data Ingestion: Load and store raw data from MongoDB to artifacts.
Data Validation: Validate schema and detect data drift using Evidently.
Feature Engineering: Handle missing values, encode categorical variables, and scale numerical features.
Feature Selection: Apply multicollinearity analysis and chi-squared tests to select key features.
Model Training: Train on various classification models and find the best base model.
Hyperparameter Tuning: Use Optuna to optimize model parameters.
Experiment Tracking: Log results and experiments with MLflow.
CI/CD: Automate evaluation and deployment using GitHub Actions.
Deployment: Use Docker to deploy the application on AWS EC2 via ECR.

Pipeline Workflow

Installation

Clone repository:

git clone https://github.com/MNitin-Reddy/US-Visa-Approval-Prediction.git

Create a python virtual environment and install dependencies:

conda create -n venv python=3.8
conda activate venv
pip install -r requirements.txt

Run pipeline: (Ensure if MLflow server is running to track experiments)

python demo.py

Run web app:

python app.py

Checking experiments using MLflow:

mlflow ui

CI/CD using GitHub Actions and Deployment with AWS

1. Login to AWS console.

2. Create IAM user for deployment

#with specific access to 

1. Amazon EC2: AmazonEC2FullAccess

2. Amazon ECR: AmazonEC2ContainerRegistryFullAccess

(Copy the Access key and Secret Access Key for the user)

3. Create ECR repo to store/save docker image

- Save the URI: (315865595366.dkr.ecr.us-east-1.amazonaws.com/visarepo) [This is a sample uri]

4. Create EC2 machine (Ubuntu)

5. Open EC2 and Install docker in EC2 Machine:

#optinal

sudo apt-get update -y

sudo apt-get upgrade

#required

curl -fsSL https://get.docker.com -o get-docker.sh

sudo sh get-docker.sh

sudo usermod -aG docker ubuntu

newgrp docker

Description: About the deployment

1. Build docker image of the source code

2. Push docker image to ECR

3. Launch Your EC2 

4. Pull Your image from ECR in EC2

5. Lauch your docker image in EC2

6. GitHub Actions

1. Configuring EC2 as self-hosted runner
    In your github repo:
        setting > actions > runner > new self hosted runner> choose os > then run command one by one

2. Create a folder on repo .github ->  workflows -> aws.yaml -> copy the CI/CD template for aws deployment

7. Github secrets or environment variables to setup:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
ECR_REPO_URI
MONGODB_URI

Now everytime we commit to the repo Github Actions automatically deploys the new code to the AWS cloud.

Conclusion

With an accuracy of 93%, the best-performing model in this project is K-Nearest Neighbors (KNN). After hyperparameter tuning using Optuna and handling target column imbalance using SMOTE, the optimal parameters were:

algorithm: "brute"
weights: "distance"
no_of_neighbors: 5 This model effectively predicted visa approvals by leveraging the most relevant features identified through multicollinearity checks and chi-squared tests.

MLflow ensured comprehensive tracking of experiments, while the integration of CI/CD pipelines with GitHub Actions automated testing and deployment. The final solution is containerized using Docker and deployed seamlessly on AWS. MongoDB supports persistent data storage, ensuring reliability across deployments.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
config		config
mlruns		mlruns
notebook		notebook
static/css		static/css
templates		templates
us_visa		us_visa
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
demo.py		demo.py
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reducing Review Overhead with ML based Application Screening

Problem Statement

Data Collection

Tech Stack

Exploratory Data Analysis:

Initial Analysis Report

Final Analysis

Flow of project

Pipeline Workflow

Installation

CI/CD using GitHub Actions and Deployment with AWS

1. Login to AWS console.

2. Create IAM user for deployment

3. Create ECR repo to store/save docker image

4. Create EC2 machine (Ubuntu)

5. Open EC2 and Install docker in EC2 Machine:

Description: About the deployment

6. GitHub Actions

7. Github secrets or environment variables to setup:

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MNitin-Reddy/Reducing-Review-Overhead-with-ML-based-Application-Screening

Folders and files

Latest commit

History

Repository files navigation

Reducing Review Overhead with ML based Application Screening

Problem Statement

Data Collection

Tech Stack

Exploratory Data Analysis:

Initial Analysis Report

Final Analysis

Flow of project

Pipeline Workflow

Installation

CI/CD using GitHub Actions and Deployment with AWS

1. Login to AWS console.

2. Create IAM user for deployment

3. Create ECR repo to store/save docker image

4. Create EC2 machine (Ubuntu)

5. Open EC2 and Install docker in EC2 Machine:

Description: About the deployment

6. GitHub Actions

7. Github secrets or environment variables to setup:

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages