The Immigration and Nationality Act (INA) of the United States permits foreign workers to work temporarily or permanently. It also protects US workers from adverse impacts in the workplace and enforces strict hiring requirements when employers seek to fill workforce shortages with foreign employees. These immigration programs are managed by the Office of Foreign Labor Certification (OFLC).
The OFLC processes job certification applications from organizations looking to bring foreign workers to the United States. With an increasing volume of applications, a machine learning model is needed to efficiently shortlist visa applicants.
- This project creates a classification model that determines whether a visa application should be approved or denied. The model not only forecasts results, but it also recommends profiles that are more likely to be accepted, so streamlining the decision-making process.
The dataset used in this project is provided by the Office of Foreign Labor Certification (OFLC).
- Python: Core programming language
- MongoDB: Data storage
- Evidently: Data validation and monitoring
- Optuna: Hyperparameter tuning
- MLflow: Experiment tracking
- GitHub Actions: CI/CD pipeline
- Docker: Containerization
- AWS (EC2, ECR): Deployment platform
- no_of_employees has many outliers which can be handled in Feature Engineering and no_of_employees is Right Skewed.
- yr_of_estab is left-skewed and some outliers are below the lower bound of the Box plot.
- prevailing_wage is right-skewed with outliers above the upper bound of the box plot.
- There are No missing values in the dataset.
- The case_id column can be deleted because each row has unique values.
- The case_status column is the target for predicting.
- In the Categorical column, features can be made Binary/ numerical in feature Encoding
- case_id column can be dropped as it is an ID.
- The requires_job_training column can be dropped as it doesn't have much impact on the target variable, as Demonstrated in visualization and the chi2 test.
- no_of_employees, prevailing_wage columns have outllier which should be handled.
- continent columns have few unique values with minimal count, which can be made as others.
- Target column case_status is imbalanced should be handled with techniques like SMOTE before model building.
-
Data Ingestion: Load and store raw data from MongoDB to artifacts.
-
Data Validation: Validate schema and detect data drift using Evidently.
-
Feature Engineering: Handle missing values, encode categorical variables, and scale numerical features.
-
Feature Selection: Apply multicollinearity analysis and chi-squared tests to select key features.
-
Model Training: Train on various classification models and find the best base model.
-
Hyperparameter Tuning: Use Optuna to optimize model parameters.
-
Experiment Tracking: Log results and experiments with MLflow.
-
CI/CD: Automate evaluation and deployment using GitHub Actions.
-
Deployment: Use Docker to deploy the application on AWS EC2 via ECR.
Clone repository:
git clone https://github.com/MNitin-Reddy/US-Visa-Approval-Prediction.git
Create a python virtual environment and install dependencies:
conda create -n venv python=3.8
conda activate venv
pip install -r requirements.txt
Run pipeline: (Ensure if MLflow server is running to track experiments)
python demo.py
Run web app:
python app.py
Checking experiments using MLflow:
mlflow ui
#with specific access to
1. Amazon EC2: AmazonEC2FullAccess
2. Amazon ECR: AmazonEC2ContainerRegistryFullAccess
(Copy the Access key and Secret Access Key for the user)
- Save the URI: (315865595366.dkr.ecr.us-east-1.amazonaws.com/visarepo) [This is a sample uri]
#optinal
sudo apt-get update -y
sudo apt-get upgrade
#required
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu
newgrp docker
1. Build docker image of the source code
2. Push docker image to ECR
3. Launch Your EC2
4. Pull Your image from ECR in EC2
5. Lauch your docker image in EC2
1. Configuring EC2 as self-hosted runner
In your github repo:
setting > actions > runner > new self hosted runner> choose os > then run command one by one
2. Create a folder on repo .github -> workflows -> aws.yaml -> copy the CI/CD template for aws deployment
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_DEFAULT_REGION
- ECR_REPO_URI
- MONGODB_URI
Now everytime we commit to the repo Github Actions automatically deploys the new code to the AWS cloud.
With an accuracy of 93%, the best-performing model in this project is K-Nearest Neighbors (KNN). After hyperparameter tuning using Optuna and handling target column imbalance using SMOTE, the optimal parameters were:
- algorithm: "brute"
- weights: "distance"
- no_of_neighbors: 5 This model effectively predicted visa approvals by leveraging the most relevant features identified through multicollinearity checks and chi-squared tests.
MLflow ensured comprehensive tracking of experiments, while the integration of CI/CD pipelines with GitHub Actions automated testing and deployment. The final solution is containerized using Docker and deployed seamlessly on AWS. MongoDB supports persistent data storage, ensuring reliability across deployments.