Trying 2 methods.

Training Documents using Tags.
Predicting both tags and embedding vectors of test document to classify them, and to find the nearest document in train set.

Use Doc2vec algorithm after extracting text using OCR API.

Two documents which have the most similar Doc2Vec embeddings are similar documents.

Crop the heading part of images
Find a pretrained feature vector online on tfhub.dev or other sources.
Run these pretrained feature vectors on all the templates ( training data) , and store them.
Take any input from the input folder ( test set), get its feature vectors.
Using distance metric like Euclidean, Manhattan to find which image in template is nearest to the Input

Add tags while training, return tags during prediction # Completed
Predict multiple files at the same time, return a dictionary of outputs
Find if there is a function in gensim for prediction, instead of manually calcuating distances

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
EDA_Ds_Assignment.ipynb		EDA_Ds_Assignment.ipynb
README.md		README.md
__init__.py		__init__.py
app.py		app.py
distance_calc.py		distance_calc.py
doc2vec.py		doc2vec.py
doc2vec_model_500_4		doc2vec_model_500_4
ocrfunction.py		ocrfunction.py
requirements.txt		requirements.txt
vectors_train_500_4.json		vectors_train_500_4.json