GitHub

This repository contains the programs related to NLP.

This contain some research paper implementation or some transformers extension of hugging face in Text similarity.
I have tried some approaches on the simple dataset which trying to classify the types of text into spam or ham.
So I have tried mulitple strategy to come up for the embeddings:
- TF_IDF
- Word2Vec
- Doc2Vec
Then I tried random forest and RNN structure with LSTM.

Scores I get is:

Model Precision Recall Accuracy

TF_IDF + RF 0.99 0.78 0.97

Word2Vec + RF 0.46 0.24 0.87

Doc2Vec + RF 0.81 0.35 0.91

RNN + text_to_sequence 0.99 0.96 0.99

I also tried to catch some hyperparameter using different methods and libraries :

Model Time (in min) Accuracy

Random forest (RF) 2.4 0.97

Grid Search CV 25.6 0.97

Pipeline 10.9 0.95

Skopt 19.3 0.97

Hyperopt 28:12 0.95

Optuna 40 0.97

Optuna is taking more time and giving accuracy which is better than some models.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Entity Recognizer		Entity Recognizer
Text similarity		Text similarity
cross_question_bert		cross_question_bert
Hyperparamter_optimization.ipynb		Hyperparamter_optimization.ipynb
README.md		README.md
cross_question.py		cross_question.py
doc2vec.ipynb		doc2vec.ipynb
predicting-tags-for-stackoverflow-deep-learning.ipynb		predicting-tags-for-stackoverflow-deep-learning.ipynb
preprocessing.ipynb		preprocessing.ipynb
question_answer.py		question_answer.py
rnn.ipynb		rnn.ipynb
similarity.ipynb		similarity.ipynb
tf_idf.ipynb		tf_idf.ipynb
tf_idf_scratch.ipynb		tf_idf_scratch.ipynb
transformer results.xlsx		transformer results.xlsx
word2vec.ipynb		word2vec.ipynb

Model	Precision	Recall	Accuracy
TF_IDF + RF	0.99	0.78	0.97
Word2Vec + RF	0.46	0.24	0.87
Doc2Vec + RF	0.81	0.35	0.91
RNN + text_to_sequence	0.99	0.96	0.99