search-keras-gensim-elasticsearch

Search Engine using Word Embeddings, GloVe, Neural Networks, BERT, and Elasticsearch

1 References

2 Architecture Overview

2.1 Application

File	Description
app/embedding.py	Vectorization utilities.
app/summary.py	Summarization utilities.
app/score.py	Scoring model.
app/search.py	Elasticsearch search util.

2.2 Datasets

File	Description
data/songs.csv	Supervised scoring sample.
data/docs.txt	New documents to train the embedding function.
data/search.csv	Documents to be loaded into Elasticsearch.

2.3 Infrastructure

File	Description
config/elasticsearch.yml	Elasticsearch configuration.
Dockerfile	Elasticsearch Docker file.

3 Instructions

3.1 Installing a Python virtual environment.

virtualenv .env
source .env/bin/activate
pip install -r requirements.txt

3.2 Installing nltk.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

3.3 Compiling TensorFlow for PyTorch

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
brew install bazel || sudo yum install bazel || sudo apt-get install bazel
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
python configure.py

3.4 Downloading GloVe.

wget https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip
unzip glove.6B.zip

3.5 Sampling new documents embeddings

All you touch and all you see is all your life will ever be.
We're just two lost souls swimming in a fishbowl, year after year.
Tear down the wall.
The lunatic is on the grass.
Is there anybody out there?
We don't need no education.
Shine on you crazy diamond.
[...]

3.6 Training the GloVe Word2Vec model with more documents.

from app.embedding import Embedding
embedding: Embedding = Embedding()
embedding.load(path="glove.6B.300d.txt")
embedding.train("data/docs.txt")
embedding.get_embeddings('The grass was greener')
embedding.save(path='embedding.h5')

3.7 Sampling a dataset of documents and queries.

doc,query,relevance
"Hanging on in quiet desperation is the English way",hanging,1
"Hanging on in quiet desperation is the English way",spanish,0
"All that is now, all that is gone",go,1
"All that is now, all that is gone",cry,0
"All that is now, all that is gone",wall,0
"Waiting for someone or something to show you the way",waiting,1
"Waiting for someone or something to show you the way",wait,1
"Waiting for someone or something to show you the way",run,0
"Can't keep my eyes from the circling skies","red ribbon",1
"Can't keep my eyes from the circling skies","keep eyes",1
"Can't keep my eyes from the circling skies","eyes",1
"Can't keep my eyes from the circling skies","mouth",0
[...]

3.8 Training a relevance model to predict the scores of tuples of documents and queries.

from app.embedding import Embedding
from app.score import Score
embedding: Embedding = Embedding()
embedding.load(path="glove.6B.300d.txt")
model: Score = Score()
model.train('data/songs.csv', embedding=embedding)
print(model.predict(query='green', doc='The grass was greener', embedding=embedding))
model.save('relevance.h5')

3.9 Testing the summarization module

from app.summary import Summary
summary: Summary = Summary()
summary.load('facebook/bart-base')
print(summary.get_summary("Scientists have discovered a new species of dinosaur in the Amazon rainforest."))

3.10 Building & running an Elasticsearch cluster.

docker build -t "my_es_image" .
docker run -p 127.0.0.1:9300:9200 -t "my_es_image"
curl -X GET "http://localhost:9300/"

3.11 Sampling documents to be loaded into Elasticsearch

id,title,text
1,Grass,The grass was greener
2,Light,The light was brifther
3,Sweet,The taste was sweeter
4,Friends,When friends surround
[...]

3.12 Indexing some documents.

from app.embedding import Embedding
from app.search import Search
from app.summary import Summary
summary: Summary = Summary()
summary.load('facebook/bart-base')
embedding: Embedding = Embedding()
embedding.load(path="glove.6B.300d.txt")
search: Search = Search(index='my_es_index2', protocol='http', host='localhost', port=9300)
search.init()
search.load('data/search.csv', embedding=embedding, summary=summary)

3.13 Combining all the predictions models to get relevant results.

from app.embedding import Embedding
from app.search import Search
from app.score import Score
embedding: Embedding = Embedding()
embedding.load(path="glove.6B.300d.txt")
search: Search = Search(index='my_es_index2', protocol='http', host='localhost', port=9300)
search.load('data/search.csv', embedding=embedding, summary=summary)
score: Score = Score()
score.load('relevance.h5')
print(search.search(query='green', embedding=embedding, score=score, size=5))

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
config		config
data		data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
architecture.png		architecture.png
requirements.txt		requirements.txt
wallpaper.jpg		wallpaper.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-keras-gensim-elasticsearch

1 References

2 Architecture Overview

2.1 Application

2.2 Datasets

2.3 Infrastructure

3 Instructions

3.1 Installing a Python virtual environment.

3.2 Installing nltk.

3.3 Compiling TensorFlow for PyTorch

3.4 Downloading GloVe.

3.5 Sampling new documents embeddings

3.6 Training the GloVe Word2Vec model with more documents.

3.7 Sampling a dataset of documents and queries.

3.8 Training a relevance model to predict the scores of tuples of documents and queries.

3.9 Testing the summarization module

3.10 Building & running an Elasticsearch cluster.

3.11 Sampling documents to be loaded into Elasticsearch

3.12 Indexing some documents.

3.13 Combining all the predictions models to get relevant results.

About

Releases

Packages

Languages

MartinCastroAlvarez/search-keras-gensim-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

search-keras-gensim-elasticsearch

1 References

2 Architecture Overview

2.1 Application

2.2 Datasets

2.3 Infrastructure

3 Instructions

3.1 Installing a Python virtual environment.

3.2 Installing nltk.

3.3 Compiling TensorFlow for PyTorch

3.4 Downloading GloVe.

3.5 Sampling new documents embeddings

3.6 Training the GloVe Word2Vec model with more documents.

3.7 Sampling a dataset of documents and queries.

3.8 Training a relevance model to predict the scores of tuples of documents and queries.

3.9 Testing the summarization module

3.10 Building & running an Elasticsearch cluster.

3.11 Sampling documents to be loaded into Elasticsearch

3.12 Indexing some documents.

3.13 Combining all the predictions models to get relevant results.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages