The Similarity Vector Embedding project utilizes Natural Language Processing (NLP) and vector databases to efficiently identify and recommend similar movies based on their descriptions and metadata. By leveraging PostgreSQL with the pgvector extension and advanced NLP models like BERT and Sentence Transformers, this project offers a scalable solution for performing similarity searches within large movie datasets. This system is ideal for enhancing recommendation engines, improving content discovery, and organizing extensive media collections.
Figure 1: Gradio App example.
Figure 2: Embedding generation process using Sentence Transformers.
Figure 3: Similarity search and recommendation pipeline Qdrant.
Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:
- Python 3.8
- PostgreSQL
- pgvector Extension
- Jupyter Notebook
-
Clone the Repository:
git clone https://github.com/AlgoETS/SimilityVectorEmbedding.git cd SimilityVectorEmbedding
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up PostgreSQL with pgvector:
- Install PostgreSQL: Download here
- Install pgvector Extension:
Or build from source:
sudo apt install postgresql-14-pgvector
git clone https://github.com/pgvector/pgvector.git cd pgvector make sudo make install
-
Create Database and Enable pgvector:
CREATE DATABASE movies_db; \c movies_db CREATE EXTENSION vector;
-
Run the Jupyter Notebook:
jupyter notebook
Open
Similarity_Vector_Embedding.ipynb
and follow the instructions to generate embeddings, insert data, and perform similarity queries.
Figure 4: System architecture integrating PostgreSQL, pgvector, and NLP models.
Figure 5: Example of cosine similarity results for the movie "Inception".
Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:
SELECT title, embedding
FROM movies
ORDER BY embedding <=> (SELECT embedding FROM movies WHERE title = %s) ASC
LIMIT 10;
This SQL command retrieves the ten most similar movies to a given movie based on their embeddings' cosine similarity.
Pgvector also supports other distance metrics such as L2 (Euclidean), L1 (Manhattan), and Dot Product. Each of these metrics can be selected based on the specific needs of your query or the characteristics of your data. Here’s how you might use these metrics:
- L2 Distance (Euclidean): Suitable for measuring the absolute differences between vectors.
- L1 Distance (Manhattan): Useful in high-dimensional data spaces.
CREATE TABLE movies (
id SERIAL PRIMARY KEY,
title VARCHAR(255) NOT NULL,
year INT,
country VARCHAR(100),
language VARCHAR(100),
duration INT,
summary TEXT,
genres TEXT[],
director JSONB,
screenwriters TEXT[],
roles JSONB,
poster_url TEXT,
embedding VECTOR(768) -- Adjust dimension based on NLP model
);
{
"title": "Inception",
"year": "2010",
"country": "USA",
"language": "English",
"duration": "148",
"summary": "A skilled thief is given a chance at redemption if he can successfully perform an inception.",
"genres": ["Action", "Sci-Fi", "Thriller"],
"director": {"_id": "123456", "__text": "Christopher Nolan"},
"screenwriters": ["Christopher Nolan"],
"roles": [
{"actor": {"_id": "78910", "__text": "Leonardo DiCaprio"}, "character": "Cobb"},
{"actor": {"_id": "111213", "__text": "Joseph Gordon-Levitt"}, "character": "Arthur"}
],
"poster_url": "https://m.media-amazon.com/images/I/51G8J1XnFQL._AC_SY445_.jpg",
"id": "54321"
}
Use Sentence Transformers to generate embeddings for movie descriptions:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def generate_embedding(text):
return model.encode(text).tolist()
Populate the movies
table with movie data and their embeddings:
import json
import psycopg2
# Connect to PostgreSQL
conn = psycopg2.connect(
dbname="movies_db",
user="your_username",
password="your_password",
host="localhost"
)
cursor = conn.cursor()
# Load movie data
with open('movies.json', 'r') as file:
movies = json.load(file)
# Insert movies into the database
for movie in movies:
cursor.execute("""
INSERT INTO movies (title, year, country, language, duration, summary, genres, director, screenwriters, roles, poster_url, embedding)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""", (
movie['titre'],
movie['annee'],
movie['pays'],
movie['langue'],
movie['duree'],
movie['resume'],
movie['genre'],
json.dumps(movie['realisateur']),
json.dumps(movie['scenariste']),
json.dumps(movie['role']),
movie['poster'],
generate_embedding(movie['resume'])
))
conn.commit()
cursor.close()
conn.close()
Retrieve movies similar to a given title using cosine similarity:
import psycopg2
def find_similar_movies(movie_title, top_k=10):
conn = psycopg2.connect(
dbname="movies_db",
user="your_username",
password="your_password",
host="localhost"
)
cursor = conn.cursor()
query = """
SELECT title
FROM movies
WHERE title != %s
ORDER BY embedding <=> (
SELECT embedding FROM movies WHERE title = %s
) ASC
LIMIT %s;
"""
cursor.execute(query, (movie_title, movie_title, top_k))
results = cursor.fetchall()
cursor.close()
conn.close()
return [movie[0] for movie in results]
# Example usage
similar_movies = find_similar_movies("Inception")
print(similar_movies)
https://developer.imdb.com/non-commercial-datasets/ Figure 6: IMDB .****
Model Name | Description | Source |
---|---|---|
BERT | Bidirectional Encoder Representations from Transformers. | BERT on Hugging Face |
Sentence Transformers | Models optimized for generating sentence-level embeddings. | Sentence Transformers |
all-MiniLM-L6-v2 | A lightweight and efficient Sentence Transformer model. | all-MiniLM-L6-v2 |
RoBERTa | A robustly optimized BERT pretraining approach. | RoBERTa on Hugging Face |
DistilBERT | A distilled version of BERT, smaller and faster while retaining performance. | DistilBERT on Hugging Face |
XLNet | Generalized autoregressive pretraining for language understanding. | XLNet on Hugging Face |
T5 | Text-to-Text Transfer Transformer for various NLP tasks. | T5 on Hugging Face |
Electra | Efficient pretraining approach replacing masked tokens with generators. | Electra on Hugging Face |
Longformer | Transformer model optimized for long documents. | Longformer on Hugging Face |
MiniLM-L12-v2 | A compact and efficient model for sentence embeddings. | MiniLM-L12-v2 on Hugging Face |
SBERT DistilRoBERTa | A distilled version of RoBERTa for efficient sentence embeddings. | SBERT DistilRoBERTa on Hugging Face |
MPNet | Masked and Permuted Pre-training for Language Understanding. | MPNet on Hugging Face |
ERNIE | Enhanced Representation through Knowledge Integration. | ERNIE on Hugging Face |
DeBERTa | Decoding-enhanced BERT with disentangled attention. | DeBERTa on Hugging Face |
SBERT paraphrase-MiniLM-L6-v2 | A Sentence Transformer model fine-tuned for paraphrase identification. | paraphrase-MiniLM-L6-v2 on Hugging Face |
Personal Preference:
I personally prefer using T5-small and the MiniLM series models due to their excellent balance between performance and computational efficiency.
- Cosine Similarity in NLP
- Visualizing Embeddings in 2D
- Qdrant Text Data Example
- Recommendation System with Qdrant
- Understanding Cosine Similarity
- Implementing Vector Databases
- Embedding Models Explained
- Advanced Embedding Techniques
- MTEB Leaderboard
- Open Source Embeddings Collection
- Qdrant Audio Data Example
- Research Paper on Embeddings
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or support, please open an issue on the GitHub repository or contact antoine@antoineboucher.info