Skip to content

Latest commit

 

History

History
309 lines (236 loc) · 13.2 KB

README.md

File metadata and controls

309 lines (236 loc) · 13.2 KB

Similarity Vector Embedding

Build Status Python Version License

Overview

The Similarity Vector Embedding project utilizes Natural Language Processing (NLP) and vector databases to efficiently identify and recommend similar movies based on their descriptions and metadata. By leveraging PostgreSQL with the pgvector extension and advanced NLP models like BERT and Sentence Transformers, this project offers a scalable solution for performing similarity searches within large movie datasets. This system is ideal for enhancing recommendation engines, improving content discovery, and organizing extensive media collections.

Workflow Animation 1

Figure 1: Gradio App example.

Workflow Animation 2

Figure 2: Embedding generation process using Sentence Transformers.

Workflow Animation 3

Figure 3: Similarity search and recommendation pipeline Qdrant.

Implementing Cosine Similarity in PostgreSQL with pgvector

Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:

Getting Started

Prerequisites

  • Python 3.8
  • PostgreSQL
  • pgvector Extension
  • Jupyter Notebook

Installation

  1. Clone the Repository:

    git clone https://github.com/AlgoETS/SimilityVectorEmbedding.git
    cd SimilityVectorEmbedding
  2. Install Dependencies:

    pip install -r requirements.txt
  3. Set Up PostgreSQL with pgvector:

    • Install PostgreSQL: Download here
    • Install pgvector Extension:
      sudo apt install postgresql-14-pgvector
      Or build from source:
      git clone https://github.com/pgvector/pgvector.git
      cd pgvector
      make
      sudo make install
  4. Create Database and Enable pgvector:

    CREATE DATABASE movies_db;
    \c movies_db
    CREATE EXTENSION vector;
  5. Run the Jupyter Notebook:

    jupyter notebook

    Open Similarity_Vector_Embedding.ipynb and follow the instructions to generate embeddings, insert data, and perform similarity queries.

Usage

Architecture Diagram

Figure 4: System architecture integrating PostgreSQL, pgvector, and NLP models.

Similarity Search Example

Figure 5: Example of cosine similarity results for the movie "Inception".

image

Implementing Cosine Similarity in PostgreSQL with pgvector

Pgvector supports several distance metrics, including cosine similarity (denoted as <=> in SQL). By utilizing this function, we can perform fast cosine distance calculations directly within SQL queries, which is critical for efficient similarity searches. Here’s how you can find similar movies based on cosine similarity:

SELECT title, embedding
FROM movies
ORDER BY embedding <=> (SELECT embedding FROM movies WHERE title = %s) ASC
LIMIT 10;

This SQL command retrieves the ten most similar movies to a given movie based on their embeddings' cosine similarity.

Other Distance Functions Supported by pgvector

Pgvector also supports other distance metrics such as L2 (Euclidean), L1 (Manhattan), and Dot Product. Each of these metrics can be selected based on the specific needs of your query or the characteristics of your data. Here’s how you might use these metrics:

  • L2 Distance (Euclidean): Suitable for measuring the absolute differences between vectors.
  • L1 Distance (Manhattan): Useful in high-dimensional data spaces.

Database Schema

CREATE TABLE movies (
    id SERIAL PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    year INT,
    country VARCHAR(100),
    language VARCHAR(100),
    duration INT,
    summary TEXT,
    genres TEXT[],
    director JSONB,
    screenwriters TEXT[],
    roles JSONB,
    poster_url TEXT,
    embedding VECTOR(768) -- Adjust dimension based on NLP model
);

Data Example

{
  "title": "Inception",
  "year": "2010",
  "country": "USA",
  "language": "English",
  "duration": "148",
  "summary": "A skilled thief is given a chance at redemption if he can successfully perform an inception.",
  "genres": ["Action", "Sci-Fi", "Thriller"],
  "director": {"_id": "123456", "__text": "Christopher Nolan"},
  "screenwriters": ["Christopher Nolan"],
  "roles": [
    {"actor": {"_id": "78910", "__text": "Leonardo DiCaprio"}, "character": "Cobb"},
    {"actor": {"_id": "111213", "__text": "Joseph Gordon-Levitt"}, "character": "Arthur"}
  ],
  "poster_url": "https://m.media-amazon.com/images/I/51G8J1XnFQL._AC_SY445_.jpg",
  "id": "54321"
}

Generating Embeddings

Use Sentence Transformers to generate embeddings for movie descriptions:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def generate_embedding(text):
    return model.encode(text).tolist()

Inserting Data into the Database

Populate the movies table with movie data and their embeddings:

import json
import psycopg2

# Connect to PostgreSQL
conn = psycopg2.connect(
    dbname="movies_db",
    user="your_username",
    password="your_password",
    host="localhost"
)
cursor = conn.cursor()

# Load movie data
with open('movies.json', 'r') as file:
    movies = json.load(file)

# Insert movies into the database
for movie in movies:
    cursor.execute("""
        INSERT INTO movies (title, year, country, language, duration, summary, genres, director, screenwriters, roles, poster_url, embedding)
        VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    """, (
        movie['titre'],
        movie['annee'],
        movie['pays'],
        movie['langue'],
        movie['duree'],
        movie['resume'],
        movie['genre'],
        json.dumps(movie['realisateur']),
        json.dumps(movie['scenariste']),
        json.dumps(movie['role']),
        movie['poster'],
        generate_embedding(movie['resume'])
    ))

conn.commit()
cursor.close()
conn.close()

Finding Similar Movies

Retrieve movies similar to a given title using cosine similarity:

import psycopg2

def find_similar_movies(movie_title, top_k=10):
    conn = psycopg2.connect(
        dbname="movies_db",
        user="your_username",
        password="your_password",
        host="localhost"
    )
    cursor = conn.cursor()
    query = """
    SELECT title
    FROM movies
    WHERE title != %s
    ORDER BY embedding <=> (
        SELECT embedding FROM movies WHERE title = %s
    ) ASC
    LIMIT %s;
    """
    cursor.execute(query, (movie_title, movie_title, top_k))
    results = cursor.fetchall()
    cursor.close()
    conn.close()
    return [movie[0] for movie in results]

# Example usage
similar_movies = find_similar_movies("Inception")
print(similar_movies)

IMDB databased

https://developer.imdb.com/non-commercial-datasets/ System Architecture Figure 6: IMDB .****

Language Models Used

Model Name Description Source
BERT Bidirectional Encoder Representations from Transformers. BERT on Hugging Face
Sentence Transformers Models optimized for generating sentence-level embeddings. Sentence Transformers
all-MiniLM-L6-v2 A lightweight and efficient Sentence Transformer model. all-MiniLM-L6-v2
RoBERTa A robustly optimized BERT pretraining approach. RoBERTa on Hugging Face
DistilBERT A distilled version of BERT, smaller and faster while retaining performance. DistilBERT on Hugging Face
XLNet Generalized autoregressive pretraining for language understanding. XLNet on Hugging Face
T5 Text-to-Text Transfer Transformer for various NLP tasks. T5 on Hugging Face
Electra Efficient pretraining approach replacing masked tokens with generators. Electra on Hugging Face
Longformer Transformer model optimized for long documents. Longformer on Hugging Face
MiniLM-L12-v2 A compact and efficient model for sentence embeddings. MiniLM-L12-v2 on Hugging Face
SBERT DistilRoBERTa A distilled version of RoBERTa for efficient sentence embeddings. SBERT DistilRoBERTa on Hugging Face
MPNet Masked and Permuted Pre-training for Language Understanding. MPNet on Hugging Face
ERNIE Enhanced Representation through Knowledge Integration. ERNIE on Hugging Face
DeBERTa Decoding-enhanced BERT with disentangled attention. DeBERTa on Hugging Face
SBERT paraphrase-MiniLM-L6-v2 A Sentence Transformer model fine-tuned for paraphrase identification. paraphrase-MiniLM-L6-v2 on Hugging Face

Personal Preference:

I personally prefer using T5-small and the MiniLM series models due to their excellent balance between performance and computational efficiency.

References

Tutorials and Guides

Documentation

Videos

Additional Resources

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For any questions or support, please open an issue on the GitHub repository or contact antoine@antoineboucher.info