Skip to content

A powerful semantic search engine that combines traditional text search with vector similarity using Elasticsearch and machine learning embeddings.

License

Notifications You must be signed in to change notification settings

DavidReque/Indexify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Indexify

A powerful semantic search engine that combines traditional text search with vector similarity using Elasticsearch and machine learning embeddings.

video1.mp4

Overview

Indexify is a comprehensive search solution that leverages Google Custom Search API, Elasticsearch, and transformer-based embeddings to provide intelligent search capabilities. It combines traditional keyword search with semantic understanding to deliver more relevant results.

✨ Features

  • πŸ” Hybrid Search: Combining text and vector similarity
  • πŸ€– ML-Powered: Text embeddings using sentence-transformers
  • πŸ“Š Analytics: Search statistics and trending queries tracking
  • πŸ’‘ Smart Suggestions: Based on user behavior
  • πŸ”„ Auto-Fetch: Content from Google Custom Search when needed
  • 🎯 Advanced Search: Multiple criteria (title, author, date, keywords)

Built with

πŸ—οΈ Architecture

graph TD
    subgraph Frontend
        UI[User Interface]
        SearchBar[Search Bar]
        Results[Results Display]
        Suggestions[Search Suggestions]
    end

    subgraph Backend API
        API[FastAPI Backend]
        SearchHandler[Search Handler]
        SuggestionEngine[Suggestion Engine]
        IndexManager[Index Manager]
    end

    subgraph External Services
        Google[Google Custom Search API]
        Transform[Sentence Transformer]
    end

    subgraph Elasticsearch
        ESIndex[Search Index]
        Stats[Search Statistics]
        Vectors[Vector Storage]
    end

    UI --> |Search Query| SearchBar
    SearchBar --> |API Request| API
    SearchBar --> |Get Suggestions| SuggestionEngine

    API --> |Process Query| SearchHandler
    SearchHandler --> |Vector Search| ESIndex
    SearchHandler --> |Update Stats| Stats

    API --> |New Content Request| Google
    Google --> |Raw Results| IndexManager
    IndexManager --> |Generate Embeddings| Transform
    Transform --> |Store Vectors| Vectors

    SuggestionEngine --> |Fetch Trends| Stats
    ESIndex --> |Return Results| Results
    Stats --> |Popular Searches| Suggestions

    style UI fill:#f9f,stroke:#333,color:#000
    style ESIndex fill:#69b,stroke:#333,color:#000
    style Transform fill:#9cf,stroke:#333,color:#000
    style Google fill:#4a8,stroke:#333,color:#000
Loading

Backend Structure

backend/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ api/routes/
β”‚   β”‚   └── routes.py
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ models.py
β”‚   β”‚   β”œβ”€β”€ elasticsearch.py
β”‚   β”‚   β”œβ”€β”€ suggestion.py
β”‚   β”‚   β”œβ”€β”€ custom_search.py
β”‚   β”‚   β”œβ”€β”€ utils.py
β”‚   β”‚   β”œβ”€β”€ client.py
β”‚   β”‚   β”œβ”€β”€ index.py
β”‚   β”‚   └── documents.py
β”‚   └── main.py
β”œβ”€β”€ .env
β”œβ”€β”€ requirements.txt
└── README.md

Frontend Structure

frontend/indexify/
β”œβ”€β”€ .next/
β”œβ”€β”€ node_modules/
β”œβ”€β”€ public/
β”œβ”€β”€ src/
β”‚   └── app/
β”‚       β”œβ”€β”€ globals.css
β”‚       β”œβ”€β”€ layout.tsx
β”‚       └── page.tsx
β”œβ”€β”€ components/
β”œβ”€β”€ └── SearchBar.tsx
β”œβ”€β”€ config/
β”‚   └── constants.ts
β”œβ”€β”€ hooks/
β”‚   β”œβ”€β”€ useSearch.ts
β”‚   └── useSuggestions.ts
β”œβ”€β”€ types/
β”‚   └── index.ts
β”œβ”€β”€ .env
β”œβ”€β”€ .gitignore
β”œβ”€β”€ eslint.config.mjs
β”œβ”€β”€ next-env.d.ts
β”œβ”€β”€ next.config.ts
β”œβ”€β”€ package-lock.json
β”œβ”€β”€ package.json
β”œβ”€β”€ postcss.config.mjs
β”œβ”€β”€ README.md
β”œβ”€β”€ tailwind.config.ts
└── tsconfig.json

Technical Details

Elasticsearch Index Mapping

The system uses a sophisticated index mapping with the following fields:

"mappings": {
                "properties": {
                    "title": {
                        "type": "text",
                        "analyzer": "custom_text_analyzer",
                        "fields": {
                            "keyword": {"type": "keyword"},
                            "completion": {
                                "type": "completion",
                                "analyzer": "custom_text_analyzer"
                            }
                        }
                    },
                    "author": {"type": "keyword"},
                    "publication_date": {"type": "date"},
                    "abstract": {"type": "text", "analyzer": "custom_text_analyzer"},
                    "keywords": {
                        "type": "keyword",
                        "fields": {
                            "text": {
                                "type": "text",
                                "analyzer": "custom_text_analyzer"
                            }
                        }
                    },
                    "content": {"type": "text", "analyzer": "custom_text_analyzer"},
                    "vector": {"type": "dense_vector", "dims": vector_dims},
                    "search_count": {"type": "long"}
                }
            }
  • title: Text field with keyword and completion sub-fields
  • author: Keyword field for exact matching
  • publication_date: Date field
  • abstract: Text field with custom analyzer
  • keywords: Keyword field with text sub-field
  • content: Text field with custom analyzer
  • vector: Dense vector field for semantic search
  • search_count: Long field for tracking popularity

Embedding Process

Indexify uses the sentence-transformers/all-MiniLM-L6-v2 model to generate semantic text embeddings that capture the meaning of content. Here's how the process works:

1. Model Initialization

model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

2. Content Processing Flow

graph TD
    A[Input Text] -->|Tokenization| B[Tokens]
    B -->|Model Processing| C[Raw Embeddings]
    C -->|Extract CLS Token| D[Final Vector]
    D -->|Store| E[Elasticsearch]
Loading

3. Technical Implementation

Text Preprocessing & Embedding Generation

def generate_embedding(text: str) -> list[float]:
    # Tokenize with truncation for long texts
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )

    # Generate embeddings without gradient computation
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract CLS token embedding
    embedding = outputs.last_hidden_state[:, 0, :].squeeze().tolist()
    return embedding

Search Process with Embeddings

def vector_text_search(client, index_name, query_text, query_vector):
    query = {
        "query": {
            "script_score": {
                "query": {
                    "multi_match": {
                        "query": query_text,
                        "fields": ["title^3", "abstract^2", "content"]
                    }
                },
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'vector') + 1.0",
                    "params": {"query_vector": query_vector}
                }
            }
        }
    }

4. Processing Pipeline

  1. Input Processing:

    • Combines title and snippet for search results
    • Truncates to 512 tokens maximum
    • Handles special tokens automatically
  2. Vector Generation:

    • Converts tokens to model inputs
    • Processes through transformer model
    • Extracts CLS token representation
    • Converts to float list format
  3. Search Integration:

    • Stores vectors in Elasticsearch
    • Uses cosine similarity for matching
    • Combines with text-based relevance
    • Boosts results based on field importance
  4. Result Scoring:

    • Base text similarity score
    • Vector similarity contribution
    • Optional keyword presence boost
    • Field-specific weight multipliers

The embedding system enhances search accuracy by:

  • Capturing semantic relationships
  • Understanding context beyond keywords
  • Enabling similarity-based matching
  • Supporting hybrid ranking strategies

Search Features

  1. Vector Text Search
  • Combines traditional text matching with vector similarity
  • Uses script scoring for hybrid ranking
  • Supports fuzzy matching and field boosting
  1. Advanced Search
  • Multi-criteria search (title, author, date range, keywords)
  • Customizable result size
  • Sort by relevance and date
  1. Search Suggestions
  • Based on previous searches and trending queries
  • Tracks and updates search statistics
  • Provides real-time completion suggestions

API Routes

Core Endpoints

POST /api/search
POST /api/advanced-search
GET /api/suggestions

Interested in Contributing?

If you're interested, please see Backend and Frontend Guidelines.

License

This project is licensed under the MIT License - see the LICENCE file for details.

About

A powerful semantic search engine that combines traditional text search with vector similarity using Elasticsearch and machine learning embeddings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published