A powerful semantic search engine that combines traditional text search with vector similarity using Elasticsearch and machine learning embeddings.
video1.mp4
Indexify is a comprehensive search solution that leverages Google Custom Search API, Elasticsearch, and transformer-based embeddings to provide intelligent search capabilities. It combines traditional keyword search with semantic understanding to deliver more relevant results.
- π Hybrid Search: Combining text and vector similarity
- π€ ML-Powered: Text embeddings using sentence-transformers
- π Analytics: Search statistics and trending queries tracking
- π‘ Smart Suggestions: Based on user behavior
- π Auto-Fetch: Content from Google Custom Search when needed
- π― Advanced Search: Multiple criteria (title, author, date, keywords)
graph TD
subgraph Frontend
UI[User Interface]
SearchBar[Search Bar]
Results[Results Display]
Suggestions[Search Suggestions]
end
subgraph Backend API
API[FastAPI Backend]
SearchHandler[Search Handler]
SuggestionEngine[Suggestion Engine]
IndexManager[Index Manager]
end
subgraph External Services
Google[Google Custom Search API]
Transform[Sentence Transformer]
end
subgraph Elasticsearch
ESIndex[Search Index]
Stats[Search Statistics]
Vectors[Vector Storage]
end
UI --> |Search Query| SearchBar
SearchBar --> |API Request| API
SearchBar --> |Get Suggestions| SuggestionEngine
API --> |Process Query| SearchHandler
SearchHandler --> |Vector Search| ESIndex
SearchHandler --> |Update Stats| Stats
API --> |New Content Request| Google
Google --> |Raw Results| IndexManager
IndexManager --> |Generate Embeddings| Transform
Transform --> |Store Vectors| Vectors
SuggestionEngine --> |Fetch Trends| Stats
ESIndex --> |Return Results| Results
Stats --> |Popular Searches| Suggestions
style UI fill:#f9f,stroke:#333,color:#000
style ESIndex fill:#69b,stroke:#333,color:#000
style Transform fill:#9cf,stroke:#333,color:#000
style Google fill:#4a8,stroke:#333,color:#000
backend/
βββ app/
β βββ api/routes/
β β βββ routes.py
β βββ core/
β β βββ models.py
β β βββ elasticsearch.py
β β βββ suggestion.py
β β βββ custom_search.py
β β βββ utils.py
β β βββ client.py
β β βββ index.py
β β βββ documents.py
β βββ main.py
βββ .env
βββ requirements.txt
βββ README.md
frontend/indexify/
βββ .next/
βββ node_modules/
βββ public/
βββ src/
β βββ app/
β βββ globals.css
β βββ layout.tsx
β βββ page.tsx
βββ components/
βββ βββ SearchBar.tsx
βββ config/
β βββ constants.ts
βββ hooks/
β βββ useSearch.ts
β βββ useSuggestions.ts
βββ types/
β βββ index.ts
βββ .env
βββ .gitignore
βββ eslint.config.mjs
βββ next-env.d.ts
βββ next.config.ts
βββ package-lock.json
βββ package.json
βββ postcss.config.mjs
βββ README.md
βββ tailwind.config.ts
βββ tsconfig.json
The system uses a sophisticated index mapping with the following fields:
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_text_analyzer",
"fields": {
"keyword": {"type": "keyword"},
"completion": {
"type": "completion",
"analyzer": "custom_text_analyzer"
}
}
},
"author": {"type": "keyword"},
"publication_date": {"type": "date"},
"abstract": {"type": "text", "analyzer": "custom_text_analyzer"},
"keywords": {
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "custom_text_analyzer"
}
}
},
"content": {"type": "text", "analyzer": "custom_text_analyzer"},
"vector": {"type": "dense_vector", "dims": vector_dims},
"search_count": {"type": "long"}
}
}
title
: Text field with keyword and completion sub-fieldsauthor
: Keyword field for exact matchingpublication_date
: Date fieldabstract
: Text field with custom analyzerkeywords
: Keyword field with text sub-fieldcontent
: Text field with custom analyzervector
: Dense vector field for semantic searchsearch_count
: Long field for tracking popularity
Indexify uses the sentence-transformers/all-MiniLM-L6-v2
model to generate semantic text embeddings that capture the meaning of content. Here's how the process works:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
graph TD
A[Input Text] -->|Tokenization| B[Tokens]
B -->|Model Processing| C[Raw Embeddings]
C -->|Extract CLS Token| D[Final Vector]
D -->|Store| E[Elasticsearch]
def generate_embedding(text: str) -> list[float]:
# Tokenize with truncation for long texts
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512
)
# Generate embeddings without gradient computation
with torch.no_grad():
outputs = model(**inputs)
# Extract CLS token embedding
embedding = outputs.last_hidden_state[:, 0, :].squeeze().tolist()
return embedding
def vector_text_search(client, index_name, query_text, query_vector):
query = {
"query": {
"script_score": {
"query": {
"multi_match": {
"query": query_text,
"fields": ["title^3", "abstract^2", "content"]
}
},
"script": {
"source": "cosineSimilarity(params.query_vector, 'vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
}
-
Input Processing:
- Combines title and snippet for search results
- Truncates to 512 tokens maximum
- Handles special tokens automatically
-
Vector Generation:
- Converts tokens to model inputs
- Processes through transformer model
- Extracts CLS token representation
- Converts to float list format
-
Search Integration:
- Stores vectors in Elasticsearch
- Uses cosine similarity for matching
- Combines with text-based relevance
- Boosts results based on field importance
-
Result Scoring:
- Base text similarity score
- Vector similarity contribution
- Optional keyword presence boost
- Field-specific weight multipliers
The embedding system enhances search accuracy by:
- Capturing semantic relationships
- Understanding context beyond keywords
- Enabling similarity-based matching
- Supporting hybrid ranking strategies
- Vector Text Search
- Combines traditional text matching with vector similarity
- Uses script scoring for hybrid ranking
- Supports fuzzy matching and field boosting
- Advanced Search
- Multi-criteria search (title, author, date range, keywords)
- Customizable result size
- Sort by relevance and date
- Search Suggestions
- Based on previous searches and trending queries
- Tracks and updates search statistics
- Provides real-time completion suggestions
Core Endpoints
POST /api/search
POST /api/advanced-search
GET /api/suggestions
If you're interested, please see Backend and Frontend Guidelines.
This project is licensed under the MIT License - see the LICENCE file for details.