-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new OpenSearch Vector Store with embeddings support #762
Merged
zainhoda
merged 3 commits into
vanna-ai:main
from
asaintsever:asaintsever/feat/opensearch-semantic-vector-store
Feb 8, 2025
Merged
Add new OpenSearch Vector Store with embeddings support #762
zainhoda
merged 3 commits into
vanna-ai:main
from
asaintsever:asaintsever/feat/opensearch-semantic-vector-store
Feb 8, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto Pull Request Review from LlamaPReview
1. Overview
1.1 PR Summary
- Business value and requirements alignment: This PR introduces a new OpenSearch vector store implementation that supports semantic/vector search, enhancing the relevance and accuracy of search results compared to the existing keyword-based approach. This aligns with the business need to improve the quality of search results, particularly for SQL generation and documentation retrieval.
- Key components modified:
pyproject.toml
: Updated dependencies to includelangchain-community
andlangchain-huggingface
.src/vanna/opensearch/__init__.py
: Added import for the newOpenSearch_Semantic_VectorStore
.src/vanna/opensearch/opensearch_vector_semantic.py
: Implemented the newOpenSearch_Semantic_VectorStore
class with methods for adding data, retrieving similar items, and handling training data.tests/test_imports.py
: Updated imports to include the new vector store class.
- Impact assessment: The introduction of semantic search capabilities represents a significant enhancement in search functionality, potentially improving user satisfaction and the accuracy of SQL generation. However, it also introduces new dependencies and potential performance considerations.
- System dependencies and integration impacts: The new feature relies on external libraries (
langchain-community
andlangchain-huggingface
) and requires embedding generation during data ingestion and query processing, which may impact system performance and resource utilization.
1.2 Architecture Changes
- System design modifications: The addition of
OpenSearch_Semantic_VectorStore
extends the existing vector store architecture to support semantic search, leveraging vector embeddings for improved search relevance. - Component interactions: The new vector store interacts with OpenSearch for storing and retrieving embeddings, and with the HuggingFace model for generating embeddings. It also integrates with the existing Vanna training functions for data ingestion.
- Integration points: The new class is integrated into the Vanna system through the
__init__.py
file and is used in the training and query processes.
2. Detailed Technical Analysis
2.1 Code Logic Deep-Dive
Core Logic Changes
- src/vanna/opensearch/opensearch_vector_semantic.py - OpenSearch_Semantic_VectorStore.get_training_data
- Submitted PR Code:
def get_training_data(self, **kwargs) -> pd.DataFrame:
data = []
query = {
"query": {
"match_all": {}
}
}
response = self.documentation_store.client.search(
index=self.document_index,
ignore_unavailable=True,
body=query,
size=1000 # TODO: paginate instead of getting max 1000 records
)
# records = [hit['_source'] for hit in response['hits']['hits']]
for hit in response['hits']['hits']:
data.append(
{
"id": hit["_id"],
"training_data_type": "documentation",
"question": None,
"content": hit["_source"]['text'],
}
)
response =self.sql_store.client.search(
index=self.question_sql_index,
ignore_unavailable=True,
body=query,
size=1000 # TODO: paginate instead of getting max 1000 records
)
# records = [hit['_source'] for hit in response['hits']['hits']]
for hit in response['hits']['hits']:
try:
doc_dict = ast.literal_eval(hit["_source"]['text'])
data.append(
{
"id": hit["_id"],
"training_data_type": "sql",
"question": doc_dict.get("question"),
"content": doc_dict.get("sql"),
}
)
except (ValueError, SyntaxError):
self.log(f"Skipping row with custom_id {hit['_id']} due to parsing error.", "Error")
continue
response = self.ddl_store.client.search(
index=self.ddl_index,
ignore_unavailable=True,
body=query,
size=1000 # TODO: paginate instead of getting max 1000 records
)
# records = [hit['_source'] for hit in response['hits']['hits']]
for hit in response['hits']['hits']:
data.append(
{
"id": hit["_id"],
"training_data_type": "ddl",
"question": None,
"content": hit["_source"]['text'],
}
)
return pd.DataFrame(data)
- Analysis:
- The
get_training_data
method retrieves training data from three OpenSearch indices:documentation_store
,sql_store
, andddl_store
. - It uses a
match_all
query to fetch all documents from each index. - Pagination Issue: The current implementation uses
size=1000
and includes aTODO
comment about pagination. This is a significant limitation as it will only retrieve the first 1000 hits from each index, potentially missing training data if there are more than 1000 documents in any of these indices. For large datasets, this will lead to incomplete training data being retrieved, impacting the model's training and performance. The initial review correctly identifies this as a point for deep-dive analysis, but doesn't explicitly highlight the severity of the data loss risk for larger datasets. - No Scroll API: The code uses the standard
search
API withsize
, which is not efficient for retrieving large datasets from OpenSearch. OpenSearch provides the Scroll API, which is designed for efficient deep pagination and retrieval of large result sets. Not using the Scroll API can lead to performance issues and potentially timeouts when indices grow large.
- The
- LlamaPReview Suggested Improvements:
def get_training_data(self, **kwargs) -> pd.DataFrame:
data = []
query = {
"query": {
"match_all": {}
}
}
indices = [
{"index": self.document_index, "type": "documentation"},
{"index": self.question_sql_index, "type": "sql"},
{"index": self.ddl_index, "type": "ddl"},
]
for index_info in indices:
index_name = index_info["index"]
training_data_type = index_info["type"]
scroll = '1m' # keep scroll context for 1 minute
response = self.documentation_store.client.search( # Use documentation_store.client consistently for search
index=index_name,
query=query["query"], # Directly use query['query'] for clarity
scroll=scroll,
size=1000
)
scroll_id = response.get('_scroll_id')
while scroll_id:
hits = response['hits']['hits']
if not hits:
break # No more hits, exit loop
for hit in hits:
source = hit['_source']
if training_data_type == "sql":
try:
doc_dict = json.loads(source['text'])
content = doc_dict.get("sql")
question = doc_dict.get("question")
except json.JSONDecodeError as e: # Catch JSONDecodeError specifically
self.log(f"Skipping row with custom_id {hit['_id']} due to JSON parsing error: {e}", "Error") # More specific error message
continue
else: # documentation or ddl
content = source['text']
question = None
data.append({
"id": hit["_id"],
"training_data_type": training_data_type,
"question": question,
"content": content,
})
# Get next batch of results
response = self.documentation_store.client.scroll(scroll_id=scroll_id, scroll=scroll) # Use documentation_store.client.scroll
scroll_id = response.get('_scroll_id')
return pd.DataFrame(data)
- Improvement rationale:
- Technical benefits:
- Addresses Pagination: Implements pagination using OpenSearch's Scroll API, allowing retrieval of all documents from the indices, regardless of the number of documents. This resolves the data loss issue and ensures comprehensive training data retrieval.
- Performance Improvement: Using Scroll API is significantly more efficient for deep pagination compared to repeated
search
requests with increasingfrom
andsize
. This improves the performance ofget_training_data
especially for large OpenSearch indices. - Scalability: Enables the system to scale to handle larger datasets in OpenSearch without being limited by the 1000-record limit of the basic
search
API.
- Business value:
- Improved Model Accuracy: By retrieving all available training data, the model can be trained on a more complete dataset, leading to potentially higher accuracy and better performance in question answering and SQL generation.
- Handles Large Datasets: The system can now effectively handle training data stored in large OpenSearch indices, making it more robust and scalable for real-world applications with growing data volumes.
- Risk assessment:
- Increased Complexity: Using Scroll API adds some complexity to the code compared to the simpler
search
withsize
. However, the added complexity is justified by the significant improvements in data retrieval completeness and performance. - Resource Usage: While Scroll API is more efficient for data retrieval, it does maintain a scroll context on the OpenSearch server, which consumes resources. The
scroll='1m'
parameter in the suggested code keeps the context alive for 1 minute, which should be sufficient for most cases. This needs to be monitored and potentially adjusted based on the size of the indices and network latency.
- Increased Complexity: Using Scroll API adds some complexity to the code compared to the simpler
- Technical benefits:
Core Logic Changes
- src/vanna/opensearch/opensearch_vector_semantic.py - OpenSearch_Semantic_VectorStore.get_similar_question_sql & OpenSearch_Semantic_VectorStore.get_training_data
- Submitted PR Code:
def get_similar_question_sql(self, question: str, **kwargs) -> list:
documents = self.sql_store.similarity_search(query=question, k=self.n_results_sql)
return [ast.literal_eval(document.page_content) for document in documents]
def get_training_data(self, **kwargs) -> pd.DataFrame:
# ...
for hit in response['hits']['hits']:
try:
doc_dict = ast.literal_eval(hit["_source"]['text'])
# ...
except (ValueError, SyntaxError):
self.log(f"Skipping row with custom_id {hit['_id']} due to parsing error.", "Error")
continue
# ...
- Analysis:
- Both
get_similar_question_sql
andget_training_data
methods useast.literal_eval(document.page_content)
to parse thepage_content
retrieved from OpenSearch. In the context ofget_similar_question_sql
, thepage_content
is expected to be a JSON string representing a question-SQL pair. Inget_training_data
, specifically for 'sql' type data, it also expects JSON. - Security Risk of
ast.literal_eval
:ast.literal_eval
is generally safer thaneval
as it only evaluates literal Python structures (strings, numbers, tuples, lists, dicts, booleans, None). However, it is still not completely safe, especially when dealing with data from external sources like OpenSearch, which might be compromised or contain maliciously crafted strings. While the risk might be low in this specific scenario if OpenSearch is considered a trusted internal component, it's a potential vulnerability. If an attacker could inject malicious strings into the OpenSearch index, they might be able to execute arbitrary code or cause denial-of-service by exploiting vulnerabilities inast.literal_eval
or the parsing logic. - Robustness and Error Handling: While there is a
try-except
block inget_training_data
, it only catchesValueError
andSyntaxError
. Other potential exceptions during parsing or data access are not handled. Furthermore, relying on string parsing for structured data is inherently less robust than using a dedicated JSON library. If the data in OpenSearch is not consistently valid JSON,ast.literal_eval
will fail, leading to data loss or errors. - Performance Overhead:
ast.literal_eval
is less efficient than direct JSON parsing usingjson.loads
, especially for complex JSON structures. While performance might not be critical in this context, usingjson.loads
is generally preferred for parsing JSON data in Python.
- Both
- LlamaPReview Suggested Improvements:
import json # Ensure json is imported at the top of the file
def get_similar_question_sql(self, question: str, **kwargs) -> list:
documents = self.sql_store.similarity_search(query=question, k=self.n_results_sql)
return [json.loads(document.page_content) for document in documents] # Use json.loads
def get_training_data(self, **kwargs) -> pd.DataFrame:
# ...
for hit in response['hits']['hits']:
source = hit['_source']
if training_data_type == "sql":
try:
doc_dict = json.loads(source['text']) # Use json.loads
content = doc_dict.get("sql")
question = doc_dict.get("question")
except json.JSONDecodeError as e: # Catch JSONDecodeError specifically
self.log(f"Skipping row with custom_id {hit['_id']} due to JSON parsing error: {e}", "Error") # More specific error message
continue
else: # documentation or ddl
content = source['text']
question = None
# ...
- Improvement rationale:
- Technical benefits:
- Security Improvement: Replaces
ast.literal_eval
withjson.loads
for parsing JSON data.json.loads
is specifically designed for parsing JSON and is generally considered safer and more appropriate for this task, reducing the risk of potential vulnerabilities associated withast.literal_eval
. - Increased Robustness: Uses
json.loads
which is the standard and recommended way to parse JSON in Python. It provides better error handling for invalid JSON formats throughjson.JSONDecodeError
, making the code more robust against malformed data in OpenSearch. - Performance:
json.loads
is generally more performant for parsing JSON compared toast.literal_eval
. - Clarity and Maintainability: Using
json.loads
explicitly signals the intent to parse JSON data, improving code readability and maintainability.
- Security Improvement: Replaces
- Business value:
- Improved System Reliability: By using a safer and more robust JSON parsing method, the system becomes less prone to errors and potential security issues, leading to improved reliability and stability.
- Data Integrity: More robust parsing ensures that data is correctly processed and reduces the risk of data corruption or misinterpretation due to parsing errors.
- Risk assessment:
- Low Risk Change: Replacing
ast.literal_eval
withjson.loads
for JSON parsing is a low-risk change with significant benefits in terms of security and robustness. It aligns with best practices for JSON handling in Python. The change includes specific exception handling forjson.JSONDecodeError
to maintain error logging and data skipping behavior, ensuring no regression in error handling.
- Low Risk Change: Replacing
- Technical benefits:
2.2 Implementation Quality
- Code organization and structure: The new
OpenSearch_Semantic_VectorStore
class is well-organized and follows a clear structure. The methods are logically grouped and follow a consistent naming convention. - Design patterns usage: The class leverages the
VannaBase
base class and uses dependency injection for configuration, adhering to good design principles. - Error handling approach: The class includes error handling for various operations, such as data parsing and deletion. However, there is room for improvement in handling specific exceptions and providing more informative error messages.
- Resource management: The class manages OpenSearch client resources effectively, ensuring that connections are properly configured and used.
3. Critical Findings
3.1 Potential Issues
-
🔴 Critical Issues
- Pagination Issue in
get_training_data
: The current implementation ofget_training_data
usessize=1000
without proper pagination, leading to potential data loss for large datasets. This can significantly impact the model's training and performance.- Impact: Incomplete training data retrieval, leading to poor model performance.
- Recommendation: Implement pagination using OpenSearch's Scroll API to retrieve all documents from the indices.
- Pagination Issue in
-
🟡 Warnings
- Security Risk of
ast.literal_eval
: The use ofast.literal_eval
for parsing JSON data poses a security risk, especially when dealing with data from external sources.- Potential risks: Potential code injection or denial-of-service attacks if malicious strings are injected into the OpenSearch index.
- Suggested improvements: Replace
ast.literal_eval
withjson.loads
for parsing JSON data, which is safer and more appropriate for this task.
- Security Risk of
3.2 Code Quality Concerns
- Maintainability aspects: The code is generally maintainable, but there are opportunities to improve error handling and logging for better debugging and monitoring.
- Readability issues: The code is readable, but some improvements in variable naming and comments can enhance understanding.
- Performance bottlenecks: The current implementation of
get_training_data
may introduce performance bottlenecks due to the lack of proper pagination. This can be addressed by using the Scroll API for efficient data retrieval.
4. Security Assessment
- Authentication/Authorization impacts: The new vector store relies on OpenSearch client authentication, which must be securely configured to prevent unauthorized access.
- Data handling concerns: Ensure that embedding models and data handled by the vector store are securely managed to prevent data leakage or unauthorized access.
- Input validation: Validate all inputs to the vector store methods to prevent injection attacks or malformed data from causing errors.
- Security best practices: Follow best practices for secure configuration and management of OpenSearch and embedding models.
- Potential security risks: The use of
ast.literal_eval
poses a security risk, which can be mitigated by usingjson.loads
. - Mitigation strategies: Implement robust input validation, secure configuration, and use safer JSON parsing methods.
- Security testing requirements: Conduct thorough security testing, including penetration testing and code reviews, to identify and mitigate potential vulnerabilities.
5. Testing Strategy
5.1 Test Coverage
- Unit test analysis: Ensure that all methods in the
OpenSearch_Semantic_VectorStore
class are covered by unit tests, including edge cases and error handling. - Integration test requirements: Perform integration tests to validate the interaction between the vector store and OpenSearch, as well as the embedding generation process.
- Edge cases coverage: Test the behavior of the vector store with various data inputs, including large datasets and malformed data.
5.2 Test Recommendations
Suggested Test Cases
# Example unit test for get_similar_question_sql
def test_get_similar_question_sql(self):
store = OpenSearch_Semantic_VectorStore(config=test_config)
question = "SELECT * FROM users;"
results = store.get_similar_question_sql(question)
assert len(results) > 0
assert all(isinstance(result, dict) for result in results)
# Example integration test for get_training_data
def test_get_training_data(self):
store = OpenSearch_Semantic_VectorStore(config=test_config)
data = store.get_training_data()
assert not data.empty
assert set(data.columns) == {"id", "training_data_type", "question", "content"}
- Coverage improvements: Ensure that all code paths, including error handling, are covered by tests.
- Performance testing needs: Perform performance testing to measure the latency and throughput of embedding generation and vector search operations under different data volumes and query loads.
6. Documentation & Maintenance
- Documentation updates needed: Update the API documentation to include the new
OpenSearch_Semantic_VectorStore
class and its methods. Provide examples of how to use the new vector store for semantic search. - Long-term maintenance considerations: Monitor the performance and resource utilization of the vector store, especially as data volumes grow. Regularly review and update the embedding models and OpenSearch configurations to ensure optimal performance and security.
- Technical debt and monitoring requirements: Address any technical debt introduced by the new feature, such as the need for proper pagination in
get_training_data
. Implement monitoring to track the performance and resource usage of the vector store.
7. Deployment & Operations
- Deployment impact and strategy: The new vector store requires deployment of the updated code and configuration of the OpenSearch client and embedding models. Ensure that the deployment process includes thorough testing to validate the new functionality.
- Key operational considerations: Monitor the performance and resource utilization of the vector store in the production environment. Implement logging and alerting to detect and respond to any issues or anomalies.
8. Summary & Recommendations
8.1 Key Action Items
- Implement pagination using OpenSearch's Scroll API in
get_training_data
to retrieve all documents from the indices. - Replace
ast.literal_eval
withjson.loads
for parsing JSON data to improve security and robustness. - Ensure comprehensive unit and integration tests for the
OpenSearch_Semantic_VectorStore
class. - Update documentation to include the new vector store and its usage.
8.2 Future Considerations
- Technical evolution path: Continuously improve the vector store implementation based on performance monitoring and user feedback. Explore opportunities to optimize embedding generation and vector search algorithms.
- Business capability evolution: As the system evolves, consider extending the semantic search capabilities to other data types and use cases beyond SQL generation and documentation retrieval.
- System integration impacts: Ensure that the new vector store integrates seamlessly with other components of the Vanna system and external systems, such as data ingestion pipelines and user interfaces.
💡 Help Shape LlamaPReview
How's this review format working for you? Vote in our Github Discussion Polls to help us improve your review experience!
Took into account improvement proposals from bot. |
zainhoda
approved these changes
Feb 8, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Existing OpenSearch Vector Store does not support semantic (aka vector) search since no embeddings are generated nor stored in the indices (mappings do not leverage
knn_vector
type either). Queries against this implementation are keyword based which often results in poor or non-relevant answers compared to semantic/vector search.This PR adds a new OpenSearch vector store implementation supporting semantic/vector search. Embeddings are generated when documentation/DDL/SQL are added (via Vanna train function) as well as when questions are submitted to retrieve data.
This PR solves GitHub issue #441.