Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new OpenSearch Vector Store with embeddings support #762

Conversation

asaintsever
Copy link
Contributor

Existing OpenSearch Vector Store does not support semantic (aka vector) search since no embeddings are generated nor stored in the indices (mappings do not leverage knn_vector type either). Queries against this implementation are keyword based which often results in poor or non-relevant answers compared to semantic/vector search.

This PR adds a new OpenSearch vector store implementation supporting semantic/vector search. Embeddings are generated when documentation/DDL/SQL are added (via Vanna train function) as well as when questions are submitted to retrieve data.

This PR solves GitHub issue #441.

Copy link

@llamapreview llamapreview bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto Pull Request Review from LlamaPReview

1. Overview

1.1 PR Summary

  • Business value and requirements alignment: This PR introduces a new OpenSearch vector store implementation that supports semantic/vector search, enhancing the relevance and accuracy of search results compared to the existing keyword-based approach. This aligns with the business need to improve the quality of search results, particularly for SQL generation and documentation retrieval.
  • Key components modified:
    • pyproject.toml: Updated dependencies to include langchain-community and langchain-huggingface.
    • src/vanna/opensearch/__init__.py: Added import for the new OpenSearch_Semantic_VectorStore.
    • src/vanna/opensearch/opensearch_vector_semantic.py: Implemented the new OpenSearch_Semantic_VectorStore class with methods for adding data, retrieving similar items, and handling training data.
    • tests/test_imports.py: Updated imports to include the new vector store class.
  • Impact assessment: The introduction of semantic search capabilities represents a significant enhancement in search functionality, potentially improving user satisfaction and the accuracy of SQL generation. However, it also introduces new dependencies and potential performance considerations.
  • System dependencies and integration impacts: The new feature relies on external libraries (langchain-community and langchain-huggingface) and requires embedding generation during data ingestion and query processing, which may impact system performance and resource utilization.

1.2 Architecture Changes

  • System design modifications: The addition of OpenSearch_Semantic_VectorStore extends the existing vector store architecture to support semantic search, leveraging vector embeddings for improved search relevance.
  • Component interactions: The new vector store interacts with OpenSearch for storing and retrieving embeddings, and with the HuggingFace model for generating embeddings. It also integrates with the existing Vanna training functions for data ingestion.
  • Integration points: The new class is integrated into the Vanna system through the __init__.py file and is used in the training and query processes.

2. Detailed Technical Analysis

2.1 Code Logic Deep-Dive

Core Logic Changes

  • src/vanna/opensearch/opensearch_vector_semantic.py - OpenSearch_Semantic_VectorStore.get_training_data
    • Submitted PR Code:
    def get_training_data(self, **kwargs) -> pd.DataFrame:
      data = []
      query = {
        "query": {
          "match_all": {}
        }
      }

      response = self.documentation_store.client.search(
        index=self.document_index,
        ignore_unavailable=True,
        body=query,
        size=1000 # TODO: paginate instead of getting max 1000 records
      )

      # records = [hit['_source'] for hit in response['hits']['hits']]
      for hit in response['hits']['hits']:
        data.append(
          {
            "id": hit["_id"],
            "training_data_type": "documentation",
            "question": None,
            "content": hit["_source"]['text'],
          }
        )

      response =self.sql_store.client.search(
        index=self.question_sql_index,
        ignore_unavailable=True,
        body=query,
        size=1000 # TODO: paginate instead of getting max 1000 records
      )
      # records = [hit['_source'] for hit in response['hits']['hits']]
      for hit in response['hits']['hits']:
        try:
          doc_dict = ast.literal_eval(hit["_source"]['text'])

          data.append(
            {
              "id": hit["_id"],
              "training_data_type": "sql",
              "question": doc_dict.get("question"),
              "content": doc_dict.get("sql"),
            }
          )
        except (ValueError, SyntaxError):
          self.log(f"Skipping row with custom_id {hit['_id']} due to parsing error.", "Error")
          continue

      response = self.ddl_store.client.search(
        index=self.ddl_index,
        ignore_unavailable=True,
        body=query,
        size=1000 # TODO: paginate instead of getting max 1000 records
      )
      # records = [hit['_source'] for hit in response['hits']['hits']]
      for hit in response['hits']['hits']:
        data.append(
          {
            "id": hit["_id"],
            "training_data_type": "ddl",
            "question": None,
            "content": hit["_source"]['text'],
          }
        )

      return pd.DataFrame(data)
  • Analysis:
    • The get_training_data method retrieves training data from three OpenSearch indices: documentation_store, sql_store, and ddl_store.
    • It uses a match_all query to fetch all documents from each index.
    • Pagination Issue: The current implementation uses size=1000 and includes a TODO comment about pagination. This is a significant limitation as it will only retrieve the first 1000 hits from each index, potentially missing training data if there are more than 1000 documents in any of these indices. For large datasets, this will lead to incomplete training data being retrieved, impacting the model's training and performance. The initial review correctly identifies this as a point for deep-dive analysis, but doesn't explicitly highlight the severity of the data loss risk for larger datasets.
    • No Scroll API: The code uses the standard search API with size, which is not efficient for retrieving large datasets from OpenSearch. OpenSearch provides the Scroll API, which is designed for efficient deep pagination and retrieval of large result sets. Not using the Scroll API can lead to performance issues and potentially timeouts when indices grow large.
  • LlamaPReview Suggested Improvements:
    def get_training_data(self, **kwargs) -> pd.DataFrame:
      data = []
      query = {
        "query": {
          "match_all": {}
        }
      }
      indices = [
          {"index": self.document_index, "type": "documentation"},
          {"index": self.question_sql_index, "type": "sql"},
          {"index": self.ddl_index, "type": "ddl"},
      ]

      for index_info in indices:
          index_name = index_info["index"]
          training_data_type = index_info["type"]
          scroll = '1m'  # keep scroll context for 1 minute
          response = self.documentation_store.client.search(  # Use documentation_store.client consistently for search
              index=index_name,
              query=query["query"], # Directly use query['query'] for clarity
              scroll=scroll,
              size=1000
          )

          scroll_id = response.get('_scroll_id')

          while scroll_id:
              hits = response['hits']['hits']
              if not hits:
                  break # No more hits, exit loop

              for hit in hits:
                  source = hit['_source']
                  if training_data_type == "sql":
                      try:
                          doc_dict = json.loads(source['text'])
                          content = doc_dict.get("sql")
                          question = doc_dict.get("question")
                      except json.JSONDecodeError as e: # Catch JSONDecodeError specifically
                          self.log(f"Skipping row with custom_id {hit['_id']} due to JSON parsing error: {e}", "Error") # More specific error message
                          continue
                  else: # documentation or ddl
                      content = source['text']
                      question = None

                  data.append({
                      "id": hit["_id"],
                      "training_data_type": training_data_type,
                      "question": question,
                      "content": content,
                  })
              # Get next batch of results
              response = self.documentation_store.client.scroll(scroll_id=scroll_id, scroll=scroll) # Use documentation_store.client.scroll
              scroll_id = response.get('_scroll_id')

      return pd.DataFrame(data)
  • Improvement rationale:
    • Technical benefits:
      • Addresses Pagination: Implements pagination using OpenSearch's Scroll API, allowing retrieval of all documents from the indices, regardless of the number of documents. This resolves the data loss issue and ensures comprehensive training data retrieval.
      • Performance Improvement: Using Scroll API is significantly more efficient for deep pagination compared to repeated search requests with increasing from and size. This improves the performance of get_training_data especially for large OpenSearch indices.
      • Scalability: Enables the system to scale to handle larger datasets in OpenSearch without being limited by the 1000-record limit of the basic search API.
    • Business value:
      • Improved Model Accuracy: By retrieving all available training data, the model can be trained on a more complete dataset, leading to potentially higher accuracy and better performance in question answering and SQL generation.
      • Handles Large Datasets: The system can now effectively handle training data stored in large OpenSearch indices, making it more robust and scalable for real-world applications with growing data volumes.
    • Risk assessment:
      • Increased Complexity: Using Scroll API adds some complexity to the code compared to the simpler search with size. However, the added complexity is justified by the significant improvements in data retrieval completeness and performance.
      • Resource Usage: While Scroll API is more efficient for data retrieval, it does maintain a scroll context on the OpenSearch server, which consumes resources. The scroll='1m' parameter in the suggested code keeps the context alive for 1 minute, which should be sufficient for most cases. This needs to be monitored and potentially adjusted based on the size of the indices and network latency.

Core Logic Changes

  • src/vanna/opensearch/opensearch_vector_semantic.py - OpenSearch_Semantic_VectorStore.get_similar_question_sql & OpenSearch_Semantic_VectorStore.get_training_data
    • Submitted PR Code:
    def get_similar_question_sql(self, question: str, **kwargs) -> list:
      documents = self.sql_store.similarity_search(query=question, k=self.n_results_sql)
      return [ast.literal_eval(document.page_content) for document in documents]
    def get_training_data(self, **kwargs) -> pd.DataFrame:
      # ...
      for hit in response['hits']['hits']:
        try:
          doc_dict = ast.literal_eval(hit["_source"]['text'])
          # ...
        except (ValueError, SyntaxError):
          self.log(f"Skipping row with custom_id {hit['_id']} due to parsing error.", "Error")
          continue
      # ...
  • Analysis:
    • Both get_similar_question_sql and get_training_data methods use ast.literal_eval(document.page_content) to parse the page_content retrieved from OpenSearch. In the context of get_similar_question_sql, the page_content is expected to be a JSON string representing a question-SQL pair. In get_training_data, specifically for 'sql' type data, it also expects JSON.
    • Security Risk of ast.literal_eval: ast.literal_eval is generally safer than eval as it only evaluates literal Python structures (strings, numbers, tuples, lists, dicts, booleans, None). However, it is still not completely safe, especially when dealing with data from external sources like OpenSearch, which might be compromised or contain maliciously crafted strings. While the risk might be low in this specific scenario if OpenSearch is considered a trusted internal component, it's a potential vulnerability. If an attacker could inject malicious strings into the OpenSearch index, they might be able to execute arbitrary code or cause denial-of-service by exploiting vulnerabilities in ast.literal_eval or the parsing logic.
    • Robustness and Error Handling: While there is a try-except block in get_training_data, it only catches ValueError and SyntaxError. Other potential exceptions during parsing or data access are not handled. Furthermore, relying on string parsing for structured data is inherently less robust than using a dedicated JSON library. If the data in OpenSearch is not consistently valid JSON, ast.literal_eval will fail, leading to data loss or errors.
    • Performance Overhead: ast.literal_eval is less efficient than direct JSON parsing using json.loads, especially for complex JSON structures. While performance might not be critical in this context, using json.loads is generally preferred for parsing JSON data in Python.
  • LlamaPReview Suggested Improvements:
    import json # Ensure json is imported at the top of the file

    def get_similar_question_sql(self, question: str, **kwargs) -> list:
      documents = self.sql_store.similarity_search(query=question, k=self.n_results_sql)
      return [json.loads(document.page_content) for document in documents] # Use json.loads

    def get_training_data(self, **kwargs) -> pd.DataFrame:
      # ...
      for hit in response['hits']['hits']:
        source = hit['_source']
        if training_data_type == "sql":
          try:
            doc_dict = json.loads(source['text']) # Use json.loads
            content = doc_dict.get("sql")
            question = doc_dict.get("question")
          except json.JSONDecodeError as e: # Catch JSONDecodeError specifically
            self.log(f"Skipping row with custom_id {hit['_id']} due to JSON parsing error: {e}", "Error") # More specific error message
            continue
        else: # documentation or ddl
          content = source['text']
          question = None
        # ...
  • Improvement rationale:
    • Technical benefits:
      • Security Improvement: Replaces ast.literal_eval with json.loads for parsing JSON data. json.loads is specifically designed for parsing JSON and is generally considered safer and more appropriate for this task, reducing the risk of potential vulnerabilities associated with ast.literal_eval.
      • Increased Robustness: Uses json.loads which is the standard and recommended way to parse JSON in Python. It provides better error handling for invalid JSON formats through json.JSONDecodeError, making the code more robust against malformed data in OpenSearch.
      • Performance: json.loads is generally more performant for parsing JSON compared to ast.literal_eval.
      • Clarity and Maintainability: Using json.loads explicitly signals the intent to parse JSON data, improving code readability and maintainability.
    • Business value:
      • Improved System Reliability: By using a safer and more robust JSON parsing method, the system becomes less prone to errors and potential security issues, leading to improved reliability and stability.
      • Data Integrity: More robust parsing ensures that data is correctly processed and reduces the risk of data corruption or misinterpretation due to parsing errors.
    • Risk assessment:
      • Low Risk Change: Replacing ast.literal_eval with json.loads for JSON parsing is a low-risk change with significant benefits in terms of security and robustness. It aligns with best practices for JSON handling in Python. The change includes specific exception handling for json.JSONDecodeError to maintain error logging and data skipping behavior, ensuring no regression in error handling.

2.2 Implementation Quality

  • Code organization and structure: The new OpenSearch_Semantic_VectorStore class is well-organized and follows a clear structure. The methods are logically grouped and follow a consistent naming convention.
  • Design patterns usage: The class leverages the VannaBase base class and uses dependency injection for configuration, adhering to good design principles.
  • Error handling approach: The class includes error handling for various operations, such as data parsing and deletion. However, there is room for improvement in handling specific exceptions and providing more informative error messages.
  • Resource management: The class manages OpenSearch client resources effectively, ensuring that connections are properly configured and used.

3. Critical Findings

3.1 Potential Issues

  • 🔴 Critical Issues

    • Pagination Issue in get_training_data: The current implementation of get_training_data uses size=1000 without proper pagination, leading to potential data loss for large datasets. This can significantly impact the model's training and performance.
      • Impact: Incomplete training data retrieval, leading to poor model performance.
      • Recommendation: Implement pagination using OpenSearch's Scroll API to retrieve all documents from the indices.
  • 🟡 Warnings

    • Security Risk of ast.literal_eval: The use of ast.literal_eval for parsing JSON data poses a security risk, especially when dealing with data from external sources.
      • Potential risks: Potential code injection or denial-of-service attacks if malicious strings are injected into the OpenSearch index.
      • Suggested improvements: Replace ast.literal_eval with json.loads for parsing JSON data, which is safer and more appropriate for this task.

3.2 Code Quality Concerns

  • Maintainability aspects: The code is generally maintainable, but there are opportunities to improve error handling and logging for better debugging and monitoring.
  • Readability issues: The code is readable, but some improvements in variable naming and comments can enhance understanding.
  • Performance bottlenecks: The current implementation of get_training_data may introduce performance bottlenecks due to the lack of proper pagination. This can be addressed by using the Scroll API for efficient data retrieval.

4. Security Assessment

  • Authentication/Authorization impacts: The new vector store relies on OpenSearch client authentication, which must be securely configured to prevent unauthorized access.
  • Data handling concerns: Ensure that embedding models and data handled by the vector store are securely managed to prevent data leakage or unauthorized access.
  • Input validation: Validate all inputs to the vector store methods to prevent injection attacks or malformed data from causing errors.
  • Security best practices: Follow best practices for secure configuration and management of OpenSearch and embedding models.
  • Potential security risks: The use of ast.literal_eval poses a security risk, which can be mitigated by using json.loads.
  • Mitigation strategies: Implement robust input validation, secure configuration, and use safer JSON parsing methods.
  • Security testing requirements: Conduct thorough security testing, including penetration testing and code reviews, to identify and mitigate potential vulnerabilities.

5. Testing Strategy

5.1 Test Coverage

  • Unit test analysis: Ensure that all methods in the OpenSearch_Semantic_VectorStore class are covered by unit tests, including edge cases and error handling.
  • Integration test requirements: Perform integration tests to validate the interaction between the vector store and OpenSearch, as well as the embedding generation process.
  • Edge cases coverage: Test the behavior of the vector store with various data inputs, including large datasets and malformed data.

5.2 Test Recommendations

Suggested Test Cases

  # Example unit test for get_similar_question_sql
  def test_get_similar_question_sql(self):
      store = OpenSearch_Semantic_VectorStore(config=test_config)
      question = "SELECT * FROM users;"
      results = store.get_similar_question_sql(question)
      assert len(results) > 0
      assert all(isinstance(result, dict) for result in results)

  # Example integration test for get_training_data
  def test_get_training_data(self):
      store = OpenSearch_Semantic_VectorStore(config=test_config)
      data = store.get_training_data()
      assert not data.empty
      assert set(data.columns) == {"id", "training_data_type", "question", "content"}
  • Coverage improvements: Ensure that all code paths, including error handling, are covered by tests.
  • Performance testing needs: Perform performance testing to measure the latency and throughput of embedding generation and vector search operations under different data volumes and query loads.

6. Documentation & Maintenance

  • Documentation updates needed: Update the API documentation to include the new OpenSearch_Semantic_VectorStore class and its methods. Provide examples of how to use the new vector store for semantic search.
  • Long-term maintenance considerations: Monitor the performance and resource utilization of the vector store, especially as data volumes grow. Regularly review and update the embedding models and OpenSearch configurations to ensure optimal performance and security.
  • Technical debt and monitoring requirements: Address any technical debt introduced by the new feature, such as the need for proper pagination in get_training_data. Implement monitoring to track the performance and resource usage of the vector store.

7. Deployment & Operations

  • Deployment impact and strategy: The new vector store requires deployment of the updated code and configuration of the OpenSearch client and embedding models. Ensure that the deployment process includes thorough testing to validate the new functionality.
  • Key operational considerations: Monitor the performance and resource utilization of the vector store in the production environment. Implement logging and alerting to detect and respond to any issues or anomalies.

8. Summary & Recommendations

8.1 Key Action Items

  1. Implement pagination using OpenSearch's Scroll API in get_training_data to retrieve all documents from the indices.
  2. Replace ast.literal_eval with json.loads for parsing JSON data to improve security and robustness.
  3. Ensure comprehensive unit and integration tests for the OpenSearch_Semantic_VectorStore class.
  4. Update documentation to include the new vector store and its usage.

8.2 Future Considerations

  • Technical evolution path: Continuously improve the vector store implementation based on performance monitoring and user feedback. Explore opportunities to optimize embedding generation and vector search algorithms.
  • Business capability evolution: As the system evolves, consider extending the semantic search capabilities to other data types and use cases beyond SQL generation and documentation retrieval.
  • System integration impacts: Ensure that the new vector store integrates seamlessly with other components of the Vanna system and external systems, such as data ingestion pipelines and user interfaces.

💡 Help Shape LlamaPReview
How's this review format working for you? Vote in our Github Discussion Polls to help us improve your review experience!

@asaintsever
Copy link
Contributor Author

Took into account improvement proposals from bot.

@zainhoda zainhoda merged commit b507f4d into vanna-ai:main Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants