Replies: 2 comments 1 reply
-
To resolve the performance inconsistencies in your web service using LangChain and LangServe, you can try the following steps:
Here is an example of ensuring the gRPC channel is ready: def _get_channel(self) -> Any:
try:
import grpc
except ImportError:
raise ImportError(
"Could not import grpcio python package. "
"Please install it with `pip install grpcio`."
)
channel = (
grpc.secure_channel(
self.target, self.grpc_credentials, options=self.grpc_options
)
if self.grpc_use_secure
else grpc.insecure_channel(self.target, options=self.grpc_options)
)
# Ensure the channel is ready
grpc.channel_ready_future(channel).result(timeout=10)
return channel To implement caching, you can use the from langchain_core.caches import InMemoryCache
cache = InMemoryCache()
# Example usage
prompt = "example prompt"
llm_string = "example llm configuration"
# Check cache before making a request
cached_result = cache.lookup(prompt, llm_string)
if cached_result is None:
# Make the API call if not in cache
result = make_api_call(prompt, llm_string)
# Update the cache with the result
cache.update(prompt, llm_string, result)
else:
result = cached_result To use OpenSearchVectorSearch for similarity searches, you can follow this example: from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_core.embeddings import OpenAIEmbeddings
# Create an OpenSearch client
opensearch_url = "http://localhost:9200"
index_name = "embeddings"
embedding_function = OpenAIEmbeddings()
vectorstore = OpenSearchVectorSearch(
opensearch_url=opensearch_url,
index_name=index_name,
embedding_function=embedding_function
)
# Perform a similarity search
query = "example query"
results = vectorstore.similarity_search(query, k=4)
for doc in results:
print(doc.page_content) By ensuring the gRPC channel is ready, implementing caching, and using efficient vector stores, you can mitigate some of the performance inconsistencies caused by network issues and repeated API calls [1][2][3][4][5]. |
Beta Was this translation helpful? Give feedback.
-
LangServe and LangChain work well for IO bound tasks (e.g., when using an API). LangServe was not designed to help manage hardware resources. The code that you shared will likely not work without major re-work. Instead I'd try to figure out how to get ollama working in parallel and see if it's possible to use ollama concurrently from langchain using the community integration. |
Beta Was this translation helpful? Give feedback.
-
Checked other resources
Commit to Help
Example Code
Description
Problem Overview:
<style> </style>I have developed a web service that leverages LangChain and LangServe for performing similarity-based searches. However, I've encountered an issue where the service alternates between delivering responses at varying speeds. Sometimes, responses are quick, while at other times, they are noticeably slower. This inconsistency in response times is impacting the overall user experience and reliability of the service.
Result when i invoke the model :
Technologies Used:
LangChain: Integrated with LlamaCpp to manage language models and processing.
LangServe: Used for handling API requests and responses effectively.
OpenAI Embeddings: Employed for generating embeddings and supporting similarity searches.
FAISS: Utilized for efficient storage and retrieval of embeddings.
System Info
Problem Details:
Environment: llama cpp-python 0.2.76 , langchain 0.2.1
Model: mistral-7b-openorca.Q4_0.gguf (Hugging Face)
Current Configuration:
AWS Instance: g5.4xlarge (24 vCPU, 64 GB RAM)
llama-cpp-python Version: 0.2.76
Web Server Framework: Fast api
Operating System: Linux
Beta Was this translation helpful? Give feedback.
All reactions