performance issue with my web service using LangChain and LangServe #22893

rmansoul · 2024-06-14T11:05:56Z

rmansoul
Jun 14, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain.schema.runnable import RunnableLambda
from langserve import add_routes
from langchain_community.llms import LlamaCpp
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.document_loaders import TextLoader
from fastapi import FastAPI
from langchain.schema.runnable import RunnableLambda
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from operator import itemgetter
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.chat_models import ChatOpenAI


model_path = "./models/mistral-7b-openorca.Q4_0.gguf"
app = FastAPI()

def func(x) :
    contenu = x['contenu']
    question  = x['question']
    temperature= x['temperature']
    max_tokens=x['max_tokens']
    model = LlamaCpp(
    model_path=model_path,
    temperature=temperature,
    max_tokens=max_tokens,
    )

    taille_morceau = 250
    morceaux = []
    #chunking classique
    for i in range(0, len(contenu), taille_morceau):
        morceaux.append(contenu[i:i+taille_morceau])
            
    embeddings = OpenAIEmbeddings(openai_api_key="")
    #embidding + store
    db = FAISS.from_texts(morceaux, embeddings)
    retriever = db.as_retriever()
   
    embedding_vector = OpenAIEmbeddings(openai_api_key="").embed_query(question)
    #method 1
    docs = db.similarity_search_by_vector(embedding_vector,k=1)
    template = """As an AI assistant specialized in question-answering tasks, your goal is to offer informative and 
accurate responses based on the provided context.
Context: {context}
Question: "{question}"
Answer:"""
    prompt = ChatPromptTemplate.from_template(template)
    
    chain = (


        prompt
        | model

)    
    final_response = chain.invoke({"context":docs,"question": question})


    
runnable = RunnableLambda(func).with_types(
    input_type=dict,
)

add_routes(app, runnable)


add_routes(app, RunnableLambda(func), path="/foo")
if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="localhost", port=8888) service.

Description

Problem Overview:
I have developed a web service that leverages LangChain and LangServe for performing similarity-based searches. However, I've encountered an issue where the service alternates between delivering responses at varying speeds. Sometimes, responses are quick, while at other times, they are noticeably slower. This inconsistency in response times is impacting the overall user experience and reliability of the service.
Result when i invoke the model :

Model	Test1(second)	Test2(second)	Test3(second)	Test4(second)
Mistral	1.74	22.56	1.87	22.93

Technologies Used:
LangChain: Integrated with LlamaCpp to manage language models and processing.
LangServe: Used for handling API requests and responses effectively.
OpenAI Embeddings: Employed for generating embeddings and supporting similarity searches.
FAISS: Utilized for efficient storage and retrieval of embeddings.

System Info

Problem Details:

Environment: llama cpp-python 0.2.76 , langchain 0.2.1
Model: mistral-7b-openorca.Q4_0.gguf (Hugging Face)

Current Configuration:

AWS Instance: g5.4xlarge (24 vCPU, 64 GB RAM)
llama-cpp-python Version: 0.2.76
Web Server Framework: Fast api
Operating System: Linux

@dosu · 2024-06-14T11:14:39Z

dosubot[bot]
bot Jun 14, 2024

To resolve the performance inconsistencies in your web service using LangChain and LangServe, you can try the following steps:

Ensure gRPC Channel Readiness: Make sure the gRPC channel is ready before making requests. This can help mitigate network-related delays.
Optimize gRPC Options: Adjust gRPC options like grpc.keepalive_time_ms and grpc.keepalive_timeout_ms to better suit your network conditions.
Check Network Quality: Ensure your network is stable and has low latency, as network quality can affect performance.
Batch Requests: Batch your requests to reduce the overhead of multiple gRPC calls.
Implement Caching: Use LangChain's caching feature to reduce the number of API calls to the LLM provider. This can save time and improve performance.
Use Efficient Vector Stores: Consider using efficient vector stores like VespaStore or OpenSearchVectorSearch for similarity searches.

Here is an example of ensuring the gRPC channel is ready:

def _get_channel(self) -> Any:
    try:
        import grpc
    except ImportError:
        raise ImportError(
            "Could not import grpcio python package. "
            "Please install it with `pip install grpcio`."
        )
    channel = (
        grpc.secure_channel(
            self.target, self.grpc_credentials, options=self.grpc_options
        )
        if self.grpc_use_secure
        else grpc.insecure_channel(self.target, options=self.grpc_options)
    )
    # Ensure the channel is ready
    grpc.channel_ready_future(channel).result(timeout=10)
    return channel

To implement caching, you can use the InMemoryCache class provided by LangChain:

from langchain_core.caches import InMemoryCache

cache = InMemoryCache()

# Example usage
prompt = "example prompt"
llm_string = "example llm configuration"

# Check cache before making a request
cached_result = cache.lookup(prompt, llm_string)
if cached_result is None:
    # Make the API call if not in cache
    result = make_api_call(prompt, llm_string)
    # Update the cache with the result
    cache.update(prompt, llm_string, result)
else:
    result = cached_result

To use OpenSearchVectorSearch for similarity searches, you can follow this example:

from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_core.embeddings import OpenAIEmbeddings

# Create an OpenSearch client
opensearch_url = "http://localhost:9200"
index_name = "embeddings"
embedding_function = OpenAIEmbeddings()

vectorstore = OpenSearchVectorSearch(
    opensearch_url=opensearch_url,
    index_name=index_name,
    embedding_function=embedding_function
)

# Perform a similarity search
query = "example query"
results = vectorstore.similarity_search(query, k=4)

for doc in results:
    print(doc.page_content)

By ensuring the gRPC channel is ready, implementing caching, and using efficient vector stores, you can mitigate some of the performance inconsistencies caused by network issues and repeated API calls ^[1]^[2]^[3]^[4]^[5].

To continue talking to Dosu, mention @dosu.

0 replies

eyurtsev · 2024-06-14T15:33:17Z

eyurtsev
Jun 14, 2024
Maintainer

LangServe and LangChain work well for IO bound tasks (e.g., when using an API). LangServe was not designed to help manage hardware resources.

The code that you shared will likely not work without major re-work. Instead I'd try to figure out how to get ollama working in parallel and see if it's possible to use ollama concurrently from langchain using the community integration.

1 reply

rmansoul Jul 10, 2024
Author

thank you for you're answer !

When you say ollama, you mean that I have to switch to ollama to load the llm ??

Thanks !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance issue with my web service using LangChain and LangServe #22893

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

performance issue with my web service using LangChain and LangServe #22893

rmansoul Jun 14, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 2 comments · 1 reply

dosubot[bot] bot Jun 14, 2024

eyurtsev Jun 14, 2024 Maintainer

rmansoul Jul 10, 2024 Author

rmansoul
Jun 14, 2024

Replies: 2 comments 1 reply

dosubot[bot]
bot Jun 14, 2024

eyurtsev
Jun 14, 2024
Maintainer

rmansoul Jul 10, 2024
Author