-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Context in the README. Show how score chat responses based on a followup from the user and then log that as feedback in LangSmith
- Loading branch information
Showing
9 changed files
with
1,874 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 LangChain, Inc. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Chat Bot Feedback Template | ||
|
||
This template shows how to evaluate your chat bot without explicit user feedback. It defines a simple chat bot in [chain.py](./chat_bot_feedback/chain.py) and custom evaluator that scores bot response effectiveness based on the subsequent user response. You can apply this run evaluator to your own chat bot by calling `with_config` on the chat bot before serving. You can also directly deploy your chat app using this template. | ||
|
||
[Chat bots](https://python.langchain.com/docs/use_cases/chatbots) are one of the most common interfaces for deploying LLMs. The quality of chat bots varies, making continuous development important. But users are wont to leave explicit feedback through mechanisms like thumbs-up or thumbs-down buttons. Furthermore, traditional analytics such as "session length" or "conversation length" often lack clarity. However, multi-turn conversations with a chat bot can provide a wealth of information, which we can transform into metrics for fine-tuning, evaluation, and product analytics. | ||
|
||
Taking [Chat Langchain](https://chat.langchain.com/) as a case study, only about 0.04% of all queries receive explicit feedback. Yet, approximately 70% of the queries are follow-ups to previous questions. A significant portion of these follow-up queries continue useful information we can use to infer the quality of the previous AI response. | ||
|
||
|
||
This template helps solve this "feedback scarcity" problem. Below is an example invocation of this chat bot: | ||
|
||
[![Chat Interaction](./static/chat_interaction.png)](https://smith.langchain.com/public/3378daea-133c-4fe8-b4da-0a3044c5dbe8/r?runtab=1) | ||
|
||
When the user responds to this ([link](https://smith.langchain.com/public/a7e2df54-4194-455d-9978-cecd8be0df1e/r)), the response evaluator is invoked, resulting in the following evaluationrun: | ||
|
||
[![Evaluator Run](./static/evaluator.png)](https://smith.langchain.com/public/534184ee-db8f-4831-a386-3f578145114c/r) | ||
|
||
As shown, the evaluator sees that the user is increasingly frustrated, indicating that the prior response was not effective | ||
|
||
## LangSmith Feedback | ||
|
||
[LangSmith](https://smith.langchain.com/) is a platform for building production-grade LLM applications. Beyond its debugging and offline evaluation features, LangSmith helps you capture both user and model-assisted feedback to refine your LLM application. This template uses an LLM to generate feedback for your application, which you can use to continuously improve your service. For more examples on collecting feedback using LangSmith, consult the [documentation](https://docs.smith.langchain.com/cookbook/feedback-examples). | ||
|
||
## Evaluator Implementation | ||
|
||
The user feedback is inferred by custom `RunEvaluator`. This evaluator is called using the `EvaluatorCallbackHandler`, which run it in a separate thread to avoid interfering with the chat bot's runtime. You can use this custom evaluator on any compatible chat bot by calling the following function on your LangChain object: | ||
|
||
```python | ||
my_chain | ||
.with_config( | ||
callbacks=[ | ||
EvaluatorCallbackHandler( | ||
evaluators=[ | ||
ResponseEffectivenessEvaluator(evaluate_response_effectiveness) | ||
] | ||
) | ||
], | ||
) | ||
``` | ||
|
||
The evaluator instructs an LLM, specifically `gpt-3.5-turbo`, to evaluate the AI's most recent chat message based on the user's followup response. It generates a score and accompanying reasoning that is converted to feedback in LangSmith, applied to the value provided as the `last_run_id`. | ||
|
||
The prompt used within the LLM [is available on the hub](https://smith.langchain.com/hub/wfh/response-effectiveness). Feel free to customize it with things like additional app context (such as the goal of the app or the types of questions it should respond to) or "symptoms" you'd like the LLM to focus on. This evaluator also utilizes OpenAI's function-calling API to ensure a more consistent, structured output for the grade. | ||
|
||
## Environment Variables | ||
|
||
Ensure that `OPENAI_API_KEY` is set to use OpenAI models. Also, configure LangSmith by setting your `LANGSMITH_API_KEY`. | ||
|
||
```bash | ||
export OPENAI_API_KEY=sk-... | ||
export LANGSMITH_API_KEY=... | ||
export LANGCHAIN_TRACING_V2=true | ||
export LANGCHAIN_PROJECT=my-project # Set to the project you want to save to | ||
``` | ||
|
||
## Usage | ||
|
||
If deploying via `LangServe`, we recommend configuring the server to return callback events as well. This will ensure the backend traces are included in whatever traces you generate using the `RemoteRunnable`. | ||
|
||
```python | ||
from chat_bot_feedback.chain import chain | ||
|
||
add_routes(app, chain, path="/chat-bot-feedback", include_callback_events=True) | ||
``` | ||
|
||
With the server running, you can use the following code snippet to stream the chat bot responses for a 2 turn conversation. | ||
|
||
```python | ||
from functools import partial | ||
from typing import Dict, Optional, Callable, List | ||
from langserve import RemoteRunnable | ||
from langchain.callbacks.manager import tracing_v2_enabled | ||
from langchain.schema import BaseMessage, AIMessage, HumanMessage | ||
|
||
# Update with the URL provided by your LangServe server | ||
chain = RemoteRunnable("http://127.0.0.1:8031/chat-bot-feedback") | ||
|
||
def stream_content( | ||
text: str, | ||
chat_history: Optional[List[BaseMessage]] = None, | ||
last_run_id: Optional[str] = None, | ||
on_chunk: Callable = None, | ||
): | ||
results = [] | ||
with tracing_v2_enabled() as cb: | ||
for chunk in chain.stream( | ||
{"text": text, "chat_history": chat_history, "last_run_id": last_run_id}, | ||
): | ||
on_chunk(chunk) | ||
results.append(chunk) | ||
last_run_id = cb.latest_run.id if cb.latest_run else None | ||
return last_run_id, "".join(results) | ||
|
||
chat_history = [] | ||
text = "Where are my keys?" | ||
last_run_id, response_message = stream_content(text, on_chunk=partial(print, end="")) | ||
print() | ||
chat_history.extend([HumanMessage(content=text), AIMessage(content=response_message)]) | ||
text = "I CAN'T FIND THEM ANYWHERE" # The previous response will likely receive a low score, | ||
# as the user's frustration appears to be escalating. | ||
last_run_id, response_message = stream_content( | ||
text, | ||
chat_history=chat_history, | ||
last_run_id=str(last_run_id), | ||
on_chunk=partial(print, end=""), | ||
) | ||
print() | ||
chat_history.extend([HumanMessage(content=text), AIMessage(content=response_message)]) | ||
``` | ||
|
||
This uses the `tracing_v2_enabled` callback manager to get the run ID of the call, which we provide in subsequent calls in the same chat thread, so the evaluator can assign feedback to the appropriate trace. | ||
|
||
|
||
## Conclusion | ||
|
||
This template provides a simple chat bot definition you can directly deploy using LangServe. It defines a custom evaluator to log evaluation feedback for the bot without any explicit user ratings. This is an effective way to augment your analytics and to better select data points for fine-tuning and evaluation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from conversational_feedback.chain import chain | ||
|
||
__all__ = ["chain"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
from __future__ import annotations | ||
|
||
from typing import List, Optional | ||
|
||
from langchain import hub | ||
from langchain.callbacks.tracers.evaluation import EvaluatorCallbackHandler | ||
from langchain.callbacks.tracers.schemas import Run | ||
from langchain.chat_models import ChatOpenAI | ||
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser | ||
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder | ||
from langchain.schema import ( | ||
AIMessage, | ||
BaseMessage, | ||
HumanMessage, | ||
StrOutputParser, | ||
get_buffer_string, | ||
) | ||
from langchain.schema.runnable import Runnable | ||
from langsmith.evaluation import EvaluationResult, RunEvaluator | ||
from langsmith.schemas import Example | ||
from pydantic import BaseModel, Field | ||
|
||
############################################################################### | ||
# | Chat Bot Evaluator Definition | ||
# | This section defines an evaluator that evaluates any chat bot | ||
# | without explicit user feedback. It formats the dialog up to | ||
# | the current message and then instructs an LLM to grade the last AI response | ||
# | based on the subsequent user response. If no chat history is present, | ||
# V the evaluator is not called. | ||
############################################################################### | ||
|
||
|
||
class ResponseEffectiveness(BaseModel): | ||
"""Score the effectiveness of the AI chat bot response.""" | ||
|
||
reasoning: str = Field( | ||
..., | ||
description="Explanation for the score.", | ||
) | ||
score: int = Field( | ||
..., | ||
min=0, | ||
max=5, | ||
description="Effectiveness of AI's final response.", | ||
) | ||
|
||
|
||
def format_messages(input: dict) -> List[BaseMessage]: | ||
"""Format the messages for the evaluator.""" | ||
chat_history = input.get("chat_history") or [] | ||
results = [] | ||
for message in chat_history: | ||
if message["type"] == "human": | ||
results.append(HumanMessage.parse_obj(message)) | ||
else: | ||
results.append(AIMessage.parse_obj(message)) | ||
return results | ||
|
||
|
||
def format_dialog(input: dict) -> dict: | ||
"""Format messages and convert to a single string.""" | ||
chat_history = format_messages(input) | ||
formatted_dialog = get_buffer_string(chat_history) + f"\nhuman: {input['text']}" | ||
return {"dialog": formatted_dialog} | ||
|
||
|
||
def normalize_score(response: dict) -> dict: | ||
"""Normalize the score to be between 0 and 1.""" | ||
response["score"] = int(response["score"]) / 5 | ||
return response | ||
|
||
|
||
# To view the prompt in the playground: https://smith.langchain.com/hub/wfh/response-effectiveness | ||
evaluation_prompt = hub.pull("wfh/response-effectiveness") | ||
evaluate_response_effectiveness = ( | ||
format_dialog | ||
| evaluation_prompt | ||
# bind_functions formats the schema for the OpenAI function | ||
# calling endpoint, which returns more reliable structured data. | ||
| ChatOpenAI(model="gpt-3.5-turbo").bind_functions( | ||
functions=[ResponseEffectiveness], | ||
function_call="ResponseEffectiveness", | ||
) | ||
# Convert the model's output to a dict | ||
| JsonOutputFunctionsParser(args_only=True) | ||
| normalize_score | ||
) | ||
|
||
|
||
class ResponseEffectivenessEvaluator(RunEvaluator): | ||
"""Evaluate the chat bot based the subsequent user responses.""" | ||
|
||
def __init__(self, evaluator_runnable: Runnable) -> None: | ||
super().__init__() | ||
self.runnable = evaluator_runnable | ||
|
||
def evaluate_run( | ||
self, run: Run, example: Optional[Example] = None | ||
) -> EvaluationResult: | ||
# This evaluator grades the AI's PREVIOUS response. | ||
# If no chat history is present, there isn't anything to evaluate | ||
# (it's the user's first message) | ||
if not run.inputs.get("chat_history"): | ||
return EvaluationResult( | ||
key="response_effectiveness", comment="No chat history present." | ||
) | ||
# This only occurs if the client isn't correctly sending the run IDs | ||
# of the previous calls. | ||
elif "last_run_id" not in run.inputs: | ||
return EvaluationResult( | ||
key="response_effectiveness", comment="No last run ID present." | ||
) | ||
# Call the LLM to evaluate the response | ||
eval_grade: Optional[dict] = self.runnable.invoke(run.inputs) | ||
target_run_id = run.inputs["last_run_id"] | ||
return EvaluationResult( | ||
**eval_grade, | ||
key="response_effectiveness", | ||
target_run_id=target_run_id, # Requires langsmith >= 0.0.54 | ||
) | ||
|
||
|
||
############################################################################### | ||
# | The chat bot definition | ||
# | This is what is actually exposed by LangServe in the API | ||
# | It can be any chain that accepts the ChainInput schema and returns a str | ||
# | all that is required is the with_config() call at the end to add the | ||
# V evaluators as "listeners" to the chain. | ||
# ############################################################################ | ||
|
||
|
||
class ChainInput(BaseModel): | ||
"""Input for the chat bot.""" | ||
|
||
chat_history: Optional[List[BaseMessage]] = Field( | ||
description="Previous chat messages." | ||
) | ||
text: str = Field(..., description="User's latest query.") | ||
last_run_id: Optional[str] = Field("", description="Run ID of the last run.") | ||
|
||
|
||
_prompt = ChatPromptTemplate.from_messages( | ||
[ | ||
( | ||
"system", | ||
"You are a helpful assistant who speaks like a pirate", | ||
), | ||
MessagesPlaceholder(variable_name="chat_history"), | ||
("user", "{text}"), | ||
] | ||
) | ||
_model = ChatOpenAI() | ||
|
||
|
||
def format_chat_history(chain_input: dict) -> dict: | ||
messages = format_messages(chain_input) | ||
|
||
return { | ||
"chat_history": messages, | ||
"text": chain_input.get("text"), | ||
} | ||
|
||
|
||
# if you update the name of this, you MUST also update ../pyproject.toml | ||
# with the new `tool.langserve.export_attr` | ||
chain = ( | ||
(format_chat_history | _prompt | _model | StrOutputParser()) | ||
.with_types(input_type=ChainInput) | ||
# This is to add the evaluators as "listeners" | ||
# and to customize the name of the chain. | ||
# Any chain that accepts a compatible input type works here. | ||
.with_config( | ||
run_name="ChatBot", | ||
callbacks=[ | ||
EvaluatorCallbackHandler( | ||
evaluators=[ | ||
ResponseEffectivenessEvaluator(evaluate_response_effectiveness) | ||
] | ||
) | ||
], | ||
) | ||
) |
Oops, something went wrong.