-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversational Feedback #12590
Merged
Merged
Conversational Feedback #12590
Changes from 5 commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
7deb4a7
Add conversational / chat feedback example
hinthornw f4397e2
update readme
hinthornw 8400eb1
update
hinthornw b31cf68
Update readme
hinthornw 7e918cb
Merge branch 'master' into wfh/conversational_feedback
hinthornw c9066d0
Merge branch 'master' into wfh/conversational_feedback
hinthornw 83d2b71
doc
hinthornw b099d53
Merge branch 'master' into wfh/conversational_feedback
hinthornw 2ba9e5f
imgs
hinthornw 04f3134
rename
hinthornw File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2023 LangChain, Inc. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# Chat Feedback Template | ||
|
||
This template captures implicit feedback from human behavior in a simple chat bot. It instructs an LLM to reference a user's responses within a conversation to evaluate the chat bot's previous replies. | ||
|
||
[Chat bots](https://python.langchain.com/docs/use_cases/chatbots) serve as one of the most common interfaces for deploying LLMs. The quality of chat bots varies, making continuous development important. But users are wont to leave explicit feedback through mechanisms like thumbs-up or thumbs-down buttons. Furthermore, traditional analytics such as "session length" or "conversation length" often lack clarity. However, multi-turn conversations with a chat bot can provide a wealth of information, which we can transform into metrics for fine-tuning, evaluation, and product analytics. | ||
|
||
Taking [Chat Langchain](https://chat.langchain.com/) as a case study, only about 0.04% of all queries receive explicit feedback. Yet, approximately 70% of the queries are follow-ups to previous questions. A significant portion of these follow-up queries continue useful information we can use to infer the quality of the previous AI response. | ||
|
||
## LangSmith Feedback | ||
|
||
[LangSmith](https://smith.langchain.com/) is a platform for building production-grade LLM applications. Beyond its debugging and offline evaluation features, LangSmith helps you capture both user and model-assisted feedback to refine your LLM application. For more examples on collecting feedback using LangSmith, consult the [documentation](https://docs.smith.langchain.com/cookbook/feedback-examples). | ||
|
||
## Implementation | ||
|
||
Feedback collection occurs within a custom `RunEvaluator`. This evaluator is called using the `EvaluatorCallbackHandler`, which run it in a separate thread to avoid interfering with the chat bot's runtime. | ||
|
||
The evaluator instructs an LLM, specifically `gpt-3.5-turbo`, to evaluate the AI's most recent chat message based on the user's followup response. It generates a score and accompanying reasoning that is converted to feedback in LangSmith, applied to the value provided as the `last_run_id`. | ||
|
||
The prompt used within the LLM [is available on the hub](https://smith.langchain.com/hub/wfh/response-effectiveness). Feel free to customize it with things like additional app context (such as the goal of the app or the types of questions it should respond to) or "symptoms" you'd like the LLM to focus on. This evaluator also utilizes OpenAI's function-calling API to ensure a more consistent, structured output for the grade. | ||
|
||
## Environment Variables | ||
|
||
Ensure that `OPENAI_API_KEY` is set to use OpenAI models. Also, configure LangSmith by setting your `LANGSMITH_API_KEY`. | ||
|
||
```bash | ||
export OPENAI_API_KEY=sk-... | ||
export LANGSMITH_API_KEY=... | ||
export LANGCHAIN_TRACING_V2=true | ||
``` | ||
|
||
## Usage | ||
|
||
If deploying via `LangServe`, we recommend configuring the server to return callback events as well. This will ensure the backend traces are included in whatever traces you generate using the `RemoteRunnable`. | ||
|
||
```python | ||
from conversational_feedback.chain import chain | ||
|
||
add_routes(app, chain, path="/conversational-feedback", include_callback_events=True) | ||
``` | ||
|
||
With the server running, you can use the following code snippet to stream the chat bot responses for a 2 turn conversation. | ||
|
||
```python | ||
from functools import partial | ||
from typing import Dict, Optional, Callable, List | ||
from langserve import RemoteRunnable | ||
from langchain.callbacks.manager import tracing_v2_enabled | ||
from langchain.schema import BaseMessage, AIMessage, HumanMessage | ||
|
||
# Update with the URL provided by your LangServe server | ||
chain = RemoteRunnable("http://127.0.0.1:8031/conversational-feedback") | ||
|
||
def stream_content( | ||
text: str, | ||
chat_history: Optional[List[BaseMessage]] = None, | ||
last_run_id: Optional[str] = None, | ||
on_chunk: Callable = None, | ||
): | ||
results = [] | ||
with tracing_v2_enabled() as cb: | ||
for chunk in chain.stream( | ||
{"text": text, "chat_history": chat_history, "last_run_id": last_run_id}, | ||
): | ||
on_chunk(chunk) | ||
results.append(chunk) | ||
last_run_id = cb.latest_run.id if cb.latest_run else None | ||
return last_run_id, "".join(results) | ||
|
||
chat_history = [] | ||
text = "Where are my keys?" | ||
last_run_id, response_message = stream_content(text, on_chunk=partial(print, end="")) | ||
print() | ||
chat_history.extend([HumanMessage(content=text), AIMessage(content=response_message)]) | ||
text = "I CAN'T FIND THEM ANYWHERE" # The previous response will likely receive a low score, | ||
# as the user's frustration appears to be escalating. | ||
last_run_id, response_message = stream_content( | ||
text, | ||
chat_history=chat_history, | ||
last_run_id=str(last_run_id), | ||
on_chunk=partial(print, end=""), | ||
) | ||
print() | ||
chat_history.extend([HumanMessage(content=text), AIMessage(content=response_message)]) | ||
``` | ||
|
||
This uses the `tracing_v2_enabled` callback manager to get the run ID of the call, which we provide in subsequent calls in the same chat thread, so the evaluator can assign feedback to the appropriate trace. |
3 changes: 3 additions & 0 deletions
3
templates/conversational-feedback/conversational_feedback/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from conversational_feedback.chain import chain | ||
|
||
__all__ = ["chain"] |
166 changes: 166 additions & 0 deletions
166
templates/conversational-feedback/conversational_feedback/chain.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
from __future__ import annotations | ||
|
||
from typing import List, Optional | ||
|
||
from langchain import hub | ||
from langchain.callbacks.tracers.evaluation import EvaluatorCallbackHandler | ||
from langchain.callbacks.tracers.schemas import Run | ||
from langchain.chains.openai_functions.base import convert_to_openai_function | ||
from langchain.chat_models import ChatOpenAI | ||
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser | ||
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder | ||
from langchain.schema import ( | ||
AIMessage, | ||
BaseMessage, | ||
HumanMessage, | ||
StrOutputParser, | ||
get_buffer_string, | ||
) | ||
from langchain.schema.runnable import Runnable | ||
from langsmith.evaluation import EvaluationResult, RunEvaluator | ||
from langsmith.schemas import Example | ||
from pydantic import BaseModel, Field | ||
|
||
### The feedback model used for the "function definition" provided to OpenAI | ||
# For use with open source models, you can add the schema directly, | ||
# but some modifications to the prompt and parser will be needed | ||
|
||
|
||
class ResponseEffectiveness(BaseModel): | ||
"""Score the effectiveness of the AI chat bot response.""" | ||
|
||
reasoning: str = Field( | ||
..., | ||
description="Explanation for the score.", | ||
) | ||
score: int = Field( | ||
..., | ||
min=0, | ||
max=5, | ||
description="Effectiveness of AI's final response.", | ||
) | ||
|
||
|
||
def format_messages(input: dict) -> List[BaseMessage]: | ||
"""Format the messages for the evaluator.""" | ||
chat_history = input.get("chat_history") or [] | ||
results = [] | ||
for message in chat_history: | ||
if message["type"] == "human": | ||
results.append(HumanMessage.parse_obj(message)) | ||
else: | ||
results.append(AIMessage.parse_obj(message)) | ||
return results | ||
|
||
|
||
def format_dialog(input: dict) -> dict: | ||
"""Format the dialog for the evaluator.""" | ||
chat_history = format_messages(input) | ||
formatted_dialog = get_buffer_string(chat_history) # + f"\nhuman: {input['text']}" | ||
return {"dialog": formatted_dialog} | ||
|
||
|
||
def normalize_score(response: dict) -> dict: | ||
"""Normalize the score to be between 0 and 1.""" | ||
response["score"] = int(response["score"]) / 5 | ||
return response | ||
|
||
|
||
evaluation_prompt = hub.pull("wfh/response-effectiveness") | ||
evaluate_response_effectiveness = ( | ||
# format_messages is a function that takes a dict and returns a dict | ||
format_dialog | ||
| evaluation_prompt | ||
# bind() provides the requested schemas to the model for structured prediction | ||
| ChatOpenAI(model="gpt-3.5-turbo").bind( | ||
functions=[convert_to_openai_function(ResponseEffectiveness)], | ||
function_call={"name": "ResponseEffectiveness"}, | ||
) | ||
# Convert the model's output to a dict | ||
| JsonOutputFunctionsParser(args_only=True) | ||
| normalize_score | ||
) | ||
|
||
|
||
class ResponseEffectivenessEvaluator(RunEvaluator): | ||
def __init__(self, evaluator_runnable: Runnable) -> None: | ||
super().__init__() | ||
self.runnable = evaluator_runnable | ||
|
||
def evaluate_run( | ||
self, run: Run, example: Optional[Example] = None | ||
) -> EvaluationResult: | ||
# This particular evaluator is configured to evaluate the previous | ||
# AI response. It uses the user's followup question or comment as | ||
# additional grounding for its grade. | ||
if not run.inputs.get("chat_history"): | ||
return EvaluationResult( | ||
key="response_effectiveness", comment="No chat history present." | ||
) | ||
elif "last_run_id" not in run.inputs: | ||
return EvaluationResult( | ||
key="response_effectiveness", comment="No last run ID present." | ||
) | ||
eval_grade: Optional[dict] = self.runnable.invoke(run.inputs) | ||
target_run_id = run.inputs["last_run_id"] | ||
return EvaluationResult( | ||
**eval_grade, | ||
key="response_effectiveness", | ||
target_run_id=target_run_id, | ||
) | ||
|
||
|
||
### The actual deployed chain (we are keeping it simple for this example) | ||
# The main focus of this template is the evaluator above, not the chain itself. | ||
|
||
|
||
class ChainInput(BaseModel): | ||
chat_history: Optional[List[BaseMessage]] = Field( | ||
description="Previous chat messages." | ||
) | ||
text: str = Field(..., description="User's latest query.") | ||
last_run_id: Optional[str] = Field("", description="ID of the last run.") | ||
|
||
|
||
_prompt = ChatPromptTemplate.from_messages( | ||
[ | ||
( | ||
"system", | ||
"You are a helpful assistant who speaks like a pirate", | ||
), | ||
MessagesPlaceholder(variable_name="chat_history"), | ||
("user", "{text}"), | ||
] | ||
) | ||
_model = ChatOpenAI() | ||
|
||
|
||
def format_chat_history(chain_input: ChainInput) -> dict: | ||
# This is a hack to get the chat history into the prompt | ||
messages = format_messages(chain_input) | ||
|
||
return { | ||
"chat_history": messages, | ||
"text": chain_input.get("text"), | ||
} | ||
|
||
|
||
# if you update the name of this, you MUST also update ../pyproject.toml | ||
# with the new `tool.langserve.export_attr` | ||
chain = ( | ||
(format_chat_history | _prompt | _model | StrOutputParser()) | ||
# This is to populate the openapi spec for LangServe | ||
.with_types(input_type=ChainInput) | ||
# This is to add the evluators as "listeners" | ||
# and to customize the name of the chain | ||
.with_config( | ||
run_name="ChatBot", | ||
callbacks=[ | ||
EvaluatorCallbackHandler( | ||
evaluators=[ | ||
ResponseEffectivenessEvaluator(evaluate_response_effectiveness) | ||
] | ||
) | ||
], | ||
) | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
[tool.poetry] | ||
name = "chat_feedback" | ||
version = "0.0.1" | ||
description = "" | ||
authors = [] | ||
readme = "README.md" | ||
|
||
[tool.poetry.dependencies] | ||
python = ">=3.8.1,<4.0" | ||
langchain = ">=0.0.325, <0.1" | ||
openai = "^0.28.1" | ||
langsmith = ">=0.0.54" | ||
langchainhub = ">=0.1.13" | ||
|
||
[tool.poetry.group.dev.dependencies] | ||
langchain-cli = ">=0.0.4" | ||
fastapi = "^0.104.0" | ||
sse-starlette = "^1.6.5" | ||
|
||
[tool.langserve] | ||
export_module = "chat_feedback.chain" | ||
export_attr = "chain" | ||
|
||
[build-system] | ||
requires = ["poetry-core"] | ||
build-backend = "poetry.core.masonry.api" |
Empty file.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
1/ IIUC, this is performing chat evaluations w/o explicit user-feedback, which is very useful. We might create a top-level shortened summary that just states this clearly. It's the first eval template, so very cool to have.
2/ We might explicitly mention that your chat app should be implemented (or called) in
chain.py
and call out specifically where as a placeholder. AFAICT, any chat runnable can simply append:3/ Where to go fetch the evals in LangSmith? May be nice to show screenshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool. For (1), are you saying on top of the README that's here? (is it too roundabout)?
For (2) - yes albeit it needs a 'last_run_id' to be passed around so that the feedback can be assigned to the previous response trace. If we didn't care about the exact credit assignment, or if we had a better way of tracking conversations, this would be easier/better
For (3) def. I'll do that.