-
Notifications
You must be signed in to change notification settings - Fork 99
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* walktrhough example branch * Add the Evaluation Walkthrough (#333) * Add eval guide * Add images to evaluation guide * Remove download button * Fiz image paths * change image branch * Add harness info * Change the tutorial thumbnail image * Change the image path --------- Co-authored-by: Bilge Yücel <bilgeyucel96@gmail.com>
- Loading branch information
1 parent
f915427
commit c98babd
Showing
5 changed files
with
156 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Evaluation is crucial for developing and deploying LLM-based systems such as RAG applications and Agents, ensuring they are accurate, reliable, and effective. It ensures the information retrieved and generated is accurate, reducing the risk of irrelevant answers. \n", | ||
"\n", | ||
"Evaluation measures performance using metrics like precision, recall, and relevancy, providing a clear picture of your pipeline's strengths and weaknesses using LLMs or ground-truth labels.\n", | ||
"\n", | ||
"Evaluating RAG systems can help understand performance bottlenecks and optimize one component at a time, for example, a Retriever or a prompt used with a Generator.\n", | ||
"\n", | ||
"Here's a step-by-step guide explaining what you need to evaluate, how you evaluate, and how you can improve your application after evaluation using Haystack!\n", | ||
"\n", | ||
"## 1. Building your pipeline\n", | ||
"\n", | ||
"Choose the required components based on your use case and create your Haystack pipeline. If you’re a beginner, start with [📚 Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline). If you’d like to explore different model providers, vector databases, retrieval techniques, and more with Haystack, pick an example from🧑🍳 [Haystack Cookbooks](https://github.com/deepset-ai/haystack-cookbook).\n", | ||
"\n", | ||
"## 2. Human Evaluation\n", | ||
"\n", | ||
"As the first step, perform **manual evaluation**. Test a few queries (5-10 queries) and manually assess the accuracy, relevance, coherence, format, and overall quality of your pipeline’s output. This will provide an initial understanding of how well your system performs and highlight any obvious issues.\n", | ||
"\n", | ||
"To trace the data through each pipeline step, debug the intermediate components using the [include_outputs_from](https://docs.haystack.deepset.ai/reference/pipeline-api#pipelinerun) parameter. This feature is particularly useful for observing the retrieved documents or verifying the rendered prompt. By examining these intermediate outputs, you can pinpoint where issues may arise and identify specific areas for improvement, such as tweaking the prompt or trying out different models.\n", | ||
"\n", | ||
"## 3. Deciding on Metrics\n", | ||
"\n", | ||
"Evaluation metrics are crucial for measuring the effectiveness of your pipeline. Common metrics are:\n", | ||
"\n", | ||
"- **Semantic Answer Similarity**: Evaluates the semantic similarity of the generated answer and the ground truth rather than their lexical overlap.\n", | ||
"- **Context Relevancy**: Assesses the relevance of the retrieved documents to the query.\n", | ||
"- **Faithfulness:** Evaluates to what extent a generated answer is based on retrieved documents\n", | ||
"- **Context Precision**: Measures the accuracy of the retrieved documents.\n", | ||
"- **Context Recall**: Measures the ability to retrieve all relevant documents.\n", | ||
"\n", | ||
"Some metrics might require labeled data, while others can be evaluated using LLMs without needing labeled data. As you evaluate your pipeline, explore various types of metrics, such as [statistical](https://docs.haystack.deepset.ai/docs/statistical-evaluation) and [model-based](https://docs.haystack.deepset.ai/docs/model-based-evaluation) metrics, or incorporate custom metrics using LLMs with the [LLMEvaluator](https://docs.haystack.deepset.ai/docs/llmevaluator). \n", | ||
"\n", | ||
"| | Retrieval Evaluation | Generation Evaluation | End-to-end Evaluation |\n", | ||
"| --- | :---: | --- | :---: |\n", | ||
"| Labeled data | [DocumentMAPEvaluator](https://docs.haystack.deepset.ai/docs/documentmapevaluator), [DocumentMRREvaluator](https://docs.haystack.deepset.ai/docs/documentmrrevaluator), [DocumentRecallEvaluator](https://docs.haystack.deepset.ai/docs/documentrecallevaluator) | - | [AnswerExactMatchEvaluator](https://docs.haystack.deepset.ai/docs/answerexactmatchevaluator), [SASEvaluator](https://docs.haystack.deepset.ai/docs/sasevaluator) |\n", | ||
"| Unlabeled data (LLM-based) | [ContextRelevanceEvaluator](https://docs.haystack.deepset.ai/docs/contextrelevanceevaluator) | [FaithfulnessEvaluator](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator)| [LLMEvaluator](https://docs.haystack.deepset.ai/docs/llmevaluator)** |\n", | ||
"\n", | ||
"<p style=\"font-size: 1rem; font-style: italic; margin-top: -1rem;\">** You need to provide the instruction and the examples to the LLM to evaluate your system.</p>\n", | ||
"<figcaption>List of evaluation metrics that Haystack has built-in support</figcaption>\n", | ||
"\n", | ||
"In addition to Haystack’s built-in evaluators, you can use metrics from other evaluation frameworks like [ragas](https://haystack.deepset.ai/integrations/ragas) and [DeepEval](https://haystack.deepset.ai/integrations/deepeval). For more detailed information on evaluation metrics, refer to 📖 [Docs: Evaluation](https://docs.haystack.deepset.ai/docs/evaluation). \n", | ||
"\n", | ||
"## 4. Building an Evaluation Pipeline\n", | ||
"\n", | ||
"Build a pipeline with your evaluators. To learn about evaluating with Haystack’s own metrics, you can follow 📚 [Tutorial: Evaluating RAG Pipelines](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines). \n", | ||
"\n", | ||
"> 🧑🍳 As well as Haystack’s own evaluation metrics, you can also integrate with a number of evaluation frameworks. See the integrations and examples below 👇\n", | ||
"> \n", | ||
"> - [Evaluate with DeepEval](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_deep_eval.ipynb)\n", | ||
"> - [Evaluate with ragas](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_ragas.ipynb)\n", | ||
"> - [Evaluate with UpTrain](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_uptrain.ipynb)\n", | ||
"\n", | ||
"For step-by-step instructions, watch [our video walkthrough](https://youtu.be/5PrzXaZ0-qk?feature=shared) 🎥 👇\n", | ||
"\n", | ||
"\n", | ||
"<iframe\n", | ||
" width=\"640\"\n", | ||
" height=\"480\"\n", | ||
" src=\"https://www.youtube.com/embed/5PrzXaZ0-qk\"\n", | ||
" frameborder=\"0\"\n", | ||
" allow=\"autoplay; encrypted-media\"\n", | ||
" allowfullscreen\n", | ||
">\n", | ||
"</iframe>\n", | ||
"\n", | ||
"For a comprehensive evaluation, make sure to evaluate specific steps in the pipeline (e.g., retrieval or generation) and the performance of the entire pipeline. To get inspiration on evaluating your pipeline, have a look at 🧑🏼🍳 [Cookbook: Prompt Optimization with DSPy](https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/prompt_optimization_with_dspy.ipynb), which explains the details of prompt optimization and evaluation, or read 📚 [Article: RAG Evaluation with Prometheus 2](https://haystack.deepset.ai/blog/rag-evaluation-with-prometheus-2), which explores using open LMs to evaluate with custom metrics.\n", | ||
"\n", | ||
"If you're looking for a straightforward and efficient solution for RAG, consider using `EvaluationHarness`, introduced with Haystack 2.3 through [`haystack-experimental`](https://github.com/deepset-ai/haystack-experimental/tree/main). You can learn more by running the example 💻 [Notebook: Evaluating RAG Pipelines with EvaluationHarness](https://github.com/deepset-ai/haystack-experimental/blob/main/examples/rag_eval_harness.ipynb).\n", | ||
"\n", | ||
"## 5. Running Evaluation\n", | ||
"\n", | ||
"Evaluate your pipeline with different parameters, change the `top_k` value, and try a different embedding model, play with the `temperature` to find what works best for your use case. If you need labeled data for evaluation, you can use some datasets that come with ground-truth documents and ground-truth answers. You can find some datasets on [Hugging Face datasets](https://huggingface.co/datasets) or in the [haystack-evaluation](https://github.com/deepset-ai/haystack-evaluation/tree/main/datasets) repository. \n", | ||
"\n", | ||
"Make sure to set up your evaluation environment so that it’s easy to evaluate using different parameters without much hassle. The [haystack-evaluation](https://github.com/deepset-ai/haystack-evaluation) repository provides examples with different architectures against various datasets. \n", | ||
"\n", | ||
"Read more about how you can optimize your pipeline by trying different parameter combinations in 📚 [Article: Benchmarking Haystack Pipelines for Optimal Performance](https://haystack.deepset.ai/blog/benchmarking-haystack-pipelines)\n", | ||
"\n", | ||
"## 6. Analyzing Results\n", | ||
"\n", | ||
"Visualize your data and your results to have a general understanding of your pipeline’s performance.\n", | ||
"\n", | ||
"- Create a report using [EvaluationRunResult.score_report()](https://docs.haystack.deepset.ai/reference/evaluation-api#evaluationrunresult) and transform the evaluation results into a Pandas DataFrame with the aggregated scores for each metric:\n", | ||
"\n", | ||
"![Score report for Document MRR, SAS and Faithfulness](https://raw.githubusercontent.com/deepset-ai/haystack-tutorials/main/tutorials/img/guide_score_report.png#small)\n", | ||
"\n", | ||
"- Use Pandas to analyze the results for different parameters (`top_k`, `batch_size`, `embedding_model`) in a comprehensive view\n", | ||
"- Use libraries like Matplotlib or Seaborn to visually represent your evaluation results.\n", | ||
" \n", | ||
"![Using box-plots makes sense when comparing different models](https://raw.githubusercontent.com/deepset-ai/haystack-tutorials/main/tutorials/img/guide_box_plot.png#medium \"Using box-plots makes sense when comparing different models\")\n", | ||
" \n", | ||
"\n", | ||
"> Refer to 📚 [Benchmarking Haystack Pipelines for Optimal Performance: Results Analysis](https://haystack.deepset.ai/blog/benchmarking-haystack-pipelines#results-analysis) or 💻 [Notebook: Analyze ARAGOG Parameter Search](https://github.com/deepset-ai/haystack-evaluation/blob/main/evaluations/analyze_aragog_parameter_search.ipynb) to visualize evaluation results.\n", | ||
"> \n", | ||
"\n", | ||
"## 7. Improving Your Pipeline\n", | ||
"\n", | ||
"After evaluation, analyze the results to identify areas of improvement. Here are some methods:\n", | ||
"\n", | ||
"### Methods to Improve Retrieval:\n", | ||
"\n", | ||
"- **Data Cleaning**: Ensure your data is clean and well-structured before indexing using [DocumentCleaner](https://docs.haystack.deepset.ai/docs/documentcleaner) and [DocumentSplitter](https://docs.haystack.deepset.ai/docs/documentsplitter).\n", | ||
"- **Data Quality:** Enrich the semantics of your documents by [embedding meaningful metadata](https://haystack.deepset.ai/tutorials/39_embedding_metadata_for_improved_retrieval) alongside the document's contents.\n", | ||
"- **Metadata Filtering**: Limit the search space by using [metadata filters](https://docs.haystack.deepset.ai/docs/metadata-filtering) or extracting metadata from queries to use as filters. For more details, read 📚 [Extract Metadata from Queries to Improve Retrieval](https://haystack.deepset.ai/blog/extracting-metadata-filter).\n", | ||
"- **Different Embedding Models:** Compare different embedding models from different model providers. See the full list of supported embedding providers in [Embedders](https://docs.haystack.deepset.ai/docs/embedders).\n", | ||
"- **Advanced Retrieval Techniques**: Leverage techniques like [hybrid retrieval](https://haystack.deepset.ai/tutorials/33_hybrid_retrieval), [sparse embeddings](https://docs.haystack.deepset.ai/docs/retrievers#sparse-embedding-based-retrievers), or [Hypothetical Document Embeddings (HYDE)](https://docs.haystack.deepset.ai/docs/hypothetical-document-embeddings-hyde).\n", | ||
"\n", | ||
"### Methods to Improve Generation:\n", | ||
"\n", | ||
"- **Ranking**: Incorporate a ranking mechanism into your retrieved documents before providing the context to your prompt\n", | ||
" - **Order by similarity**: Reorder your retrieved documents by similarity using cross-encoder models from Hugging Face with [TransformersSimilarityRanker](https://docs.haystack.deepset.ai/docs/transformerssimilarityranker), Rerank models from Cohere with [CohereRanker](https://docs.haystack.deepset.ai/docs/cohereranker), or Rerankers from Jina with [JinaRanker](https://docs.haystack.deepset.ai/docs/jinaranker)\n", | ||
" - **Increase diversity by ranking**: Maximize the overall diversity among your context using sentence-transformers models with [SentenceTransformersDiversityRanker](https://docs.haystack.deepset.ai/docs/sentencetransformersdiversityranker) to help increase the semantic answer similarity (SAS) in LFQA applications.\n", | ||
" - **Address the \"Lost in the Middle\" problem by reordering**: Position the most relevant documents at the beginning and end of the context using [LostInTheMiddleRanker](https://docs.haystack.deepset.ai/docs/lostinthemiddleranker) to increase faithfulness.\n", | ||
"- **Different Generators**: Try different large language models and benchmark the results. The full list of model providers is in [Generators](https://docs.haystack.deepset.ai/docs/generators).\n", | ||
"- **Prompt Engineering**: Use few-shot prompts or provide more instructions to enable the exact match.\n", | ||
"\n", | ||
"## 8. Monitoring\n", | ||
"\n", | ||
"Implement strategies for [tracing](https://docs.haystack.deepset.ai/docs/tracing) the application post-deployment. By integrating [LangfuseConnector](https://docs.haystack.deepset.ai/docs/langfuseconnector) into your pipeline, you can collect the queries, documents, and answers and use them to continuously evaluate your application. Learn more about pipeline monitoring in 📚 [Article: Monitor and trace your Haystack pipelines with Langfuse](https://haystack.deepset.ai/blog/langfuse-integration)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.