LLMQuoter is a lightweight, distillation-based model designed to enhance Retrieval-Augmented Generation (RAG) workflows. The model adopts a "quote-first-then-answer" strategy, extracting relevant textual evidence (quotes) from large contexts before performing reasoning tasks.
The Retrieval-Augmented Fine-Tuning (RAFT) approach involves reasoning and answering directly over the full context. RAFT trains models to “think while quoting,” combining reasoning with the ability to extract relevant portions of the text. An example of RAFT inference is shown below:
In RAFT, the model performs the following steps in a single process:
- Reads the entire context.
- Extracts quotes inline.
- Generates a reasoning chain to derive the answer.
While effective, RAFT’s holistic approach can cause cognitive overload, especially for smaller models.
LLMQuoter, by contrast, separates the quote extraction and reasoning steps:
- Step 1: A smaller model extracts quotes relevant to the question.
- Step 2: The reasoning model focuses solely on the extracted quotes, avoiding the overhead of processing the entire context.
This modular approach simplifies the reasoning task and reduces errors arising from large or noisy contexts.
The repository includes:
- Methodology: Explanation of the quote extraction process and fine-tuning.
- Experiments: Comparisons between using quotes and full contexts.
- Results: Performance metrics highlighting the benefits of LLMQuoter.
The process begins by creating a distilled dataset using a high-performing teacher model (e.g., Gemini Pro 1.5). The model extracts golden quotes from the context that directly support the answer to a given question. These quotes serve as a foundation for fine-tuning smaller, resource-efficient models.
Using the distilled dataset, a smaller model (LLaMA 3.2:3B) is fine-tuned using Low-Rank Adaptation (LoRA). This technique enables task-specific training while maintaining computational efficiency.
You can find the training step in the following Colab notebook:
Fine-Tuning with LoRA on Colab
The experiments are conducted on a subset of the HotpotQA dataset. The dataset contains:
- Question: A specific query.
- Context: A large body of text containing the answer.
- Answer: The expected response to the question.
Statistics:
Attribute | Value |
---|---|
Dataset Name | HotpotQA |
Total Samples | 15,000 |
Training Set | 14,400 |
Test Set | 600 |
Source | Wikipedia |
Topics | General Knowledge |
Using the DSpy framework, the performance of quote extraction is evaluated using:
- Precision: Fraction of predicted quotes matching the ground truth.
- Recall: Fraction of ground truth quotes captured by the predictions.
- F1-Score: Harmonic mean of precision and recall.
Semantic accuracy is used to compare the answers generated by models using either golden quotes or the full context.
The following table shows the improvement in precision, recall, and F1-score after fine-tuning the smaller model with LoRA:
Metric | Before Fine-Tuning | After Fine-Tuning | Improvement |
---|---|---|---|
Recall | 48.3% | 68.0% | +19.7% |
Precision | 43.6% | 71.0% | +27.4% |
F1-Score | 41.3% | 69.1% | +27.8% |
The performance of several models was evaluated in two configurations:
- Using the full context.
- Using golden quotes.
Model | Full Context Accuracy | Golden Quotes Accuracy | Improvement |
---|---|---|---|
LLaMA 1B | 24.4% | 62.2% | +37.8% |
LLaMA 3B | 57.7% | 83.0% | +25.3% |
GPT-3.5 Turbo | 75.8% | 88.5% | +12.7% |
Takeaway: Providing models with extracted quotes significantly boosts their accuracy by reducing cognitive overhead and enabling more focused reasoning.
Which Walt Disney Pictures film was created first, Finding Dory or The Wild Country?
A lengthy body of text (5,086 characters) covering Disney and Pixar films.
##begin_quote## The Wild Country is a 1970 American adventure film produced by Walt Disney Pictures and directed by Robert Totten. ##end_quote## ##begin_quote## Finding Nemo is a 2003 American computer-animated family film produced by Pixar Animation Studios and released by Walt Disney Pictures. ##end_quote##
Model | Full Context Answer | Golden Quotes Answer |
---|---|---|
GPT-3.5 Turbo | Finding Nemo was created first | The Wild Country |
LLaMA 1B | Finding Dory is created first | The Wild Country |
LLaMA 3B | Finding Dory is created first | The Wild Country |
Conclusion: Using quotes aligns model outputs closer to the ground truth answer.
This study demonstrates the effectiveness of quote-first-then-answer strategies in RAG workflows:
- Efficiency: Fine-tuning small models with LoRA achieves significant improvements with minimal computational resources.
- Performance: Quote extraction reduces cognitive load, enabling both small and large models to perform better.
- Dataset Expansion: Apply the methodology to datasets from different domains.
- Reinforcement Learning: Incorporate techniques like PPO for further refinement.
- Scalability: Train larger models (e.g., 8B parameters) to explore scalability.
- Enhanced Prompts: Develop advanced prompt engineering for improved quote extraction.