Language models can be used to provide interactive, personalized student feedback in educational settings. However, real-world deployment faces three key challenges: privacy concerns, limited computational resources, and the need for pedagogically valid responses. These constraints require small, open-source models that can run locally and reliably ground their outputs in correct information. We introduce SCRIBE, a framework for multi-hop, tool-augmented reasoning designed to generate valid responses to student questions about feedback reports. SCRIBE combines domain-specific tools with a self-reflective inference pipeline that supports iterative reasoning, tool use, and error recovery. We distil these capabilities into 3B and 8B models via two-stage LoRA fine-tuning on synthetic GPT-4o-generated data. Evaluation with a human-aligned GPT-Judge and a user study with 108 students shows that SCRIBE models achieve comparable or superior quality to much larger models in key dimensions such as relevance and actionability, while being perceived on par with GPT-4o and Llama-3.3 70B by students. These findings demonstrate the viability of SCRIBE for low-resource, privacy-sensitive educational applications.
To finetune a model with this framework, use a command like the following (replace the paths and names with your actual model, adapter, and checkpoint names as needed):
accelerate launch --config_file /path/to/your/accelerate_config.yaml /path/to/finetune_lama.py --step multi_step --adapter_checkpoint /path/to/your/adapter_checkpoint/initial_adapter --model_name /path/to/your/model_directory --epochs 3
Note: In your accelerate_default_config.yaml
, make sure to set num_processes
to the number of GPUs you want to use for training.
We have made the synthetic training data used for SCRIBE open source. You can access and download it here:
This dataset contains the multi-hop, tool-augmented reasoning examples used for model finetuning and evaluation.
The following SCRIBE models are available as open source on Hugging Face: