EvalYaks

EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Relying on human experts to evaluate CEFR speaking assessments in an e-learning environment creates scalability challenges, as it limits how quickly and widely assessments can be conducted. We aim to automate the evaluation of CEFR B2 English speaking assessments in e-learning environments from conversation transcripts. First, we evaluate the capability of leading open source and commercial Large Language Models (LLMs) to score a candidate's performance across various criteria in the CEFR B2 speaking exam in both global and India-specific contexts. Next, we create a new expert-validated, CEFR-aligned synthetic conversational dataset with transcripts that are rated at different assessment scores. In addition, new instruction-tuned datasets are developed from the English Vocabulary Profile (up to CEFR B2 level) and the CEFR-SP WikiAuto datasets. Finally, using these new datasets, we perform parameter efficient instruction tuning of Mistral Instruct 7B v0.2 to develop a family of models called EvalYaks. Four models in this family are for assessing the four sections of the CEFR B2 speaking exam, one for identifying the CEFR level of vocabulary and generating level-specific vocabulary, and another for detecting the CEFR level of text and generating level-specific text. EvalYaks achieved an average acceptable accuracy of 96%, a degree of variation of 0.35 levels, and performed 3 times better than the next best model. This demonstrates that a 7B parameter LLM instruction tuned with high-quality CEFR-aligned assessment data can effectively evaluate and score CEFR B2 English speaking assessments, offering a promising solution for scalable, automated language proficiency evaluation.

EvalYaks/
├── Files
│   ├── AverageAcceptableAccuracy_PerformanceDescriptors.png
│   ├── AverageAcceptableAccuracy_WithoutPerformanceDescriptors.png         
│   ├── DatasetExample.png                                            
├── InstructionDatasets/                        # Datasets used for instruction tuning
│   ├── Cambridge_VocabProfile.csv              # Cambridge Vocabulary Profile
│   ├── CEFR_WikiAuto.csv                       # CEFR WikiAuto Dataset
│   ├── Part1_Introduction.csv                  # Part 1 of CEFR B2 English speaking assessment
│   ├── Part2_LongTurn.csv                      # Part 2 of CEFR B2 English speaking assessment
│   ├── Part3_Discussion.csv                    # Part 3 of CEFR B2 English speaking assessment
│   ├── Part4_ExtendedDiscussion.csv            # Part 4 of CEFR B2 English speaking assessment
├── LICENSE                                                       
└── README.md

Datapoint examples for different parts of the speaking assessment for instruction tuning.

Results

The performance of EvalYaks compared to different LLMs are given below:

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts without performance descriptors.

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts with performance descriptors.

Citation

If you find our dataset and work beneficial, please cite our work:

@misc{scaria2024evalyaksinstructiontuningdatasets,
      title={EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts}, 
      author={Nicy Scaria and Silvester John Joseph Kennedy and Thomas Latinovich and Deepak Subramani},
      year={2024},
      eprint={2408.12226},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.12226}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Files		Files
InstructionDatasets_Training		InstructionDatasets_Training
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalYaks

EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Datapoint examples for different parts of the speaking assessment for instruction tuning.

Results

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts without performance descriptors.

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts with performance descriptors.

Citation

About

Releases

Packages

Contributors 2

License

Talking-Yak/EvalYaks

Folders and files

Latest commit

History

Repository files navigation

EvalYaks

EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts

Datapoint examples for different parts of the speaking assessment for instruction tuning.

Results

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts without performance descriptors.

The distribution of acceptable accuracy of leading LLMs without LoRA in comparison with EvalYaks part 1-4 models using prompts with performance descriptors.

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages