Adaptive Testing Framework for AI Models

This Python library offers a streamlined solution for rapidly developing an Adaptive Testing system specifically for AI models, particularly large language models (LLMs). Based on psychometrics, it encompasses a comprehensive suite of tools that integrate both traditional statistical methods and recent machine learning and deep learning techniques.

❗ What is Adaptive Testing About?

Computerized Adaptive Testing stands as one of the earliest and most successful integrations of educational practices and computing technology.

In evaluating human abilities, psychometrics gradually replaced traditional paper-and-pencil testing with a more advanced approach—-adaptive testing. This approach employs an understanding of cognitive functions and processes to guide the design of assessments, including the measurement of human knowledge, abilities, attitudes, and personality traits. By capturing the characteristics and utility (e.g., difficulty, discrimination) of different items and adjusting the test items in real-time based on the test-taker's performance, adaptive testing avoids overwhelming them with numerous items all at once. Adaptive testing has been widely applied in high-stakes exams such as the Graduate Management Admission Test (GMAT), Graduate Record Examinations (GRE), and the Scholastic Assessment Test (SAT).

The adaptive system is split into two main components that take turns: At each test step, the psychometric model, as the user model, first uses the model's previous responses to estimate their current ability. Then, the selection algorithm picks the next item from the benchmark according to certain criteria. This two-step process repeats until a predefined stopping rule is met, and the final estimated ability of individual models will be fed back to themselves as the outcome of this assessment or for facilitating future training.

Contribution

This repository implements the basic functionalities of adaptive testing. It includes three types of psychometric models: Item Response Theory (IRT), Multidimensional Item Response Theory (MIRT), and Neural Cognitive Diagnosis (NCD). Each psychometric model has its corresponding selection algorithm. This library includes the following models and algorithms:

Item Response Theory (IRT)
- MaximumFisherInformation (MFI) strategy
- Kullback-Leibler Information (KLI) strategy
- Model-Agnostic Adaptive Testing (MAAT) strategy
Multidimensional Item Response Theory (MIRT)
- D-Optimality (D-opt) strategy
- Multivariate Kullback-Leibler Information (MKLI) strategy
- Model-Agnostic Adaptive Testing (MAAT) strategy
Neural Cognitive Diagnosis (NCD)
- Model-Agnostic Adaptive Testing (MAAT) strategy

Installation

Git and install by pip

pip install -e .

Quick Start

See the examples in scripts directory.

Data Preparation

For instance, using the GSM8K benchmark from HELM (HELM), download the response data for each LLM (Full JSON) and place it in the data/raw_data directory.

Data Processing

To process the data, run the following notebook:

scripts/dataset/gsm8k.ipynb

Psychometric Model Training (Item Feature Estimation)

Train the psychometric model by running:

scripts/dataset/train.ipynb