We evaluated the performance of commercially available Vision-Language Models (VLMs) on visual mathematical reasoning problems using the MathVista dataset. We highlight the shortcomings of traditional evaluation metrics and propose a novel scoring system suited for mathematical reasoning tasks.
A detailed explanation of our work and results can be found in Report.pdf
.
- 📂 Dataset
- 🤖 VLM Response Generation
- 📝 Gold Standard Reasoning Creation
⚠️ Limitations of Traditional Evaluation Metrics- 🧮 Proposed Scoring Method
- 📊 Results
- 🏁 Conclusion
- 🚀 Future Directions
The MathVista dataset can be accessed via the following link:
We performed our analysis of answer correctness and problem category recognition on the testmini set – a subset of 1000 labelled samples. This is present in the directory Dataset-1000
.
We worked with a smaller set consisting of the first 100 math problems from the testmini set for human annotations and reasoning evaluation. This is present in the directory Dataset-100
. These samples correspond to the PIDs in the file pid_100.csv
.
We conducted our analysis on the following VLMs:
- GPT-4
- Claude
- Gemini
- LLaVA
For LLaVA, we wrote the following notebooks to run the VLM on a T4 GPU using HuggingFace for various tasks:
- Problem category recognition –
category_recognition.ipynb
- Reasoning generation –
reasoning_generation.ipynb
For GPT-4, we used API calls to directly retrieve the responses.
We evaluated the problem category recognition and answer correctness scores on Dataset-1000
. We limited this study to GPT-4 and LLaVA due to resource limitations.
We assessed the mathematical reasoning capabilities of the four VLMs on Dataset-100
. We generated the responses for Claude and Gemini using their respective online user interfaces.
The table below lists the prompts used for generating results on the three tasks, with <Question>
replaced by the query field of the sample:
The MathVista dataset does not provide reasoning for the answers. There can be multiple correct ways to reason through a mathematical problem. To account for these variations, all five members of our team solved and documented the reasoning for the problems in Dataset-100
, resulting in 500 gold standard reasoning annotations. We assessed the reasoning generated by a VLM for a given problem by comparing it to the five gold standard annotations and selecting the highest score.
We created the notebook postprocess.ipynb
to apply a heuristic-based approach for post-processing to ensure consistent representation of mathematical expressions across human annotations and VLM-generated reasonings.
The human annotations after postprocessing are present in the file postprocessed_annotations_100.csv
.
We explored using n-gram and seq2seq evaluation metrics such as BLEU, BLEURT and BERTScore to compare the reasoning provided by a VLM with the gold standard annotations to determine logical correctness. These metrics prioritize response style similarity over mathematical correctness, as illustrated in the table below:
We also observe that these scores lack interpretability.
Unlike traditional approaches that focus on stylistic similarity, our work introduces an interpretable scoring method tailored to mathematical reasoning tasks. We identified three main aspects pertaining to evaluating mathematical reasoning statements:
- Logical correctness (LC)
- Mathematical correctness (MC)
- Readability (R)
LC and R are categorized into three levels: Low (0), Medium (0.5), and High (1). MC is classified as either Incorrect (0) or Correct (1).
Our proposed score is a weighted sum of these three aspects:
Proposed Score = 0.5 × LC + 0.3 × MC + 0.2 × R
These weights are designed to emphasize logical correctness, followed by mathematical correctness, with readability being a lesser priority to ensure interpretable results.
The following subsections illustrate the results we obtained on various tasks.
We assessed the VLMs' ability to identify the type of problem on Dataset-1000
based on the four categories presented to them. We hypothesized that a model's inability to understand the problem type contributed to incorrect reasoning.
There are four kinds of problem categories (present as the label "task"
in the metadata
column of the dataset), with each problem corresponding to a single category:
- TQA – Textbook Question Answering
- FQA – Figure Question Answering
- MWP – Math Word Problem
- GPS – Geometry Problem Solving
Model | Accuracy Score |
---|---|
GPT-4 | 0.543 |
LLaVA | 0.103 |
We evaluated the VLMs' ability to arrive at the correct answer when given visual math problems from Dataset-1000
.
The following table shows the overall accuracy along with a breakdown of the scores obtained on various problem categories:
Model | Accuracy | TQA | FQA | MWP | GPS |
---|---|---|---|---|---|
GPT-4 | 0.397 | 0.377 | 0.250 | 0.521 | 0.514 |
LLaVA | 0.070 | 0.155 | 0.107 | 0.091 | 0.014 |
We employed human evaluation along with our proposed scoring on the responses generated by the four VLMs on Dataset-100
. The LC, MC, and R categories assigned to each model's response for each problem are given in the file human_evaluation_<model>.csv
.
The notebook scoring.ipynb
contains the code for reading the human_evaluation_<model>.csv
file and generating the final proposed score for the corresponding model.
The following table shows the average proposed score for each of the VLMs:
Model | Proposed Score |
---|---|
GPT-4 | 0.697 |
Claude | 0.647 |
Gemini | 0.502 |
LLaVA | 0.322 |
- GPT-4 consistently outperforms other VLMs across visual mathematical reasoning tasks.
- LLaVA performs significantly worse, likely due to its smaller 7b configuration.
- Traditional evaluation metrics such as BLEU, BLEURT, and BERTScore are unsuitable for logical correctness due to their focus on stylistic similarity rather than mathematical accuracy.
- Our proposed scoring method is tailored toward mathematical reasoning with a focus on interpretability.
- To enhance VLMs’ understanding of visual question answering (VQA), external tools such as solvers and object recognition systems can be integrated.
- To improve the reasoning generated by the models, approaches like zero-shot or few-shot learning, combined with chain-of-thought reasoning, could be explored.
- Automating the scoring process to eliminate reliance on human evaluation of logical correctness and readability could facilitate more extensive analysis.
As an exploratory step, we experimented with various readability scores, which are present in the notebook readability_experiments.ipynb
.