Visual Mathematical Reasoning using Vision-Language Models

We evaluated the performance of commercially available Vision-Language Models (VLMs) on visual mathematical reasoning problems using the MathVista dataset. We highlight the shortcomings of traditional evaluation metrics and propose a novel scoring system suited for mathematical reasoning tasks.

A detailed explanation of our work and results can be found in Report.pdf.

📂 Dataset

The MathVista dataset can be accessed via the following link:

https://huggingface.co/datasets/AI4Math/MathVista

We performed our analysis of answer correctness and problem category recognition on the testmini set – a subset of 1000 labelled samples. This is present in the directory Dataset-1000.

We worked with a smaller set consisting of the first 100 math problems from the testmini set for human annotations and reasoning evaluation. This is present in the directory Dataset-100. These samples correspond to the PIDs in the file pid_100.csv.

🤖 VLM Response Generation

We conducted our analysis on the following VLMs:

GPT-4
Claude
Gemini
LLaVA

For LLaVA, we wrote the following notebooks to run the VLM on a T4 GPU using HuggingFace for various tasks:

Problem category recognition – category_recognition.ipynb
Reasoning generation – reasoning_generation.ipynb

For GPT-4, we used API calls to directly retrieve the responses.

We evaluated the problem category recognition and answer correctness scores on Dataset-1000. We limited this study to GPT-4 and LLaVA due to resource limitations.

We assessed the mathematical reasoning capabilities of the four VLMs on Dataset-100. We generated the responses for Claude and Gemini using their respective online user interfaces.

The table below lists the prompts used for generating results on the three tasks, with <Question> replaced by the query field of the sample:

📝 Gold Standard Reasoning Creation

The MathVista dataset does not provide reasoning for the answers. There can be multiple correct ways to reason through a mathematical problem. To account for these variations, all five members of our team solved and documented the reasoning for the problems in Dataset-100, resulting in 500 gold standard reasoning annotations. We assessed the reasoning generated by a VLM for a given problem by comparing it to the five gold standard annotations and selecting the highest score.

We created the notebook postprocess.ipynb to apply a heuristic-based approach for post-processing to ensure consistent representation of mathematical expressions across human annotations and VLM-generated reasonings.

The human annotations after postprocessing are present in the file postprocessed_annotations_100.csv.

⚠️ Limitations of Traditional Evaluation Metrics

We explored using n-gram and seq2seq evaluation metrics such as BLEU, BLEURT and BERTScore to compare the reasoning provided by a VLM with the gold standard annotations to determine logical correctness. These metrics prioritize response style similarity over mathematical correctness, as illustrated in the table below:

We also observe that these scores lack interpretability.

🧮 Proposed Scoring Method

Unlike traditional approaches that focus on stylistic similarity, our work introduces an interpretable scoring method tailored to mathematical reasoning tasks. We identified three main aspects pertaining to evaluating mathematical reasoning statements:

Logical correctness (LC)
Mathematical correctness (MC)
Readability (R)

LC and R are categorized into three levels: Low (0), Medium (0.5), and High (1). MC is classified as either Incorrect (0) or Correct (1).

Our proposed score is a weighted sum of these three aspects:

Proposed Score = 0.5 × LC + 0.3 × MC + 0.2 × R

These weights are designed to emphasize logical correctness, followed by mathematical correctness, with readability being a lesser priority to ensure interpretable results.

📊 Results

The following subsections illustrate the results we obtained on various tasks.

🗂️ Problem Category Recognition

We assessed the VLMs' ability to identify the type of problem on Dataset-1000 based on the four categories presented to them. We hypothesized that a model's inability to understand the problem type contributed to incorrect reasoning.

There are four kinds of problem categories (present as the label "task" in the metadata column of the dataset), with each problem corresponding to a single category:

TQA – Textbook Question Answering
FQA – Figure Question Answering
MWP – Math Word Problem
GPS – Geometry Problem Solving

Model	Accuracy Score
GPT-4	0.543
LLaVA	0.103

✅ Answer Correctness

We evaluated the VLMs' ability to arrive at the correct answer when given visual math problems from Dataset-1000.

The following table shows the overall accuracy along with a breakdown of the scores obtained on various problem categories:

Model	Accuracy	TQA	FQA	MWP	GPS
GPT-4	0.397	0.377	0.250	0.521	0.514
LLaVA	0.070	0.155	0.107	0.091	0.014

🔢 Mathematical Reasoning

We employed human evaluation along with our proposed scoring on the responses generated by the four VLMs on Dataset-100. The LC, MC, and R categories assigned to each model's response for each problem are given in the file human_evaluation_<model>.csv.

The notebook scoring.ipynb contains the code for reading the human_evaluation_<model>.csv file and generating the final proposed score for the corresponding model.

The following table shows the average proposed score for each of the VLMs:

Model	Proposed Score
GPT-4	0.697
Claude	0.647
Gemini	0.502
LLaVA	0.322

🏁 Conclusion

GPT-4 consistently outperforms other VLMs across visual mathematical reasoning tasks.
LLaVA performs significantly worse, likely due to its smaller 7b configuration.
Traditional evaluation metrics such as BLEU, BLEURT, and BERTScore are unsuitable for logical correctness due to their focus on stylistic similarity rather than mathematical accuracy.
Our proposed scoring method is tailored toward mathematical reasoning with a focus on interpretability.

🚀 Future Directions

To enhance VLMs’ understanding of visual question answering (VQA), external tools such as solvers and object recognition systems can be integrated.
To improve the reasoning generated by the models, approaches like zero-shot or few-shot learning, combined with chain-of-thought reasoning, could be explored.
Automating the scoring process to eliminate reliance on human evaluation of logical correctness and readability could facilitate more extensive analysis.

As an exploratory step, we experimented with various readability scores, which are present in the notebook readability_experiments.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Dataset-100		Dataset-100
Dataset-1000		Dataset-1000
Images		Images
LICENSE		LICENSE
README.md		README.md
Report.pdf		Report.pdf
annotations_100.tsv		annotations_100.tsv
category_gpt4_1000.csv		category_gpt4_1000.csv
category_llava_1000.csv		category_llava_1000.csv
category_recognition.ipynb		category_recognition.ipynb
human_evaluation_claude.csv		human_evaluation_claude.csv
human_evaluation_gemini.csv		human_evaluation_gemini.csv
human_evaluation_gpt4.csv		human_evaluation_gpt4.csv
human_evaluation_llava.csv		human_evaluation_llava.csv
pid_100.csv		pid_100.csv
postprocess.ipynb		postprocess.ipynb
postprocessed_annotations_100.csv		postprocessed_annotations_100.csv
readability_experiments.ipynb		readability_experiments.ipynb
reasoning_generation.ipynb		reasoning_generation.ipynb
reasoning_gpt4_1000.csv		reasoning_gpt4_1000.csv
reasoning_llava_100.csv		reasoning_llava_100.csv
scoring.ipynb		scoring.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Mathematical Reasoning using Vision-Language Models

Table of Contents

📂 Dataset

🤖 VLM Response Generation

📝 Gold Standard Reasoning Creation

⚠️ Limitations of Traditional Evaluation Metrics

🧮 Proposed Scoring Method

📊 Results

🗂️ Problem Category Recognition

✅ Answer Correctness

🔢 Mathematical Reasoning

🏁 Conclusion

🚀 Future Directions

About

Releases

Packages

Languages

License

Taejas/vmr-vlm

Folders and files

Latest commit

History

Repository files navigation

Visual Mathematical Reasoning using Vision-Language Models

Table of Contents

📂 Dataset

🤖 VLM Response Generation

📝 Gold Standard Reasoning Creation

⚠️ Limitations of Traditional Evaluation Metrics

🧮 Proposed Scoring Method

📊 Results

🗂️ Problem Category Recognition

✅ Answer Correctness

🔢 Mathematical Reasoning

🏁 Conclusion

🚀 Future Directions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages