Skip to content
/ vmr-vlm Public

Evaluating the performance of VLMs on visual mathematical reasoning problems and proposing a novel scoring system

License

Notifications You must be signed in to change notification settings

Taejas/vmr-vlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual Mathematical Reasoning using Vision-Language Models

We evaluated the performance of commercially available Vision-Language Models (VLMs) on visual mathematical reasoning problems using the MathVista dataset. We highlight the shortcomings of traditional evaluation metrics and propose a novel scoring system suited for mathematical reasoning tasks.

A detailed explanation of our work and results can be found in Report.pdf.

Table of Contents

  1. 📂 Dataset
  2. 🤖 VLM Response Generation
  3. 📝 Gold Standard Reasoning Creation
  4. ⚠️ Limitations of Traditional Evaluation Metrics
  5. 🧮 Proposed Scoring Method
  6. 📊 Results
  7. 🏁 Conclusion
  8. 🚀 Future Directions

📂 Dataset

The MathVista dataset can be accessed via the following link:

https://huggingface.co/datasets/AI4Math/MathVista

We performed our analysis of answer correctness and problem category recognition on the testmini set – a subset of 1000 labelled samples. This is present in the directory Dataset-1000.

We worked with a smaller set consisting of the first 100 math problems from the testmini set for human annotations and reasoning evaluation. This is present in the directory Dataset-100. These samples correspond to the PIDs in the file pid_100.csv.

🤖 VLM Response Generation

We conducted our analysis on the following VLMs:

  • GPT-4
  • Claude
  • Gemini
  • LLaVA

For LLaVA, we wrote the following notebooks to run the VLM on a T4 GPU using HuggingFace for various tasks:

For GPT-4, we used API calls to directly retrieve the responses.

We evaluated the problem category recognition and answer correctness scores on Dataset-1000. We limited this study to GPT-4 and LLaVA due to resource limitations.

We assessed the mathematical reasoning capabilities of the four VLMs on Dataset-100. We generated the responses for Claude and Gemini using their respective online user interfaces.

The table below lists the prompts used for generating results on the three tasks, with <Question> replaced by the query field of the sample:

Prompts for various tasks

📝 Gold Standard Reasoning Creation

The MathVista dataset does not provide reasoning for the answers. There can be multiple correct ways to reason through a mathematical problem. To account for these variations, all five members of our team solved and documented the reasoning for the problems in Dataset-100, resulting in 500 gold standard reasoning annotations. We assessed the reasoning generated by a VLM for a given problem by comparing it to the five gold standard annotations and selecting the highest score.

We created the notebook postprocess.ipynb to apply a heuristic-based approach for post-processing to ensure consistent representation of mathematical expressions across human annotations and VLM-generated reasonings.

The human annotations after postprocessing are present in the file postprocessed_annotations_100.csv.

⚠️ Limitations of Traditional Evaluation Metrics

We explored using n-gram and seq2seq evaluation metrics such as BLEU, BLEURT and BERTScore to compare the reasoning provided by a VLM with the gold standard annotations to determine logical correctness. These metrics prioritize response style similarity over mathematical correctness, as illustrated in the table below:

Limitations of traditional metrics

We also observe that these scores lack interpretability.

🧮 Proposed Scoring Method

Unlike traditional approaches that focus on stylistic similarity, our work introduces an interpretable scoring method tailored to mathematical reasoning tasks. We identified three main aspects pertaining to evaluating mathematical reasoning statements:

  • Logical correctness (LC)
  • Mathematical correctness (MC)
  • Readability (R)

LC and R are categorized into three levels: Low (0), Medium (0.5), and High (1). MC is classified as either Incorrect (0) or Correct (1).

Our proposed score is a weighted sum of these three aspects:

Proposed Score = 0.5 × LC + 0.3 × MC + 0.2 × R

These weights are designed to emphasize logical correctness, followed by mathematical correctness, with readability being a lesser priority to ensure interpretable results.

📊 Results

The following subsections illustrate the results we obtained on various tasks.

🗂️ Problem Category Recognition

We assessed the VLMs' ability to identify the type of problem on Dataset-1000 based on the four categories presented to them. We hypothesized that a model's inability to understand the problem type contributed to incorrect reasoning.

There are four kinds of problem categories (present as the label "task" in the metadata column of the dataset), with each problem corresponding to a single category:

  • TQA – Textbook Question Answering
  • FQA – Figure Question Answering
  • MWP – Math Word Problem
  • GPS – Geometry Problem Solving
Model Accuracy Score
GPT-4 0.543
LLaVA 0.103

✅ Answer Correctness

We evaluated the VLMs' ability to arrive at the correct answer when given visual math problems from Dataset-1000.

The following table shows the overall accuracy along with a breakdown of the scores obtained on various problem categories:

Model Accuracy TQA FQA MWP GPS
GPT-4 0.397 0.377 0.250 0.521 0.514
LLaVA 0.070 0.155 0.107 0.091 0.014

🔢 Mathematical Reasoning

We employed human evaluation along with our proposed scoring on the responses generated by the four VLMs on Dataset-100. The LC, MC, and R categories assigned to each model's response for each problem are given in the file human_evaluation_<model>.csv.

The notebook scoring.ipynb contains the code for reading the human_evaluation_<model>.csv file and generating the final proposed score for the corresponding model.

The following table shows the average proposed score for each of the VLMs:

Model Proposed Score
GPT-4 0.697
Claude 0.647
Gemini 0.502
LLaVA 0.322

🏁 Conclusion

  • GPT-4 consistently outperforms other VLMs across visual mathematical reasoning tasks.
  • LLaVA performs significantly worse, likely due to its smaller 7b configuration.
  • Traditional evaluation metrics such as BLEU, BLEURT, and BERTScore are unsuitable for logical correctness due to their focus on stylistic similarity rather than mathematical accuracy.
  • Our proposed scoring method is tailored toward mathematical reasoning with a focus on interpretability.

🚀 Future Directions

  • To enhance VLMs’ understanding of visual question answering (VQA), external tools such as solvers and object recognition systems can be integrated.
  • To improve the reasoning generated by the models, approaches like zero-shot or few-shot learning, combined with chain-of-thought reasoning, could be explored.
  • Automating the scoring process to eliminate reliance on human evaluation of logical correctness and readability could facilitate more extensive analysis.

As an exploratory step, we experimented with various readability scores, which are present in the notebook readability_experiments.ipynb.

About

Evaluating the performance of VLMs on visual mathematical reasoning problems and proposing a novel scoring system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published