Large memory usage on MATH #80

lewtun · 2024-03-02T14:29:52Z

Is the MATH benchmark expected to run for anything beyond batch_size=1?

Running the following command for a small model gives OOM on a single node of H100s which is a bit surprising to me:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=Qwen/Qwen1.5-0.5B" \
    --override_batch_size 2

Strangely enough, bumping up the batch size for Mistral 7B is fine:

accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py \
    --tasks="lighteval|math:algebra|5|0" \
    --output_dir "./scratch/evals" \
    --model_args "pretrained=mistralai/Mistral-7B-v0.1" \
    --override_batch_size 2

Perhaps there's some sort of unbounded generation occurring which is causing the memory to explode for certain models like Qwen?

The text was updated successfully, but these errors were encountered:

clefourrier · 2024-03-02T15:29:56Z

Hi,
Thanks for the issue!

I can confirm that the generation size is unbounded, which you can see in the task description

{"name":"math:algebra","suite":["lighteval","math"],"prompt_function":"math","hf_repo":"lighteval\/MATH","hf_subset":"algebra","hf_avail_splits":["train","test","validation"],"evaluation_splits":["test"],"few_shots_split":null,"few_shots_select":null,"generation_size":null,"metric":["quasi_exact_match_math"],"stop_sequence":["\n"],"output_regex":null,"frozen":false}

When generation_size is null, there is no bound expect the model's max context length (should be around 8K for both these models though).

I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?

lewtun · 2024-03-02T19:22:18Z

I'll check if the paper defines a maximum expected generation size, else will fix the bound to the maximum answer size + 10% maybe?

Yes, alternatively we could set the max gen size to something like 1024 or 2048 tokens since if a model cannot answer in that span then it is likely incorrect. You can see here that the authors chose 1024 tokens for models that aren't gpt2-xl, so 2048 seems like a safe bet

clefourrier · 2024-03-04T07:46:26Z

Sounds perfect, will use this rn!

… with a much longer context size. Should fix #80

Caps it at 2048 even for models with a much longer context size. Should fix #80

clefourrier added a commit that referenced this issue Mar 4, 2024

set a max length for the MATH task, to cap it at 2048 even for models…

ff193b9

… with a much longer context size. Should fix #80

clefourrier mentioned this issue Mar 4, 2024

Sets a max length for the MATH task #83

Merged

clefourrier self-assigned this Mar 4, 2024

NathanHB added the bug Something isn't working label Mar 4, 2024

clefourrier closed this as completed in #83 Mar 4, 2024

clefourrier added a commit that referenced this issue Mar 4, 2024

Sets a max length for the MATH task (#83)

458d50b

Caps it at 2048 even for models with a much longer context size. Should fix #80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory usage on MATH #80

Large memory usage on MATH #80

lewtun commented Mar 2, 2024 •

edited

Loading

clefourrier commented Mar 2, 2024

lewtun commented Mar 2, 2024 •

edited

Loading

clefourrier commented Mar 4, 2024

Large memory usage on MATH #80

Large memory usage on MATH #80

Comments

lewtun commented Mar 2, 2024 • edited Loading

clefourrier commented Mar 2, 2024

lewtun commented Mar 2, 2024 • edited Loading

clefourrier commented Mar 4, 2024

lewtun commented Mar 2, 2024 •

edited

Loading

lewtun commented Mar 2, 2024 •

edited

Loading