Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ar support for MBZUAI-arabic-mmlu #209

Closed
wants to merge 11 commits into from

Conversation

bakrianoo
Copy link

Add Support for ArabicMMLU Evaluation Task

Overview

This PR introduces a new evaluation task for Arabic Language Models (LLMs) using the ArabicMMLU dataset, as detailed in the paper "ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic". The ArabicMMLU dataset provides a comprehensive benchmark for evaluating the performance of LLMs on a wide range of tasks in the Arabic language.

Related Work

Notes

This contribution aligns with the ongoing efforts to expand the capabilities of lightEval in supporting diverse languages and tasks. Feedback and suggestions are welcome!

@clefourrier
Copy link
Member

Hi, thanks for your PR!
FYI, we have a small backlog of PRs to go through so we might take about a week to address it

@clefourrier
Copy link
Member

In the meantime, please make sure that the styling is correct :)

@clefourrier
Copy link
Member

@NathanHB Could be worth waiting for #214 (last PR of the above serie) before editing this one to fit the new format

@clefourrier
Copy link
Member

Hi! I think once you update the PR to the new format for metrics, prompts and functions, we'll be good to go!
Also tagging @alielfilali01 since he was the author of the original file for arabic_evals (these are behind the arabic LLM leaderboard) to get his opinion too.

prompt_function=mbzuai_arabic_mmlu,
suite=["community"],
hf_repo="MBZUAI/ArabicMMLU",
hf_subset="default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is the same as test subset

@alielfilali01
Copy link
Contributor

Tnx @clefourrier for the tag and thanks dear @bakrianoo for your valuable contribution 🤗

I have one remark to make related to the comment i left above :
hf_subset="default" will load the default subset which is also the test subset used for the eval and few shots !
Solution for me would be to drop eval subset and never use few shots in this benchmark ! OR Make a custom version of this dataset with test and val subsets in it !?

@clefourrier
Copy link
Member

I agree with the comment - if you can setup your dataset to have different splits for few shots, it will avoid context contamination. You also need to run the linters to get code quality

@bakrianoo
Copy link
Author

Thank you @alielfilali01 for your comment. I am wondering if creating a new dataset would violate any license issues for the original dataset!

I need to confirm with the dataset authors, or we can follow the zer-shot suggestion.

What do you think?

@alielfilali01
Copy link
Contributor

@bakrianoo , If it's not Apache2.0 for example then let's open a discussion in the repo and see if the authors would help with creating the dataset themselves. Plz feel free to do it and if no response then i can reach out directly to one or two of the authors... What do you think?

@bakrianoo
Copy link
Author

Sure. I will start the discussion there. Thank you @alielfilali01 for your interesting.

@clefourrier
Copy link
Member

Hi! Feel free to tell us when this is updated!

@hynky1999
Copy link
Collaborator

hynky1999 commented Oct 2, 2024

I think it's solved by this pr #338.
See mmlu_ara_mcf

If you want to use native arabic letter as options anchors use:

formulation=MCFFormulation("NativeLetters")

as formulation for the task

cc @clefourrier

@alielfilali01
Copy link
Contributor

I think it's solved by this pr #338. See mmlu_ara_mcf

cc @clefourrier

@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks

@alielfilali01
Copy link
Contributor

I think it's solved by this pr #338. See mmlu_ara_mcf
cc @clefourrier

@hynky1999 can you plz mention the exact name from the pr you just mentioned? Cuz the mmlu here is different than the other mmlus and couldn't find it in the list from pr #338. Also this pr is a stale now and feel free to close since I'm planning to merge this version of mmlu in an upcoming pr myb next week alongside other new arabic tasks

Sorry just saw you mentioned "mmlu_ara_mcf" ma bad.
I see so many commits in the pr 😅 i will need to open it from laptop. I get back to you on it tomorrow at max so i can plan to remove it from my pr if all is good.
Tnx man for taking care of it

@hynky1999
Copy link
Collaborator

So many commits are because of how we we were merging the prs hhhhh

@hynky1999
Copy link
Collaborator

What tasks are you planning to add btw ?

@alielfilali01
Copy link
Contributor

What tasks are you planning to add btw ?

Some new benchmarks we got from some colleagues and partners and convinced them to make them public 😁

@alielfilali01
Copy link
Contributor

Hey @hynky1999, Sorry couldn't get back to you yesterday. Well i saw the arabic mmlu task and it seems good. I'm just not sure about the instruction if you can provide more details on that. Also i saw the hf_repo is yazeed7/ArabicMMLU where the official release is MBZUAI/ArabicMMLU if you can change that plz.

In my upcoming pr i will be adding 3 arabic MMLU datasets including this one as well as part of the community suite then we can run them both and see if actually the implementation affects the evals (which shouldn't but wanna try it anyway 😅)

For clarity here is the upcoming MMLU datasets :

  • arabic_mmlu_mt : machine translated
  • arabic_mmlu_ht : our in house human translated
  • arabic_mmmlu : OpenAI's human translated
  • arabic_mmlu : The one we discuss here (MBZUAI/ArabicMMLU)

PS : "mmlu_okapi_ar" is already part of the community suite

@alielfilali01
Copy link
Contributor

Hey @clefourrier , I've spoke with @bakrianoo and plz feel free to close this PR. @bakrianoo maybe you can confirm here

@bakrianoo
Copy link
Author

Since @alielfilali01 is working on including this in another PR, I am going to close it.
Thank you all for your support.

@bakrianoo bakrianoo closed this Oct 4, 2024
@hynky1999
Copy link
Collaborator

hynky1999 commented Oct 4, 2024

@alielfilali01
Sure I will switch it. The reason why I used the other repo is that previously you were either missing dedicated few-shot split or it was annoying to access subsets separately (I don't remember what reason it was exactly). Now it looks good 👍

Re instructions:
The design of templates (which is what all the multilingual evals use in that file) is heavily based on OLMEs paper. Secondly since it's a bit hard to create a global instruction for all multilingual tasks I decided to not use instructions for any task. We run several ablations with this setting and we didn't notice that it would be a reason why the models can't solve the tasks.
As said in olmes paper the question / answer are sufficient for guiding the model what to do.
In theory we could have something like instruction registry and for each task create a generic instruction.

Re other MMLUs:

  • This PR should be adding openai mmlus Misc-multilingual tasks #339
  • arabic_mmlu_mt (is this okapi one ? or different?)
  • arabic_mmlu_ht (I don't see a point in adding the above if you have a in-house translation)

Last note, do you think you could use the multilingual tempaltes ?

@alielfilali01
Copy link
Contributor

@hynky1999 Actually I added the in house translated mmlu about a week before OpenAI release MMMLU and you can imagine how much effort it took to convince the team internally to release it 😅. Also i thought it's gonna be helpful to test how the translation quality impact models perf. And also just leave it to the community to decide which one they want to use ...

Note : mmlu_mt is different than mmlu_okapi_ar. First was translated using translation engine while okapi using gpt-3.5 (i guess)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants