Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questeval #40

Merged
merged 28 commits into from
Apr 3, 2021
Merged

Questeval #40

merged 28 commits into from
Apr 3, 2021

Conversation

padipadou
Copy link
Contributor

  • Adding a new metric, QuestEval, which is using source and / or references to evaluate models.

  • This metric needs to know on which task the model is evaluated, if the task is not specified in data, QuestEval will use a general model to evaluate which might not fit well (e.g. Textual QA models dont work well on tables)

  • minor typos in README

@tuetschek
Copy link
Collaborator

@padipadou : Thanks, looks mostly good to me! I'd have two points:

  • It would be good to add tests, now that we have them for the other metrics – would that be possible?
  • I'm not completely sure about adding the "task" attribute into the predictions set. Maybe we want the different versions of QuestEval exposed as different metrics (would there be a situation where you'd want to measure more of them?). What does @sebastianGehrmann think?

One small note: I made some changes to run_metrics.py recently and moved most of its code to gem_metrics/__init__.py so that all of GEM metrics can be installed by pip (and you get a script called gem_metrics in your $PATH). This is the probable cause of the merge conflict – I guess I can look into that for the merge.

@padipadou
Copy link
Contributor Author

padipadou commented Mar 29, 2021

Hello @tuetschek ,

About unit tests: great indeed, I will add them this week (after or before the merge request accepted, up to you)
About adding the "task": BERTScore for instance requires the language and according to it use a multilingual model or not, still it is the same task. The same way, QuestEval uses different QA models if the input is a table or a text, but it is still the same metric. So I think at some point, this information will be required anyway.

Tell me what you think ! :)

@padipadou
Copy link
Contributor Author

For now:

  • I rebased my branch onto current main
  • I modified setup.py file allowing to be enough generic to tolerate package name to be different from repo name.
  • tests are not added yet.

@sebastianGehrmann
Copy link
Contributor

sebastianGehrmann commented Mar 30, 2021

Thanks for all this work! I wanted to chime in wrt task in prediction set.
We already have that information from the name of the predictions, since every task prefix is the name of the dataset. In fact, our website already has this file: https://github.com/GEM-benchmark/GEM-benchmark.github.io/blob/main/web/results/eval_config.json which links summarization to xsum etc.

Could we somehow use this file to determine the task? (you will have to add QuestEval here as well btw to get it rendered on the website)

In the same vain, @tuetschek, we could get rid of the language tag and move that to the eval_config.json.

I also just noticed that our test outputs don't have a gem_id along the predictions. Since TFDS shuffles files by default, we need to retain those to get the predictions in order and to allow subset evals (see #38)

@padipadou
Copy link
Contributor Author

  • Unit tests are added.
  • I will look also at the website's repo to see what should I do.
  • From the beginning, I used the way explained in README to test my code, e. g by passing a particular file (and therefore, unknown from the eval_config point of view) to the metric. Then, in this particular case, it would be impossible to guess the language / task of the given predictions.
  • Ok for the gem_id, but I guess QuestEval metric is not particularly concerned by this ?

Thanks for the quick and clear answers !!

…nto questeval

� Conflicts:
�	README.md
�	gem_metrics/__init__.py
@sebastianGehrmann
Copy link
Contributor

Hi @padipadou, thank you for all the updates! My comment wrt gem_id was more targeted at @tuetschek since this is a functionality we need independent from metrics.

I think it definitely makes sense to refactor the current setup of the framework to just look up a task in the eval_config where it can have a specified language + high level task.

@tuetschek
Copy link
Collaborator

@padipadou : Thanks, looking good!
@sebastianGehrmann : So we'll have both task and language, but they'd default to whatever is said in the config file? Do you have the config file somewhere? Because that's not used by the metrics code at the moment.

Re. gem_id, that's completely unrelated to this, right?

@sebastianGehrmann
Copy link
Contributor

yes, the gem-id argument is unrelated, I just noticed that our test examples do not include them. Hence my initial comment.

Wrt config file - you can find a version at https://github.com/GEM-benchmark/GEM-benchmark.github.io/blob/main/web/results/eval_config.json
It needs to be updated with the newer metrics (Emiels, NUBIA, and QuestEval), but in the best case we'd use the same file for both website and eval framework.

@tuetschek tuetschek merged commit 2693f34 into GEM-benchmark:main Apr 3, 2021
@tuetschek
Copy link
Collaborator

Thanks! I'll open a new issue regarding the config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants