Questeval #40

padipadou · 2021-03-26T13:24:09Z

Adding a new metric, QuestEval, which is using source and / or references to evaluate models.
This metric needs to know on which task the model is evaluated, if the task is not specified in data, QuestEval will use a general model to evaluate which might not fit well (e.g. Textual QA models dont work well on tables)
minor typos in README

Waiting for SAFEval update to finish.

+ requirements with QuestEval

tuetschek · 2021-03-28T16:54:48Z

@padipadou : Thanks, looks mostly good to me! I'd have two points:

It would be good to add tests, now that we have them for the other metrics – would that be possible?
I'm not completely sure about adding the "task" attribute into the predictions set. Maybe we want the different versions of QuestEval exposed as different metrics (would there be a situation where you'd want to measure more of them?). What does @sebastianGehrmann think?

One small note: I made some changes to run_metrics.py recently and moved most of its code to gem_metrics/__init__.py so that all of GEM metrics can be installed by pip (and you get a script called gem_metrics in your $PATH). This is the probable cause of the merge conflict – I guess I can look into that for the merge.

padipadou · 2021-03-29T09:38:26Z

Hello @tuetschek ,

About unit tests: great indeed, I will add them this week (after or before the merge request accepted, up to you)
About adding the "task": BERTScore for instance requires the language and according to it use a multilingual model or not, still it is the same task. The same way, QuestEval uses different QA models if the input is a table or a text, but it is still the same metric. So I think at some point, this information will be required anyway.

Tell me what you think ! :)

Waiting for SAFEval update to finish.

+ requirements with QuestEval

# Conflicts: # run_metrics.py

…erent from repo name in case of git install requirement. 2 cases: - no package name, intuiting it from repo name. - a package name is given by the special keyword "#egg" because it is different from repo name.

padipadou · 2021-03-29T17:20:30Z

For now:

I rebased my branch onto current main
I modified setup.py file allowing to be enough generic to tolerate package name to be different from repo name.
tests are not added yet.

sebastianGehrmann · 2021-03-30T13:40:01Z

Thanks for all this work! I wanted to chime in wrt task in prediction set.
We already have that information from the name of the predictions, since every task prefix is the name of the dataset. In fact, our website already has this file: https://github.com/GEM-benchmark/GEM-benchmark.github.io/blob/main/web/results/eval_config.json which links summarization to xsum etc.

Could we somehow use this file to determine the task? (you will have to add QuestEval here as well btw to get it rendered on the website)

In the same vain, @tuetschek, we could get rid of the language tag and move that to the eval_config.json.

I also just noticed that our test outputs don't have a gem_id along the predictions. Since TFDS shuffles files by default, we need to retain those to get the predictions in order and to allow subset evals (see #38)

padipadou · 2021-03-31T13:31:31Z

Unit tests are added.
I will look also at the website's repo to see what should I do.
From the beginning, I used the way explained in README to test my code, e. g by passing a particular file (and therefore, unknown from the eval_config point of view) to the metric. Then, in this particular case, it would be impossible to guess the language / task of the given predictions.
Ok for the gem_id, but I guess QuestEval metric is not particularly concerned by this ?

Thanks for the quick and clear answers !!

…nto questeval � Conflicts: � README.md � gem_metrics/__init__.py

sebastianGehrmann · 2021-03-31T19:17:32Z

Hi @padipadou, thank you for all the updates! My comment wrt gem_id was more targeted at @tuetschek since this is a functionality we need independent from metrics.

I think it definitely makes sense to refactor the current setup of the framework to just look up a task in the eval_config where it can have a specified language + high level task.

tuetschek · 2021-03-31T20:12:40Z

@padipadou : Thanks, looking good!
@sebastianGehrmann : So we'll have both task and language, but they'd default to whatever is said in the config file? Do you have the config file somewhere? Because that's not used by the metrics code at the moment.

Re. gem_id, that's completely unrelated to this, right?

sebastianGehrmann · 2021-03-31T20:16:03Z

yes, the gem-id argument is unrelated, I just noticed that our test examples do not include them. Hence my initial comment.

Wrt config file - you can find a version at https://github.com/GEM-benchmark/GEM-benchmark.github.io/blob/main/web/results/eval_config.json
It needs to be updated with the newer metrics (Emiels, NUBIA, and QuestEval), but in the best case we'd use the same file for both website and eval framework.

tuetschek · 2021-04-03T10:48:37Z

Thanks! I'll open a new issue regarding the config file.

padipadou added 11 commits February 12, 2021 12:21

feat: SAFEval

59b3ea5

fix: minor

c0dfd9d

Waiting for SAFEval update to finish.

fix: heavy metrics with SAFEval

392964a

feat: task param in Predictions.

fc765e9

feat: computecached method

9969b7a

feat: questeval working with sumarization en.

eafa063

refactor: removing useless SourcedMetric

b549c3a

fix: typos / minor

895fc36

feat: handling references sources in case of unknown task.

2a87fe5

+ requirements with QuestEval

fix: questeval name

46c58dc

fix: questeval correct branch

66edfce

padipadou added 15 commits March 29, 2021 17:34

feat: SAFEval

152a6b2

fix: minor

9a3afea

Waiting for SAFEval update to finish.

fix: heavy metrics with SAFEval

494f0f3

feat: task param in Predictions.

3fd9f00

feat: computecached method

13f2a29

feat: questeval working with sumarization en.

72f3414

refactor: removing useless SourcedMetric

84779a5

fix: typos / minor

b82c831

feat: handling references sources in case of unknown task.

cc2ae96

+ requirements with QuestEval

fix: questeval name

12cf701

fix: questeval correct branch

a79c3b3

fix: questeval added to heavy metrics

313e3fe

Merge remote-tracking branch 'origin/questeval' into questeval

6f27b78

# Conflicts: # run_metrics.py

fix: removing mistaken / useless source metrics

332153e

fix: setup.py is way more generic by allowing package name to be diff…

3d5a834

…erent from repo name in case of git install requirement. 2 cases: - no package name, intuiting it from repo name. - a package name is given by the special keyword "#egg" because it is different from repo name.

feat: unit tests passed

f6fcbbb

Merge branch 'main' of https://github.com/GEM-benchmark/GEM-metrics i…

e4b3372

…nto questeval � Conflicts: � README.md � gem_metrics/__init__.py

tuetschek merged commit 2693f34 into GEM-benchmark:main Apr 3, 2021

This was referenced Apr 3, 2021

Use eval_config to detect languages and tasks automatically. #43

Closed

Dataset-specific metrics #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questeval #40

Questeval #40

padipadou commented Mar 26, 2021

tuetschek commented Mar 28, 2021

padipadou commented Mar 29, 2021 •

edited

Loading

padipadou commented Mar 29, 2021

sebastianGehrmann commented Mar 30, 2021 •

edited

Loading

padipadou commented Mar 31, 2021

sebastianGehrmann commented Mar 31, 2021

tuetschek commented Mar 31, 2021

sebastianGehrmann commented Mar 31, 2021

tuetschek commented Apr 3, 2021

Questeval #40

Questeval #40

Conversation

padipadou commented Mar 26, 2021

tuetschek commented Mar 28, 2021

padipadou commented Mar 29, 2021 • edited Loading

padipadou commented Mar 29, 2021

sebastianGehrmann commented Mar 30, 2021 • edited Loading

padipadou commented Mar 31, 2021

sebastianGehrmann commented Mar 31, 2021

tuetschek commented Mar 31, 2021

sebastianGehrmann commented Mar 31, 2021

tuetschek commented Apr 3, 2021

padipadou commented Mar 29, 2021 •

edited

Loading

sebastianGehrmann commented Mar 30, 2021 •

edited

Loading