Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Multi Turn Dataset Fix (Base Category) #723

Merged
merged 4 commits into from
Oct 30, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

This PR fixes the ambiguous prompt issue and some wrong ground truth issues for the multi_turn_base category. After this PR, the multi_turn_base entries should be bug-free.

Following #719 and #722 , this is also part of the effort to thoroughly bug fix the multi turn categories. We will have more PR coming in the next few days.


Co-authored-by: Charlie Cheng-Jie Ji 55744150+CharlieJCJ@users.noreply.github.com
Co-authored-by: Fanjia-Yan 78303449+Fanjia-Yan@users.noreply.github.com
Co-authored-by: VishnuSuresh27 112032533+VishnuSuresh27@users.noreply.github.com

@HuanzhiMao HuanzhiMao added the BFCL-Dataset BFCL Dataset-Related Issue label Oct 29, 2024
Copy link
Collaborator

@Fanjia-Yan Fanjia-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShishirPatil ShishirPatil merged commit a79d891 into ShishirPatil:main Oct 30, 2024
HuanzhiMao added a commit that referenced this pull request Oct 31, 2024
This PR updates the question and ground truth for the
`multi_turn_miss_func` and `multi_turn_long_context` accordingly, since
they are augmented from `multi_turn_base` and the fix for the base
entries was finalized in #723.

Following #719, #722, #723 and #725, this is also part of the effort to
thoroughly bug fix the multi turn categories. There will be one more PR
coming for the `multi_turn_miss_param` category fix.

---------

Co-authored-by: Charlie Cheng-Jie Ji
<55744150+CharlieJCJ@users.noreply.github.com>
Co-authored-by: Fanjia-Yan
<78303449+Fanjia-Yan@users.noreply.github.com>
Co-authored-by: VishnuSuresh27
<112032533+VishnuSuresh27@users.noreply.github.com>
HuanzhiMao added a commit that referenced this pull request Oct 31, 2024
This PR updates the question and ground truth for the
`multi_turn_miss_param` category, since they are augmented from
`multi_turn_base` and the fix for the base entries was finalized in
#723.

Following #719, #722, #723, #725 and #728, this is also part of the
effort to thoroughly bug fix the multi turn categories.

---------

Co-authored-by: Charlie Cheng-Jie Ji
<55744150+CharlieJCJ@users.noreply.github.com>
Co-authored-by: Fanjia-Yan
<78303449+Fanjia-Yan@users.noreply.github.com>
Co-authored-by: VishnuSuresh27
<112032533+VishnuSuresh27@users.noreply.github.com>
HuanzhiMao added a commit that referenced this pull request Oct 31, 2024
In the current metric, for the `multi_turn_miss_func` and
`multi_turn_miss_param` categories, the model is expected to output no
function calls when a turn is missing necessary information (either a
relevant function or parameter). This mirrors the standard for
irrelevance detection in single-turn scenarios. However, multi-turn
interactions introduce additional complexity.

For instance, if the user’s request is "go to the ABC folder and display
content of the XYZ file" but the `cd` function isn’t provided, the model
might reasonably attempt exploratory actions (like calling `pwd` or `ls`
to check its current location) before recognizing that it cannot
complete the task as requested. Ultimately, the model should recognize
that the user’s task is unachievable given the context.

To address this, we've updated the metric: a dummy function,
`flag_task_unachievable`, will now be provided for every multi-turn
entry. If the model determines that one or more tasks are unachievable,
it should explicitly invoke this function. During evaluation, any entry
where the model calls this function will be marked as correct for
irrelevance detection, even if other functions were called beforehand.

In addition, this PR addresses #664. The execution result for each turn
(for both the model and the ground truth) is also included as part of
the score output files to help with debugging.

Following #719, #722 and #723, this is also part of the effort to
thoroughly bug fix the multi turn categories. We will have more PR
coming in the next few days.
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
This PR fixes the ambiguous prompt issue and some wrong ground truth
issues for the multi_turn_base category. After this PR, the
multi_turn_base entries should be bug-free.

Following ShishirPatil#719 and ShishirPatil#722 , this is also part of the effort to thoroughly
bug fix the multi turn categories. We will have more PR coming in the
next few days.

---------

Co-authored-by: Charlie Cheng-Jie Ji
<55744150+CharlieJCJ@users.noreply.github.com>
Co-authored-by: Fanjia-Yan
<78303449+Fanjia-Yan@users.noreply.github.com>
Co-authored-by: VishnuSuresh27
<112032533+VishnuSuresh27@users.noreply.github.com>
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
…l#728)

This PR updates the question and ground truth for the
`multi_turn_miss_func` and `multi_turn_long_context` accordingly, since
they are augmented from `multi_turn_base` and the fix for the base
entries was finalized in ShishirPatil#723.

Following ShishirPatil#719, ShishirPatil#722, ShishirPatil#723 and ShishirPatil#725, this is also part of the effort to
thoroughly bug fix the multi turn categories. There will be one more PR
coming for the `multi_turn_miss_param` category fix.

---------

Co-authored-by: Charlie Cheng-Jie Ji
<55744150+CharlieJCJ@users.noreply.github.com>
Co-authored-by: Fanjia-Yan
<78303449+Fanjia-Yan@users.noreply.github.com>
Co-authored-by: VishnuSuresh27
<112032533+VishnuSuresh27@users.noreply.github.com>
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
This PR updates the question and ground truth for the
`multi_turn_miss_param` category, since they are augmented from
`multi_turn_base` and the fix for the base entries was finalized in
ShishirPatil#723.

Following ShishirPatil#719, ShishirPatil#722, ShishirPatil#723, ShishirPatil#725 and ShishirPatil#728, this is also part of the
effort to thoroughly bug fix the multi turn categories.

---------

Co-authored-by: Charlie Cheng-Jie Ji
<55744150+CharlieJCJ@users.noreply.github.com>
Co-authored-by: Fanjia-Yan
<78303449+Fanjia-Yan@users.noreply.github.com>
Co-authored-by: VishnuSuresh27
<112032533+VishnuSuresh27@users.noreply.github.com>
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
…irPatil#725)

In the current metric, for the `multi_turn_miss_func` and
`multi_turn_miss_param` categories, the model is expected to output no
function calls when a turn is missing necessary information (either a
relevant function or parameter). This mirrors the standard for
irrelevance detection in single-turn scenarios. However, multi-turn
interactions introduce additional complexity.

For instance, if the user’s request is "go to the ABC folder and display
content of the XYZ file" but the `cd` function isn’t provided, the model
might reasonably attempt exploratory actions (like calling `pwd` or `ls`
to check its current location) before recognizing that it cannot
complete the task as requested. Ultimately, the model should recognize
that the user’s task is unachievable given the context.

To address this, we've updated the metric: a dummy function,
`flag_task_unachievable`, will now be provided for every multi-turn
entry. If the model determines that one or more tasks are unachievable,
it should explicitly invoke this function. During evaluation, any entry
where the model calls this function will be marked as correct for
irrelevance detection, even if other functions were called beforehand.

In addition, this PR addresses ShishirPatil#664. The execution result for each turn
(for both the model and the ground truth) is also included as part of
the score output files to help with debugging.

Following ShishirPatil#719, ShishirPatil#722 and ShishirPatil#723, this is also part of the effort to
thoroughly bug fix the multi turn categories. We will have more PR
coming in the next few days.
HuanzhiMao added a commit that referenced this pull request Nov 19, 2024
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #719
2. #722
3. #723
4. #728 
5. #732
6. #725
7. #712
8. #733
9. #720 
10. #760 
11. #761 
12. #767
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-Dataset BFCL Dataset-Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants