[BFCL] Multi Turn Dataset Fix (Base Category) #723

HuanzhiMao · 2024-10-29T06:53:28Z

This PR fixes the ambiguous prompt issue and some wrong ground truth issues for the multi_turn_base category. After this PR, the multi_turn_base entries should be bug-free.

Following #719 and #722 , this is also part of the effort to thoroughly bug fix the multi turn categories. We will have more PR coming in the next few days.

Co-authored-by: Charlie Cheng-Jie Ji 55744150+CharlieJCJ@users.noreply.github.com
Co-authored-by: Fanjia-Yan 78303449+Fanjia-Yan@users.noreply.github.com
Co-authored-by: VishnuSuresh27 112032533+VishnuSuresh27@users.noreply.github.com

Fanjia-Yan

LGTM

CharlieJCJ

LGTM

This PR updates the question and ground truth for the `multi_turn_miss_func` and `multi_turn_long_context` accordingly, since they are augmented from `multi_turn_base` and the fix for the base entries was finalized in #723. Following #719, #722, #723 and #725, this is also part of the effort to thoroughly bug fix the multi turn categories. There will be one more PR coming for the `multi_turn_miss_param` category fix. --------- Co-authored-by: Charlie Cheng-Jie Ji <55744150+CharlieJCJ@users.noreply.github.com> Co-authored-by: Fanjia-Yan <78303449+Fanjia-Yan@users.noreply.github.com> Co-authored-by: VishnuSuresh27 <112032533+VishnuSuresh27@users.noreply.github.com>

This PR updates the question and ground truth for the `multi_turn_miss_param` category, since they are augmented from `multi_turn_base` and the fix for the base entries was finalized in #723. Following #719, #722, #723, #725 and #728, this is also part of the effort to thoroughly bug fix the multi turn categories. --------- Co-authored-by: Charlie Cheng-Jie Ji <55744150+CharlieJCJ@users.noreply.github.com> Co-authored-by: Fanjia-Yan <78303449+Fanjia-Yan@users.noreply.github.com> Co-authored-by: VishnuSuresh27 <112032533+VishnuSuresh27@users.noreply.github.com>

In the current metric, for the `multi_turn_miss_func` and `multi_turn_miss_param` categories, the model is expected to output no function calls when a turn is missing necessary information (either a relevant function or parameter). This mirrors the standard for irrelevance detection in single-turn scenarios. However, multi-turn interactions introduce additional complexity. For instance, if the user’s request is "go to the ABC folder and display content of the XYZ file" but the `cd` function isn’t provided, the model might reasonably attempt exploratory actions (like calling `pwd` or `ls` to check its current location) before recognizing that it cannot complete the task as requested. Ultimately, the model should recognize that the user’s task is unachievable given the context. To address this, we've updated the metric: a dummy function, `flag_task_unachievable`, will now be provided for every multi-turn entry. If the model determines that one or more tasks are unachievable, it should explicitly invoke this function. During evaluation, any entry where the model calls this function will be marked as correct for irrelevance detection, even if other functions were called beforehand. In addition, this PR addresses #664. The execution result for each turn (for both the model and the ground truth) is also included as part of the score output files to help with debugging. Following #719, #722 and #723, this is also part of the effort to thoroughly bug fix the multi turn categories. We will have more PR coming in the next few days.

This PR fixes the ambiguous prompt issue and some wrong ground truth issues for the multi_turn_base category. After this PR, the multi_turn_base entries should be bug-free. Following ShishirPatil#719 and ShishirPatil#722 , this is also part of the effort to thoroughly bug fix the multi turn categories. We will have more PR coming in the next few days. --------- Co-authored-by: Charlie Cheng-Jie Ji <55744150+CharlieJCJ@users.noreply.github.com> Co-authored-by: Fanjia-Yan <78303449+Fanjia-Yan@users.noreply.github.com> Co-authored-by: VishnuSuresh27 <112032533+VishnuSuresh27@users.noreply.github.com>

…l#728) This PR updates the question and ground truth for the `multi_turn_miss_func` and `multi_turn_long_context` accordingly, since they are augmented from `multi_turn_base` and the fix for the base entries was finalized in ShishirPatil#723. Following ShishirPatil#719, ShishirPatil#722, ShishirPatil#723 and ShishirPatil#725, this is also part of the effort to thoroughly bug fix the multi turn categories. There will be one more PR coming for the `multi_turn_miss_param` category fix. --------- Co-authored-by: Charlie Cheng-Jie Ji <55744150+CharlieJCJ@users.noreply.github.com> Co-authored-by: Fanjia-Yan <78303449+Fanjia-Yan@users.noreply.github.com> Co-authored-by: VishnuSuresh27 <112032533+VishnuSuresh27@users.noreply.github.com>

This PR updates the question and ground truth for the `multi_turn_miss_param` category, since they are augmented from `multi_turn_base` and the fix for the base entries was finalized in ShishirPatil#723. Following ShishirPatil#719, ShishirPatil#722, ShishirPatil#723, ShishirPatil#725 and ShishirPatil#728, this is also part of the effort to thoroughly bug fix the multi turn categories. --------- Co-authored-by: Charlie Cheng-Jie Ji <55744150+CharlieJCJ@users.noreply.github.com> Co-authored-by: Fanjia-Yan <78303449+Fanjia-Yan@users.noreply.github.com> Co-authored-by: VishnuSuresh27 <112032533+VishnuSuresh27@users.noreply.github.com>

…irPatil#725) In the current metric, for the `multi_turn_miss_func` and `multi_turn_miss_param` categories, the model is expected to output no function calls when a turn is missing necessary information (either a relevant function or parameter). This mirrors the standard for irrelevance detection in single-turn scenarios. However, multi-turn interactions introduce additional complexity. For instance, if the user’s request is "go to the ABC folder and display content of the XYZ file" but the `cd` function isn’t provided, the model might reasonably attempt exploratory actions (like calling `pwd` or `ls` to check its current location) before recognizing that it cannot complete the task as requested. Ultimately, the model should recognize that the user’s task is unachievable given the context. To address this, we've updated the metric: a dummy function, `flag_task_unachievable`, will now be provided for every multi-turn entry. If the model determines that one or more tasks are unachievable, it should explicitly invoke this function. During evaluation, any entry where the model calls this function will be marked as correct for irrelevance detection, even if other functions were called beforehand. In addition, this PR addresses ShishirPatil#664. The execution result for each turn (for both the model and the ground truth) is also included as part of the score output files to help with debugging. Following ShishirPatil#719, ShishirPatil#722 and ShishirPatil#723, this is also part of the effort to thoroughly bug fix the multi turn categories. We will have more PR coming in the next few days.

This PR updates the leaderboard to reflect the change in score due to the following PR merge: 1. #719 2. #722 3. #723 4. #728 5. #732 6. #725 7. #712 8. #733 9. #720 10. #760 11. #761 12. #767

HuanzhiMao added 3 commits October 28, 2024 23:49

update function source code

1f7a2db

recompile func doc

0b33f8d

update base category question and ground truth

e81adf6

HuanzhiMao added the BFCL-Dataset BFCL Dataset-Related Issue label Oct 29, 2024

update pyproject.toml to include mpmath package

6c90f3f

Fanjia-Yan approved these changes Oct 29, 2024

View reviewed changes

Fanjia-Yan mentioned this pull request Oct 29, 2024

[BFCL] Multi Turn Pipeline Robustness Patch #724

Merged

CharlieJCJ approved these changes Oct 30, 2024

View reviewed changes

This was referenced Oct 30, 2024

[BFCL] Update Eval Metric for Multi Turn Irrelevance Scenarios #725

Merged

[BFCL] Multi Turn Dataset Fix (Miss Func & Long Context) #728

Merged

ShishirPatil merged commit a79d891 into ShishirPatil:main Oct 30, 2024

HuanzhiMao mentioned this pull request Oct 31, 2024

[BFCL] Multi Turn Dataset Fix (Miss Param) #732

Merged

HuanzhiMao mentioned this pull request Nov 9, 2024

[BFCL] Leaderboard Update, 11/17/2024 #748

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Multi Turn Dataset Fix (Base Category) #723

[BFCL] Multi Turn Dataset Fix (Base Category) #723

HuanzhiMao commented Oct 29, 2024

Fanjia-Yan left a comment

CharlieJCJ left a comment

[BFCL] Multi Turn Dataset Fix (Base Category) #723

[BFCL] Multi Turn Dataset Fix (Base Category) #723

Conversation

HuanzhiMao commented Oct 29, 2024

Fanjia-Yan left a comment

Choose a reason for hiding this comment

CharlieJCJ left a comment

Choose a reason for hiding this comment