Add Cohere Command R7B, replace older Command R+ handler #835

harry-cohere · 2024-12-16T16:49:26Z

Greetings! It's been a while since our last contribution to BFCL, and the new versions and recent improvements are great to see.

This PR adds our latest model, released on Friday. I've also replaced our older models because it simplifies the code within the BFCL framework.

When I run this PR against 3245d9 I get the following results (without REST category sanity checks):

|Name |Overall Acc|Model      |Model Link                         |Cost ($ Per 1k Function Calls)|Latency Mean (s)|Latency Standard Deviation (s)|Latency 95th Percentile (s)|Non-Live AST Acc|Non-Live Simple AST|Non-Live Multiple AST|Non-Live Parallel AST|Non-Live Parallel Multiple AST|Non-Live Exec Acc|Non-Live Simple Exec|Non-Live Multiple Exec|Non-Live Parallel Exec|Non-Live Parallel Multiple Exec|Live Acc|Live Simple AST|Live Multiple AST|Live Parallel AST|Live Parallel Multiple AST|Multi Turn Acc|Multi Turn Base|Multi Turn Miss Func|Multi Turn Miss Param|Multi Turn Long Context|Relevance Detection|Irrelevance Detection|Organization|License     |
|-----|-----------|-----------|-----------------------------------|------------------------------|----------------|------------------------------|---------------------------|----------------|-------------------|---------------------|---------------------|------------------------------|-----------------|--------------------|----------------------|----------------------|-------------------------------|--------|---------------|-----------------|-----------------|--------------------------|--------------|---------------|--------------------|---------------------|-----------------------|-------------------|---------------------|------------|------------|
|Run 1|53.37%     |Command R7B|https://cohere.com/blog/command-r7b|0.1                           |5.22            |9.75                          |12.64                      |81.00%          |68.00%             |91.50%               |84.50%               |80.00%                        |81.21%           |84.86%              |90.00%                |80.00%                |70.00%                         |74.23%  |63.95%         |69.71%           |50.00%           |70.83%                    |5.12%         |7.00%          |1.00%               |6.50%                |6.00%                  |61.11%             |80.68%               |Cohere      |cc-by-nc-4.0|
|Run 2|53.59%     |Command R7B|https://cohere.com/blog/command-r7b|0.1                           |3.19            |10.43                         |6.29                       |81.21%          |68.33%             |91.50%               |84.50%               |80.50%                        |82.46%           |85.36%              |92.00%                |80.00%                |72.50%                         |74.28%  |63.18%         |69.71%           |50.00%           |83.33%                    |5.12%         |7.50%          |1.50%               |6.00%                |5.50%                  |61.11%             |80.47%               |Cohere      |cc-by-nc-4.0|

I notice some variance between runs but haven't concluded where this is coming from - I'm confident, however, that the model regularly scores the higher score.

Thanks in advance for reviewing this PR and I look forward to seeing us on the leaderboard!

Co-authored-by: yxuansu <suyx1201@163.com> Co-authored-by: Jozef Mokry <jozef@cohere.com>

berkeley-function-call-leaderboard/bfcl/model_handler/utils.py

berkeley-function-call-leaderboard/bfcl/eval_checker/model_metadata.py

Fanjia-Yan · 2024-12-16T23:24:20Z

Hi Cohere,

Congrats on the release and thanks for contributing the really nice tool calling interface to interact with command models! We have one question before onboarding r7b on the leaderboard:

We typically don't deprecate a model unless the next generation alternative releases. With that, would v2 client be able to support full tool usage capability of Command R+ and Command R now or in the future? We do like to see models of different sizes on leaderboard for better comparison :) Thank you.

cc: @HuanzhiMao @ShishirPatil

berkeley-function-call-leaderboard/SUPPORTED_MODELS.md

berkeley-function-call-leaderboard/bfcl/eval_checker/model_metadata.py

harry-cohere · 2024-12-17T15:38:15Z

Thanks @Fanjia-Yan ! All of our models support the newer version, so I've added command-r-plus back and it now uses the updated handler. When I run it against 3245d9 I get this score:

|Overall Acc|Model              |Model Link                                           |Cost ($ Per 1k Function Calls)|Latency Mean (s)|Latency Standard Deviation (s)|Latency 95th Percentile (s)|Non-Live AST Acc|Non-Live Simple AST|Non-Live Multiple AST|Non-Live Parallel AST|Non-Live Parallel Multiple AST|Non-Live Exec Acc|Non-Live Simple Exec|Non-Live Multiple Exec|Non-Live Parallel Exec|Non-Live Parallel Multiple Exec|Live Acc|Live Simple AST|Live Multiple AST|Live Parallel AST|Live Parallel Multiple AST|Multi Turn Acc|Multi Turn Base|Multi Turn Miss Func|Multi Turn Miss Param|Multi Turn Long Context|Relevance Detection|Irrelevance Detection|Organization |License     |
|-----------|-------------------|-----------------------------------------------------|------------------------------|----------------|------------------------------|---------------------------|----------------|-------------------|---------------------|---------------------|------------------------------|-----------------|--------------------|----------------------|----------------------|-------------------------------|--------|---------------|-----------------|-----------------|--------------------------|--------------|---------------|--------------------|---------------------|-----------------------|-------------------|---------------------|-------------|------------|
|51.17%     |Command-R-Plus (FC)|https://txt.cohere.com/command-r-plus-microsoft-azure|7.71                          |2.75            |9.26                          |4.49                       |76.56%          |72.25%             |89.50%               |82.00%               |62.50%                        |79.91%           |90.14%              |90.00%                |82.00%                |57.50%                         |63.26%  |73.64%         |71.79%           |43.75%           |58.33%                    |15.00%        |23.00%         |10.00%              |11.50%               |15.50%                 |94.44%             |50.57%               |Cohere For AI|cc-by-nc-4.0|

This is +5.1% from the leaderboard score, and the increase mainly comes from the multihop category.

Looking into the increase I think part of the reason is that there are bugs in the replaced handler which lowered the overall score. For example the base handler adds a holdout function message with role/content, but the replaced handler expects them as role/message, which would explain the 10% increase in the missed function category. There may be other sources of the increase, too.

Another change is the cost per 1K functions - I see an increase of around $2.30 per 1K functions. Is it possible that the price is increased because the multi turn categories now succeed in more cases?

Fanjia-Yan · 2024-12-18T01:22:39Z

Thank you Harry! That addresses all my concerns. And yes, higher cost per 1K functions typically means the model attempts more and have less early stop. @HuanzhiMao will review code change and merge that in.

ShishirPatil

LGTM. Thank you for the PR and congratulations on the new model launch @harry-cohere

This PR updates the leaderboard to reflect the change in score due to the following PR merge: 1. #822 2. #826 3. #829 4. #832 5. #837 6. #840 7. #835 8. #842 9. #843 10. #846 11. #838 12. #847 13. #855 14. #857 Models were evaluated using checkpoint commit 0cea216.

Add Cohere Command R7B

524b8fd

Co-authored-by: yxuansu <suyx1201@163.com> Co-authored-by: Jozef Mokry <jozef@cohere.com>

harry-cohere commented Dec 16, 2024

View reviewed changes

berkeley-function-call-leaderboard/bfcl/model_handler/utils.py Show resolved Hide resolved

harry-cohere commented Dec 16, 2024

View reviewed changes

berkeley-function-call-leaderboard/bfcl/eval_checker/model_metadata.py Show resolved Hide resolved

Re-add command-r-plus to updated handler

d2ec3fd

harry-cohere commented Dec 17, 2024

View reviewed changes

berkeley-function-call-leaderboard/SUPPORTED_MODELS.md Show resolved Hide resolved

harry-cohere commented Dec 17, 2024

View reviewed changes

berkeley-function-call-leaderboard/bfcl/eval_checker/model_metadata.py Show resolved Hide resolved

harry-cohere changed the title ~~Add Cohere Command R7B~~ Add Cohere Command R7B, replace older Command R+ handler Dec 17, 2024

Merge branch 'main' into harry/add-command-r7b

3fc0403

HuanzhiMao added the BFCL-New Model Add New Model to BFCL label Dec 19, 2024

remove USE_COHERE_OPTIMIZATION from .env

6720b3b

ShishirPatil approved these changes Dec 19, 2024

View reviewed changes

ShishirPatil merged commit d1ac6ba into ShishirPatil:main Dec 19, 2024

HuanzhiMao mentioned this pull request Dec 19, 2024

[BFCL] Leaderboard Update - 2024/12/29 (Checkpoint 0cea216) #845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cohere Command R7B, replace older Command R+ handler #835

Add Cohere Command R7B, replace older Command R+ handler #835

harry-cohere commented Dec 16, 2024 •

edited

Loading

Fanjia-Yan commented Dec 16, 2024

harry-cohere commented Dec 17, 2024 •

edited

Loading

Fanjia-Yan commented Dec 18, 2024

ShishirPatil left a comment

Add Cohere Command R7B, replace older Command R+ handler #835

Add Cohere Command R7B, replace older Command R+ handler #835

Conversation

harry-cohere commented Dec 16, 2024 • edited Loading

Fanjia-Yan commented Dec 16, 2024

harry-cohere commented Dec 17, 2024 • edited Loading

Fanjia-Yan commented Dec 18, 2024

ShishirPatil left a comment

Choose a reason for hiding this comment

harry-cohere commented Dec 16, 2024 •

edited

Loading

harry-cohere commented Dec 17, 2024 •

edited

Loading