Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cohere Command R7B, replace older Command R+ handler #835

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

harry-cohere
Copy link
Contributor

@harry-cohere harry-cohere commented Dec 16, 2024

Greetings! It's been a while since our last contribution to BFCL, and the new versions and recent improvements are great to see.

This PR adds our latest model, released on Friday. I've also replaced our older models because it simplifies the code within the BFCL framework.

When I run this PR against 3245d9 I get the following results (without REST category sanity checks):

|Name |Overall Acc|Model      |Model Link                         |Cost ($ Per 1k Function Calls)|Latency Mean (s)|Latency Standard Deviation (s)|Latency 95th Percentile (s)|Non-Live AST Acc|Non-Live Simple AST|Non-Live Multiple AST|Non-Live Parallel AST|Non-Live Parallel Multiple AST|Non-Live Exec Acc|Non-Live Simple Exec|Non-Live Multiple Exec|Non-Live Parallel Exec|Non-Live Parallel Multiple Exec|Live Acc|Live Simple AST|Live Multiple AST|Live Parallel AST|Live Parallel Multiple AST|Multi Turn Acc|Multi Turn Base|Multi Turn Miss Func|Multi Turn Miss Param|Multi Turn Long Context|Relevance Detection|Irrelevance Detection|Organization|License     |
|-----|-----------|-----------|-----------------------------------|------------------------------|----------------|------------------------------|---------------------------|----------------|-------------------|---------------------|---------------------|------------------------------|-----------------|--------------------|----------------------|----------------------|-------------------------------|--------|---------------|-----------------|-----------------|--------------------------|--------------|---------------|--------------------|---------------------|-----------------------|-------------------|---------------------|------------|------------|
|Run 1|53.37%     |Command R7B|https://cohere.com/blog/command-r7b|0.1                           |5.22            |9.75                          |12.64                      |81.00%          |68.00%             |91.50%               |84.50%               |80.00%                        |81.21%           |84.86%              |90.00%                |80.00%                |70.00%                         |74.23%  |63.95%         |69.71%           |50.00%           |70.83%                    |5.12%         |7.00%          |1.00%               |6.50%                |6.00%                  |61.11%             |80.68%               |Cohere      |cc-by-nc-4.0|
|Run 2|53.59%     |Command R7B|https://cohere.com/blog/command-r7b|0.1                           |3.19            |10.43                         |6.29                       |81.21%          |68.33%             |91.50%               |84.50%               |80.50%                        |82.46%           |85.36%              |92.00%                |80.00%                |72.50%                         |74.28%  |63.18%         |69.71%           |50.00%           |83.33%                    |5.12%         |7.50%          |1.50%               |6.00%                |5.50%                  |61.11%             |80.47%               |Cohere      |cc-by-nc-4.0|

I notice some variance between runs but haven't concluded where this is coming from - I'm confident, however, that the model regularly scores the higher score.

Thanks in advance for reviewing this PR and I look forward to seeing us on the leaderboard!

Co-authored-by: yxuansu <suyx1201@163.com>
Co-authored-by: Jozef Mokry <jozef@cohere.com>
@Fanjia-Yan
Copy link
Collaborator

Hi Cohere,

Congrats on the release and thanks for contributing the really nice tool calling interface to interact with command models! We have one question before onboarding r7b on the leaderboard:

We typically don't deprecate a model unless the next generation alternative releases. With that, would v2 client be able to support full tool usage capability of Command R+ and Command R now or in the future? We do like to see models of different sizes on leaderboard for better comparison :) Thank you.

cc: @HuanzhiMao @ShishirPatil

@harry-cohere
Copy link
Contributor Author

harry-cohere commented Dec 17, 2024

Thanks @Fanjia-Yan ! All of our models support the newer version, so I've added command-r-plus back and it now uses the updated handler. When I run it against 3245d9 I get this score:

|Overall Acc|Model              |Model Link                                           |Cost ($ Per 1k Function Calls)|Latency Mean (s)|Latency Standard Deviation (s)|Latency 95th Percentile (s)|Non-Live AST Acc|Non-Live Simple AST|Non-Live Multiple AST|Non-Live Parallel AST|Non-Live Parallel Multiple AST|Non-Live Exec Acc|Non-Live Simple Exec|Non-Live Multiple Exec|Non-Live Parallel Exec|Non-Live Parallel Multiple Exec|Live Acc|Live Simple AST|Live Multiple AST|Live Parallel AST|Live Parallel Multiple AST|Multi Turn Acc|Multi Turn Base|Multi Turn Miss Func|Multi Turn Miss Param|Multi Turn Long Context|Relevance Detection|Irrelevance Detection|Organization |License     |
|-----------|-------------------|-----------------------------------------------------|------------------------------|----------------|------------------------------|---------------------------|----------------|-------------------|---------------------|---------------------|------------------------------|-----------------|--------------------|----------------------|----------------------|-------------------------------|--------|---------------|-----------------|-----------------|--------------------------|--------------|---------------|--------------------|---------------------|-----------------------|-------------------|---------------------|-------------|------------|
|51.17%     |Command-R-Plus (FC)|https://txt.cohere.com/command-r-plus-microsoft-azure|7.71                          |2.75            |9.26                          |4.49                       |76.56%          |72.25%             |89.50%               |82.00%               |62.50%                        |79.91%           |90.14%              |90.00%                |82.00%                |57.50%                         |63.26%  |73.64%         |71.79%           |43.75%           |58.33%                    |15.00%        |23.00%         |10.00%              |11.50%               |15.50%                 |94.44%             |50.57%               |Cohere For AI|cc-by-nc-4.0|

This is +5.1% from the leaderboard score, and the increase mainly comes from the multihop category.

Looking into the increase I think part of the reason is that there are bugs in the replaced handler which lowered the overall score. For example the base handler adds a holdout function message with role/content, but the replaced handler expects them as role/message, which would explain the 10% increase in the missed function category. There may be other sources of the increase, too.

Another change is the cost per 1K functions - I see an increase of around $2.30 per 1K functions. Is it possible that the price is increased because the multi turn categories now succeed in more cases?

@harry-cohere harry-cohere changed the title Add Cohere Command R7B Add Cohere Command R7B, replace older Command R+ handler Dec 17, 2024
@Fanjia-Yan
Copy link
Collaborator

Thank you Harry! That addresses all my concerns. And yes, higher cost per 1K functions typically means the model attempts more and have less early stop. @HuanzhiMao will review code change and merge that in.

@HuanzhiMao HuanzhiMao added the BFCL-New Model Add New Model to BFCL label Dec 19, 2024
Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the PR and congratulations on the new model launch @harry-cohere

@ShishirPatil ShishirPatil merged commit d1ac6ba into ShishirPatil:main Dec 19, 2024
HuanzhiMao added a commit that referenced this pull request Dec 31, 2024
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #822 
2. #826 
3. #829 
4. #832 
5. #837 
6. #840 
7. #835 
8. #842 
9.  #843 
10. #846 
11. #838 
12. #847 
13. #855 
14. #857 

Models were evaluated using checkpoint commit 0cea216.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-New Model Add New Model to BFCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants