Announcing function calling and JSON mode #638

irthomasthomas · 2024-02-27T20:00:00Z

Announcing function calling and JSON mode

Announcing function calling and JSON mode

DESCRIPTION:
Announcing function calling and JSON mode
JANUARY 31, 2024・BY TOGETHER AI
We are excited to introduce JSON mode & function calling on Together Inference! They are designed to provide you with more flexibility and control over your interactions with LLMs. We currently support these features in Mixtral, Mistral, and CodeLlama with more coming soon. In this post, we'll introduce and walk you through how to use JSON mode and function calling through the Together API!

Introduction to JSON mode and function calling

While both JSON mode and function calling can enhance your interaction with LLMs, it's important to understand that they are not interchangeable — they serve different purposes and offer unique benefits. Specifically:

JSON mode allows you to specify a JSON schema that will be used by the LLM to output data in this format. This means you can dictate the format and data types of the response, leading to a more structured and predictable output that can suit your specific needs.
Function calling enables LLMs to intelligently output a JSON object containing arguments for external functions that are defined. This is particularly useful when there is a need for real-time data access, such as weather updates, product information, or stock market data, or when you want the LLM to be aware of certain functions you’ve defined. It also makes it possible for the LLM to intelligently determine what information to gather from a user if it determines a function should be called. Our endpoint ensures that these function calls align with the prescribed function schema, incorporating necessary arguments with the appropriate data types.

JSON Mode

With JSON mode, you can specify a schema for the output of the LLM. While the OpenAI API does not inherently allow for the specification of a JSON schema, we augmented the response_format argument with schema. When a schema is passed in, we enforce the model to generate the output aligned with the given schema.

Here's an example of how you can use JSON mode with Mixtral:

import os
import json
import openai
from pydantic import BaseModel, Field

# Create client
client = openai.OpenAI(
    base_url = "https://api.together.xyz/v1",
    api_key = os.environ['TOGETHER_API_KEY'],
)

# Define the schema for the output.
class User(BaseModel):
    name: str = Field(description="user name")
    address: str = Field(description="address")
    
# Generate
chat_completion = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    response_format={
        "type": "json_object", 
        "schema": User.model_json_schema()
    },
    messages=[
        {"role": "system", "content": "You are a helpful assistant that answers in JSON."},
        {"role": "user", "content": "Create a user named Alice, who lives in 42, Wonderland Avenue."}
    ],
)

created_user = json.loads(chat_completion.choices[0].message.content)
print(json.dumps(created_user, indent=2))

In this example, we define a schema for a User object that contains their name and address. The LLM then generates a response that matches this schema, providing a structured JSON object that we can use directly in our application in a deterministic way.

The expected output of this example is:

{
  "address": "42, Wonderland Avenue",
  "name": "Alice"
}

More Examples:

Array and Optional argument
Nested data types

For more detailed information, check out our documentation on JSON mode.

Function Calling

With function calling, it will output a JSON object containing arguments for external functions that are defined. After the functions are defined, the LLM will intelligently determine if a function needs to be invoked and if it does, it will suggest the appropriate one with the correct parameters in a JSON object. After that, you can execute the API call within your application and relay the response back to the LLM to continue working.

Let's illustrate this process with a simple example: creating a chatbot that has access to weather data. The function is defined in tools:

import os
import json
import openai

# Create client
client = openai.OpenAI(
    base_url = "https://api.together.xyz/v1",
    api_key = os.environ['TOGETHER_API_KEY'],
)

# Define function(s)
tools = [
  {
    "type": "function",
    "function": {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA"
          },
          "unit": {
            "type": "string",
            "enum": [
              "celsius",
              "fahrenheit"
            ]
          }
        }
      }
    }
  }
]
    
# Generate
response = client.chat.completions.create(
    model="mistralai/Mixtral-8x7B-Instruct-v0.1",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the current temperature of New York?"}
    ],
    tools=tools,
    tool_choice="auto",
)

print(json.dumps(response.choices[0].message.dict()['tool_calls'], indent=2))

In this example, we define an external function that gets the current weather in a given location. We then use this function in our chat completion request. The AI model generates a response that includes calls to this function, providing real-time weather data for the requested locations. The expected output is:

[
  {
    "id": "...",
    "function": {
      "arguments": "{\"location\":\"New York\",\"unit\":\"fahrenheit\"}",
      "name": "get_current_weather"
    },
    "type": "function"
  }
]

More Examples:

Parallel function calling
No function calling
Multi-turn example

For more detailed information, check out our documentation on function calling.

Conclusion

We believe that JSON mode and function calling are a significant step forward, bringing a new level of versatility and functionality to AI applications. By enabling a more structured interaction with the model and allowing for specific types of outputs and behaviors, we're confident that it will be a valuable tool for developers.

We can't wait to see what you build on Together AI! For more info, check out our function calling and JSON mode docs.

Suggested labels

{'label-name': 'JSON-structure', 'label-description': 'Describes JSON schema usage and generation for structured data output in AI interactions.', 'gh-repo': 'knowledge-repo', 'confidence': 53.09}

irthomasthomas · 2024-02-27T20:00:03Z

Related issues

#129: Few-shot and function calling - API - OpenAI Developer Forum

### Details

Similarity score: 0.92 - [ ] [Few-shot and function calling - API - OpenAI Developer Forum](https://community.openai.com/t/few-shot-and-function-calling/265908/10)

The thing to understand here is that function calling introduced a new role for the chat prompt messages ("role": "function"). To use few-shot examples with chat model prompts you provide a series of alternating (possibly 'fake') messages that show how the assistant did / should respond to a given user input. With function calling the principle is the same but rather than providing a series of alternating user-assistant example messages, you provide alternating user-function messages.

e.g.

schema = {
    "type": "object",
    "properties": {
        "object_type": {"type": "string"},
        "geometry": {
            "type": "array",
            "items": {
                "type": "number"
            }
        }
    },
    "required": ["object_type", "geometry"]
}

example_response_1 = "{\"object_type\": \"point\", \"geometry\": [2.3, 1.0]}\}"
example_response_2 = "{\"object_type\": \"line\", \"geometry\": [[1.0, 2.0], [3.0, 4.0]]\}"

few_shot_function_calling_example = openai.ChatCompletion.create(
    model = "gpt-3.5-turbo-0613",
        messages = [
            {"role": "system", "content": "You are a system for returning geometric objects in JSON."},
            {"role": "user", "content": "give me a point"},
            {"role": "function", "name": "example_func", "content": example_response_1,},
            {"role": "user", "content": "give me a line"},
            {"role": "function", "name": "example_func", "content": example_response_2,},
            {"role": "user", "content": "give me a polygon"}
        ],
    functions=[{"name": "example_func", "parameters": schema}],
    function_call={"name": "example_func"},
    temperature=0
)

print(few_shot_function_calling_example.choices[0].message)

{
  "content": null,
  "function_call": {
    "arguments": "{\"object_type\": \"polygon\", \"geometry\": [[0, 0], [0, 5], [5, 5], [5, 0]]}",
    "name": "example_func"
  },
  "role": "assistant"
}

#309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"

### Details

Similarity score: 0.88 - [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)

HumanEval: Hand-Written Evaluation Set

This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".

Installation

Make sure to use python 3.7 or later:

$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:

$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:

{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.

Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples, run

$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}
This script provides more fine-grained information in a new file ending in <input_path>_results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".

As a quick sanity-check, the example samples should yield 0.5 pass@1.

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00, 1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}
Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see

$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.

Known Issues

While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.

malloc: can't allocate region
Citation

Please cite using the following bibtex entry:

@Article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

Suggested labels

{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

#418: openchat/openchat-3.5-1210 · Hugging Face

### Details

Similarity score: 0.87 - [ ] [openchat/openchat-3.5-1210 · Hugging Face](https://huggingface.co/openchat/openchat-3.5-1210#conversation-templates)

Using the OpenChat Model

We highly recommend installing the OpenChat package and using the OpenChat OpenAI-compatible API server for an optimal experience. The server is optimized for high-throughput deployment using vLLM and can run on a consumer GPU with 24GB RAM.

Installation Guide: Follow the installation guide in our repository.
Serving: Use the OpenChat OpenAI-compatible API server by running the serving command from the table below. To enable tensor parallelism, append --tensor-parallel-size N to the serving command.

Model Size Context Weights Serving

OpenChat 3.5 1210 7B 8192 python -m ochat.serving.openai_api_server --model openchat/openchat-3.5-1210 --engine-use-ray --worker-use-ray

API Usage: Once started, the server listens at localhost:18888 for requests and is compatible with the OpenAI ChatCompletion API specifications. Here's an example request:

curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "openchat_3.5",
        "messages": [{"role": "user", "content": "You are a large language model named OpenChat. Write a poem to describe yourself"}]
      }'

Web UI: Use the OpenChat Web UI for a user-friendly experience.

Online Deployment

If you want to deploy the server as an online service, use the following options:

--api-keys sk-KEY1 sk-KEY2 ... to specify allowed API keys
--disable-log-requests --disable-log-stats --log-file openchat.log for logging only to a file.

For security purposes, we recommend using an HTTPS gateway in front of the server.

Mathematical Reasoning Mode

The OpenChat model also supports mathematical reasoning mode. To use this mode, include condition: "Math Correct" in your request.

```bash
curl http://localhost:18888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "openchat_3.5",
        "condition": "Math Correct",
        "messages": [{"role": "user", "content": "10.3 − 7988.8133 = "}]
      }'
```

Conversation Templates

We provide several pre-built conversation templates to help you get started.

Default Mode (GPT4 Correct):

GPT4 Correct User: Hello<|end_of_turn|>
GPT4 Correct Assistant: Hi<|end_of_turn|>
GPT4 Correct User: How are you today?<|end_of_turn|>
GPT4 Correct Assistant:

Mathematical Reasoning Mode:
```
Math Correct User: 10.3 − 7988.8133=<|end_of_turn|>
Math Correct Assistant:
```
NOTE: Remember to set <|end_of_turn|> as end of generation token.
Integrated Tokenizer: The default (GPT4 Correct) template is also available as the integrated tokenizer.chat_template, which can be used instead of manually specifying the template.

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

### Details

Similarity score: 0.87 - [ ] [datastax/astra-assistants-api: A backend implementation of the OpenAI beta Assistants API](https://github.com/datastax/astra-assistants-api)

Astra Assistant API Service

A drop-in compatible service for the OpenAI beta Assistants API with support for persistent threads, files, assistants, messages, retrieval, function calling and more using AstraDB (DataStax's db as a service offering powered by Apache Cassandra and jvector).

Compatible with existing OpenAI apps via the OpenAI SDKs by changing a single line of code.

Getting Started

Create an Astra DB Vector database
Replace the following code:

client = OpenAI(
    api_key=OPENAI_API_KEY,
)

with:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
    }
)

Or, if you have an existing astra db, you can pass your db_id in a second header:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key=OPENAI_API_KEY,
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "astra-db-id": ASTRA_DB_ID
    }
)

Create an assistant

assistant = client.beta.assistants.create(
  instructions="You are a personal math tutor. When asked a math question, write and run code to answer the question.",
  model="gpt-4-1106-preview",
  tools=[{"type": "retrieval"}]
)

By default, the service uses AstraDB as the database/vector store and OpenAI for embeddings and chat completion.

Third party LLM Support

We now support many third party models for both embeddings and completion thanks to litellm. Pass the api key of your service using api-key and embedding-model headers.

For AWS Bedrock, you can pass additional custom headers:

client = OpenAI(
    base_url="https://open-assistant-ai.astra.datastax.com/v1", 
    api_key="NONE",
    default_headers={
        "astra-api-token": ASTRA_DB_APPLICATION_TOKEN,
        "embedding-model": "amazon.titan-embed-text-v1",
        "LLM-PARAM-aws-access-key-id": BEDROCK_AWS_ACCESS_KEY_ID,
        "LLM-PARAM-aws-secret-access-key": BEDROCK_AWS_SECRET_ACCESS_KEY,
        "LLM-PARAM-aws-region-name": BEDROCK_AWS_REGION,
    }
)

and again, specify the custom model for the assistant.

assistant = client.beta.assistants.create(
    name="Math Tutor",
    instructions="You are a personal math tutor. Answer questions briefly, in a sentence or less.",
    model="meta.llama2-13b-chat-v1",
)

Additional examples including third party LLMs (bedrock, cohere, perplexity, etc.) can be found under examples.

To run the examples using poetry:

Create a .env file in this directory with your secrets.
Run:

poetry install
poetry run python examples/completion/basic.py
poetry run python examples/retreival/basic.py
poetry run python examples/function-calling/basic.py

Coverage

See our coverage report here.

Roadmap

Support for other embedding models and LLMs
Function calling
Pluggable RAG strategies
Streaming support

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#632: OpenRouter: Assistant Prefill supports asking models to complete a partial response.

### Details

Similarity score: 0.86 - [ ] [Docs | OpenRouter](https://openrouter.ai/docs#responses)

Docs | OpenRouter

DESCRIPTION: "Assistant Prefill: OpenRouter supports asking models to complete a partial response. This can be useful for guiding models to respond in a certain way.
To use this features, simply include a message with role: "assistant" at the end of your messages array. Example:

fetch("https://openrouter.ai/api/v1/chat/completions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${OPENROUTER_API_KEY}`,
    "HTTP-Referer": `${YOUR_SITE_URL}`, // Optional, for including your app on openrouter.ai rankings.
    "X-Title": `${YOUR_SITE_NAME}`, // Optional. Shows in rankings on openrouter.ai.
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    "messages": [
      {"role": "user", "content": "Who are you?"},
      {"role": "assistant", "content": "I'm not sure, but my best guess is"},
    ],
  })
});

URL: OpenRouter Documentation

Suggested labels

{'label-name': 'Chatbot-API', 'label-description': 'API documentation for interacting with chatbots on OpenRouter.', 'gh-repo': 'AI-Chatbots', 'confidence': 65.43}

#160: sid321axn/tinyllama-text2sql-finetuned at main

### Details

Similarity score: 0.86 ## tiny-llama-text2sql ## safetensors - [ ] [sid321axn/tinyllama-text2sql-finetuned at main](https://huggingface.co/sid321axn/tinyllama-text2sql-finetuned/tree/main)

adapter

https://huggingface.co/sid321axn/tiny-llama-text2sql

This model is a fine-tuned version of PY007/TinyLlama-1.1B-Chat-v0.3 on the None dataset.

{
 "_name_or_path": "PY007/TinyLlama-1.1B-Chat-v0.3",
 "architectures": [
   "LlamaForCausalLM"
 ],
 "attention_bias": false,
 "attention_dropout": 0.0,
 "bos_token_id": 1,
 "eos_token_id": 2,
 "hidden_act": "silu",
 "hidden_size": 2048,
 "initializer_range": 0.02,
 "intermediate_size": 5632,
 "max_position_embeddings": 2048,
 "model_type": "llama",
 "num_attention_heads": 32,
 "num_hidden_layers": 22,
 "num_key_value_heads": 4,
 "pretraining_tp": 1,
 "rms_norm_eps": 1e-05,
 "rope_scaling": null,
 "rope_theta": 10000.0,
 "tie_word_embeddings": false,
 "torch_dtype": "float16",
 "transformers_version": "4.37.0.dev0",
 "use_cache": false,
 "vocab_size": 32003
}
```</details>

irthomasthomas mentioned this issue Mar 4, 2024

ez-openai: better openai python library for assistants and function calling #683

Open

1 task

This was referenced Mar 14, 2024

Claude-3 api snippets #713

Open

LoRAX + Outlines: Better JSON Extraction with Structured Generation and LoRA - Predibase - Predibase #709

Open

ShellLM mentioned this issue Aug 16, 2024

MoA/README.md at main · togethercomputer/MoA #884

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Announcing function calling and JSON mode #638

Announcing function calling and JSON mode #638

irthomasthomas commented Feb 27, 2024

irthomasthomas commented Feb 27, 2024

Suggested labels

{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

Using the OpenChat Model

Online Deployment

Mathematical Reasoning Mode

Conversation Templates

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

Astra Assistant API Service

Getting Started

Third party LLM Support

Coverage

Roadmap

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

Docs | OpenRouter

Suggested labels

{'label-name': 'Chatbot-API', 'label-description': 'API documentation for interacting with chatbots on OpenRouter.', 'gh-repo': 'AI-Chatbots', 'confidence': 65.43}

adapter

Announcing function calling and JSON mode #638

Announcing function calling and JSON mode #638

Comments

irthomasthomas commented Feb 27, 2024

Announcing function calling and JSON mode

Introduction to JSON mode and function calling

JSON Mode

Function Calling

Conclusion

Suggested labels

{'label-name': 'JSON-structure', 'label-description': 'Describes JSON schema usage and generation for structured data output in AI interactions.', 'gh-repo': 'knowledge-repo', 'confidence': 53.09}

irthomasthomas commented Feb 27, 2024

Related issues

#129: Few-shot and function calling - API - OpenAI Developer Forum

#309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"

Suggested labels

{ "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }

#418: openchat/openchat-3.5-1210 · Hugging Face

Using the OpenChat Model

Online Deployment

Mathematical Reasoning Mode

Conversation Templates

Suggested labels

{ "label": "chat-templates", "description": "Pre-defined conversation structures for specific modes of interaction." }

#396: astra-assistants-api: A backend implementation of the OpenAI beta Assistants API

Astra Assistant API Service

Getting Started

Third party LLM Support

Coverage

Roadmap

Suggested labels

{ "key": "llm-function-calling", "value": "Integration of function calling with Large Language Models (LLMs)" }

#632: OpenRouter: Assistant Prefill supports asking models to complete a partial response.

Docs | OpenRouter

Suggested labels

{'label-name': 'Chatbot-API', 'label-description': 'API documentation for interacting with chatbots on OpenRouter.', 'gh-repo': 'AI-Chatbots', 'confidence': 65.43}

#160: sid321axn/tinyllama-text2sql-finetuned at main

adapter