Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearchitecting to take advantage of the latest in mlx_lm #36

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
f41354c
A little down-payment clean-up
uogbuji Jan 11, 2025
edfd44a
Stub out `generate_step_with_schema` method with similar signature to…
chimezie Jan 12, 2025
25fa0f5
Stub out `generate_step_with_schema` method with similar signature to…
chimezie Jan 12, 2025
64f5466
Merge remote-tracking branch 'origin/33-re-integrate-mlxlm' into 33-r…
chimezie Jan 12, 2025
1e75977
Minor updates
chimezie Jan 12, 2025
7bd6ccf
Remove generate_with_preemptive_decoding, which is fiddly and almost …
uogbuji Jan 12, 2025
666d043
Partway through the process of modernizing our use of MLX
uogbuji Jan 13, 2025
62bc792
Some more cleanup, including to local call method signatures
uogbuji Jan 13, 2025
db4eca5
Incomplete work on response_helper.py
uogbuji Jan 14, 2025
06eeee6
Getting closer but make_logit_bias_processor seems ot be breaking things
uogbuji Jan 14, 2025
03ddce7
Update logits bias adjustment per make_logits_processors in mlx_lm.sa…
chimezie Jan 14, 2025
903aba3
Try a different method to apply logits biases. Spoiler alet: doesn't …
uogbuji Jan 15, 2025
9641ee5
I think this gets us out of logit_bias_processor purgatory! Thanks to…
uogbuji Jan 15, 2025
6d1b43b
Add country extractor demo
uogbuji Jan 16, 2025
a7f9b52
Add some debugging logic, and try the token suppression logic from to…
uogbuji Jan 17, 2025
2f1984c
Simplify logit_bias_processor setup
uogbuji Jan 17, 2025
d050e3e
Got the prefill situation sorted, but there is still something wonky …
uogbuji Jan 17, 2025
deef2d1
Many enhancements and fixes towards getting this rearchitecture working
uogbuji Jan 18, 2025
c3c8c89
Handle eos token. Improvements to iter_print.
uogbuji Jan 18, 2025
c2997fb
Code cleanup. Variant demo set up for perf profiling the guts of Toolio
uogbuji Jan 18, 2025
7b8437b
Profiled version of apply_token_mask()
uogbuji Jan 18, 2025
d151201
Fix the schemaless case. Other fixes while updating README.
uogbuji Jan 20, 2025
d163989
Change create_mask
chimezie Jan 20, 2025
5caf8c6
Generalize from example
chimezie Jan 20, 2025
0302dd9
Start working on tool-calling path, though not quite there yet.
uogbuji Jan 21, 2025
7ef419a
QUick checkin before resuming work
uogbuji Feb 1, 2025
984508a
Got tool-calling working again. Lots of modularization & improved sta…
uogbuji Feb 3, 2025
1ebf103
Non-tool-calling HTTP client/server working!
uogbuji Feb 9, 2025
ca5c88a
Add web/LLM researcher demo
uogbuji Feb 9, 2025
5ce1a6e
Add source citation to researcher demo
uogbuji Feb 9, 2025
7a1d69d
Fix tool-calling over HTTP, and in particular handling tools on the c…
uogbuji Feb 15, 2025
04d9584
Correct HTTP client complete_with_tools semantics. Test fix.
uogbuji Feb 16, 2025
e7dd839
More tool call fixes
uogbuji Feb 22, 2025
9c5574b
More and MOAR tool-calling fixes
uogbuji Feb 23, 2025
15b309a
Closer with the test fixes
uogbuji Feb 23, 2025
c2ba270
Needs a fair amount of cleanup, but test_readme_examples.py is all gr…
uogbuji Feb 23, 2025
48c4e6f
Enforce llm_response from complete_with_tools(). Test fixes.
uogbuji Feb 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -175,3 +175,4 @@ cython_debug/
#.idea/
demo/docker_compose.yaml
demo/reddit_newsletter/cacert.pem
.aider*
80 changes: 64 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

Toolio is an OpenAI-like HTTP server API implementation which supports structured LLM response generation (e.g. make it conform to a [JSON schema](https://json-schema.org/)). It also implements tool calling by LLMs. Toolio is based on the MLX framework for Apple Silicon (e.g. M1/M2/M3/M4 Macs), so **that's the only supported platform at present**.

Whether the buzzword you're pursuing is tool-calling, function-calling, agentic workflows, compound AI, guaranteed structured output, schema-driven output, guided generation, or steered response, give Toolio a try. You can think of it as your "GPT Private Agent", handling intelligent tasks for you, without spilling your secrets.
Whether the buzzword you're pursuing is tool-calling, function-calling, agentic workflows, compound AI, guaranteed structured output, schema-driven output, guided generation, or steered response, give Toolio a try, in your own private setting.

Builds on: https://github.com/otriscon/llm-structured-output/

Expand Down Expand Up @@ -108,11 +108,17 @@ The key here is specification of a JSON schema. The schema is escaped for the co
{"type": "object", "properties": {"guess": {"type": "number"}}}
```

It looks a bit intimidating, at first, if you're not familiar with [JSON schema](https://json-schema.org/), but they're reasonably easy to learn. [You can follow the primer](https://json-schema.org/learn/getting-started-step-by-step).
This describes a response such as:

Ultimately, you can just paste an example of your desired output structure and ask ChatGPT, Claude, Gemini, etc. as simply as: "Please write a JSON schema to represent this data format."
```json
{"guess": 5}
```

The schema may look a bit intimidating, at first, if you're not familiar with [JSON schema](https://json-schema.org/), but they're reasonably easy to learn. [You can follow the primer](https://json-schema.org/learn/getting-started-step-by-step).

Or you can just paste an example of your desired output structure and ask ChatGPT, Claude, Gemini, etc.—or of course our favorite local LLM via Toolio. "Please write a JSON schema to represent this data format: [response format example]"

Toolio's JSOn schema support is a subset, so you might need to tweak a schema before using it with Toolio. Most of the unsupported features can be just omitted, or expressed in the prompt or schema descriptions instead.
Toolio's JSON schema support is a subset, so you might need to tweak a schema before using it with Toolio. Most of the unsupported features can be just omitted, or expressed in the prompt or schema descriptions instead.

## Using the command line client instead

Expand Down Expand Up @@ -349,32 +355,32 @@ Toolio uses OpenAI API conventions a lot under the hood. If you run the followin

```py
import asyncio
from toolio.llm_helper import local_model_runner, extract_content
from toolio.llm_helper import local_model_runner

toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')

async def say_hello(tmm):
msgs = [{"role": "user", "content": "Hello! How are you?"}]
# FYI, there is a fnction toolio.common.response_text which can help cnsume iter_* methods
async for chunk_struct in tmm.iter_complete(msgs):
# FYI, there are functions in toolio.commonresponse_text & print_response—which can help consume iter_* methods
async for chunk_struct in tmm.iter_complete(msgs, simple=False):
print(chunk_struct)
print(chunk_struct.first_choice_text)
break

asyncio.run(say_hello(toolio_mm))
```

from

You should see something like:

```py
{'choices': [{'index': 0, 'delta': {'role': 'assistant', 'content': 'Hi'}, 'finish_reason': None}], 'object': 'chat.completion.chunk', 'id': 'chatcmpl-17588006160_1721823730', 'created': 1721823730, 'model': 'mlx-community/Hermes-2-Theta-Llama-3-8B-4bit'}
llm_response(response_type=<llm_response_type.MESSAGE: 1>, choices=[{'index': 0, 'delta': {'role': 'assistant', 'content': 'Hello'}, 'finish_reason': None}], usage={'prompt_tokens': 15, 'completion_tokens': 1, 'total_tokens': 16}, object='chat.completion', id='cmpl-1737387910', created=1737387910, model='mlx-community/Hermes-2-Theta-Llama-3-8B-4bit', model_type='llama', _first_choice_text=None)
Hello
```

The LLM response is delivered in such structures ("deltas") as they're generated. `chunk_struct['choices'][0]['delta']['content']` is a bit of the actual text we teased out in the previous snippet. `chunk_struct['choices'][0]['finish_reason']` is `None` because it's not yet finished, etc. This is based on OpenAI API.
The LLM response is delivered in such structures ("deltas") as they're generated. The `simple=False` flag tells Toolio to yield the responses in a data structure which includes useful metadata as well as the actual response text. The `first_choice_text` attribute on that object gives the plain text of the response—supporting multiple choices is for OpenAI API compatibility, though Toolio only ever ofers a single choice. Notice that `chunk_struct.choices[0]['finish_reason']` is `None` because it's not yet finished, etc. These are largely OpenAI API conventions, though `first_choice_text` is Toolio (and [OgbujiPT](https://github.com/OoriData/OgbujiPT)) specific.

`extract_content`, used in the previous snippet, is a very simple coroutine that extracts the actual text content from this series of response structures.

<!-- THE FINISH DELTA NEEDS REPAIR AS OF THE JANUARY REFACTOR
The final chunk would look something like this:

```py
Expand All @@ -385,7 +391,7 @@ Notice there is more information, now that it's finished (`'finish_reason': 'sto

```py
import asyncio
from toolio.llm_helper import local_model_runner, extract_content
from toolio.llm_helper import local_model_runner

toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')

Expand All @@ -410,7 +416,7 @@ You'll get something like:
Number of tokens generated: 32
```

Tip: don't forget all the various, useful bits to be found in `itertools` and the like.
-->

# Structured LLM responses via direct API

Expand All @@ -422,16 +428,29 @@ from toolio.llm_helper import local_model_runner

toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')

async def say_hello(tmm):
async def extractor(tmm):
prompt = ('Which countries are mentioned in the sentence \'Adamma went home to Nigeria for the hols\'?'
'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#')
schema = ('{"type": "array", "items":'
'{"type": "object", "properties": {"name": {"type": "string"}, "continent": {"type": "string"}}, "required": ["name", "continent"]}}')
print(await tmm.complete([{'role': 'user', 'content': prompt}], json_schema=schema))

asyncio.run(say_hello(toolio_mm))
asyncio.run(extractor(toolio_mm))
```

Printing:

```json
[
{
"name": "Nigeria",
"continent": "Africa"
}
]
```

Find an expanded version of this code in `demo/country_extract.py`.

## Example of tool use

```py
Expand Down Expand Up @@ -507,6 +526,35 @@ In which case you can express a response such as:
> By the tool's decree, the square root of 256, a number most fair,
> Is sixteen, a digit most true, and a figure most rare.

# Making more memory available for Large Models

<!--
Largely copied from MLX_LM README. SHould probably move to docs.
-->

> [!NOTE]
This requires macOS 15.0 or higher to work.

Models which are large relative to the total RAM available on the machine can
be slow. The underlying `mlx-lm` code will attempt to make them faster by wiring the memory
occupied by the model and cache. This requires macOS 15 or higher to
work.

If you see the following warning message:

> [WARNING] Generating with a model that requires ...

then the model will likely be slow on the given machine. If the model fits in
RAM then it can often be sped up by increasing the system wired memory limit.
To increase the limit, set the following `sysctl`:

```bash
sudo sysctl iogpu.wired_limit_mb=N
```

The value `N` should be larger than the size of the model in megabytes but
smaller than the memory size of the machine.

# Learn more

* [Documentation](https://OoriData.github.io/Toolio/)
Expand Down
7 changes: 2 additions & 5 deletions demo/algebra_tutor.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

import asyncio
from toolio.llm_helper import local_model_runner
from toolio.common import print_response

SCHEMA = '''\
{
Expand Down Expand Up @@ -43,10 +44,6 @@
async def tutor_main(tmm):
prompt = ('solve 8x + 31 = 2. Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#')
msgs = [{'role': 'user', 'content': prompt.format(json_schema=SCHEMA)}]
rt = await tmm.complete(msgs, json_schema=SCHEMA, max_tokens=512)
print(rt)

# async for chunk in extract_content(tmm.complete(msgs, json_schema=SCHEMA, max_tokens=512)):
# print(chunk, end='')
await print_response(tmm.iter_complete(msgs, json_schema=SCHEMA, max_tokens=512))

asyncio.run(tutor_main(toolio_mm))
34 changes: 34 additions & 0 deletions demo/country_extract.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
'''
Demo using toolio to interact with a model that extracts countries from a sentence
It also shows how you can set a random seed for reproducible results
'''
import asyncio
import mlx.core as mx
from toolio.llm_helper import local_model_runner
from toolio.common import print_response

RANDOM_SEED = 42

toolio_mm = local_model_runner('mlx-community/Mistral-Nemo-Instruct-2407-4bit')

SCHEMA_PY = {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'continent': {'type': 'string'}
},
'required': ['name', 'continent']
}
}

async def main():
mx.random.seed(RANDOM_SEED)
sentence = 'Adamma went home to Nigeria for the hols'
prompt = f'Which countries are mentioned in the sentence \'{sentence}\'?\n'
prompt += 'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
# iter_complete() method accepts a JSON schema in string form or as the equivalent Python dictionary
await print_response(toolio_mm.iter_complete([{'role': 'user', 'content': prompt}], json_schema=SCHEMA_PY))

asyncio.run(main())
48 changes: 48 additions & 0 deletions demo/country_extract_cprofile.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
'''
Demo using toolio to interact with a model that extracts countries from a sentence
It also shows how you can set a random seed for reproducible results
'''
import sys
import asyncio
import cProfile

import mlx.core as mx
from toolio.llm_helper import local_model_runner
from toolio.common import print_response

try:
pstats_fname = sys.argv[1]
except IndexError:
pstats_fname = 'profile.prof'

RANDOM_SEED = 42

toolio_mm = local_model_runner('mlx-community/Mistral-Nemo-Instruct-2407-4bit')

SCHEMA_PY = {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'name': {'type': 'string'},
'continent': {'type': 'string'}
},
'required': ['name', 'continent']
}
}

async def main():
mx.random.seed(RANDOM_SEED)
sentence = 'Adamma went home to Nigeria for the hols'
prompt = f'Which countries are mentioned in the sentence \'{sentence}\'?\n'
prompt += 'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
# iter_complete() method accepts a JSON schema in string form or as the equivalent Python dictionary

profiler = cProfile.Profile()
profiler.enable()
await print_response(toolio_mm.iter_complete([{'role': 'user', 'content': prompt}], json_schema=SCHEMA_PY))
profiler.disable()
profiler.dump_stats(pstats_fname)

if __name__ == "__main__":
asyncio.run(main())
60 changes: 60 additions & 0 deletions demo/researcher/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@

# Setup

## SearXNG

Running [SearXNG](https://github.com/searxng/searxng) instance. You can just use the Docker container. To run this locally:

```sh
docker pull searxng/searxng
export SEARXNG_PORT=8888
docker run \
-d -p ${SEARXNG_PORT}:8080 \
-v "${PWD}/searxng:/etc/searxng" \
-e "BASE_URL=http://localhost:$SEARXNG_PORT/" \
-e "INSTANCE_NAME=tee-seeker-searxng" \
--name tee-seeker-searxng \
searxng/searxng
```

Note: We want to have some sort of API key, but doesn't seem there is any built-in approach (`SEARXNG_SECRET` is something different). We might have to use a reverse proxy with HTTP auth.

This gets SearXNG runing on port 8888. Feel free to adjust as necessary in the 10minclimate.com config.

You do need to edit `searxng/settings.yml` relative to where you launched the docker comtainer, making sure `server.limiter` is set to false and `- json` is included in `search.formats`.

You can then just restart the continer (use `docker ps` to get the ID, `docker stop [ID]` and then repeat the `docker run` command above).

<!-- Not needed at present
One trick for generating a secret key:

```sh
python -c "from uuid import uuid1; print(str(uuid1()))"
```
-->

### Clean up

```sh
docker stop tee-seeker-searxng
# And only if you're done with it:
docker rm tee-seeker-searxng
```

# Running

```sh
time python demo/tee_seeker/main.py "What's so valuable about DeepSeek's GRPO technique?" --rigor 0.1
```

# Future work

<!-- Note: See what's already in the Arkestra version -->

* Panel of experts approach from [Stanford Storm](https://github.com/stanford-oval/storm)

# See also

* [Introducing Deeper Seeker - A simpler and OSS version of OpenAI's latest Deep Research feature](https://www.reddit.com/r/LocalLLaMA/comments/1igyy0n/introducing_deeper_seeker_a_simpler_and_oss/) [Feb 2025]
* [Automated-AI-Web-Researcher-Ollama](https://github.com/TheBlewish/Automated-AI-Web-Researcher-Ollama) ([Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1gvlzug/i_created_an_ai_research_assistant_that_actually/))

Loading