OoriData · uogbuji · Jan 11, 2025 · Jan 12, 2025 · Jan 12, 2025 · Jan 12, 2025
diff --git a/.gitignore b/.gitignore
@@ -175,3 +175,4 @@ cython_debug/
 #.idea/
 demo/docker_compose.yaml
 demo/reddit_newsletter/cacert.pem
+.aider*
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 
 Toolio is an OpenAI-like HTTP server API implementation which supports structured LLM response generation (e.g. make it conform to a [JSON schema](https://json-schema.org/)). It also implements tool calling by LLMs. Toolio is based on the MLX framework for Apple Silicon (e.g. M1/M2/M3/M4 Macs), so **that's the only supported platform at present**.
 
-Whether the buzzword you're pursuing is tool-calling, function-calling, agentic workflows, compound AI, guaranteed structured output, schema-driven output, guided generation, or steered response, give Toolio a try. You can think of it as your "GPT Private Agent", handling intelligent tasks for you, without spilling your secrets.
+Whether the buzzword you're pursuing is tool-calling, function-calling, agentic workflows, compound AI, guaranteed structured output, schema-driven output, guided generation, or steered response, give Toolio a try, in your own private setting.
 
 Builds on: https://github.com/otriscon/llm-structured-output/
 
@@ -108,11 +108,17 @@ The key here is specification of a JSON schema. The schema is escaped for the co
 {"type": "object", "properties": {"guess": {"type": "number"}}}
 ```
 
-It looks a bit intimidating, at first, if you're not familiar with [JSON schema](https://json-schema.org/), but they're reasonably easy to learn. [You can follow the primer](https://json-schema.org/learn/getting-started-step-by-step).
+This describes a response such as:
 
-Ultimately, you can just paste an example of your desired output structure and ask ChatGPT, Claude, Gemini, etc. as simply as: "Please write a JSON schema to represent this data format."
+```json
+{"guess": 5}
+```
+
+The schema may look a bit intimidating, at first, if you're not familiar with [JSON schema](https://json-schema.org/), but they're reasonably easy to learn. [You can follow the primer](https://json-schema.org/learn/getting-started-step-by-step).
+
+Or you can just paste an example of your desired output structure and ask ChatGPT, Claude, Gemini, etc.—or of course our favorite local LLM via Toolio. "Please write a JSON schema to represent this data format: [response format example]"
 
-Toolio's JSOn schema support is a subset, so you might need to tweak a schema before using it with Toolio. Most of the unsupported features can be just omitted, or expressed in the prompt or schema descriptions instead.
+Toolio's JSON schema support is a subset, so you might need to tweak a schema before using it with Toolio. Most of the unsupported features can be just omitted, or expressed in the prompt or schema descriptions instead.
 
 ## Using the command line client instead
 
@@ -349,32 +355,32 @@ Toolio uses OpenAI API conventions a lot under the hood. If you run the followin
 
 ```py
 import asyncio
-from toolio.llm_helper import local_model_runner, extract_content
+from toolio.llm_helper import local_model_runner
 
 toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
 
 async def say_hello(tmm):
     msgs = [{"role": "user", "content": "Hello! How are you?"}]
-    # FYI, there is a fnction toolio.common.response_text which can help cnsume iter_* methods
-    async for chunk_struct in tmm.iter_complete(msgs):
+    # FYI, there are functions in toolio.common—response_text & print_response—which can help consume iter_* methods
+    async for chunk_struct in tmm.iter_complete(msgs, simple=False):
         print(chunk_struct)
+        print(chunk_struct.first_choice_text)
         break
 
 asyncio.run(say_hello(toolio_mm))
 ```
 
-from 
-
 You should see something like:
 
 ```py
-{'choices': [{'index': 0, 'delta': {'role': 'assistant', 'content': 'Hi'}, 'finish_reason': None}], 'object': 'chat.completion.chunk', 'id': 'chatcmpl-17588006160_1721823730', 'created': 1721823730, 'model': 'mlx-community/Hermes-2-Theta-Llama-3-8B-4bit'}
+llm_response(response_type=<llm_response_type.MESSAGE: 1>, choices=[{'index': 0, 'delta': {'role': 'assistant', 'content': 'Hello'}, 'finish_reason': None}], usage={'prompt_tokens': 15, 'completion_tokens': 1, 'total_tokens': 16}, object='chat.completion', id='cmpl-1737387910', created=1737387910, model='mlx-community/Hermes-2-Theta-Llama-3-8B-4bit', model_type='llama', _first_choice_text=None)
+Hello
 ```
 
-The LLM response is delivered in such structures ("deltas") as they're generated. `chunk_struct['choices'][0]['delta']['content']` is a bit of the actual text we teased out in the previous snippet. `chunk_struct['choices'][0]['finish_reason']` is `None` because it's not yet finished, etc. This is based on OpenAI API.
+The LLM response is delivered in such structures ("deltas") as they're generated. The `simple=False` flag tells Toolio to yield the responses in a data structure which includes useful metadata as well as the actual response text. The `first_choice_text` attribute on that object gives the plain text of the response—supporting multiple choices is for OpenAI API compatibility, though Toolio only ever ofers a single choice. Notice that `chunk_struct.choices[0]['finish_reason']` is `None` because it's not yet finished, etc. These are largely OpenAI API conventions, though `first_choice_text` is Toolio (and [OgbujiPT](https://github.com/OoriData/OgbujiPT)) specific.
 
-`extract_content`, used in the previous snippet, is a very simple coroutine that extracts the actual text content from this series of response structures.
 
+<!-- THE FINISH DELTA NEEDS REPAIR AS OF THE JANUARY REFACTOR
 The final chunk would look something like this:
 
 ```py
@@ -385,7 +391,7 @@ Notice there is more information, now that it's finished (`'finish_reason': 'sto
 
 ```py
 import asyncio
-from toolio.llm_helper import local_model_runner, extract_content
+from toolio.llm_helper import local_model_runner
 
 toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
 
@@ -410,7 +416,7 @@ You'll get something like:
  Number of tokens generated: 32
 ```
 
-Tip: don't forget all the various, useful bits to be found in `itertools` and the like.
+-->
 
 # Structured LLM responses via direct API
 
@@ -422,16 +428,29 @@ from toolio.llm_helper import local_model_runner
 
 toolio_mm = local_model_runner('mlx-community/Hermes-2-Theta-Llama-3-8B-4bit')
 
-async def say_hello(tmm):
+async def extractor(tmm):
     prompt = ('Which countries are mentioned in the sentence \'Adamma went home to Nigeria for the hols\'?'
               'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#')
     schema = ('{"type": "array", "items":'
               '{"type": "object", "properties": {"name": {"type": "string"}, "continent": {"type": "string"}}, "required": ["name", "continent"]}}')
     print(await tmm.complete([{'role': 'user', 'content': prompt}], json_schema=schema))
 
-asyncio.run(say_hello(toolio_mm))
+asyncio.run(extractor(toolio_mm))
 ```
 
+Printing:
+
+```json
+[
+  {
+    "name": "Nigeria",
+    "continent": "Africa"
+  }
+]
+```
+
+Find an expanded version of this code in `demo/country_extract.py`.
+
 ## Example of tool use
 
 ```py
@@ -507,6 +526,35 @@ In which case you can express a response such as:
 > By the tool's decree, the square root of 256, a number most fair,
 > Is sixteen, a digit most true, and a figure most rare.
 
+# Making more memory available for Large Models
+
+<!--
+Largely copied from MLX_LM README. SHould probably move to docs.
+-->
+
+> [!NOTE]
+    This requires macOS 15.0 or higher to work.
+
+Models which are large relative to the total RAM available on the machine can
+be slow. The underlying `mlx-lm` code will attempt to make them faster by wiring the memory
+occupied by the model and cache. This requires macOS 15 or higher to
+work.
+
+If you see the following warning message:
+
+> [WARNING] Generating with a model that requires ...
+
+then the model will likely be slow on the given machine. If the model fits in
+RAM then it can often be sped up by increasing the system wired memory limit.
+To increase the limit, set the following `sysctl`:
+
+```bash
+sudo sysctl iogpu.wired_limit_mb=N
+```
+
+The value `N` should be larger than the size of the model in megabytes but
+smaller than the memory size of the machine.
+
 # Learn more
 
 * [Documentation](https://OoriData.github.io/Toolio/)

diff --git a/demo/algebra_tutor.py b/demo/algebra_tutor.py
@@ -7,6 +7,7 @@
 
 import asyncio
 from toolio.llm_helper import local_model_runner
+from toolio.common import print_response
 
 SCHEMA = '''\
 {
@@ -43,10 +44,6 @@
 async def tutor_main(tmm):
     prompt = ('solve 8x + 31 = 2. Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#')
     msgs = [{'role': 'user', 'content': prompt.format(json_schema=SCHEMA)}]
-    rt = await tmm.complete(msgs, json_schema=SCHEMA, max_tokens=512)
-    print(rt)
-
-    # async for chunk in extract_content(tmm.complete(msgs, json_schema=SCHEMA, max_tokens=512)):
-    #     print(chunk, end='')
+    await print_response(tmm.iter_complete(msgs, json_schema=SCHEMA, max_tokens=512))
 
 asyncio.run(tutor_main(toolio_mm))
diff --git a/demo/country_extract.py b/demo/country_extract.py
@@ -0,0 +1,34 @@
+'''
+Demo using toolio to interact with a model that extracts countries from a sentence
+It also shows how you can set a random seed for reproducible results
+'''
+import asyncio
+import mlx.core as mx
+from toolio.llm_helper import local_model_runner
+from toolio.common import print_response
+
+RANDOM_SEED = 42
+
+toolio_mm = local_model_runner('mlx-community/Mistral-Nemo-Instruct-2407-4bit')
+
+SCHEMA_PY = {
+    'type': 'array',
+    'items': {
+        'type': 'object',
+        'properties': {
+            'name': {'type': 'string'},
+            'continent': {'type': 'string'}
+        },
+        'required': ['name', 'continent']
+    }
+}
+
+async def main():
+    mx.random.seed(RANDOM_SEED)
+    sentence = 'Adamma went home to Nigeria for the hols'
+    prompt = f'Which countries are mentioned in the sentence \'{sentence}\'?\n'
+    prompt += 'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
+    # iter_complete() method accepts a JSON schema in string form or as the equivalent Python dictionary
+    await print_response(toolio_mm.iter_complete([{'role': 'user', 'content': prompt}], json_schema=SCHEMA_PY))
+
+asyncio.run(main())
diff --git a/demo/country_extract_cprofile.py b/demo/country_extract_cprofile.py
@@ -0,0 +1,48 @@
+'''
+Demo using toolio to interact with a model that extracts countries from a sentence
+It also shows how you can set a random seed for reproducible results
+'''
+import sys
+import asyncio
+import cProfile
+
+import mlx.core as mx
+from toolio.llm_helper import local_model_runner
+from toolio.common import print_response
+
+try:
+    pstats_fname = sys.argv[1]
+except IndexError:
+    pstats_fname = 'profile.prof'
+
+RANDOM_SEED = 42
+
+toolio_mm = local_model_runner('mlx-community/Mistral-Nemo-Instruct-2407-4bit')
+
+SCHEMA_PY = {
+    'type': 'array',
+    'items': {
+        'type': 'object',
+        'properties': {
+            'name': {'type': 'string'},
+            'continent': {'type': 'string'}
+        },
+        'required': ['name', 'continent']
+    }
+}
+
+async def main():
+    mx.random.seed(RANDOM_SEED)
+    sentence = 'Adamma went home to Nigeria for the hols'
+    prompt = f'Which countries are mentioned in the sentence \'{sentence}\'?\n'
+    prompt += 'Your answer should be only JSON, according to this schema: #!JSON_SCHEMA!#'
+    # iter_complete() method accepts a JSON schema in string form or as the equivalent Python dictionary
+
+    profiler = cProfile.Profile()
+    profiler.enable()
+    await print_response(toolio_mm.iter_complete([{'role': 'user', 'content': prompt}], json_schema=SCHEMA_PY))
+    profiler.disable()
+    profiler.dump_stats(pstats_fname)
+
+if __name__ == "__main__":
+    asyncio.run(main())
diff --git a/demo/researcher/README.md b/demo/researcher/README.md
@@ -0,0 +1,60 @@
+
+# Setup
+
+## SearXNG
+
+Running [SearXNG](https://github.com/searxng/searxng) instance. You can just use the Docker container. To run this locally:
+
+```sh
+docker pull searxng/searxng
+export SEARXNG_PORT=8888
+docker run \
+    -d -p ${SEARXNG_PORT}:8080 \
+    -v "${PWD}/searxng:/etc/searxng" \
+    -e "BASE_URL=http://localhost:$SEARXNG_PORT/" \
+    -e "INSTANCE_NAME=tee-seeker-searxng" \
+    --name tee-seeker-searxng \
+    searxng/searxng
+```
+
+Note: We want to have some sort of API key, but doesn't seem there is any built-in approach (`SEARXNG_SECRET` is something different). We might have to use a reverse proxy with HTTP auth.
+
+This gets SearXNG runing on port 8888. Feel free to adjust as necessary in the 10minclimate.com config.
+
+You do need to edit `searxng/settings.yml` relative to where you launched the docker comtainer, making sure `server.limiter` is set to false and `- json` is included in `search.formats`.
+
+You can then just restart the continer (use `docker ps` to get the ID, `docker stop [ID]` and then repeat the `docker run` command above).
+
+<!-- Not needed at present
+One trick for generating a secret key:
+
+```sh
+python -c "from uuid import uuid1; print(str(uuid1()))"
+```
+-->
+
+### Clean up
+
+```sh
+docker stop tee-seeker-searxng
+# And only if you're done with it:
+docker rm tee-seeker-searxng
+```
+
+# Running
+
+```sh
+time python demo/tee_seeker/main.py "What's so valuable about DeepSeek's GRPO technique?" --rigor 0.1
+```
+
+# Future work
+
+<!-- Note: See what's already in the Arkestra version -->
+
+* Panel of experts approach from [Stanford Storm](https://github.com/stanford-oval/storm)
+
+# See also
+
+* [Introducing Deeper Seeker - A simpler and OSS version of OpenAI's latest Deep Research feature](https://www.reddit.com/r/LocalLLaMA/comments/1igyy0n/introducing_deeper_seeker_a_simpler_and_oss/) [Feb 2025]
+* [Automated-AI-Web-Researcher-Ollama](https://github.com/TheBlewish/Automated-AI-Web-Researcher-Ollama) ([Reddit](https://www.reddit.com/r/LocalLLaMA/comments/1gvlzug/i_created_an_ai_research_assistant_that_actually/))
+
-Original file line number
+Diff line change
@@ Expand Up / @@ -175,3 +175,4 @@ cython_debug/ @@
     #.idea/
     demo/docker_compose.yaml
     demo/reddit_newsletter/cacert.pem
+    .aider*