-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve multi-modal models with a new concept of "attachments" #587
Comments
Here's some research I did against Gemini the other day: https://til.simonwillison.net/llms/prompt-gemini Resulting in a Bash script that could do this: prompt-gemini 'extract text from this image' example-handwriting.jpg In that case it was detecting the file type from the file extension, since that type needs to be passed like so: {
"contents": [
{
"role": "user",
"parts": [
{
"text": "Extract text from this image"
},
{
"inlineData": {
"data": "$(base64 -i image.png)",
"mimeType": "image/png"
}
}
]
}
]
} But in some cases the file extension may not be usable. In those cases I'm going to have a second option: llm prompt "extract text" --at myimage image/png Models that accept attachments should specify what |
I'm going to use the term "attachments" for binary files returned by models as well. So far I have two examples of those:
|
Incoming attachments to the CLI tool can be specified in one of three ways:
Some models accept URLs directly, in which case the URL will be passed to the model. Other models don't, in which case LLM will detect that and download the image from the URL and send the bytes. |
Here's an interesting challenge: do we resize the images before we send them or not? Different models have different recommendations around this. I expect there are some models out there that are vastly less expensive if you resize the image before sending it, in which case resizing is an important feature. We could use Pillow for that. Question is, how do we know that dimensions to resize to? Maybe this can be an option that the model classes themselves specify. We could have a CLI option for |
https://platform.openai.com/docs/guides/vision/calculating-costs says:
That's pretty complicated! It also exposes the need for a mechanism for sending detail low/high when making the API calls. One option could be to default to low and allow users of that model to do this: llm -m gpt-4o --at bigimage.png image-high/png So we abuse the I'll keep an eye out for any other oddities like that in other models that may need to be supported. |
GPT-4o format support:
Here's what GPT-4o preview audio input looks like: {
"model": "gpt-4o-audio-preview",
"modalities": ["text", "audio"],
"audio": { "voice": "alloy", "format": "wav" },
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "What is in this recording?" },
{
"type": "input_audio",
"input_audio": {
"data": "<base64 bytes here>",
"format": "wav"
}
}
]
}
]
} Where Interestingly you don't need to pass the image type for images, even for base64 data: Since detail is optional I may ignore it for the first implementation of this. |
Claude models DO require a content type: {
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "/9j/4AAQSkZJRg..."
}
},
{
"type": "text",
"text": "What is in this image?"
}
]
} https://docs.anthropic.com/en/api/messages says
As far as I can tell Claude doesn't accept URLs to images, only base64 encoded data. |
Gemini supports these image formats: https://ai.google.dev/gemini-api/docs/vision?lang=rest
Maybe LLM should know how to convert images from unsupported formats to supported formats? Not sure if that's worth the fuss, maybe a plugin thing at a later date? Gemini has a file API and really encourages you to upload images first... but it says that if your files add up to less than 20MB you can use base64 instead. I think I'll stick with base64 at first. Gemini can also do what it calls "document processing" - https://ai.google.dev/gemini-api/docs/document-processing?lang=rest
I definitely want to support these, especially since they can represent a big discount on overall cost because of the weird 258 token flat rate (also the rate for an image). And for audio: https://ai.google.dev/gemini-api/docs/audio?lang=rest
And video too! https://ai.google.dev/gemini-api/docs/vision?lang=rest#technical-details-video
I think |
As far as I can tell there is no way to provide Gemini with a URL to content that has NOT been uploaded first to the Google file service. So out of OpenAI, Anthropic, Google it looks like OpenAI are the only ones that accept an arbitrary URL to an image. |
I don't know if OpenAI accept URLs to both images and audio clips. To be safe, maybe the API design should have the ability to define a |
The Pixtral API accepts URLs: https://docs.mistral.ai/capabilities/vision/ [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": "https://tripfixers.com/wp-content/uploads/2019/11/eiffel-tower-with-snow.jpeg"
}
]
}
] Or base64 images: [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": "data:image/jpeg;base64,{base64_image}"
}
]
}
] Note that you don't have to specify Supported file types:
|
Groq API also supports both base64 and regular URLs, for Llama 3.1 vision models: https://console.groq.com/docs/vision [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,{base64_image}"
}
}
]
}
] [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Whats the weather like in this state?"
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
] |
This worked against Groq: curl https://api.groq.com/openai/v1/chat/completions -s \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GROQ_API_KEY" \
-d '{
"model": "llama-3.2-11b-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in a great deal of detail, do not describe any people in it"
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}' | jq Returned: {
"id": "chatcmpl-33f4a341-1dd6-44fa-8dbd-90ed9d437b80",
"object": "chat.completion",
"created": 1729826570,
"model": "llama-3.2-11b-vision-preview",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The image depicts the iconic Statue of Liberty in New York Harbor, with the Manhattan skyline in the background. \n\nThe Statue of Liberty is prominently featured in the foreground, situated on a small island that juts out into the harbor. The statue's copper sheeting, which is normally a bright green due to oxidation, appears a slightly lighter shade in the image, possibly due to the lighting conditions or exposure of the photo. The statue's broken shackles and chains are visible, symbolizing the abolition of slavery.\n\nIn the background, the Manhattan skyline rises majestically, dominated by the towering skyscrapers of the Financial District. The image showcases several notable landmarks, including One World Trade Center, the former World Trade Center, and the majestic Brooklyn Bridge. The atmosphere of the image is peaceful and serenely beautiful, with the calm waters of the harbor reflecting the soft light of the setting sun. The overall mood is one of tranquility and wonder, inviting the viewer to appreciate the majesty of this iconic symbol of freedom."
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"queue_time": 0.258170439,
"prompt_tokens": 28,
"prompt_time": 0.001334961,
"completion_tokens": 207,
"completion_time": 0.421154809,
"total_tokens": 235,
"total_time": 0.42248977
},
"system_fingerprint": "fp_fa3d3d25b0",
"x_groq": {
"id": "req_01jb0v5ganf85s8gfzcxh9n50r"
}
}
|
OK, database design. Reminder: the current schema on https://llm.datasette.io/en/stable/logging.html#sql-schema looks like this: CREATE TABLE [conversations] (
[id] TEXT PRIMARY KEY,
[name] TEXT,
[model] TEXT
);
CREATE TABLE [responses] (
[id] TEXT PRIMARY KEY,
[model] TEXT,
[prompt] TEXT,
[system] TEXT,
[prompt_json] TEXT,
[options_json] TEXT,
[response] TEXT,
[response_json] TEXT,
[conversation_id] TEXT REFERENCES [conversations]([id]),
[duration_ms] INTEGER,
[datetime_utc] TEXT
);
CREATE VIRTUAL TABLE [responses_fts] USING FTS5 (
[prompt],
[response],
content=[responses]
); I'm going to have a new
At least one of Here's where things get tricky: how should these be associated with data in the These things can be used for both input AND output (the OpenAI audio output case). So maybe there are two many-to-many columns:
I guess for outputs I'll be populating just the I'm not crazy about those table names. Other option:
|
Should I still store the full If not I could invent my own clever JSON format, something like this: {
"model": "llama-3.2-11b-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in a great deal of detail, do not describe any people in it"
},
{
"type": "image_url",
"image_url": {
"url": {
"$attachment": {
"id": "v01jax4p0rstqbs3fvbkkszvt",
"column": "url"
}
}
}
}
]
}
]
} So that |
My vote would be for not duplicating the images/attachments in the database.
If the attachments are stored as sqlite blobs, you would also reduce the cost of base64-encoding their contents. I would definitely wish for there to be a proper sqlite foreign key and first-class (non-json) references between the two (three) tables. Giving it some more thought, it actually would be better to have the references in the prompt/response tables pointing to the attachments table (rather than the other way around) because I just realized that you might want to deduplicate the attachments. Given that you could, quite easily, imagine that a person would make 10+ queries in a row against the same attachment, which could be rather large (in the case of some of the gemini models, 10s of MiBs+). If you're already ingesting the full attachment, I would definitely say go ahead and calculate the sha256 (often hardware-accelerated and very much standardized) or wyhash (fastest general hash I've found ported to Python, from a quick search) as you ingest each chunk. It'll have fairly low additional overhead (since you're IO-limited) but you'll realize massive savings if you can prevent adding a possibly very large blob to the database each time. |
Great point about de-duplicating attachments there, given the need to support long conversation threads. ... actually that's handled a bit already: the I'll still think about ways to avoid duplicate storage though - might even calculate a sha255 hash of the BLOB content and store that in a column (or maybe even use that as the ID itself?) A neat thing about using a SHA ID is that it means if you send the same stored image to multiple different LLMs (to compare their responses for example) you only record it once in the database. That's a pretty compelling reason to do this. Note that my current idea is that if you store ... so I may have some kind of option that means "store the images in the database BLOB columns anyway", maybe this: llm -m claude-3.5-sonnet "describe this image" -a image.png --store-attachments Could be |
Is there anything I can do to help? I made a whole vision analysis cli based on claude. https://github.com/irthomasthomas/claude-vision |
Initial attempt at an diff --git a/llm/cli.py b/llm/cli.py
index a1b1457..1082dc7 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -30,7 +30,10 @@ from llm import (
from .migrations import migrate
from .plugins import pm
import base64
+from dataclasses import dataclass
+import httpx
import pathlib
+import puremagic
import pydantic
import readline
from runpy import run_module
@@ -48,6 +51,44 @@ warnings.simplefilter("ignore", ResourceWarning)
DEFAULT_TEMPLATE = "prompt: "
+@dataclass
+class Attachment:
+ mimetype: str
+ filepath: str
+ url: str
+ content: bytes
+
+
+class AttachmentType(click.ParamType):
+ name = "attachment"
+
+ def convert(self, value, param, ctx):
+ if value == "-":
+ content = sys.stdin.buffer.read()
+ # Try to guess type
+ try:
+ mimetype = puremagic.from_string(content, mime=True)
+ except puremagic.PureError:
+ raise click.BadParameter("Could not determine mimetype of stdin")
+ return Attachment(mimetype, None, None, content)
+ if "://" in value:
+ # Confirm URL exists and try to guess type
+ try:
+ response = httpx.head(value)
+ response.raise_for_status()
+ mimetype = response.headers.get("content-type")
+ except httpx.HTTPError as ex:
+ raise click.BadParameter(str(ex))
+ return Attachment(mimetype, None, value, None)
+ # Check that the file exists
+ path = pathlib.Path(value)
+ if not path.exists():
+ self.fail(f"File {value} does not exist", param, ctx)
+ # Try to guess type
+ mimetype = puremagic.from_file(str(path), mime=True)
+ return Attachment(mimetype, str(path), None, None)
+
+
def _validate_metadata_json(ctx, param, value):
if value is None:
return value
@@ -88,6 +129,22 @@ def cli():
@click.argument("prompt", required=False)
@click.option("-s", "--system", help="System prompt to use")
@click.option("model_id", "-m", "--model", help="Model to use")
+@click.option(
+ "attachments",
+ "-a",
+ "--attachment",
+ type=AttachmentType(),
+ multiple=True,
+ help="Attachment path or URL or -",
+)
+@click.option(
+ "attachment_types",
+ "--at",
+ "--attachment-type",
+ type=(str, str),
+ multiple=True,
+ help="Attachment with explicit mimetype",
+)
@click.option(
"options",
"-o",
@@ -127,6 +184,8 @@ def prompt(
prompt,
system,
model_id,
+ attachments,
+ attachment_types,
options,
template,
param,
@@ -143,6 +202,8 @@ def prompt(
Documentation: https://llm.datasette.io/en/stable/usage.html
"""
+ print(attachments)
+ return
if log and no_log:
raise click.ClickException("--log and --no-log are mutually exclusive")
diff --git a/setup.py b/setup.py
index 1f6adcd..b8b55bf 100644
--- a/setup.py
+++ b/setup.py
@@ -48,6 +48,7 @@ setup(
"setuptools",
"pip",
"pyreadline3; sys_platform == 'win32'",
+ "puremagic",
],
extras_require={
"test": [ |
I'm thinking about how the Python API is going to work. I'm leaning towards this: model = llm.get_model("gpt-4o")
response = model.prompt("Describe these images", open("image.jpg", "rb"), open("image2.jpg", "rb")) I could have that accept file-like objects or string paths or string URLs, or maybe I could tell people to do this instead: response = model.prompt(
"Describe these images",
llm.Attachment(url="https://..."),
llm.Attachment(path="image.jpg")
) I like that second option better, it's more fitting with Python's optional type hints. So the Right now the signature of that method looks like this: Lines 270 to 280 in d654c95
Technically this would be a breaking change, because Lines 42 to 51 in d654c95
|
... well I got this to work (including some hacking around with
|
OK, I have a working prototype of this for both the default OpenAI plugin and the |
Design question: I keep mistakenly running this: llm -m gpt-4o 'ocr' example.jpg Which currently gives this error:
I could say that all extra options are treated as attachments. But it's been suggested in the past that llm -m gpt-4o capital of france Which I know is a nice pattern because https://github.com/simonw/llm-cmd does it: llm cmd use ffmpeg to convert blah.mp4 to mp3 So I have three options:
I'm torn between all three options at the moment. |
Here's a puremagic annoyance:
Note that the mp3 file was identified as
Can I get |
Tried this: python -c '
import puremagic, sys, pprint
pprint.pprint(
puremagic.magic_stream(open(sys.argv[-1], "rb"))
)' russian-pelican-in-spanish.mp3 Got:
|
Alpha is out!
|
This works too: alias llm="uvx --with 'llm==0.17a0' llm" Then: llm --version
|
* Prototype of attachments support * Support for continued attachment conversations Refs simonw/llm#587
https://github.com/simonw/llm-gemini/releases/tag/0.3a0 I've released this as an alpha and alias llm="uvx --with 'llm==0.17a0' --with 'llm-gemini==0.3a0' llm" And then this (idea to hit the webcam URL from Drew on Discord): llm -m gemini-1.5-flash-latest \
'how foggy is it on a scale of 1-10, also tell me the current time and date and elevation and vibes' \
-a 'https://cameras.alertcalifornia.org/public-camera-data/Axis-Purisma1/latest-frame.jpg'
|
Refs #19 Refs simonw/llm#587
https://github.com/simonw/llm-claude-3/releases/tag/0.6a0 uvx --with 'llm==0.17a0' --with 'llm-claude-3==0.6a0' \
llm -m claude-3.5-sonnet 'describe image' \
-a https://static.simonwillison.net/static/2024/pelicans.jpg
|
OK, this is great! I'm going to ship it. |
Previous work is in:
I'm going a different direction. Previously I had just been thinking about images, but Gemini accepts PDFs and videos and audio clips and the latest GPT-4o model supports audio clips too.
The
llm prompt
command isn't using-a
for anything yet, so I'm going to have-a filename
be the way an attachment (or multiple attachments) is added to a prompt.-a
is short for--attachment
- not for--attach
because that already means something different for thellm embed-multi
command (it attaches extra SQLite databases).TODO
llm 'describe image' -a image.jpeg
workingllm 'describe image' -a https://static.simonwillison.net/static/2024/imgcat.jpg
cat image.jpeg | llm 'describe image' -a -
Attachment
class should not have code forhttpx.get()
fetching of content, since anasyncio
wrapper may want to do that a different way.llm logs
output for prompts with attachmentsllm logs --json
outputOut of scope for this issue:
llm chat
support for attachments via!attachment path-or-url
The text was updated successfully, but these errors were encountered: