-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal: <img x.jpg> will only detect filename, and ignore HTML #54
base: main
Are you sure you want to change the base?
Conversation
Hey @BabyCNM... thanks for submitting this and it looks useful. Would you be able to comment with some short examples of how images and audio are currently included, and then updated examples which work with your code? I'm happy to test it out :) |
Here is an example where current implementation would fail but the edited version will work. prompt = """Read the screenshot image and the website's source code. Then, answer the user's question.
User Question: is the button below or above the image?
Screenshot: <img C:/User/xyz/Desktop/screenshot_3.jpg>
--- HTML Code ---
<!DOCTYPE html>
<html lang="en">
<img src="website/relative/path/300.jpg" alt="Placeholder Image">
<button onclick="alert('Button clicked!')">Click Me</button>
</body>
</html>
""" Note there are two locations we have "<img" tag. However, the first one should be interpreted as an image to be sent for GPT-4o, and the other one (which is embedded in HTML code) should just be code rather than image. |
Why are these changes needed?
The autogen tag parsing system uses HTML-like tags to allow users to input images and audio directly from text. However, this system may mistakenly interpret actual HTML content (such as website source code) as multimodal components for GPT-4o and other VLMs, which is undesirable.
Fortunately, autogen’s tag format differs from HTML. In autogen, file paths do not require quotation marks. To improve parsing accuracy, we’ve introduced a strict_filepath_match parameter for the multimodal utilities. When enabled (True), it ensures that only simple tag contents—without spaces or quotes—are matched, making it especially useful for detecting filenames and ignoring complex HTML syntax. This parameter is turned on (True) for parsing multimodal agents' messages.
Note: This is a custom tagging convention, which could be confusion for some users. Please share any recommendations regarding the current design. Further simplification of the message component is planned for future updates.
Related issue number
Checks