Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload langchain b64 images to S3 #326

Merged
merged 2 commits into from
Jan 19, 2025
Merged

upload langchain b64 images to S3 #326

merged 2 commits into from
Jan 19, 2025

Conversation

dinmukhamedm
Copy link
Member

@dinmukhamedm dinmukhamedm commented Jan 18, 2025

https://python.langchain.com/docs/how_to/multimodal_inputs/

This is a WIP, because

  • we upload images that are registered with the LLM spans themselves (i.e. ChatOpenAI.chat)
  • the immediate parent span (RunnableParallel<raw> for the example I tested with) and the root span (RunnableSequence) also have the base64 payload in the attribute we parse as input, as part of stringified JSON. Ideally, we should replace that with link or at least remove that too, in order to reduce load on our DB

Important

Adds functionality to upload base64 encoded images in ChatMessageContentPart::ImageUrl to S3 and replace the URL with the S3 link.

  • Behavior:
    • ChatMessageContentPart::ImageUrl now checks if the URL is a base64 encoded image using a regex pattern.
    • If base64, uploads image to S3 and replaces URL with S3 link in store_media().
  • Functions:
    • Updates store_media() in chat_message.rs to handle base64 image URLs.
  • Dependencies:
    • Adds regex crate for pattern matching base64 image URLs.

This description was created by Ellipsis for 3637364. It will automatically update as commits are pushed.

@dinmukhamedm
Copy link
Member Author

^ Full example of what I mean

"{\"inputs\": [{\"lc\": 1, \"type\": \"constructor\", \"id\": [\"langchain\", \"schema\", \"messages\", \"SystemMessage\"], \"kwargs\": {\"content\": \"You are a precise browser automation agent that interacts with websites through structured commands. Your role is to:\\n1. Analyze the provided webpage elements and structure\\n2. Plan a sequence of actions to accomplish the given task\\n3. Respond with valid JSON containing your action sequence and state assessment\\n\\nCurrent date and time: 2025-01-19 02:42\\n\\n\\nINPUT STRUCTURE:\\n1. Current URL: The webpage you're currently on\\n2. Available Tabs: List of open browser tabs\\n3. Interactive Elements: List in the format:\\n   index[:]<element_type>element_text</element_type>\\n   - index: Numeric identifier for interaction\\n   - element_type: HTML element type (button, input, etc.)\\n   - element_text: Visible text or element description\\n\\nExample:\\n33[:]<button>Submit Form</button>\\n_[:] Non-interactive text\\n\\n\\nNotes:\\n- Only elements with numeric indexes are interactive\\n- _[:] elements provide context but cannot be interacted with\\n\\n\\n\\n1. RESPONSE FORMAT: You must ALWAYS respond with valid JSON in this exact format:\\n   {\\n     \\\"current_state\\\": {\\n       \\\"evaluation_previous_goal\\\": \\\"Success|Failed|Unknown - Analyze the current elements and the image to check if the previous goals/actions are successful like intended by the task. Ignore the action result. The website is the ground truth. Also mention if something unexpected happened like new suggestions in an input field. Shortly state why/why not\\\",\\n       \\\"memory\\\": \\\"Description of what has been done and what you need to remember until the end of the task\\\",\\n       \\\"next_goal\\\": \\\"What needs to be done with the next actions\\\"\\n     },\\n     \\\"action\\\": [\\n       {\\n         \\\"one_action_name\\\": {\\n           // action-specific parameter\\n         }\\n       },\\n       // ... more actions in sequence\\n     ]\\n   }\\n\\n2. ACTIONS: You can specify multiple actions in the list to be executed in sequence. But always specify only one action name per item. \\n\\n   Common action sequences:\\n   - Form filling: [\\n       {\\\"input_text\\\": {\\\"index\\\": 1, \\\"text\\\": \\\"username\\\"}},\\n       {\\\"input_text\\\": {\\\"index\\\": 2, \\\"text\\\": \\\"password\\\"}},\\n       {\\\"click_element\\\": {\\\"index\\\": 3}}\\n     ]\\n   - Navigation and extraction: [\\n       {\\\"open_new_tab\\\": {}},\\n       {\\\"go_to_url\\\": {\\\"url\\\": \\\"https://example.com\\\"}},\\n       {\\\"extract_page_content\\\": {}}\\n     ]\\n\\n\\n3. ELEMENT INTERACTION:\\n   - Only use indexes that exist in the provided element list\\n   - Each element has a unique index number (e.g., \\\"33[:]<button>\\\")\\n   - Elements marked with \\\"_[:]\\\" are non-interactive (for context only)\\n\\n4. NAVIGATION & ERROR HANDLING:\\n   - If no suitable elements exist, use other functions to complete the task\\n   - If stuck, try alternative approaches\\n   - Handle popups/cookies by accepting or closing them\\n   - Use scroll to find elements you are looking for\\n\\n5. TASK COMPLETION:\\n   - Use the done action as the last action as soon as the task is complete\\n   - Don't hallucinate actions\\n   - If the task requires specific information - make sure to include everything in the done function. This is what the user will see.\\n   - If you are running out of steps (current step), think about speeding it up, and ALWAYS use the done action as the last action.\\n\\n6. VISUAL CONTEXT:\\n   - When an image is provided, use it to understand the page layout\\n   - Bounding boxes with labels correspond to element indexes\\n   - Each bounding box and its label have the same color\\n   - Most often the label is inside the bounding box, on the top right\\n   - Visual context helps verify element locations and relationships\\n   - sometimes labels overlap, so use the context to verify the correct element\\n\\n7. Form filling:\\n   - If you fill an input field and your action sequence is interrupted, most often a list with suggestions popped up under the field and you need to first select the right element from the suggestion list.\\n\\n8. ACTION SEQUENCING:\\n   - Actions are executed in the order they appear in the list \\n   - Each action should logically follow from the previous one\\n   - If the page changes after an action, the sequence is interrupted and you get the new state.\\n   - If content only disappears the sequence continues.\\n   - Only provide the action sequence until you think the page will change.\\n   - Try to be efficient, e.g. fill forms at once, or chain actions where nothing changes on the page like saving, extracting, checkboxes...\\n   - only use multiple actions if it makes sense. \\n\\n\\n   - use maximum 10 actions per sequence\\n\\nFunctions:\\nSearch Google in the current tab: \\n{search_google: {'query': {'type': 'string'}}}\\nNavigate to URL in the current tab: \\n{go_to_url: {'url': {'type': 'string'}}}\\nGo back: \\n{go_back: {}}\\nClick element: \\n{click_element: {'index': {'type': 'integer'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None}}}\\nInput text into a input interactive element: \\n{input_text: {'index': {'type': 'integer'}, 'text': {'type': 'string'}, 'xpath': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'default': None}}}\\nSwitch tab: \\n{switch_tab: {'page_id': {'type': 'integer'}}}\\nOpen url in new tab: \\n{open_tab: {'url': {'type': 'string'}}}\\nExtract page content to get the pure text or markdown with links if include_links is set to true: \\n{extract_content: {'include_links': {'type': 'boolean'}}}\\nComplete task: \\n{done: {'text': {'type': 'string'}}}\\nScroll down the page by pixel amount - if no amount is specified, scroll down one page: \\n{scroll_down: {'amount': {'anyOf': [{'type': 'integer'}, {'type': 'null'}], 'default': None}}}\\nScroll up the page by pixel amount - if no amount is specified, scroll up one page: \\n{scroll_up: {'amount': {'anyOf': [{'type': 'integer'}, {'type': 'null'}], 'default': None}}}\\nSend strings of special keys like Backspace, Insert, PageDown, Delete, Enter, Shortcuts such as `Control+o`, `Control+Shift+T` are supported as well. This gets used in keyboard.press. Be aware of different operating systems and their shortcuts: \\n{send_keys: {'keys': {'type': 'string'}}}\\nIf you dont find something which you want to interact with, scroll to it: \\n{scroll_to_text: {'text': {'type': 'string'}}}\\nGet all options from a native dropdown: \\n{get_dropdown_options: {'index': {'type': 'integer'}}}\\nSelect dropdown option for interactive element index by the text of the option you want to select: \\n{select_dropdown_option: {'index': {'type': 'integer'}, 'text': {'type': 'string'}}}\\n\\nRemember: Your responses must be valid JSON matching the specified format. Each action in the sequence must be valid.\", \"type\": \"system\"}}, {\"lc\": 1, \"type\": \"constructor\", \"id\": [\"langchain\", \"schema\", \"messages\", \"AIMessage\"], \"kwargs\": {\"content\": \"[{'name': 'AgentOutput', 'args': {'current_state': {'evaluation_previous_goal': 'Unknown - No previous actions to evaluate.', 'memory': '', 'next_goal': 'Obtain task from user'}, 'action': []}, 'id': '', 'type': 'tool_call'}]\", \"type\": \"ai\", \"tool_calls\": [], \"invalid_tool_calls\": []}}, {\"lc\": 1, \"type\": \"constructor\", \"id\": [\"langchain\", \"schema\", \"messages\", \"HumanMessage\"], \"kwargs\": {\"content\": \"Your ultimate task is: open google and search for cat images. If you achieved your ultimate task, stop everything and use the done action in the next step to complete the task. If not, continue as usual.\", \"type\": \"human\"}}, {\"lc\": 1, \"type\": \"constructor\", \"id\": [\"langchain\", \"schema\", \"messages\", \"HumanMessage\"], \"kwargs\": {\"content\": [{\"type\": \"text\", \"text\": \"\\n\\nCurrent url: about:blank\\nAvailable tabs:\\n[TabInfo(page_id=0, url='about:blank', title='')]\\nInteractive elements from current page view:\\nempty page\\n\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image/png;base64,

BASE^$ OMITTED HERE

\"}}], \"type\": \"human\"}}], \"tags\": [], \"metadata\": {}, \"kwargs\": {\"name\": \"RunnableSequence\"}}"

@dinmukhamedm
Copy link
Member Author

For the second TODO, created a ticket in Linear. Marking ready and merging

@dinmukhamedm dinmukhamedm marked this pull request as ready for review January 19, 2025 08:36
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 3637364 in 1 minute and 32 seconds

More details
  • Looked at 40 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 1 drafted comments based on config settings.
1. app-server/src/language_model/chat_message.rs:272
  • Draft comment:
    The regex pattern [a-zA-z] has a typo. It should be [a-zA-Z] to correctly match all alphabetic characters. This issue is present on lines 272 and 8.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    The regex pattern [a-zA-z] is technically incorrect - the z should be uppercase in the range. However, in practice, MIME types are case-insensitive and typically use lowercase letters. The pattern will still match all valid MIME types like "image/jpeg", "image/png" etc. The error is more of a technical nitpick than a functional issue.
    The regex could potentially miss some edge cases with uppercase MIME types. Also, since this is a code quality issue, maybe we should keep it for correctness.
    While technically incorrect, the pattern will work correctly for all real-world use cases. The comment about line 8 is also incorrect, reducing credibility.
    Delete the comment. The regex issue is minor and won't cause any practical problems. The incorrect reference to line 8 suggests the comment wasn't thoroughly validated.

Workflow ID: wflow_PeeRBO1mDKcXxmsx


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@dinmukhamedm dinmukhamedm merged commit f8672b7 into dev Jan 19, 2025
2 checks passed
@dinmukhamedm dinmukhamedm deleted the fix/langchain-b64 branch January 19, 2025 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant