Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ollama issues #329

Merged
merged 6 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 50 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,10 @@ You can also run VisionAgent in a local Jupyter Notebook. Here are some example
Check out the [notebooks](https://github.com/landing-ai/vision-agent/blob/main/examples/notebooks) folder for more examples.


### Installation
### Get Started
To get started with the python library, you can install it using pip:

#### Installation and Setup
```bash
pip install vision-agent
```
Expand All @@ -47,11 +48,17 @@ Ensure you have both an Anthropic key and an OpenAI API key and set in your envi
variables (if you are using Azure OpenAI please see the Azure setup section):

```bash
export ANTHROPIC_API_KEY="your-api-key" # needed for VisionAgent and VisionAgentCoder
export OPENAI_API_KEY="your-api-key" # needed for ToolRecommender
export ANTHROPIC_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"
```

### Basic Usage
---
**NOTE**
You must have both Anthropic and OpenAI API keys set in your environment variables to
use VisionAgent. If you don't have an Anthropic key you can use Ollama as a backend.
---

#### Chatting with VisionAgent
To get started you can just import the `VisionAgent` and start chatting with it:
```python
>>> from vision_agent.agent import VisionAgent
Expand All @@ -67,6 +74,40 @@ The chat messages are similar to `OpenAI`'s format with `role` and `content` key
in addition to those you can add `media` which is a list of media files that can either
be images or video files.

#### Getting Code from VisionAgent
You can also use `VisionAgentCoder` to generate code for you:

```python
>>> from vision_agent.agent import VisionAgentCoder
>>> agent = VisionAgentCoder(verbosity=2)
>>> code = agent("Count the number of people in this image", media="people.jpg")
```

#### Don't have Anthropic/OpenAI API keys?
You can use `OllamaVisionAgentCoder` which uses Ollama as the backend. To get started
pull the models:

```bash
ollama pull llama3.2-vision
ollama pull mxbai-embed-large
```

Then you can use it just like you would use `VisionAgentCoder`:

```python
>>> from vision_agent.agent import OllamaVisionAgentCoder
>>> agent = OllamaVisionAgentCoder(verbosity=2)
>>> code = agent("Count the number of people in this image", media="people.jpg")
```

---
**NOTE**
Smaller open source models like Llama 3.1 8B will not work well with VisionAgent. You
will encounter many coding errors because it generates incorrect code or JSON decoding
errors because it generates incorrect JSON. We recommend using larger models or
Anthropic/OpenAI models.
---

## Documentation

[VisionAgent Library Docs](https://landing-ai.github.io/vision-agent/)
Expand Down Expand Up @@ -400,15 +441,14 @@ Usage is the same as `VisionAgentCoder`:
`OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:

```bash
ollama pull llama3.1
ollama pull llama3.2-vision
ollama pull mxbai-embed-large
```

`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
required by the agent. Since `llama3.1` cannot handle images you may see some
performance degredation. `mxbai-embed-large` is the embedding model used to look up
tools. You can use it just like you would use `VisionAgentCoder`:
`llama3.2-vision` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Becuase
`llama3.2-vision` is a smaller model you **WILL see performance degredation** compared to
using Anthropic or OpenAI models. `mxbai-embed-large` is the embedding model used to
look up tools. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
Expand Down
84 changes: 74 additions & 10 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
<div align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://github.com/landing-ai/vision-agent/blob/main/assets/logo_light.svg?raw=true">
<source media="(prefers-color-scheme: light)" srcset="https://github.com/landing-ai/vision-agent/blob/main/assets/logo_dark.svg?raw=true">
<img alt="VisionAgent" height="200px" src="https://github.com/landing-ai/vision-agent/blob/main/assets/logo_light.svg?raw=true">
</picture>

[![](https://dcbadge.vercel.app/api/server/wPdN8RCYew?compact=true&style=flat)](https://discord.gg/wPdN8RCYew)
![ci_status](https://github.com/landing-ai/vision-agent/actions/workflows/ci_cd.yml/badge.svg)
[![PyPI version](https://badge.fury.io/py/vision-agent.svg)](https://badge.fury.io/py/vision-agent)
![version](https://img.shields.io/pypi/pyversions/vision-agent)
</div>

VisionAgent is a library that helps you utilize agent frameworks to generate code to
solve your vision task. Check out our discord for updates and roadmaps!
Expand All @@ -20,10 +28,18 @@ solve your vision task. Check out our discord for updates and roadmaps!
The fastest way to test out VisionAgent is to use our web application. You can find it
[here](https://va.landing.ai/).

### Local Jupyter Notebook
You can also run VisionAgent in a local Jupyter Notebook. Here are some examples of using VisionAgent:

1. [Counting cans in an image](https://github.com/landing-ai/vision-agent/blob/main/examples/notebooks/counting_cans.ipynb)

Check out the [notebooks](https://github.com/landing-ai/vision-agent/blob/main/examples/notebooks) folder for more examples.


### Installation
### Get Started
To get started with the python library, you can install it using pip:

#### Installation and Setup
```bash
pip install vision-agent
```
Expand All @@ -32,11 +48,17 @@ Ensure you have both an Anthropic key and an OpenAI API key and set in your envi
variables (if you are using Azure OpenAI please see the Azure setup section):

```bash
export ANTHROPIC_API_KEY="your-api-key" # needed for VisionAgent and VisionAgentCoder
export OPENAI_API_KEY="your-api-key" # needed for ToolRecommender
export ANTHROPIC_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"
```

### Basic Usage
---
**NOTE**
You must have both Anthropic and OpenAI API keys set in your environment variables to
use VisionAgent. If you don't have an Anthropic key you can use Ollama as a backend.
---

#### Chatting with VisionAgent
To get started you can just import the `VisionAgent` and start chatting with it:
```python
>>> from vision_agent.agent import VisionAgent
Expand All @@ -52,6 +74,40 @@ The chat messages are similar to `OpenAI`'s format with `role` and `content` key
in addition to those you can add `media` which is a list of media files that can either
be images or video files.

#### Getting Code from VisionAgent
You can also use `VisionAgentCoder` to generate code for you:

```python
>>> from vision_agent.agent import VisionAgentCoder
>>> agent = VisionAgentCoder(verbosity=2)
>>> code = agent("Count the number of people in this image", media="people.jpg")
```

#### Don't have Anthropic/OpenAI API keys?
You can use `OllamaVisionAgentCoder` which uses Ollama as the backend. To get started
pull the models:

```bash
ollama pull llama3.2-vision
ollama pull mxbai-embed-large
```

Then you can use it just like you would use `VisionAgentCoder`:

```python
>>> from vision_agent.agent import OllamaVisionAgentCoder
>>> agent = OllamaVisionAgentCoder(verbosity=2)
>>> code = agent("Count the number of people in this image", media="people.jpg")
```

---
**NOTE**
Smaller open source models like Llama 3.1 8B will not work well with VisionAgent. You
will encounter many coding errors because it generates incorrect code or JSON decoding
errors because it generates incorrect JSON. We recommend using larger models or
Anthropic/OpenAI models.
---

## Documentation

[VisionAgent Library Docs](https://landing-ai.github.io/vision-agent/)
Expand Down Expand Up @@ -385,15 +441,14 @@ Usage is the same as `VisionAgentCoder`:
`OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:

```bash
ollama pull llama3.1
ollama pull llama3.2-vision
ollama pull mxbai-embed-large
```

`llama3.1` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Normally we would
use an actual LMM such as `llava` but `llava` cannot handle the long context lengths
required by the agent. Since `llama3.1` cannot handle images you may see some
performance degredation. `mxbai-embed-large` is the embedding model used to look up
tools. You can use it just like you would use `VisionAgentCoder`:
`llama3.2-vision` is used for the `OllamaLMM` for `OllamaVisionAgentCoder`. Becuase
`llama3.2-vision` is a smaller model you **WILL see performance degredation** compared to
using Anthropic or OpenAI models. `mxbai-embed-large` is the embedding model used to
look up tools. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
Expand Down Expand Up @@ -454,3 +509,12 @@ agent = va.agent.AzureVisionAgentCoder()
Failure to have sufficient API credits may result in limited or no functionality for
the features that rely on the OpenAI API. For more details on managing your API usage
and credits, please refer to the OpenAI API documentation.


******************************************************************************************************************************

## Troubleshooting

### 1. Encounter `ModuleNotFoundError` when VisionAgent generating code

If you keep seeing a `ModuleNotFoundError` when VisionAgent generating code and seeing VisionAgent got stuck and could not install the missing dependencies, you can manually add the missing dependencies into your Python environment by: `pip install <missing_package_name>`. And then try generating code again.
3 changes: 2 additions & 1 deletion examples/chat/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Example App
This is an example applicaiton to demonstrate how to run VisionAgent locally.
This is an example applicaiton to demonstrate how to run VisionAgentV2 locally. This
only works with the V2 version of VisionAgent.


# Quick Start
Expand Down
11 changes: 4 additions & 7 deletions vision_agent/agent/vision_agent_coder.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,12 +644,9 @@ class OllamaVisionAgentCoder(VisionAgentCoder):
"""VisionAgentCoder that uses Ollama models for planning, coding, testing.

Pre-requisites:
1. Run ollama pull llama3.1 for the LLM
1. Run ollama pull llama3.2-vision for the LMM
2. Run ollama pull mxbai-embed-large for the embedding similarity model

Technically you should use a VLM such as llava but llava is not able to handle the
context length and crashes.

Example
-------
>>> image vision_agent as va
Expand All @@ -674,17 +671,17 @@ def __init__(
else planner
),
coder=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
if coder is None
else coder
),
tester=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
if tester is None
else tester
),
debugger=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
if debugger is None
else debugger
),
Expand Down
2 changes: 1 addition & 1 deletion vision_agent/agent/vision_agent_planner.py
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ def __init__(
) -> None:
super().__init__(
planner=(
OllamaLMM(model_name="llama3.1", temperature=0.0)
OllamaLMM(model_name="llama3.2-vision", temperature=0.0)
if planner is None
else planner
),
Expand Down
8 changes: 4 additions & 4 deletions vision_agent/agent/vision_agent_planner_prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,20 +62,20 @@
- Count the number of detected objects labeled as 'person'.
plan3:
- Load the image from the provided file path 'image.jpg'.
- Use the 'countgd_counting' tool to count the dominant foreground object, which in this case is people.
- Use the 'countgd_object_detection' tool to count the dominant foreground object, which in this case is people.

```python
from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_counting
from vision_agent.tools import load_image, owl_v2_image, florence2_sam2_image, countgd_object_detection
image = load_image("image.jpg")
owl_v2_out = owl_v2_image("person", image)

f2s2_out = florence2_sam2_image("person", image)
# strip out the masks from the output becuase they don't provide useful information when printed
f2s2_out = [{{k: v for k, v in o.items() if k != "mask"}} for o in f2s2_out]

cgd_out = countgd_counting(image)
cgd_out = countgd_object_detection("person", image)

final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_counting": cgd_out}}
final_out = {{"owl_v2_image": owl_v2_out, "florence2_sam2_image": f2s2, "countgd_object_detection": cgd_out}}
print(final_out)
--- END EXAMPLE1 ---

Expand Down
15 changes: 7 additions & 8 deletions vision_agent/utils/sim.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,18 +58,19 @@ def __init__(
"""
self.df = df
self.client = OpenAI(api_key=api_key)
self.emb_call = (
lambda x: self.client.embeddings.create(input=x, model=model)
.data[0]
.embedding
)
self.model = model
if "embs" not in df.columns and sim_key is None:
raise ValueError("key is required if no column 'embs' is present.")

if sim_key is not None:
self.df["embs"] = self.df[sim_key].apply(
lambda x: get_embedding(
lambda text: self.client.embeddings.create(
input=text, model=self.model
)
.data[0]
.embedding,
self.emb_call,
x,
)
)
Expand Down Expand Up @@ -126,9 +127,7 @@ def top_k(
"""

embedding = get_embedding(
lambda text: self.client.embeddings.create(input=text, model=self.model)
.data[0]
.embedding,
self.emb_call,
query,
)
self.df["sim"] = self.df.embs.apply(lambda x: 1 - cosine(x, embedding))
Expand Down