Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ollama integration #30

Closed
Kydlaw opened this issue Apr 23, 2024 · 2 comments
Closed

Ollama integration #30

Kydlaw opened this issue Apr 23, 2024 · 2 comments

Comments

@Kydlaw
Copy link
Contributor

Kydlaw commented Apr 23, 2024

Description

Description of the feature: Provide the ability to use a Ollama in SemanticIngestionPipeline (currently it only supports proprietary models).

This way it would be possible to use the semantic parsing without spending money on a proprietary model.

Why the feature should be added to openparse (as opposed to another library or just implemented in your code):
The interface already exists in this library (similar feature) and I'm not aware of a straightforward way to go around the existing code to inject this feature into the current openparse.

I can contribute this feature if this interest you.

@Filimoa
Copy link
Owner

Filimoa commented Apr 23, 2024

This is a duplicate of #8. You can track progress in #23 - the main difficulty of doing this is we currently use a hard coded similarity that works well for OpenAI's models. But each embedding models will have it's own optimal cutoff. There's a couple approaches of dealing with this:

1. Start using a percentile cutoff.

This is the approach that llang-chain and llama-index use. In my limited testing, finding the optimal cutoff is still not trivial and I found it to perform worse than a hard coded approach.

We could offload choosing this to the user, but the library aims to have opinionated defaults.

2. Figure out cutoff dynamically

We would generate examples of text that should / shouldn't be combined and use this to figure out a similarity threshold.

similar_pairs = [("very similar text", "continuation"), ...]

similarities = []
for text1, text2 in similar_pairs:
    sim = get_similarity(text1, text2)
    similarities.append(sim)

avg_cutoff = ...

While this is kind of dirty, this is the approach I'm currently leaning to.

@Kydlaw
Copy link
Contributor Author

Kydlaw commented Apr 24, 2024

I apologize for the duplicate (didn't see the links to #21 and #23... in #8)

Ok, I see and understand the problem. It is indeed very hard to provide good defaults on that.
I'll have a look at your progress in #23 and see if I can maybe suggest something.

I'm closing this issue as it doesn't provide anything useful.

@Kydlaw Kydlaw closed this as completed Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants