Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add html parsing and web scraping #86

Merged
merged 3 commits into from
Sep 29, 2024

Conversation

mattlindsey
Copy link
Contributor

For issue #68.

Still work in progress, but hoping for feedback. Try importing a page from a site. Need to fix tests.

@nickthecook
Copy link
Owner

Thanks for the PR for such a cool feature. I'll take a look this morning!

Copy link
Owner

@nickthecook nickthecook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! This fits really well into the existing parser structure.

I tried it with the Wikipedia article on LLMs, and it worked.

Seems like it will fit into Document well enough too; are you thinking of making filename optional, or maybe renaming it to title and trying to populate it with the title of the page?

app/controllers/documents_controller.rb Outdated Show resolved Hide resolved
@mattlindsey
Copy link
Contributor Author

Thanks @nickthecook. I want to work on this a little more, but unfortunately the EmbedChunkJob on my machine has stopped working with an active record validation error (see below).
I tried dropping and recreating the database and containers, but still get this error when adding any document. Any ideas?

llmjobs | 2024-09-27T21:02:06.096Z pid=20775 tid=avr class=EmbedChunkJob jid=faf83fb854fbd5f560bcca46 INFO: start
llmjobs | 2024-09-27T21:02:06.145Z pid=20775 tid=avr class=EmbedChunkJob jid=faf83fb854fbd5f560bcca46 elapsed=0.049 INFO: fail
llmjobs | 2024-09-27T21:02:06.145Z pid=20775 tid=avr WARN: {"context":"Job raised exception","job":{"retry":3,"queue":"llm","args":[10],"class":"EmbedChunkJob","jid":"faf83fb854fbd5f560bcca46","created_at":1727470879.495826,"enqueued_at":1727470926.0948238,"error_message":"Validation failed: Key has already been taken","error_class":"ActiveRecord::RecordInvalid","failed_at":1727470879.701788,"retry_count":1,"retried_at":1727470897.364526}}
llmjobs | 2024-09-27T21:02:06.145Z pid=20775 tid=avr WARN: ActiveRecord::RecordInvalid: Validation failed: Key has already been taken
llmjobs | 2024-09-27T21:02:06.145Z pid=20775 tid=avr WARN: app/models/setting.rb:16:in `set'
llmjobs | app/models/setting.rb:10:in `get'
llmjobs | app/lib/llm_clients/client.rb:63:in `timeout_retries'
llmjobs | app/lib/llm_clients/client.rb:139:in `with_retries'
llmjobs | app/lib/llm_clients/ollama/client.rb:93:in `request'
llmjobs | app/lib/llm_clients/ollama/client.rb:28:in `embed'
llmjobs | app/services/embedder.rb:7:in `embed'
llmjobs | app/services/embed_chunk.rb:18:in `embed'
llmjobs | app/services/embed_chunk.rb:12:in `execute'
llmjobs | app/sidekiq/embed_chunk_job.rb:15:in `perform'

@nickthecook
Copy link
Owner

Yep, that's my fault. I introduced a bug in Setting that keeps creating duplicate settings. I just ran into it myself.

Go to the admin ui, then Settings, then delete any settings with nil values.

I'll work on a fix.

@nickthecook
Copy link
Owner

I think the root of the problem is that I added validation to Setting so it wouldn't create duplicates, but some duplicates already existed in our databases because of a previous bug.

rails db:seed will deduplicate them for you now.

@mattlindsey
Copy link
Contributor Author

mattlindsey commented Sep 27, 2024

Great. Works now.
You can merge this PR if you want. Or if you think there's more to be done, I can keep working on it.

P.S. There are a couple of tests in the chunker spec that are failing, so I commented them out for now. I don't think that I broke them, but you might want to check.

@mattlindsey
Copy link
Contributor Author

Actually don't merge this yet.
The chunking for plain text and markdown seems to be broken. Looks like chunk_size isn't getting set and may be defaulting to 1000, but I don't know why. Take a look if you can. Thanks.

@mattlindsey
Copy link
Contributor Author

Fixed this. Everthing seems good now.

@nickthecook
Copy link
Owner

Fixed this. Everthing seems good now.

Great! I'll have another look, and comment on code organization/UI stuff this time, since it's out of draft.

Copy link
Owner

@nickthecook nickthecook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Just a few comments.

app/controllers/documents_controller.rb Show resolved Hide resolved
app/controllers/documents_controller.rb Show resolved Hide resolved
app/controllers/documents_controller.rb Show resolved Hide resolved
<div class="flex-col">

<%= f.file_field :file, class: "w-full p-2 rounded-lg my-2 px-5 py-2 border border-secondary-400 dark:border-secondary-600" %>

<%= f.url_field "link", class: "w-full p-2 rounded-lg my-2 px-5 py-2 border border-secondary-400 dark:border-secondary-600" %>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good in light mode, but in dark mode the text is light but the bg is white.

Can you theme it like the search bar on the same page?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be good if there was some hint for the user as to what the link field was. It does say "Add document or URL" above the file upload box, but maybe if you added placeholder text it would help?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good in dark mode:

Suggested change
<%= f.url_field "link", class: "w-full p-2 rounded-lg my-2 px-5 py-2 border border-secondary-400 dark:border-secondary-600" %>
<%= f.url_field "link",
placeholder: "Enter a URL",
class: "w-full p-2 rounded-lg my-2 px-5 py-2 border border-secondary-400 dark:bg-secondary-925 dark:border-secondary-600"
%>

Copy link
Owner

@nickthecook nickthecook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @mattlindsey !

@nickthecook nickthecook merged commit 482f29c into nickthecook:main Sep 29, 2024
@nickthecook
Copy link
Owner

nickthecook commented Sep 29, 2024

I think somehow the change you made to set the document title was lost. Moving the fetching logic to a new method is also gone.

I know these were there when I reviewed and before you force-pushed!

Any chance you can find that in a branch and push it again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants