Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Scraping #68

Open
nickthecook opened this issue Sep 6, 2024 · 2 comments
Open

Web Scraping #68

nickthecook opened this issue Sep 6, 2024 · 2 comments
Labels
collections feature help wanted Extra attention is needed

Comments

@nickthecook
Copy link
Owner

Instead of uploading a document, the user should be able to enter a URL.

Archyve needs to:

  • find and scrape the main document content
  • ideally show the user a preview of the text scraped from the page before actually ingesting the content
  • once the user confirms, chunk, embed, and graph the content just like it was a document

Along the way, this will require:

  • updating Document so that it works for a web source instead of just an uploaded attachment (preferred to creating a new model and having Collection#documents be polymorphic)
  • updating the Collection view so it shows web sources along with Documents
@mattlindsey
Copy link
Contributor

mattlindsey commented Oct 9, 2024

Since the web page has now been scraped, an enhancement now could be for the chat augmentation to include the URL(and maybe date scraped) that a chunk was derived from. Does that make sense?

@nickthecook
Copy link
Owner Author

Yeah, makes sense!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections feature help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants