Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gaby: exclude stale(?) web document #63

Closed
hyangah opened this issue Nov 26, 2024 · 3 comments
Closed

gaby: exclude stale(?) web document #63

hyangah opened this issue Nov 26, 2024 · 3 comments
Assignees

Comments

@hyangah
Copy link
Contributor

hyangah commented Nov 26, 2024

From golang/go#67901 (comment)

Docs like https://go.dev/doc/go1.17_spec#Package_initialization are kept for historical purposes.
We may come up with a workaround for this specific issue. I am not sure about general solutions.

Some approaches I am thinking of:

  • Label such docs manually in the document source and exclude them

  • Label such docs using LLM (e.g. "obsolete"?) and exclude them

    (we can also do the same for issues that we don't want to appear in the related info by labelling/classifying appropriately)

  • Before posting, drop almost duplicates (e.g. by checking pair-wise similarity comparison)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/633395 mentions this issue: internal/gaby: exclude go1.17_spec docs from crawling

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/635176 mentions this issue: internal/devtools/cmd/rmdoc: delete crawled pages from corpus

gopherbot pushed a commit that referenced this issue Dec 15, 2024
This page was temporarily added to help spec revision.
It will be removed at the start of go1.25.
Until then, ignore this page.
(We have two entries for this page in our DB)

For #63

Change-Id: Ibf369100ca25f47ca487bb87f7327388ef8dcef3
Reviewed-on: https://go-review.googlesource.com/c/oscar/+/633395
Reviewed-by: Tatiana Bradley <tatianabradley@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Dec 15, 2024
Gaby splits each crawled webpage into docs for embedding, computes
embedding, and store them in the vector db. Delete all the docs
and their embedding.

This is meant to be run after the webpage is excluded from
crawling with Crawler.Deny.

For #63

Change-Id: I095a65b9a834ccf48062facc3654f40b43562e15
Reviewed-on: https://go-review.googlesource.com/c/oscar/+/635176
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Jonathan Amsterdam <jba@google.com>
@hyangah
Copy link
Contributor Author

hyangah commented Jan 3, 2025

Applied deletion:

$ cd internal/devtools/cmd/rmdoc
$ go run . -project oscar-go-1 -firestoredb prod https://go.dev/doc/go1.17_spec
$ go run . -project oscar-go-1 -firestoredb prod https://go.dev/doc/go1.17_spec.html 

Verified they don't appear in the similar doc search page.

The rmdoc is a bit cumbersome to use since Gaby stores each section as a separate doc, the spec doc has many sections, and the rmdoc requires to approve deletion for each doc.

Closing - we will use manual deletion as we find the obsolete or unsuitable docs.
We don't have a good solution to automatically delete/remove obsolete pages yet, but we also don't have a plan to address it until it becomes a frequent issue.

@hyangah hyangah closed this as completed Jan 3, 2025
@hyangah hyangah self-assigned this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants