-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gaby: exclude stale(?) web document #63
Comments
Change https://go.dev/cl/633395 mentions this issue: |
Change https://go.dev/cl/635176 mentions this issue: |
This page was temporarily added to help spec revision. It will be removed at the start of go1.25. Until then, ignore this page. (We have two entries for this page in our DB) For #63 Change-Id: Ibf369100ca25f47ca487bb87f7327388ef8dcef3 Reviewed-on: https://go-review.googlesource.com/c/oscar/+/633395 Reviewed-by: Tatiana Bradley <tatianabradley@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Gaby splits each crawled webpage into docs for embedding, computes embedding, and store them in the vector db. Delete all the docs and their embedding. This is meant to be run after the webpage is excluded from crawling with Crawler.Deny. For #63 Change-Id: I095a65b9a834ccf48062facc3654f40b43562e15 Reviewed-on: https://go-review.googlesource.com/c/oscar/+/635176 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Jonathan Amsterdam <jba@google.com>
Applied deletion:
Verified they don't appear in the similar doc search page. The rmdoc is a bit cumbersome to use since Gaby stores each section as a separate doc, the spec doc has many sections, and the rmdoc requires to approve deletion for each doc. Closing - we will use manual deletion as we find the obsolete or unsuitable docs. |
From golang/go#67901 (comment)
Docs like https://go.dev/doc/go1.17_spec#Package_initialization are kept for historical purposes.
We may come up with a workaround for this specific issue. I am not sure about general solutions.
Some approaches I am thinking of:
Label such docs manually in the document source and exclude them
Label such docs using LLM (e.g. "obsolete"?) and exclude them
(we can also do the same for issues that we don't want to appear in the related info by labelling/classifying appropriately)
Before posting, drop almost duplicates (e.g. by checking pair-wise similarity comparison)
The text was updated successfully, but these errors were encountered: