Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate spec_urls based on webref ids #23958

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Elchi3
Copy link
Member

@Elchi3 Elchi3 commented Jul 26, 2024

Draft testing PR for @tidoust :)

Based on w3c/webref#1198 (comment), I wrote a quick test to see if webref ids could be used to (deeply) validate BCD's spec_urls. (that is, we want to check if the fragment ids are valid as well, not just the spec hosts).

It spits out a lot of errors and I would be interested to hear if BCD should be using different fragment ids, or if webref is missing these fragment ids, or if something else is going on. Please see the CI failure for the results.

(This is a draft PR that removes our dependency on web-specs and instead fetches raw webref JSON files, we might not want to fetch the data this way, so consider this PR just a test for now)

@github-actions github-actions bot added infra 🏗️ Infrastructure issues (npm, GitHub Actions, releases) of this project linter 🏡 Issues or pull requests regarding the tests / linter of the JSON files. labels Jul 26, 2024
@tidoust
Copy link
Contributor

tidoust commented Jul 26, 2024

It spits out a lot of errors and I would be interested to hear if BCD should be using different fragment ids, or if webref is missing these fragment ids, or if something else is going on. Please see the CI failure for the results.

I'd say that the good news is that, in most cases, it seems that "something else is going on" ;)

Main categories of errors I see:

  1. Fragments in ids extracts are percent-encoded. That does not seem to be the case for URLs used in BCD. If percent-encoding seems wrong, we could perhaps change that (I'm always confused as to when that is needed, or good practice). Otherwise, for comparison purpose, some preprocessing is going to be required, see inline.
  2. The URL used in BCD may have the filename, e.g., index.html in https://webassembly.github.io/spec/js-api/index.html#dom-globaldescriptor-mutable. URLs in the ids extracts don't have the filename (except for multipage specs!). The index.json file in Webref contains a nightly.filename property for each spec that could be used to create the URL variants with the filename if needed. Alternatively, it could perhaps be a good idea to drop that filename in BCD data?
  3. Many URLs in BCD use the series URL, whereas ids extracts in Webref are per specification. For example, you'll have https://drafts.csswg.org/css-logical/#position-properties in BCD, while you'll find https://drafts.csswg.org/css-logical-1/#position-properties in Webref. To find the right level in Webref, you'll need to look at entries in index.json in Webref with the same series.nightlyUrl as the URL (without fragment) used in BCD, then select the entry whose shortname is equal to series.currentSpecification.
  4. The code does not handle the case where a spec does not define any ID. That does happen for some WebGL extensions referenced by BCD, such as WebGL EXT_disjoint_timer_query
  5. For IETF RFCs produced by the HTTP WG, Webref prefers the httpwg.org URL because its rendering is slightly more user-friendly, whereas BCD seems to use www.rfc-editor.org URLs. The latter URL appears as the canonical url in Webref, so it should be relatively easy to find what you want if BCD wants to keep using that origin.

And then there are actual broken links in BCD, such as https://tc39.es/proposal-temporal/#sec-get-temporal.zoneddatetime.prototype.timezone. There are also "outdated" URLs, such as https://tc39.es/ecma262/multipage/additional-ecmascript-features-for-web-browsers.html#sec-object.prototype.__defineGetter__, which redirects to https://tc39.es/ecma262/multipage/fundamental-objects.html#sec-object.prototype.__defineGetter__ that appears in Webref.

There may be a few other error cases to dig into.

@Elchi3
Copy link
Member Author

Elchi3 commented Jul 27, 2024

Fantastique François!! 🎉
Thanks for the very useful review comment! I've updated the script :) Now we're down to just 269 problems found! :)

What I see now:

  • We should change rfc-editor urls to httpwg urls
  • Filenames should be omitted
  • Quite a few legit broken fragment links that need to be fixed in BCD (yay, these are the ones I want to chase with this exercise)

Something I would like for you to take a look:

  • There are about 22 links to HTML multipage fragments and upon spot checking they work. Maybe these are missing in webref or what am I missing?

@tidoust
Copy link
Contributor

tidoust commented Jul 27, 2024

There are about 22 links to HTML multipage fragments and upon spot checking they work. Maybe these are missing in webref or what am I missing?

As far as I can tell, all of them are examples of what I called outdated links: they work, but that's because the HTML spec has logic in place to redirect past fragments to their new page. Each time, the content referenced by the link moved to another page of the HTML spec and would better be targeted using the new fragment to avoid a redirect.

For example, clicking on https://html.spec.whatwg.org/multipage/browsing-the-web.html#dom-beforeunloadevent-returnvalue makes you load the browsing-the-web.html page, which includes some JavaScript that detects the fragment, knows it no longer exists in that page, and redirects you to the nav-history-apis.html page where the content was moved. The final URL is https://html.spec.whatwg.org/multipage/nav-history-apis.html#dom-beforeunloadevent-returnvalue. That final URL appears in Webref. Ideally, BCD would always use such final URLs to avoid redirects that consume a bit of time, bandwidth and energy.

This was referenced Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra 🏗️ Infrastructure issues (npm, GitHub Actions, releases) of this project linter 🏡 Issues or pull requests regarding the tests / linter of the JSON files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants