Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop cache info when a redirection took place #1777

Merged
merged 1 commit into from
Mar 5, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Drop cache info when a redirection took place
When a redirection takes place, Reffy follows the redirection logic (typically
done through scripting) but the cache info it gets from Puppeteer remains for
the initial URL. Reffy incorrectly assumed that info also applied to the final
page.

There's no easy way to retrieve the cache info of the final URL. Since that
should only affect a spec that moves, and only until we detect and update the
URL of the spec in browser-specs, this update simply drops the cache info to
force Reffy to crawl the spec.

Fix #1774.
tidoust committed Feb 17, 2025
commit 28d88b9080dd4314e607eccf1dbca0f642667eeb
29 changes: 28 additions & 1 deletion src/lib/mock-server.js
Original file line number Diff line number Diff line change
@@ -169,13 +169,40 @@ mockAgent
}
});

mockAgent
.get("https://www.w3.org")
.intercept({ method: "GET", path: "/TR/iredirect/" })
.reply(200,
`<!DOCTYPE html><script>window.location = '/TR/recentlyupdated/';</script>`,
{
headers: {
"Content-Type": "text/html",
"Last-Modified": "Fri, 11 Feb 2022 00:00:42 GMT"
}
}
);

mockAgent
.get("https://www.w3.org")
.intercept({ method: "GET", path: "/TR/recentlyupdated/" })
.reply(200,
`<html><title>Recently updated</title>
<h1>Recently updated</h1>`,
{
headers: {
"Content-Type": "text/html",
"Last-Modified": (new Date()).toString()
}
}
);

mockAgent
.get("https://drafts.csswg.org")
.intercept({ method: "GET", path: "/server-hiccup/" })
.reply(200,
`<html><title>Server hiccup</title>
<h1> Index of Server Hiccup Module Level 42 </h1>`,
{ header: { "Content-Type": "text/html" } })
{ headers: { "Content-Type": "text/html" } })
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated fix. I don't think this had any practical implication in practice, but that was still wrong...

.persist();

/*nock.emitter.on('error', function (err) {
11 changes: 9 additions & 2 deletions src/lib/specs-crawler.js
Original file line number Diff line number Diff line change
@@ -138,8 +138,15 @@ async function crawlSpec(spec, crawlOptions) {
if (result.crawled) {
spec.crawled = result.crawled;
}
if (result.crawlCacheInfo) {
spec.crawlCacheInfo = result.crawlCacheInfo;
if (result.crawlCacheInfo &&
(result.crawled === spec.url ||
result.crawled === spec.nightly?.url)) {
// Note: Some redirection took place. That happens when, e.g., a
// WICG spec gets moved to another group, until we update the URL
// in browser-specs. Redirection is done through scripting. Reffy
// follows the redirect but the cache info it receives from
// Puppeteer is for the initial URL. We cannot rely on it!
spec.crawlCacheInfo = result.crawlCacheInfo;
}
crawlOptions.modules.forEach(mod => {
if (result[mod.property]) {
10 changes: 9 additions & 1 deletion tests/crawl.js
Original file line number Diff line number Diff line change
@@ -103,7 +103,6 @@ if (global.describe && describe instanceof Function) {
assert.equal(results.results[0].title, 'A test spec');
});


it("skips processing and reuse fallback data when spec cache info indicates it has not changed", async () => {
const url = "https://www.w3.org/TR/ididnotchange/";
const fallback = path.resolve(scriptPath, 'crawl-cache.json');
@@ -116,6 +115,15 @@ if (global.describe && describe instanceof Function) {
assert.equal(results[0].title, "Change is the only constant");
assert.ifError(results[0].error);
assert.equal(results[0].refs, "A useful list of refs");
});

it("does not return cache info when a redirection took place", async () => {
const url = "https://www.w3.org/TR/iredirect/";
const results = await crawlSpecs(
[{ url, nightly: { url } }],
{ forceLocalFetch: true });
assert.equal(results[0].title, "Recently updated");
assert.equal(results[0].crawlCacheInfo, undefined);
})

it("reports HTTP error statuses", async () => {