ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478

ato · 2022-04-22T07:53:46Z

This ensures that when we later compare the context in processEmbed() we don't need to deal with variants like srcSet or SRCSET. Note that we're already sometimes lowercasing it in HTMLLinkContext.get().

The second commit adds a main() method to ExtractorHTML to run the extractor against a given URL without having to setup a full job configuration. This is something I find myself frequently reaching for when we encounter a crawl problem.

Fixes #477.

This ensures that when we later compare the context in processEmbed() we don't need to deal with variants like srcSet or SRCSET. Note that we're already sometimes lowercasing it later in HTMLLinkContext.get(). Fixes #477.

This makes troubleshooting link extraction problems much easier.

ato added 2 commits April 22, 2022 16:44

ExtractorHTML: Add a main() method to run the extractor standalone

914756f

This makes troubleshooting link extraction problems much easier.

ato merged commit 207adec into master Apr 27, 2022

ato deleted the srcset-fix branch April 27, 2022 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478

ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478

ato commented Apr 22, 2022

ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478

ExtractorHTML: Fix srcset by normalizing elementContext() to lowercase #478

Conversation

ato commented Apr 22, 2022