-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HtmlExtractor: allow relative hrefs in the base element #209
Conversation
Looks good to me. I think JerichoExtractorHtml has the same problem. |
Hm, well, this now breaks other tests. Elsewhere, BAD BASE tests are in play that assume Not sure what to do about this. |
Hm. So I think the problem is that HTML4 specified that base URLs must be absolute. Presumably browsers at the time ignored absolute URLs hence the BADBASE tests. However HTML5 allows them to be relative but only the first is obeyed and any subsequent base element is ignored. So I guess the options for Heritrix's behaviour are:
Personally I would go with option 1 since it's simple and matches what browsers do. I would consider option 2 speculative link extraction. |
I'm happy to implement option 1. Option 2 gets quite messy as it would mean storing all potential base URIs and generating all potential URL combinations (i.e. for L links and B base URIs we'd need to emit L x B links!) Option 3 is doable too I guess, with a fallback to option 1 when the DOCTYPE is not known. I'm not really sure what to do with the tests, as they appear to be testing what happens when you get bad URIs, and happen to use the HTML4 base URI convention to make it happen. Having taken that away, I am struggling to make a test case that actually does create a bad URI! |
If we're going with option 1 I think we should replace those bad url tests entirely. They're just plain wrong for modern html. We should instead test relative base hrefs and a page with several base tags. |
The problem is that the failing test is called |
Looks like this is the original issue: https://webarchive.jira.com/browse/HER-25
I'm confused, why would we want them to? Isn't the point of the original issue that Heritrix should NOT choke on bad URIs? (ie the name of the test is describing the problem addressed not the desired behaviour) |
Those tests look all wrong to me. I'm pretty sure modern browsers would apply the base urls in all cases (one.html two.html html). If there's extra slashes or illegal characters ir whatever modern browsers just normalise and escape them. The schemaless url is treated as a relative URL. So I still think the tests are wrong and should be removed. (Or inverted so they enforce that goodone.html etc are actually not found) |
Yes, and the test appears to be designed to check that the page parsing does not stop if the extractor hits a bad URI. If I can't create a URI error, then I can't make the test test anything. |
Okay, so from a Real Life Log File^(TM) I find three types of error:
I think the latter is probably easiest to use. |
Ah. How about this?
|
Right, so, finally got the tests working. One of the other tests assumed BaseURI would be set every time a |
@@ -329,7 +402,6 @@ public void testScriptTagWritingScriptType() throws URIException { | |||
public void testOutLinksWithBaseHref() throws URIException { | |||
CrawlURI puri = new CrawlURI(UURIFactory | |||
.getInstance("http://www.example.com/abc/index.html")); | |||
puri.setBaseURI(puri.getUURI()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the change needed because BaseURI can now only be set once in ExtractorHTML
@anjackson It looks fine, but unfortunately you caught me minutes before I'm out the door for a holiday so I can't properly vet it until the end of the month. |
Hey @nlevitt @kris-sigur or anyone able to take a quick look at this? |
Looks fine to me. |
I believe this ensures Heritrix3 deals with relative base hrefs. (fixes #208)