manifest processing model, what if null base URL? (related to origin issue) #12

danielweck · 2018-12-05T08:59:48Z

Issue originally raised in the "opaque origin" conversation:
w3c/wpub#321 (comment)

iherman · 2018-12-05T13:48:30Z

If the manifest is embedded, the only way this can happen (see w3c/wpub#321 (comment)) is if the value of baseURI in the DOM for the <script> element is null. The question is when would that happen per the HTML or the DOM specs. I do not have a precise answer, but I suspect that it may happen in the case of a file: URL, ie, when the entry page is read from the file system. If, as we referred to in w3c/wpub#321, we disallow that (or we just say the effect depends on user agent and users should be prepared) then we are done, aren't we?

iherman · 2018-12-05T13:52:18Z

One step further in https://www.w3.org/TR/DOM-Level-3-Core/core.html#Node3-baseURI:

baseURI of type DOMString, readonly, introduced in DOM Level 3
The absolute base URI of this node or null if the implementation wasn't able to obtain an absolute URI.

iherman · 2018-12-05T13:53:09Z

Related to the original question: I am fine modifying the processing model stating that if this happens, the processing stops.

danielweck · 2018-12-05T15:11:31Z

I my original comment I mentioned data: URLs. I believe this is a problematic edge case.
w3c/wpub#321 (comment)

iherman · 2018-12-05T15:26:42Z

@danielweck I must admit I do not understand your remark with the data URL. Can you give a somewhat more detailed example of what this would be and mean?

danielweck · 2018-12-05T15:58:14Z

In the following edge case example, the data: URL encodes an HTML document which does not specify <base href="..."> in its head. Consequently, the <script>-embedded WebPub manifest has a null base URI (inherited from its parent document context).

Please ignore the lack of character escaping, this is pseudo-code:

https://domain.org/index.html
=>

<html>
<body>
<iframe
    src="data:text/html,<html><head><script type="application/ld+json">{...}</script></head><body>...</body></html>"
/>
</body>
</html>

Let's not try to explain why such convoluted markup would exist in the first place. Let's just handle the edge case regardless of its possible causes. I see two options:

Early termination: no point continuing to load the WebPub manifest without a base URI. If I understand correctly, at this point in time the JSON-LD processing model is being discussed / finalized, with respect to handling base URI in embedded contexts. However the WebPub processing model can isolate itself from this potentially moving target, by aborting as soon as the failure criterion is met.
Allow the WebPub manifest to load: if/when a base URI is required as part of the JSON-LD processing model in order to resolve an absolute URL from a relative "path", and this base URI is missing, then let the JSON-LD processor raise the appropriate error. This may be a complete abort, of a skip-resource-and-continue kind of algorithm (I am not sure, do you know Ivan?)

iherman · 2018-12-05T17:05:33Z

(2) is of course sounds as a viable and reasonable option, except that I would expect many reading systems would want to parse and interpret the manifest directly for the purposes of publications without relying on a full-blown json-ld processor. I.e., relying on that may be an issue.

On (1) yes, there are discussions on the JSON-LD but on (other) edge cases of embedding a manifest (e.g., is it required to escape certain HTML terms within the script element). I actually do not think this type of edge case has been discussed or not. Yes, the WebPub model can isolate itself, but I would think it is better to align with the JSON-LD WG.

Bottom line, I think this question should be raised in the JSON-LD WG. I can of course raise the issue, but it may be better if you did it (on https://github.com/w3c/json-ld-syntax/issues).

Do you know what will the baseURI value be on the DOM element for <script>? Will it be null (which I expect to be)?

danielweck · 2018-12-05T18:31:34Z

Quick test:

<html>
<body>
<iframe
    width="100%"
    height="100%"

    src="data:text/html;base64,CjxodG1sPgo8aGVhZD4KPGJhc2UgaHJlZj0iaHR0cHM6Ly9kb21haW4ub3JnL3BhdGgvIiAvPgoKPHNjcmlwdCBpZD0ic2NyaXB0IiB0eXBlPSJ0ZXh0L2phdmFzY3JpcHQiPgogIGRvY3VtZW50LmFkZEV2ZW50TGlzdGVuZXIoIkRPTUNvbnRlbnRMb2FkZWQiLCBmdW5jdGlvbihldmVudCkgewogICAgY29uc29sZS5sb2coIkRPTUNvbnRlbnRMb2FkZWQiKTsKICAgIAogICAgLy8gd2luZG93LmxvY2F0aW9uLm9yaWdpbiB0b28KICAgIGxldCB0MSA9ICJ3aW5kb3cub3JpZ2luOiAiICsgd2luZG93Lm9yaWdpbjsKICAgIGNvbnNvbGUubG9nKHQxKTsKICAgIGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCJfMSIpLmlubmVySFRNTCA9IHQxOwogICAgCiAgICBsZXQgdDIgPSAiZG9jdW1lbnQuYmFzZVVSSTogIiArIGRvY3VtZW50LmJhc2VVUkk7CiAgICBjb25zb2xlLmxvZyh0Mik7CiAgICBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgiXzIiKS5pbm5lckhUTUwgPSB0MjsKCiAgICBsZXQgdDMgPSAibG9jYXRpb24uaHJlZjogIiArIGxvY2F0aW9uLmhyZWY7CiAgICBjb25zb2xlLmxvZyh0Myk7CiAgICBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgiXzMiKS5pbm5lckhUTUwgPSB0MzsKCiAgICBsZXQgdDQgPSAic2NyaXB0LmJhc2VVUkk6ICIgKyBkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgic2NyaXB0IikuYmFzZVVSSTsKICAgIGNvbnNvbGUubG9nKHQ0KTsKICAgIGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCJfNCIpLmlubmVySFRNTCA9IHQ0OwogIH0pOwo8L3NjcmlwdD4KPC9oZWFkPgo8Ym9keT4KPGgxIGlkPSJfMSI+MTwvaDE+CjxoMSBpZD0iXzMiPjM8L2gxPgo8aDEgaWQ9Il8yIj4yPC9oMT4KPGgxIGlkPSJfNCI+NDwvaDE+CjwvYm9keT4KPC9odG1sPg=="
/>
</body>
</html>
<!--
<html>
<head>
<base href="https://domain.org/path/" />

<script id="script" type="text/javascript">
  document.addEventListener("DOMContentLoaded", function(event) {
    console.log("DOMContentLoaded");
    
    // window.location.origin too
    let t1 = "window.origin: " + window.origin;
    console.log(t1);
    document.getElementById("_1").innerHTML = t1;
    
    let t2 = "document.baseURI: " + document.baseURI;
    console.log(t2);
    document.getElementById("_2").innerHTML = t2;

    let t3 = "location.href: " + location.href;
    console.log(t3);
    document.getElementById("_3").innerHTML = t3;

    let t4 = "script.baseURI: " + document.getElementById("script").baseURI;
    console.log(t4);
    document.getElementById("_4").innerHTML = t4;
  });
</script>
</head>
<body>
<h1 id="_1">1</h1>
<h1 id="_3">3</h1>
<h1 id="_2">2</h1>
<h1 id="_4">4</h1>
</body>
</html>
-->

Result:

window.origin: null

location.href: data:text/html;base64,LONG_BASE64_STRING

document.baseURI: https://domain.org/path/

script.baseURI: https://domain.org/path/

If the <base href="https://domain.org/path/" /> element is removed, then baseURI for both document and script is in fact not null, it is the same as location.href (i.e. the data: URL) ... which cannot be used for resolving absolute URLs from relative paths anywhere in the document (such as when processing an embedded WebPub manifest).

Based on this simple experiment, I am starting to wonder whether ; just like opaque origin ; the WebPub specification should simply remain silent about baseURI edge cases. Once again, I think that the rationale for explicitly null-testing origin/baseURI (e.g. fail => terminate) in the WP manifest acquisition algorithm should be that origin/baseURI is explicitly needed later in the algorithm. For origin, the processing steps rely on the fetch API response status (e.g. bad CORS -> error response). For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

Thoughts?

iherman · 2018-12-06T08:11:18Z

Taking this out from @danielweck's long comment for an easier reference:

Based on this simple experiment, I am starting to wonder whether ; just like opaque origin ; the WebPub specification should simply remain silent about baseURI edge cases. Once again, I think that the rationale for explicitly null-testing origin/baseURI (e.g. fail => terminate) in the WP manifest acquisition algorithm should be that origin/baseURI is explicitly needed later in the algorithm. For origin, the processing steps rely on the fetch API response status (e.g. bad CORS -> error response). For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

I got to a similar conclusion, so I wholeheartedly agree. Although weird, the example with data: URL makes sense but, also, it is perfectly possible to create a manifest using absolute URL-s only and, consequently, the interpretation of the manifest could be oblivious to the null baseURI value.

I think for both this issue and w3c/wpub#321 we should try to find a blanket formulation in the processing which says that if a processing step runs into an error (or a OWP related error?), then the processing would stop and there would be no manifest. (We could put there an note giving examples for such situations, and we can refer to the origin or the baseURI null problem, but that should only be an informal note.) I am not sure how exactly to formulate that, but maybe @mattgarrish can come with the best terminology...

iherman · 2018-12-06T08:16:36Z

N.B. I have raised an explicit issue by the JSON-LD WG (w3c/json-ld-syntax#103), a.k.a. passing over the buck:-)

danielweck · 2018-12-06T09:02:50Z

Thanks Ivan!

Let me also clarify this statement:

For baseURI, it depends on whether the WebPub specification describes in great detail how to resolve absolute URLs in the manifest, or if this is "automatically" inherited from the JSON-LD processing model (in which case this concern becomes an polyfill / user-agent implementation detail).

If the former (i.e. the WP specification describes "parsing" rules, probably as an extension to the JSON-LD processing model), then the manifest algorithm must be clear about what happens when an absolute URL cannot be resolved:

complete failure (i.e. abort loading the manifest entirely)
or:
skip the unresolved URL (i.e. ignore the resource), and continue loading the rest of the data.

iherman · 2019-08-09T13:43:44Z

All this in a new setting, where we are "only" talking about the strict vocabulary and not the processing models anymore...

Looking at the canonicalization algorithm the only place where the base is used is in step 11, i.e., when relative URL-s are turned into absolute ones. I see two simple options:

remove that step altogether. I.e., how to handle relative URL-s should be left fully under the control of the processor using the manifest, and this should be specified in the corresponding extension. I.e., one could say that if the manifest is used in a packaged audiobook, then the relative URI-s are relative to the top-level of the 'file system' within LPF.
alternatively, if the base is null, then all relative URLs are left as they are.

In fact, the consequence of (2) is still (1), in the sense that the processor specification should still define what a relative URI means within the publication. How is that formally defined in EPUB?

I mildly in favor of (2), i.e., allowing an explicit base setting but falling back on the processor behavior if not used. Note that if we decide for (1) that makes #11 moot as well.

BigBlueHat · 2019-08-09T15:29:25Z

@iherman looks like your "canonicalization algorithm" link is going to thew wrong spec.

I'd suggest not doing anything that forks from the JSON-LD processing semantics for @base and be sure to build up any "base" calculations from the same foundation from RFC3986.

iherman · 2019-08-09T15:34:42Z

I am sorry, the right link is https://w3c.github.io/pub-manifest/#canonical-manifest

iherman · 2019-08-09T15:35:43Z

I certainly wouldn't want to fork. (1) and (2) is to be silent about the issue in the canonicalization...

iherman · 2019-09-10T07:29:43Z

This issue was discussed in a meeting.

No actions or resolutions

View the transcript

5. Issue #12 Manifest processing model, what if null base URL?
Garth Conboy: Is Daniel on the call to talk about (?)
… issue 12 Manifest processing model, what if null base URL?
Garth Conboy: See Issue #12
Wendy Reid: I need to read this over before I have any opinions… I think we can save this one for discussion. Maybe Ivan has more info?
Ivan Herman: Related to what I said before - at the moment we have the publication manifest, where the base comes from is up to the various profiles…
… it was all about what happens if web content has an iframe, what is the base URL?
… we haven’t solved this issue, but it’s not relevant any more for the manifest…
Garth Conboy: Was that a ‘leave to TPAC’ or ‘close now’?
Ivan Herman: Leave to TPAC…
Garth Conboy: We’ll have Laurent with us at TPAC, so that makes sense.

iherman · 2019-09-25T13:33:52Z

This issue was discussed in a meeting.

RESOLVED: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery

View the transcript

Wendy Reid: #12
Wendy Reid: this is my favorite issue!
… what if there’s a null base URL?
… in light of recent changes to the specification, we have gotten rid of the canonicalization model algorithm
… so maybe this is a non-issue
Benjamin Young: we don’t know where these json files are used
… we don’t have an origin now
… if LPF would be to go to REC, we might have to figure out how the base url is calculated
… but until this JSON file is related to some HTML document that can express a base URL, we don’t need to say anything
… it’s blank/null by default
… there are other concerns, but this issue is not an issue
Proposed resolution: Close Issue #12, the canonicalization algorithm has been removed, origin is no longer a concern for Publication Manifest (Wendy Reid)
Benjamin Young: before we vote
… the canonicalization thing has not been removed but renamed
… maybe leave that bit out
… just say it’s a json data document thingy. might not be at a URL
Ralph Swick: do you want to capture bigbluehat’s thought that this will be a concern in the future when the manifest is is included in some future transfer protocol(s)
Proposed resolution: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery (Wendy Reid)
Benjamin Young: +1
Wendy Reid: +1
Laurent Le Meur: +1
Gregorio Pellegrino: +1
Juan Corona: +1
Dave Cramer: +1 with an error of 1
Brady Duga: +1
Toshiaki Koike: +1
Charles LaPierre: +1
Resolution #3: Close Issue #12, the canonicalization algorithm has been changed, origin is no longer a concern for Publication Manifest, but should be considered for specifications concerning discovery

mattgarrish transferred this issue from w3c/wpub Aug 7, 2019

wareid closed this as completed Sep 16, 2019

danielweck mentioned this issue Jan 30, 2020

JSON base URL for resolving 'text' URL 'fragment' is base URL of HTML document specified as "alternate"? w3c/sync-media-pub#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manifest processing model, what if null base URL? (related to origin issue) #12

manifest processing model, what if null base URL? (related to origin issue) #12

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

iherman commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 6, 2018

iherman commented Dec 6, 2018

danielweck commented Dec 6, 2018

iherman commented Aug 9, 2019

BigBlueHat commented Aug 9, 2019

iherman commented Aug 9, 2019

iherman commented Aug 9, 2019

iherman commented Sep 10, 2019

iherman commented Sep 25, 2019

manifest processing model, what if null base URL? (related to origin issue) #12

manifest processing model, what if null base URL? (related to origin issue) #12

Comments

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

iherman commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 5, 2018

danielweck commented Dec 5, 2018

iherman commented Dec 6, 2018

iherman commented Dec 6, 2018

danielweck commented Dec 6, 2018

iherman commented Aug 9, 2019

BigBlueHat commented Aug 9, 2019

iherman commented Aug 9, 2019

iherman commented Aug 9, 2019

iherman commented Sep 10, 2019

iherman commented Sep 25, 2019