Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start writing the bundling spec. #98

Merged
merged 10 commits into from
May 29, 2018
Merged

Start writing the bundling spec. #98

merged 10 commits into from
May 29, 2018

Conversation

jyasskin
Copy link
Member

@jyasskin jyasskin commented Dec 21, 2017

Preview

This fills a similar niche as MHTML, but

  • is random access,
  • efficiently encodes 8-bit payloads,
  • can represent multiple top-level resources,
  • can represent content-negotiated resources,
  • and possibly other advantages.

I believe that combined with signed exchanges, bundles fill all of the web packaging use cases except that they'll need a subsequent change to work for "Third-party security review", which will need a signature to cover groups of resources to enforce that their versions are consistent.

Copy link

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You state that "parsing MUST fail" in a bunch of cases, but there's also a lot of scanning and jumping going on. Are there multiple levels where things can fail? That is, if you happened to just parse this once without jumping around, you can't simply return failure for all errors but would instead have to throw away items? That's rather unclear.

* Has exactly two keys starting with a ':' character, ':method' and ':url'.
* *R*\[':method'] is an HTTP method defined as cacheable (Section 4.2.3 of
{{!RFC7231}}) and safe (Section 4.2.1 of {{!RFC7231}}), as required for
PUSH_PROMISEd requests (Section 8.2 of {{?RFC7540}}).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really good enough for browsers I think. You need to have a safelist.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.iana.org/assignments/http-methods/http-methods.xhtml is precise about the safe methods, but you have to click through to get cacheability defined, and the WebDAV methods leave it out. In an IETF draft, I'm pretty sure I can list the current methods that satisfy this (GET and HEAD), and in the browser-side loading living spec I can just list those expecting the list to get updated if any new ones get added.

I also note that Fetch uses "unsafe" without specifying it more precisely.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cache section, which is somewhat optional in a way.

* *R*\[':method'] is an HTTP method defined as cacheable (Section 4.2.3 of
{{!RFC7231}}) and safe (Section 4.2.1 of {{!RFC7231}}), as required for
PUSH_PROMISEd requests (Section 8.2 of {{?RFC7540}}).
* *R*\[':url'] is an absolute URI (Section 4.3 of {{!RFC3986}}).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear how to validate this in browsers. There's no such primitive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe one identifies an absolute-URL string by running the URL parser with no base URL and then checking that it doesn't have a fragment.

Are you complaining that I'm not linking to the URL spec, that this isn't phrased as a parsing algorithm, both, or something else?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I go with all three? E.g., http://example.com/% is not an "absolute URI", but will parse per the URL parser.

A valid request item *R* is interpreted as an HTTP request by interpreting
*R*\[':method'] as the request's method (Section 4 of {{!RFC7231}}),
*R*\[':url'] as the request's effective request URI (Section 5.5 of
{{!RFC7230}}), and the remaining key/value pairs as the request's header fields.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those key/value pairs are suitably constrained somehow? What's the parsing for them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're HTTP headers, so we should do whatever https://fetch.spec.whatwg.org/#http-network-fetch does.

Or are you worried about how claimed request headers are matched against browser request headers? That's not written yet. I was hoping to use the matching rules for PUSH_PROMISEs, but those aren't written yet either.

Copy link

@annevk annevk Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well they're not quite the same as ordinary HTTP headers, since they are delimited differently, right? E.g., can you have a 0x0A byte in the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP/2 headers are also delimited in a way that lets their values include a 0x0A byte. I wonder if browsers handle that correctly...

I suspect the right thing to do here is to explicitly create a header list from this, but since that concept doesn't exist in the IETF, I'll have to give up on the idea of making this an IETF draft.

I'm going to ping some folks to see if I'll hurt any feelings by doing that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh wow, I didn't know that about H/2. That's another reason to get a web-platform-tests setup.

@annevk
Copy link

annevk commented Mar 26, 2018

(Note that I replied even in the feedback GitHub considers "outdated".)

@annevk
Copy link

annevk commented Mar 26, 2018

Are you going to leave it to Fetch to drop the response body for a HEAD request? Or will that be an error to codify? It seems to me that you'd want to forbid HEAD to be encoded and only support it on the retrieval side and do the appropriate translations to GET minus response body there.

Copy link
Member Author

@jyasskin jyasskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for skipping the "Are there multiple levels where things can fail?" question earlier. My plan is that there are 2 levels where things can fail:

  1. The browser parses the metadata (https://jyasskin.github.io/webpackage/bundles/draft-yasskin-dispatch-bundled-exchanges.html#load-metadata), and if anything fails there, the whole bundle is invalid.
  2. Then for any response the browser actually wants to use, it parses that particular response (https://jyasskin.github.io/webpackage/bundles/draft-yasskin-dispatch-bundled-exchanges.html#load-response). If that fails, only that particular response is invalid, but the bundle can still contain other responses.

Thanks for saying that's not adequately clear; I'll try to fix it.

I hadn't thought about how exactly to constrain HEAD. If you think the right approach is to treat it as an automatic transformation from GET, I'll do that.

None of this is done yet. I'll comment again when it is.

A valid request item *R* is interpreted as an HTTP request by interpreting
*R*\[':method'] as the request's method (Section 4 of {{!RFC7231}}),
*R*\[':url'] as the request's effective request URI (Section 5.5 of
{{!RFC7230}}), and the remaining key/value pairs as the request's header fields.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP/2 headers are also delimited in a way that lets their values include a 0x0A byte. I wonder if browsers handle that correctly...

I suspect the right thing to do here is to explicitly create a header list from this, but since that concept doesn't exist in the IETF, I'll have to give up on the idea of making this an IETF draft.

I'm going to ping some folks to see if I'll hurt any feelings by doing that.

@annevk
Copy link

annevk commented Mar 27, 2018

(Another thought I had was to take inspiration from the Cache API, as sort of the API surface on top of this format. The Cache API only supports GET thus far, but it does have Vary support. Haven't fully explored whether it's a complete fit, but it's interesting to think about as a way to avoid introducing too many new primitives.)

@jyasskin
Copy link
Member Author

I believe this is finally ready for a re-review. It's missing 3 things that are going to be important, but that I'd like to do in a subsequent patch:

  1. The algorithm to take the request a browser would generate, e.g. with multiple values in its Accept header, and find a matching request inside a bundle, where requests will generally only have 1 value in their Accept headers.
  2. Compression: the index can probably just be brotli-compressed, but response headers should be able to take advantage of a shared dictionary.
  3. Signatures covering multiple resources: this both saves on signature verification costs and ensures that resources have matching versions.

Copy link
Collaborator

@kinu kinu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed through, had some nit comments / questions (non blocking).


## Stream attributes and operations {#stream-operations}

* A sequence of **available** bytes. As the stream delivers bytes, these are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should bolding be applied to available bytes ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

the 4-item array initial byte and 8-byte bytestring initial byte, followed by
🌐📦 in UTF-8), return an error.

1. Let `sectionOffsetsLength be the result of getting the length of the CBOR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sectionOffsetsLength (missing the closing `)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

+ offset`. That is, offsets in the index are relative to the start of the
"responses" section.
1. If `offset + length` is greater than
`section-offsets\["responses"].length`, return an error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


1. Set `metadata`'s "manifest" item to `url`.

### Parsing the critical section {#critical-section}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't fully unsure what could come here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add more sections in the future, and the publisher wants to make sure parsers don't just ignore one of those sections because they don't understand it, this section points out the ones not to skip.

I'm not certain this is the right way to do this. PNG does it with a particular capitalization in the section names: https://tools.ietf.org/html/rfc2083#page-13

struct to fill in, the parser MUST do the following:

1. Let `critical` be the result of parsing `sectionContents` as a CBOR item
matching the above `critical` rule ({{parse-cbor}}. If `critical` is an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing ) after {{parse-cbor}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

TODO: Add the rest of the details of creating a `ReadableStream` object.

1. Let `response` be a new response ({{FETCH}}) whose:
* Url list is `request`'s url list,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Parsing the index section" seems to say request has one url, do you mean it could have a list / redirect chain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an asymmetry in how requests and responses default their values: https://fetch.spec.whatwg.org/#concept-request-url-list defaults from the request's url, while https://fetch.spec.whatwg.org/#concept-response-url is hard-coded to point to the response's url list. I believe both URL lists should always have 1 value for exchanges in the bundle.

I suspect Fetch will actually wind up copying or extending this response when serving browser-generated requests from bundles, so that a redirect into a bundle can work correctly.

and {{semantics-load-response}} each of which can return an error instead of
their normal result.

## Stream attributes and operations {#stream-operations}
Copy link
Member Author

@jyasskin jyasskin May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domenic, here's the use of streams I mentioned on #whatwg. I'm thinking about re-doing it in terms of ReadableStream despite your advice because https://jyasskin.github.io/webpackage/bundles/draft-yasskin-dispatch-bundled-exchanges.html#load-response needs to create a body's stream, which is defined as a ReadableStream. The stream a bundle's parsed from is likely to be a response body anyway, so it'll probably be a ReadableStream even if I pretend it isn't.

What do you think?

Copy link

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find it a little hard to digest how the whole thing fits together, but I tried to review regardless.


--- abstract

Bundled exchanges provide a way to bundle up groups of HTTP request+response
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request-response?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to emphasize that it's both. A hyphen doesn't feel quite right because "request" doesn't modify "response". A slash feels like it might be alternation, although https://infra.spec.whatwg.org/#pair uses it the way I intend here.

Bundled exchanges provide a way to bundle up groups of HTTP request+response
pairs to transmit or store them together. The component exchanges can be signed
using {{?I-D.yasskin-http-origin-signed-responses}} to establish their
authenticity.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave this out, since the signing is controversial and not required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend to shy away from the controversial bits, but you're right that it's not required, so maybe it doesn't belong in the abstract.

I do expect to add a section to bundles to efficiently encode multi-resource signatures with the same semantics as multiple signed-exchanges, at which point it'll come back to the abstract.


Discussion of this draft takes place on the ART area mailing list
(art@ietf.org), which is archived
at <https://mailarchive.ietf.org/arch/search/?email_list=art>.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems discussion also takes place on GitHub?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's covered by the "issues list" comment in the next paragraph. HTTPWG drafts have similar text, and also discuss on GitHub issues.

## Terminology

Exchange (noun)
: An HTTP request/response pair. This can either be a request from a client and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd pick one separator and stick with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks.

error or because the stream has a finite length.
* A **seek** operation to change the current offset, relative to either the
beginning of the available bytes or to the old current offset. A seek past the
end of the available bytes is an *error in this specification*.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"error in this specification" doesn't appear to mean anything. Wouldn't it be better to say that seek or read can return an error and that callers of them need to account for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended the same meaning as in https://infra.spec.whatwg.org/#assertions, but looking at the uses of "wait", "seek", and "read", every seek and read except the ones to the start of the stream was immediately preceded by a wait, so I may as well incorporate the wait into the other operations.


requests

: A map ({{INFRA}}) whose keys are the HTTP requests ({{FETCH}}) for the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Fetch defines as "requests" are not necessarily HTTP requests. They're more broad.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the best way to say that the keys here come from the subset of Fetch "requests" that are HTTP requests? I don't think it makes sense to include the other schemes mentioned in https://fetch.spec.whatwg.org/#url.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even those are more broad as they carry fields HTTP requests don't have. I guess you could say requests whose url list contains a single URL whose scheme is an HTTP(S) scheme (unless this is restricted to secure contexts?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the attempt to subset the kinds of Fetch requests this operation can return. If it would noticeably simplify the request-matching function I need to write, I might add the subsetting back later, but I don't really expect that.

{{semantics-load-metadata}}, while a client will generally want to load the
response for a request that the client generated. This specification does not
define how a client determines the best available bundled response, if any, for
that client-generated request.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? We define this for the Cache API, I think ideally we use the same algorithm here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly we're going to have to define that algorithm somewhere. I don't have a strong opinion between here and the WICG/Fetch side, but I was leaning toward WICG/Fetch, and would definitely prefer to do it in a separate PR even if you prefer this spec.

Unfortunately, https://w3c.github.io/ServiceWorker/#query-cache-algorithm assumes that varied headers will be identical between the cached request and the queried request, which isn't plausible for a bundle intended to be used by more than one UA.

For example, imagine that Greta's UA sends Accept-Language: de-DE,de,en;q=0.9. In the Cache API, that's exactly the request header that'll be cached since the Cache API uses https://fetch.spec.whatwg.org/#concept-fetch to fill it in. Similarly, if Hilda's UA sends Accept-Language: de-AT,de;q=0.9, that's what her cache will contain. If the server wants to put a de resource in a bundle to serve both users, what Accept-Language header should they write? There isn't one that Query-Cache will find for both people. The same problem shows up for Accept across browsers instead of people. So we need a new algorithm.


The bundle is roughly a CBOR item ({{?I-D.ietf-cbor-7049bis}}) with the
following CDDL ({{?I-D.ietf-cbor-cddl}}) schema, but bundle parsers are required
to successfully parse some byte strings that aren't valid CBOR. For example,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still use in CBOR then? Doesn't the MIME type lie with +cbor if you cannot actually use a CBOR parser?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm primarily using CBOR instead of a totally custom binary format because 1) ASCII/UTF-8 formats like JSON are nice because you can read them without custom parsers, and 2) we need a binary format for this, but there are likely to be generic tools to dump CBOR to a readable format, so that preserves as much of the ASCII benefit as possible.

I would actually like to say that bundles must be well-formed CBOR items, but because most parsers won't actually work that way, I expect non-CBOR bundles to wind up existing in the wild. Do you think I should say that clients MAY reject bundles that aren't a valid CBOR item? Or something else?

`knownSections` says not to process other sections, add those sections' names
to `ignoredSections`.

1. Let `metadata` be a struct ({{INFRA}}) with no items.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with structs is that they are immutable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean that the set of item names is immutable. It's simple to change this to a map, so I've done that. The set of keys is only not-fixed because I wanted to let other specifications add section names that might define new metadata, but I'm not totally sure that's a good idea, so this might go back to a struct later.

1. Continue.
1. If `name` or `value` doesn't satisfy the requirements for a header in
{{FETCH}}, return an error.
1. If `headers` contains ({{FETCH}}) `name`, return an error.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot have multiple headers of the same name? Might be worth calling out you cannot encode cookies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe multiple Cookie headers would be encoded by joining them with semicolons (not that you'd generally put cookies in bundled requests), but yeah, Set-Cookie has no single-field form. Noted.

(This line is actually redundant with the fact that CBOR maps exclude duplicate keys, so I've replaced it with an Assert.)

@jyasskin jyasskin force-pushed the bundles branch 2 times, most recently from a9f8f25 to 03a5f78 Compare May 23, 2018 23:11
@annevk
Copy link

annevk commented May 24, 2018

Do you think I should say that clients MAY reject bundles that aren't a valid CBOR item? Or something else?

  1. I think we want it to be fully deterministic across all conforming (browser) implementations how they would handle all possible sequences of bytes. I.e., at least as good as the HTML parser.
  2. I don't think we should use +cbor if that isn't always the case or required to be the case.

Re: "best available bundled response" I'd prefer it as an open issue if you plan on addressing it rather than it saying it's implementation-defined.

(Note that we might get rid of pairs still: whatwg/infra#127.)

Since clients parse by following byte offsets, they won't enforce that the
format is CBOR, which means some instances of it probably won't be. Putting
+cbor in the MIME type would be misleading.
@jyasskin
Copy link
Member Author

  1. I haven't added the MAY.
  2. I've removed the +cbor from the MIME type.
  3. I marked the "best available bundled response" as a TODO.

Thanks!

Load Metadata will only return a subset of all possible requests, but it
doesn't seem important to specify that subset precisely here.
@jyasskin
Copy link
Member Author

I'm going to merge this so it's easier to refer to and so I can start sending PRs to modify it, but please feel free to keep commenting and filing issues against it.

@jyasskin jyasskin merged commit f3459fa into WICG:master May 29, 2018
@jyasskin jyasskin deleted the bundles branch December 1, 2018 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants