Decoding concatenated data items #75

devinivy · 2023-02-13T17:19:46Z

It would be convenient to be able to decode concatenated data items, which has applications in streaming settings as is suggested in the streaming section (link) of the spec:

In a streaming application, a data stream may be composed of a sequence of CBOR data items concatenated back-to-back. In such an environment, the decoder immediately begins decoding a new data item if data is found after the end of a previous data item.

If you are interested in catering to the use-case but don't want to add a new API for it, one light way to support this might be to make the default Tokeniser class public, which I expect would make it simpler for folks to add this support themselves as needed. There are also precedents for libraries supporting this through a streaming interface or a separate method specifically for decoding multiple data items.

The text was updated successfully, but these errors were encountered:

rvagg · 2023-02-14T03:59:53Z

Yeah, I don't mind doing this as an opt-in option. The default should be strictly no trailing bytes but there's reasonable cases for allowing the decoder to find the boundaries. In fact just 4 days ago I added support for this exact kind of case in the Go dag-cbor implementation: ipld/go-ipld-prime#490 ! It was a bit easier there though because the decoder is prepared for it.

We have a problem here of the tokeniser taking a fully materialised Uint8Array and operating on it under the assumption that the data is all there. In a streaming situation you're not going to have it all; or at least you don't know if you have it all. So it might not be as easy as just exposing the Tokeniser. Or can you see a way to deal with it as it is if you just get access to the default Tokenizer? I don't mind making that public, I used it in the json version (https://github.com/rvagg/cborg/blob/master/lib/json/json.js) which could have just as easily been a separate package, but it wasn't so I didn't need to expose it.

One solution is to make an interface that has just enough functionality that can work over iterables or plain Uint8Arrays interchangeably. I did that over in js-car, see https://github.com/ipld/js-car/blob/master/src/coding.ts#L52-L66 for what the interface looks like to make that work. The big downside of that is the async/sync difference, originally I did it all async but when we added a sync-only form so that the top-level API could be sync, it required a lot of reworking of the innards and we ended up with those two BytesReader variants, which is kind of awkward. If we have to sniff for Promise objects everywhere we touch it in order to determine whether to await then it's going to significantly slow down parsing for the common case of "I have all the bytes", which is what I'd like to keep relatively fast.

devinivy · 2023-02-15T15:43:52Z

Ah nice, it sounds like this has been on your mind recently, then! It would be useful to have an interface around async iterables for generic streaming, but at least in my case today a synchronous interface would do the trick. I hold all the bytes for some unknown number of concatenated data items, and I'm looking to decode each of them.

I am not 100% sure about the approach with the Tokeniser class, but I will dig into this a little bit deeper. It would be nice to land on something that works for the synchronous case, but could also be a stepping stone to async, similar to what you've done there with js-car. Maybe those don't both need to land at the same time. I can totally appreciate how having sync and async interfaces opens a can of worms for the implementation, and the common path of simply wanting to decode some bytes.

rvagg · 2023-02-16T01:18:42Z

Well, if you have all the bytes, then that certainly makes it easier.

The main barrier is the erroring if the tokeniser isn't done():

cborg/lib/decode.js

Lines 189 to 191 in 7f8bcee

    
           if (!tokeniser.done()) { 
        
             throw new Error(`${decodeErrPrefix} too many terminals, data makes no sense`) 
        
           }

, we're
adjusting this behaviour with a DontParseBeyondEnd option in Go (https://github.com/ipld/go-ipld-prime/blob/200b4a6b6fb6720911cb385aff05da55c39d56de/codec/dagcbor/unmarshal.go#LL64C2-L64C20), "Parse" meaning "look for more bytes and error if there are any".

Essentially what you probably want to do is parse a chunk, get the object and the byte length (pos), then subarray your Uint8Array by that length and repeat. So you need an instruction to not error if there are more bytes, just halt, and a way to get the length.

Some options:

Use the Tokeniser directly and replicating the code in decode() except for that done(). I think you'd just need to add it to the exports for this to work for you.
Add a decode option to not error on done(), but unfortunately you have no way of knowing byte length of the decoded object (I think) so you can't subarray your input bytes.
Add a new decodeFirst() that does this behaviour, no need for an option, and have it return the length too. We did this not too long ago for decoding CIDs and that's not an unpleasant API: https://github.com/multiformats/js-multiformats/blob/debcdda652b82b27ffbf0678383145ba094685d2/src/cid.js#L395 - if you scroll up you'll see decode() using this and checking that the return is zero-lengthed, we could even copy that behaviour here instead of using done(). So it could be decodeFirst(data:Uint8Array, options:DecodeOptions):[any, Uint8Array], where the second return value is a subarray. Or it could return [any, number] and tell you the byte length so you can subarray it yourself. I think either would be fine.

Ref: #75

BREAKING CHANGE Implementations of `Tokenizer` must now implement a pos() method to be compatible. This should only impact advanced users of cborg. Ref: #75

rvagg · 2023-09-11T03:49:25Z

I'm proposing a feature that should suit this, at least in the non-async case where you think you have enough bytes to find a cbor chunk at the start of it: #93

BREAKING CHANGE Implementations of `Tokenizer` must now implement a pos() method to be compatible. This should only impact advanced users of cborg. Ref: #75

rvagg · 2023-09-12T04:37:41Z

decodeFirst() was included in 4.0.0; I'm going to close this for now considering it done, but I know it's not quite what you might want if you were properly streaming and encountering an async boundary. If you want to outline the use-case there perhaps we can consider possibilities for moving forward. There might be a way to avoid making everything async internally by buffering up pieces and being able to query whether you're "done" or not and popping an object out when one is available. That way you can stream data in, push it into cborg, and pop objects out as they are complete.

Feel free to reopen or open an new issue if you want to outline what else might be needed beyond decodeFirst().

devinivy · 2023-09-12T14:50:48Z

This is great. Actually for our current use-case I think this will do the trick just fine. If we happen to run into the async use-case, we'll open an issue with some details. Appreciate it!

devinivy mentioned this issue Mar 18, 2023

Lex: more string formats bluesky-social/atproto#688

Merged

rvagg mentioned this issue Sep 10, 2023

Decoder failure #92

Closed

rvagg added a commit that referenced this issue Sep 11, 2023

feat!: decodeFirst(), & require DecodeTokenizer to implement pos()

40ce95d

Ref: #75

rvagg added a commit that referenced this issue Sep 11, 2023

feat!: decodeFirst(), & require DecodeTokenizer to implement pos()

98e37cd

BREAKING CHANGE Implementations of `Tokenizer` must now implement a pos() method to be compatible. This should only impact advanced users of cborg. Ref: #75

rvagg added a commit that referenced this issue Sep 11, 2023

feat!: decodeFirst(), & require DecodeTokenizer to implement pos()

1b35871

BREAKING CHANGE Implementations of `Tokenizer` must now implement a pos() method to be compatible. This should only impact advanced users of cborg. Ref: #75

rvagg closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding concatenated data items #75

Decoding concatenated data items #75

devinivy commented Feb 13, 2023

rvagg commented Feb 14, 2023

devinivy commented Feb 15, 2023

rvagg commented Feb 16, 2023

rvagg commented Sep 11, 2023

rvagg commented Sep 12, 2023

devinivy commented Sep 12, 2023

Decoding concatenated data items #75

Decoding concatenated data items #75

Comments

devinivy commented Feb 13, 2023

rvagg commented Feb 14, 2023

devinivy commented Feb 15, 2023

rvagg commented Feb 16, 2023

rvagg commented Sep 11, 2023

rvagg commented Sep 12, 2023

devinivy commented Sep 12, 2023