Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding concatenated data items #75

Closed
devinivy opened this issue Feb 13, 2023 · 6 comments
Closed

Decoding concatenated data items #75

devinivy opened this issue Feb 13, 2023 · 6 comments

Comments

@devinivy
Copy link

It would be convenient to be able to decode concatenated data items, which has applications in streaming settings as is suggested in the streaming section (link) of the spec:

In a streaming application, a data stream may be composed of a sequence of CBOR data items concatenated back-to-back. In such an environment, the decoder immediately begins decoding a new data item if data is found after the end of a previous data item.

If you are interested in catering to the use-case but don't want to add a new API for it, one light way to support this might be to make the default Tokeniser class public, which I expect would make it simpler for folks to add this support themselves as needed. There are also precedents for libraries supporting this through a streaming interface or a separate method specifically for decoding multiple data items.

@rvagg
Copy link
Owner

rvagg commented Feb 14, 2023

Yeah, I don't mind doing this as an opt-in option. The default should be strictly no trailing bytes but there's reasonable cases for allowing the decoder to find the boundaries. In fact just 4 days ago I added support for this exact kind of case in the Go dag-cbor implementation: ipld/go-ipld-prime#490 ! It was a bit easier there though because the decoder is prepared for it.

We have a problem here of the tokeniser taking a fully materialised Uint8Array and operating on it under the assumption that the data is all there. In a streaming situation you're not going to have it all; or at least you don't know if you have it all. So it might not be as easy as just exposing the Tokeniser. Or can you see a way to deal with it as it is if you just get access to the default Tokenizer? I don't mind making that public, I used it in the json version (https://github.com/rvagg/cborg/blob/master/lib/json/json.js) which could have just as easily been a separate package, but it wasn't so I didn't need to expose it.

One solution is to make an interface that has just enough functionality that can work over iterables or plain Uint8Arrays interchangeably. I did that over in js-car, see https://github.com/ipld/js-car/blob/master/src/coding.ts#L52-L66 for what the interface looks like to make that work. The big downside of that is the async/sync difference, originally I did it all async but when we added a sync-only form so that the top-level API could be sync, it required a lot of reworking of the innards and we ended up with those two BytesReader variants, which is kind of awkward. If we have to sniff for Promise objects everywhere we touch it in order to determine whether to await then it's going to significantly slow down parsing for the common case of "I have all the bytes", which is what I'd like to keep relatively fast.

@devinivy
Copy link
Author

Ah nice, it sounds like this has been on your mind recently, then! It would be useful to have an interface around async iterables for generic streaming, but at least in my case today a synchronous interface would do the trick. I hold all the bytes for some unknown number of concatenated data items, and I'm looking to decode each of them.

I am not 100% sure about the approach with the Tokeniser class, but I will dig into this a little bit deeper. It would be nice to land on something that works for the synchronous case, but could also be a stepping stone to async, similar to what you've done there with js-car. Maybe those don't both need to land at the same time. I can totally appreciate how having sync and async interfaces opens a can of worms for the implementation, and the common path of simply wanting to decode some bytes.

@rvagg
Copy link
Owner

rvagg commented Feb 16, 2023

Well, if you have all the bytes, then that certainly makes it easier.

The main barrier is the erroring if the tokeniser isn't done():

cborg/lib/decode.js

Lines 189 to 191 in 7f8bcee

if (!tokeniser.done()) {
throw new Error(`${decodeErrPrefix} too many terminals, data makes no sense`)
}
, we're
adjusting this behaviour with a DontParseBeyondEnd option in Go (https://github.com/ipld/go-ipld-prime/blob/200b4a6b6fb6720911cb385aff05da55c39d56de/codec/dagcbor/unmarshal.go#LL64C2-L64C20), "Parse" meaning "look for more bytes and error if there are any".

Essentially what you probably want to do is parse a chunk, get the object and the byte length (pos), then subarray your Uint8Array by that length and repeat. So you need an instruction to not error if there are more bytes, just halt, and a way to get the length.

Some options:

  • Use the Tokeniser directly and replicating the code in decode() except for that done(). I think you'd just need to add it to the exports for this to work for you.
  • Add a decode option to not error on done(), but unfortunately you have no way of knowing byte length of the decoded object (I think) so you can't subarray your input bytes.
  • Add a new decodeFirst() that does this behaviour, no need for an option, and have it return the length too. We did this not too long ago for decoding CIDs and that's not an unpleasant API: https://github.com/multiformats/js-multiformats/blob/debcdda652b82b27ffbf0678383145ba094685d2/src/cid.js#L395 - if you scroll up you'll see decode() using this and checking that the return is zero-lengthed, we could even copy that behaviour here instead of using done(). So it could be decodeFirst(data:Uint8Array, options:DecodeOptions):[any, Uint8Array], where the second return value is a subarray. Or it could return [any, number] and tell you the byte length so you can subarray it yourself. I think either would be fine.

@rvagg rvagg mentioned this issue Sep 10, 2023
rvagg added a commit that referenced this issue Sep 11, 2023
BREAKING CHANGE

Implementations of `Tokenizer` must now implement a pos() method to be
compatible. This should only impact advanced users of cborg.

Ref: #75
@rvagg
Copy link
Owner

rvagg commented Sep 11, 2023

I'm proposing a feature that should suit this, at least in the non-async case where you think you have enough bytes to find a cbor chunk at the start of it: #93

rvagg added a commit that referenced this issue Sep 11, 2023
BREAKING CHANGE

Implementations of `Tokenizer` must now implement a pos() method to be
compatible. This should only impact advanced users of cborg.

Ref: #75
@rvagg
Copy link
Owner

rvagg commented Sep 12, 2023

decodeFirst() was included in 4.0.0; I'm going to close this for now considering it done, but I know it's not quite what you might want if you were properly streaming and encountering an async boundary. If you want to outline the use-case there perhaps we can consider possibilities for moving forward. There might be a way to avoid making everything async internally by buffering up pieces and being able to query whether you're "done" or not and popping an object out when one is available. That way you can stream data in, push it into cborg, and pop objects out as they are complete.

Feel free to reopen or open an new issue if you want to outline what else might be needed beyond decodeFirst().

@rvagg rvagg closed this as completed Sep 12, 2023
@devinivy
Copy link
Author

This is great. Actually for our current use-case I think this will do the trick just fine. If we happen to run into the async use-case, we'll open an issue with some details. Appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants