Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending the built-in type set with (a) tagged union(s) isn't supported? #140

Closed
goodboy opened this issue Jul 6, 2022 · 12 comments
Closed

Comments

@goodboy
Copy link

goodboy commented Jul 6, 2022

My use case: handling an IPC stream of arbitrary object messages, specifically with msgpack. I desire to use Structs for custom object serializations that can be passed between memory boundaries.


My presumptions were originally:

  • a top level decoder is used to process msgpack data over an IPC channel
  • by default, i'd expect that decoder will decode using the python type set to be able to accept arbitrary msgpack bytes and tagged msgspec.Structs
  • if a custom tagged struct was placed inside some std python type (aka embedded), i'd expect this decoder (if enabled as such) to be able to detect the tagged object field (say {"type": "CustomStruct", "field0": "blah"}) and automatically know that the embedded msgpack object is one of our custom tagged structs and should be decoded as a CustomStruct.

Conclusions

Based on below thread:

  • you can't easily define the std type set and a custom tagged struct using Union
  • Decoder(Any | Struct) won't work even for top level Structs in the msgpack frame

This took me a (little) while to figure out because the docs didn't have an example for this use case, but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to specify the subset of the std types that don't conflict with Struct as per @jcrist's comment in the section starting with:

This is not possible, for the same reason as presented above. msgspec forbids ambiguity.

So Decoder(Any | MyStructType) will not work.

I had to dig into the source code to figure this out and it's probably worth documenting this case for users?


Alternative solutions:

It seems there is no built-in way to handle an arbitrary serialization encode-stream that you wish to decode into the default set as well as be able to decode embedded tagged Struct types.

But, you can do a couple other things inside custom codec routines to try and accomplish this:

  • create a custom boxed Any struct type, as per @jcrist's comment under the section starting with:

    Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

  • consider creating a top-level boxing Msg type and then using msgspec.Raw and a custom decoder table to decode payload msgpack data as in my example below

@jcrist
Copy link
Owner

jcrist commented Jul 7, 2022

I'm sorry, I'm not sure I understand this issue? What are you trying to do here?

but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to do something like

Note that Decoder(Any | anything) is equal to Decoder(Any) (which is the same as Decoder() or the default msgspec.json.decode(...)) - only the default types will be decoded.

In [1]: import msgspec

In [2]: class Point(msgspec.Struct):
   ...:     x: int
   ...:     y: int
   ...: 

In [3]: msg = b'{"x": 1, "y": 2}'

In [4]: msgspec.json.decode(msg, type=Point)  # returns a Point object
Out[4]: Point(x=1, y=2)

In [5]: from typing import Union, Any

In [6]: msgspec.json.decode(msg, type=Union[Point, Any])  # returns a raw dict, same as the default below
Out[6]: {'x': 1, 'y': 2}

In [7]: msgspec.json.decode(msg)
Out[7]: {'x': 1, 'y': 2}

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

@jcrist sorry if i'm not being clear.

It's your first case if I'm not mistaken:

msgspec.json.decode(msg, type=Point)

This will not decode, for example, tuple (or any other default built-in python type) like the default version:

[ins] In [7]:  msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))
---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))

DecodeError: Expected `object`, got `array`

However, if you do the union including list, then it works:

[ins] In [9]: from typing import Any

[nav] In [10]:  msgspec.json.Decoder(Point | list).decode(msgspec.json.encode((1, 2)))
Out[10]: [1, 2]

hopefully that's clearer in terms of what i was trying to describe 😂


UPDATE:
The more explicit desire i have is detailed in responses below.

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

Ahh I also see what you mean now, which I didn't anticipate:

[nav] In [13]: class Point(msgspec.Struct, tag=True):
          ...:     x: int
          ...:     y: int

[nav] In [14]:  msgspec.json.Decoder(Point | Any).decode(msgspec.json.encode(Point(1, 2)))
Out[14]: {'type': 'Point', 'x': 1, 'y': 2}

[nav] In [15]:  msgspec.json.Decoder(Point).decode(msgspec.json.encode(Point(1, 2)))
Out[15]: Point(x=1, y=2)

That actually is super non-useful to me; i would expect the Point decode to still work no?
Is there no way to create a decoder that will decode built-ins as well as custom tagged (union) structs?

Like, does a struct always have to be the outer most decoding type?

I was actually going to create another issue saying something like embedded structs don't decode (like say a Struct inside a dict) but I'm seeing now that it's actually this limitation that's the real issue?

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

I'm sorry, I'm not sure I understand this issue? What are you trying to do here?

More details of what I'm doing:

  • general IPC messages where you may want to pass (embedded) Structs along side other built-in types over a stream
  • presumably a secondary benefit of using structs it the type validation on decode and not just as some kind of lone container type?
    • i don't really understand if the main thing that needs to be done is built-ins defined on a top level struct, and if so how exactly is one supposed to know which decoder to apply to a multi-typed stream of messages? Or is that just not allowed?

Maybe I'm having a dreadful misconception about all this 😂

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

Ahh so I know why I see why this feels odd, it seems the limitation is really due to the use of typing.Union?

from typing import Union

from msgspec import Struct


class Point(Struct, tag=True):
    x: float
    arr: list[int]


msgspec.json.Decoder(
    Union[Point] | list
).decode(msgspec.json.encode(Point(1, [2])))

Works just fine, but if you try Union[Point] | list | set or Union[Point] | dict is where you run into problems.. TypeError raised by the Union, union 😂

What's more odd to me is that you can support Structs that contain dicts but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct] where Struct is tagged?

@goodboy goodboy changed the title Extending the built-in type set with (a) tagged union(s) requires an Any Extending the built-in type set with (a) tagged union(s) isn't supported? Jul 7, 2022
@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

Ahh so one way to maybe do what I'd like is to use Raw inside of some top level "god" message type?

This python code i think replicates what I thought was going to be the default behavior with tagged union structs:

from contextlib import contextmanager as cm
from typing import Union, Any, Optional

from msgspec import Struct, Raw
from msgspec.msgpack import Decoder, Encoder


class Header(Struct, tag=True):
    uid: str
    msgtype: Optional[str] = None


class Msg(Struct, tag=True):
    header: Header
    payload: Raw


class Point(Struct, tag=True):
    x: float
    y: float


_root_dec = Decoder(Msg)
_root_enc = Encoder()

# sub-decoders for retreiving embedded
# payload data and decoding to a sender
# side defined (struct) type.
_decs:  dict[Optional[str], Decoder] = {
    None: Decoder(Any),
}


@cm
def init(msg_subtypes: list[list[Struct]]):
    for types in msg_subtypes:
        first = types[0]
        
        # register using the default tag_field of "type"
        # which seems to map to the class "name".
        tags = [first.__name__]

        # create a tagged union decoder for this type set
        type_union = Union[first]
        for typ in types[1:]:
            type_union |= typ
            tags.append(typ.__name__)

        dec = Decoder(type_union)
        
        # register all tags for this union sub-decoder
        for tag in tags:
            _decs[tag] = dec
        try:
            yield dec
        finally:
            for tag in tags:
                _decs.pop(tag)


def decmsg(msg: Msg) -> Any:
    msg = _root_dec.decode(msg)
    tag_field = msg.header.msgtype
    dec = _decs[tag_field]
    return dec.decode(msg.payload)


def encmsg(payload: Any) -> Msg:
    
    tag_field = None

    plbytes = _root_enc.encode(payload)
    if b'type' in plbytes:
        assert isinstance(payload, Struct)
        tag_field = type(payload).__name__
        payload = Raw(plbytes)

    msg = Msg(Header('mymsg', tag_field), payload)
    return _root_enc.encode(msg)


if __name__ == '__main__':
    with init([[Point]]):

        # arbitrary struct payload case
        send = Point(0, 1)
        rx = decmsg(encmsg(send))
        assert send == rx

        # arbitrary dict payload case
        send = {'x': 0, 'y': 1}
        rx = decmsg(encmsg(send))
        assert send == rx

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set (which I've probably emphasized ad nauseam at this point 😂). So for example I could still do my Msg.header business to explicitly limit which message types are allowed in my IPC protocol, but also be able to create a decoder that can (recursively) unwrap embedded structs when needed, instead of trying to do it myself in python.

But, as a (short term) solution I guess the above could be a way to get what I want?

The even more ideal case for me would be that you could embed tagged structs inside other std container data types (dict, list, etc.) and then as an option, a (default) tagged struct + built-ins decoder would be available to just take care of decoding everything automatically in some arbitrary serialization object-frame when needed.

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

Heh, actually the more I think about this context-oriented msg type decoding policy, the more i like it. This kind of thing would play super well with structured concurrency.

msg: Msg

with open_msg_context(
    types=[IOTStatustMsg, CmdControlMsg],
    capability_uuid='sd0-a98sdf-9a0ssdf'
) as decoder:

    # this will simply log an error on non-enabled payload msg types
    payload = decoder.decode(msg)

@jcrist
Copy link
Owner

jcrist commented Jul 7, 2022

Sorry, there's a lot above, I'll try to respond to what I think are your current issues.

Like, does a struct always have to be the outer most decoding type?

What's more odd to me is that you can support Structs that contain dicts but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct] where Struct is tagged?

This does work. All types are fully composable, there is no limitation in msgspec requiring structs be at the top level, or that structs can't be subtypes in containers. dict[str, SomeStructType] or dict[str, Union[Struct1, Struct2, ...]] fully work fine. If you have a reproducible example showing otherwise I'd be happy to take a look.

Works just fine, but if you try Union[Point] | list | set or Union[Point] | dict is where you run into problems.. TypeError raised by the Union

Side note - when posting comments referring to errors, it's helpful to include the full traceback so we're all on the same page. Right now I'm left guessing what you're seeing raising the type error.

First, there's no difference in encoding/decoding support between Unions of tagged structs and structs in general. Also, Union[SomeType] is always the same as just SomeType, no need for the extra union. So your simplified examples are:

import msgspec
from typing import Union


class Point(msgspec.Struct):
    x: int
    y: int


for typ in [Union[Point, list, set], Union[Point, dict], Union[int, list, dict]]:
    print(f"Trying a decoder for {typ}...")
    try:
        msgspec.json.Decoder(typ)
    except TypeError as exc:
        print(f"  Failed: {exc}")
    else:
        print("  Succeeded")

This outputs:

Trying a decoder for typing.Union[__main__.Point, list, set]...
  Failed: Type unions may not contain more than one array-like (list, set, tuple) type - type `typing.Union[__main__.Point, list, set]` is not supported
Trying a decoder for typing.Union[__main__.Point, dict]...
  Failed: Type unions may not contain both a Struct type and a dict type - type `typing.Union[__main__.Point, dict]` is not supported
Trying a decoder for typing.Union[int, list, dict]...
  Succeeded

Note that the error is coming from creating the msgspec.json.Decoder, not from creating the Union itself. The error messages are echoing the restrictions for unions that are described in the docs (https://jcristharif.com/msgspec/usage.html#union-optional).

In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list and set map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict - both python objects encode as JSON objects, there's no efficient way to determine which type to decode into. int | list | dict is fine, since each of these python types maps to a different JSON type. Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set

This is not possible, for the same reason as presented above. msgspec forbids ambiguity. Say we try to support what you're asking, given the following schema:

import msgspec

from typing import Any

class Point(msgspec.Struct):
    x: int
    y: int

dec = msgspec.json.Decoder(Point | Any)  # right now this works, but ignores the struct completely since `Any` is present

Given a message like {"x": 1, "y": 2}, you might expect this to return a Point, since it matches the Point schema. But what if we get a message like {"x": 1, "y": 2.0}? Or {"x": 1}? These messages don't match Point, do we error? Or do we return a dict? What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender, not a distinct new message that should be decodd separately. All of this is ambiguous, and can't be done efficiently, which is why msgspec forbids it.

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

import msgspec
from typing import Any, Union


class Msg(msgspec.Struct, tag=True):
    pass


class Msg1(Msg):
    x: int
    y: int


class Msg2(Msg):
    a: int
    b: int


class Custom(Msg):
    obj: Any


enc = msgspec.json.Encoder()
dec = msgspec.json.Decoder(Union[Msg1, Msg2, Custom])


def encode(obj: Any) -> bytes:
    if not isinstance(obj, Msg):
        obj = Custom(obj)
    return enc.encode(obj)


def decode(buf: bytes) -> Any:
    msg = dec.decode(buf)
    if isinstance(msg, Custom):
        return msg.obj
    return msg


buf_msg1 = encode(Msg1(1, 2))
print(buf_msg1)
print(decode(buf_msg1))

buf_custom = encode(["my", "custom", "message"])
print(buf_custom)
print(decode(buf_custom))

Output:

b'{"type":"Msg1","x":1,"y":2}'
Msg1(x=1, y=2)
b'{"type":"Custom","obj":["my","custom","message"]}'
['my', 'custom', 'message']

Note that the builtin message types (Msg1, Msg2) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:

MsgTypes = Union[Msg1, Msg2, Custom]
# Decoder expects either one of the above msg types, or a list of the above msg types
decoder = msgspec.json.Decoder(Union[MsgTypes, list[MsgTypes]]) 

@jcrist
Copy link
Owner

jcrist commented Jul 7, 2022

In the future, large issue dumps like this that are rapidly updated are hard to follow as a maintainer. If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner. While the examples presented in this blogpost aren't 100% relevant for this repository, the general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner.

My apologies, I didn't know the root issue that I was seeing at outset, it's why I've tried to update things as I've discovered both using the lib and seeing what's possible through tinkering.

Also a lot of this is just thinking out loud as a new user, my apologies if that's noisy, hopefully someone else will find it useful if they run into a similar issue.

The main issue I was confused by was mostly this (and i can move this to the top for reference if you want):

  • a top level decoder is used to process msgpack data over an IPC channel
  • by default, i'd expect that decoder will decode using the python type set to be able to accept arbitrary msgpack bytes and tagged msgspec.Structs
  • if a custom tagged struct was placed inside some std python type, i'd expect this decoder (if enabled as such) to be able to detect the tagged object field and automatically know that the embedded msgpack object is one of our custom tagged structs

I do think making some examples of the case I'm describing would be super handy to
have in the docs as maybe more of an advanced use case?

The general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.

Totally, originally I thought this was a simple question and now I realize it's a lot more involved; I entirely mis-attributed the real problem to something entirely different, hence my original issue title being ill-informed 😂


In summary, my main issue was more or less addressed in your answer here, which is what I also concluded:

Note that the builtin message types (Msg1, Msg2) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:

In other words you can't pass in an arbitrary: dict[dict[dict, Struct]]] and expect the embedded Struct to be decoded without defining the exact schema hierarchy (at least leading to the Struct field) ahead of time. So in some sense this also is similar to my question in #25.

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

So I think this is pretty similar to what i presented in the embedded Raw-payload example i put above, I was just originally look for any tagged union struct, anywhere in the encoded data to be automatically decoded no matter where is was situated in the composed data structure hierarchy.

So really, I guess what I am after now is some way to dynamically describe such schemas, maybe even during a struct-msg flow.

Again my apologies about this being noisy, not well formed, ill-attributed; I really just didn't know what the real problem was.

@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

@jcrist I updated the description to include the summary of everything, hopefully that makes all the noise, back and forth, more useful to onlookers 😎


To just finally summarize and answer all questions you left open for me:

  • This does work. All types are fully composable, there is no limitation in msgspec requiring structs be at the top level, or that structs can't be subtypes in containers. dict[str, SomeStructType] or dict[str, Union[Struct1, Struct2, ...]] fully work fine. If you have a reproducible example showing otherwise I'd be happy to take a look.

Yes, this does work as long if you specify the schema ahead of time, but even still it's not clear to me how you would use some "top level" decoder to decode non-static-schema embedded Struct types. So you have to either know the schema or you have to create some dynamic decoding system as I showed in my longer example.


  • Note that the error is coming from creating the msgspec.json.Decoder, not from creating the Union itself.

Agreed, I mis-attributed the error: msgspec.msgpack.Decoder(Struct | Any) works fine.


In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list and set map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict - both python objects encode as JSON objects, there's no efficient way to determine which type to decode into.

Agreed, but with the case of tagged Struct this isn't true any more right because you can check for the tag_field and decide if it matches one in your struct registry no?

Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.

Ok so this sounds like what I'm asking for is supposed to work right?

This is not possible, for the same reason as presented above. msgspec forbids ambiguity. Say we try to support what you're asking, given the following schema:

But then you say it isn't and give an example with a non-tagged-Struct?

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any

Yes, this is more or less what I concluded except using Raw is more general and allows the default to be Any and using a header to specify custom/filtered struct types.

What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,

So i guess the problem here would be decode aliasing due to a tag field collision?
I can see why you might was to just sidestep this entirely.

goodboy added a commit to goodboy/tractor that referenced this issue Jul 7, 2022
The greasy details are strewn throughout a `msgspec` issue:
jcrist/msgspec#140

and specifically this code was mostly written as part of POC example in
this comment:
jcrist/msgspec#140 (comment)

This work obviously pertains to our desire and prep for typed messaging
and capabilities aware msg-oriented-protocols in #196, caps sec nods in

I added a "wants to have" method to `Context` showing how I think we
could offer a pretty neat msg-type-set-as-capability-for-protocol
system.
@goodboy
Copy link
Author

goodboy commented Jul 7, 2022

What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,

So i guess the problem here would be decode aliasing due to a tag field collision?
I can see why you might was to just sidestep this entirely.

Just as one final proposal, couldn't we just presume if you find a {"type": "CustomStruct"} that it should be decoded to CustomStruct or error and if it turns out that was an error by the sender having a "type" (or wtv tag_field is) key, then you just throw an equivalent error?

DecodeError(f"Can't decode {msgpack_obj} to type 'CustomStruct' did you send an object with '{tag_field}' set?")

And then the user will know either the serialized object is malformed or there is a collision they have to work around by changing the tag_field setting?

@jcrist jcrist closed this as completed Feb 24, 2023
goodboy added a commit to goodboy/tractor that referenced this issue May 15, 2023
The greasy details are strewn throughout a `msgspec` issue:
jcrist/msgspec#140

and specifically this code was mostly written as part of POC example in
this comment:
jcrist/msgspec#140 (comment)

This work obviously pertains to our desire and prep for typed messaging
and capabilities aware msg-oriented-protocols in #196, caps sec nods in

I added a "wants to have" method to `Context` showing how I think we
could offer a pretty neat msg-type-set-as-capability-for-protocol
system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants