Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add to/from_pydict methods #203

Merged
merged 4 commits into from
May 9, 2022
Merged

Conversation

crflynn
Copy link
Contributor

@crflynn crflynn commented Jan 29, 2021

This PR adds to_pydict() and from_pydict() to parse and serialize messages into python dictionaries, alternative to the current to/from_dict() methods which cater to JSON serialization.

In addition to defining messages, our protobuf definitions double as spark table schemas. Ingesting messages requires declaring the spark schema explicitly, and so we do that by mapping the betterproto message dataclass field types into spark sql types. The ingestion itself is done by decoding into a betterproto message, calling to_dict(), and then adding additional business logic to transform the JSON-prepared datetime strings into datetime objects. To do this we have to write a bit of extra code that looks like this:

def get_decoder(message_type: dataclass) -> Callable:
    """Get a decoder used to create a udf."""

    # determine fields which are datetimes so we can convert them from strings
    datetime_fields = [
        name
        for name, field in message_type().__dataclass_fields__.items()
        if field.type == datetime
    ]

    def decoder(s):
        """betterproto converts datetimes to strings on to_dict() so we convert it back
        using dateutil (iso8601) so that it goes into the spark table as a proper datetime
        """
        asdict = message_type().parse(bytes(s)).to_dict(casing=betterproto.Casing.SNAKE)
        for field in datetime_fields:
            if field in asdict:
                asdict[field] = parser.parse(asdict[field])
        return asdict

    return decoder

With this change our code would look like this, and the output dictionary value types match those declared in the dataclasses:

def get_decoder(message_type: dataclass) -> Callable:
    """Get a decoder used to create a udf."""

    def decoder(s):
        return message_type().parse(bytes(s)).to_pydict(casing=betterproto.Casing.SNAKE)

    return decoder

An example copied from the test:

@dataclass
class TestDatetimeMessage(betterproto.Message):
    bar: datetime = betterproto.message_field(1)
    baz: timedelta = betterproto.message_field(2)

test = TestDatetimeMessage().from_dict(
    {"bar": "2020-01-01T00:00:00Z", "baz": "86400.000s"}
)

print(test.to_pydict())
# {'bar': datetime.datetime(2020, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'baz': datetime.timedelta(days=1)}

This relates to some of the discussion happening in #36 with respect to datetimes. Personally this is the behavior that I expected for the to_dict() function, expecting it to act more like dataclasses.asdict but omitting defaults and handling casing.

Thanks a ton for betterproto. I'm happy to iterate on this if there are any requested changes.

@nat-n nat-n added enhancement New feature or request has test Has a (xfail) test that verifies the bugfix or feature labels Apr 6, 2021
@Gobot1234
Copy link
Collaborator

Hi, sorry for the slow response on this.

I like this idea, however. I'm not a fan of the large amounts of duplication this brings, for which I may have a solution. Would it be possible to add an internal generator method that yields the keys and values for to/from_(py)dict to then process in their own ways?

@guysz
Copy link
Contributor

guysz commented Dec 17, 2021

Voted +1
I can assist if needed

@Gobot1234
Copy link
Collaborator

Voted +1 I can assist if needed

If you'd be willing to pick this up with the feedback I've suggested, I'd be very grateful.

@tcamise-gpsw
Copy link

This does in fact satisfy my request from #316 (after the addition of a meta.proto_type == TYPE_ENUM case in the to_pydict method).

@Gobot1234, in regards to minimizing the code duplication... I took a stab at your generator idea. I don't think it was as clean as you imagined (or maybe I'm missing something). I also tried another approach of merging the two to/from_dict methods together into one and conditionally doing things depending on whether or not it is a to/from_pydict.

The generator approach is here: https://github.com/tcamise-gpsw/python-betterproto/commit/e1cb903f882446124a68bb7e54f234dda614504d
The merging approach is here: https://github.com/tcamise-gpsw/python-betterproto/commit/b749ad4cbf4910915ce3344fa24ae1473fd29643

Do either of these look acceptable? If so, i can open a PR and try to set up the poetry environment for more testing.

@Gobot1234
Copy link
Collaborator

This does in fact satisfy my request from #316 (after the addition of a meta.proto_type == TYPE_ENUM case in the to_pydict method).

@Gobot1234, in regards to minimizing the code duplication... I took a stab at your generator idea. I don't think it was as clean as you imagined (or maybe I'm missing something). I also tried another approach of merging the two to/from_dict methods together into one and conditionally doing things depending on whether or not it is a to/from_pydict.

The generator approach is here: tcamise-gpsw@e1cb903 The merging approach is here: tcamise-gpsw@b749ad4

Do either of these look acceptable? If so, i can open a PR and try to set up the poetry environment for more testing.

Hi, sorry for the late reply, either looks fine to me. Thank you for looking into this

@einarjohnson
Copy link

Any ETA on when this might be released?

@Gobot1234
Copy link
Collaborator

Any ETA on when this might be released?

I'm going to try and get this into b5, but no promises.

# Conflicts:
#	src/betterproto/__init__.py
#	tests/test_features.py
Copy link
Collaborator

@Gobot1234 Gobot1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming #293 is merged, which should make the Message instances have enum fields always be enums,

This does in fact satisfy my request from #316 (after the addition of a meta.proto_type == TYPE_ENUM case in the to_pydict method).

special casing here shouldn't be necessary

@Gobot1234 Gobot1234 merged commit 6536181 into danielgtaylor:master May 9, 2022
@Gobot1234
Copy link
Collaborator

Thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request has test Has a (xfail) test that verifies the bugfix or feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants