Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JsonParserGenerator using smithy-json #498

Merged
merged 8 commits into from
Jun 22, 2021
Merged

Implement JsonParserGenerator using smithy-json #498

merged 8 commits into from
Jun 22, 2021

Conversation

jdisanti
Copy link
Collaborator

More progress towards #161. This PR implements JSON deserialization codegen using the new smithy-json token streaming parser. Cleaning up the old Serde implementation is going to be a separate PR as it's going to require more refactoring.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@jdisanti jdisanti requested a review from rcoh June 12, 2021 00:04
@jdisanti jdisanti marked this pull request as draft June 12, 2021 00:16
Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

Some high level things:

  1. A lot of functions in smithy-json should probably be #[inline], it makes a shockingly big different in Rust in terms of performance.
  2. as_escaped_str() doesn't communicate it's danger enough, IMO. Perhaps as_escaped_unchecked()? And then a specific justification about why this usage is OK? I'm worried about silently sneaking \" into someone's response etc.
  3. It would be good to get a ballpark comparison in terms of wall clock time and cargo llvm-lines between this and serde_json for DynamoDB (I suspect, probably similar?) But just good to make sure it's not way worse.

private val smithyJson = CargoDependency.smithyJson(runtimeConfig).asType()
private val codegenScope = arrayOf(
"Error" to smithyJson.member("deserialize::Error"),
"expect_blob_or_null" to smithyJson.member("deserialize::token::expect_blob_or_null"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally, I would probably have just made token the namespace then done #{token}::expect_bool_or_null in the code but, either way.

*codegenScope
)
rust("let result =")
deserializeMember(member, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably create a Ctx for this, partially just to follow the pattern of all the other deserializers & partially to make the code a little easier to read without jumping to definition to see what false means

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may end up having to introduce context to correctly handle the non-sparse list nullability issue. Otherwise, it seemed like overkill for one context value. I should have used named arguments though to clarify the boolean value.

}
if (!forceOptional && !symbol.isOptional()) {
if (symbol.canUseDefault()) {
rust(".unwrap_or_default()")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to do this if you use the builder (but I could be wrong).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the result of having to create union variants that have zero types. For example:

"int" => Some(crate::model::Choice::Int(
    smithy_json::deserialize::token::expect_number_or_null(tokens.next())?
        .map(|v| v.to_i32())
        .unwrap_or_default(),
)),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right...unions don't have builders. OK.

} else {
val name = escape(memberShape.memberName)
rustTemplate(
".ok_or_else(|| #{Error}::custom(\"value for '$name' cannot be null\"))?",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably just push this onto the builder so that it fails when you try to .build() on the builder and it keeps this null checking in one place

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a bug: it's going to reject null values in non-sparse lists when it should ignore them (see example below).

let mut items = Vec::new();

// ...

items.push(
    deser_union_choice(tokens)?
        .ok_or_else(|| Error::custom("value for 'member' cannot be null"))?
);

I'll fix this and add a protocol test for it.

val stringFn = if (isEnum) "expect_string_or_null" else "expect_unescaped_string_or_null"
rustTemplate("#{$stringFn}(tokens.next())?", *codegenScope)
if (isEnum) {
rust(".map(|s| #T::from(s.as_escaped_str()))", symbolProvider.toSymbol(target))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth a comment that we don't need to escape here because enum values are restricted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although actually I've seen some crazy enum values with quotes and stuff. Might need to unescape actually (worth adding a test).

rustTemplate("#{expect_number_or_null}(tokens.next())?.map(|v| v.to_#{T}())", "T" to symbol, *codegenScope)
}

private fun RustWriter.deserializeTimestamp(member: MemberShape) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still deciding if how I feel about pushing this stuff into smithy-json/smithy-xml. I think it's good? I guess it simplifies codegen a little bit.

rustBlock("match tokens.next().transpose()?") {
rustBlockTemplate(
"""
Some(#{Token}::ValueNull { .. }) => return Ok(None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are unions nullable? I'm not positive they are.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends:

structure SomeStruct {
    unionValue: SomeUnion,
}

// Case 1: Top-level union value is null.
// Nulls shouldn't be explicit, per the spec, but we should handle
// this case gracefully and set unionValue to None in this case:

{ "unionValue": null }

// Case 2: One variant is explicitly set to null.
// This case is invalid because there has to be exactly
// one non-null variant in the union struct per the spec:

{ "unionValue": { "someVariant": null } }

// The above is basically equivalent to:

{ "unionValue": {} }

The match block in question is handling Case 1.

}
}
}
// TODO: Handle unrecognized union variants
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: Handle unrecognized union variants
// TODO: Handle unrecognized union variants https://github.com/awslabs/smithy-rs/issues/185


private fun RustWriter.objectKeyLoop(inner: RustWriter.() -> Unit) {
rustBlock("loop") {
rustBlock("match tokens.next().transpose()?") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be helpful to offer a wrapper around Peekable inside of smithy-json that has a nicer interface to avoid all the transposition etc.

rustBlockTemplate(
"""
Some(#{Token}::ValueNull { .. }) => return Ok(None),
Some(#{Token}::Start${StringUtils.capitalize(objectOrArray)} { .. }) =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's restricted to this private function maybe just pass "Array" and "Object"?

Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And seems like deserializeFunctionName is generating an extra _ when called on lists which is causing CI to fail

@jdisanti
Copy link
Collaborator Author

And seems like deserializeFunctionName is generating an extra _ when called on lists which is causing CI to fail

It looks like this is actually because MediaPackage has names such as com.amazonaws.mediapackage#__mapOf__string. I think we just need to suppress this warning.

@jdisanti
Copy link
Collaborator Author

Did some crude benchmarking to get a rough idea of how these changes impact performance. Note: this isn't a completely fair comparison right now because the smithy-json version is still using serde to parse errors (that part hasn't been refactored as part of this PR).

Repeating json-rpc10, json-rpc11, rest-json, and rest-json-extras protocol tests 100x in release mode (wall-clock time):

smithy-json: 42.354 seconds
serde-json:  44.656 seconds

Dynamo release build time including dependencies (wall-clock time):

smithy-json: 1m 36s
serde-json:  1m 29s

Running cargo llvm-lines:

smithy-json: 528,876
serde-json:  760,254

I want to see the same comparisons after refactoring the error parsing, and also want to add a real-world Dynamo GetItem comparison.

@rcoh
Copy link
Collaborator

rcoh commented Jun 15, 2021

big +1 to both of those! I would make the integration test gradle job slightly smarter and allow us to add a benches folder in addition to a tests folder. We can eventually even run these benchmarks in CI. I'd also be curious about absolute numbers for serialization

@jdisanti
Copy link
Collaborator Author

Added a DynamoDB benchmark in #507. I went ahead and ran it against both main and this PR (after running ./gradlew :aws:sdk:assemble for each), and the results are not that different:

main:

ser_deser_bench         time:   [224.39 us 226.10 us 228.14 us]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

PR:

ser_deser_bench         time:   [226.49 us 227.60 us 228.89 us]
                        change: [-0.2383% +0.5420% +1.3489%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe

I'm not fully convinced this benchmark is giving me the information I want, but it does corroborate the earlier wall-clock time that wasn't really showing much of a difference.

If we want to add benches to CI later, we probably want to use lai instead of criterion since that will measure instructions and memory accesses rather than time.

@jdisanti
Copy link
Collaborator Author

Did some more benchmarking since simplifying the benchmarks in #507. It looks like performance is in the same ball park after this PR.

main:

deserialization_bench   time:   [12.200 us 12.701 us 13.262 us]
deserialization_bench   time:   [10.637 us 11.046 us 11.456 us]
deserialization_bench   time:   [8.4684 us 8.5261 us 8.5913 us]
deserialization_bench   time:   [8.1389 us 8.2168 us 8.3122 us]

PR:

deserialization_bench   time:   [11.877 us 11.980 us 12.097 us]
deserialization_bench   time:   [11.694 us 11.765 us 11.846 us]
deserialization_bench   time:   [11.302 us 11.402 us 11.516 us]
deserialization_bench   time:   [11.235 us 11.299 us 11.369 us]

It looks like there may be some opportunity for improvement, but we're on the order of a 0-4 microsecond difference. The consistency of measurements of the PR version vs. what's in main is interesting.

@jdisanti jdisanti marked this pull request as ready for review June 21, 2021 18:46
Copy link
Collaborator

@rcoh rcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! a few comments inline

uri: "/escaped-string-values",
method: "POST",
protocol: "aws.protocols#restJson1",
body: "{\"enum\":\"has\\\"quotes\",\"also\\\"has\\\"quotes\":\"test\"}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you prefer, smithy supports """ which might remove some \

""",
*codegenScope
)
deserializeStructInner(errorShape.members())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this read until }? we should make sure that extra key/value pairs are allowed/ignored

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deserializeStructInner consumes the final } as well as skips values for keys it doesn't recognize.

@jdisanti jdisanti merged commit 28af7e2 into smithy-lang:main Jun 22, 2021
@jdisanti jdisanti deleted the json-de branch June 22, 2021 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants