Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster UTF8 encoding in Content.write() #12475

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

lorban
Copy link
Contributor

@lorban lorban commented Nov 4, 2024

Fixes #12469

Signed-off-by: Ludovic Orban <lorban@bitronix.be>
@lorban lorban self-assigned this Nov 4, 2024
@lorban lorban requested review from sbordet and gregw November 4, 2024 16:08
Copy link
Contributor

@joakime joakime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm that the behavior when dealing with bad input is the same with the new code vs the old code.

Example:
A bad / Malformed input String that would trigger a fault in the conversion.
If the old code throws an exception, this new one should too.
If the old code used replacement characters, this new one should too.

Signed-off-by: Ludovic Orban <lorban@bitronix.be>
@lorban
Copy link
Contributor Author

lorban commented Nov 5, 2024

@joakime is it possible to create a String object containing invalid UTF8? All I've seen in our tests is to create invalid UTF8 in byte arrays then the String constructor is used to build the string, which does apply some correction. Did I get that right?

sbordet
sbordet previously approved these changes Nov 5, 2024
Signed-off-by: Ludovic Orban <lorban@bitronix.be>
Copy link
Contributor

@joakime joakime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is difficult to create a String that CharsetDecoder.encode(String) would fail on.

@lorban
Copy link
Contributor Author

lorban commented Nov 5, 2024

I found one other place where we use encode that could be replaced with getBytes/wrap.

@lorban
Copy link
Contributor Author

lorban commented Nov 5, 2024

For future reference, here is the benchmark's report:

Benchmark                                          (locale)   Mode  Cnt         Score         Error   Units
Utf8Benchmark.testEncode                              ASCII  thrpt   10   1885900.209 ±   12517.384   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                ASCII  thrpt   10      1121.696 ±       7.530  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm           ASCII  thrpt   10       624.007 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                     ASCII  thrpt   10        19.000                counts
Utf8Benchmark.testEncode:gc.time                      ASCII  thrpt   10        23.000                    ms
Utf8Benchmark.testEncode                                 FR  thrpt   10   1310399.805 ±   12798.866   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                   FR  thrpt   10       789.489 ±       7.739  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm              FR  thrpt   10       632.011 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                        FR  thrpt   10        14.000                counts
Utf8Benchmark.testEncode:gc.time                         FR  thrpt   10        18.000                    ms
Utf8Benchmark.testEncode                                 JA  thrpt   10    814449.918 ±   11152.653   ops/s
Utf8Benchmark.testEncode:gc.alloc.rate                   JA  thrpt   10      2925.414 ±      40.086  MB/sec
Utf8Benchmark.testEncode:gc.alloc.rate.norm              JA  thrpt   10      3768.017 ±       0.001    B/op
Utf8Benchmark.testEncode:gc.count                        JA  thrpt   10        33.000                counts
Utf8Benchmark.testEncode:gc.time                         JA  thrpt   10        47.000                    ms
Utf8Benchmark.testWrapGetBytes                        ASCII  thrpt   10  39417563.752 ± 1256275.047   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate          ASCII  thrpt   10     19538.322 ±     623.689  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm     ASCII  thrpt   10       520.000 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count               ASCII  thrpt   10        71.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                ASCII  thrpt   10       144.000                    ms
Utf8Benchmark.testWrapGetBytes                           FR  thrpt   10   3434889.274 ±   64716.469   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate             FR  thrpt   10      4819.736 ±      90.934  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm        FR  thrpt   10      1472.004 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count                  FR  thrpt   10        37.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                   FR  thrpt   10        58.000                    ms
Utf8Benchmark.testWrapGetBytes                           JA  thrpt   10   1399081.733 ±   47082.158   ops/s
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate             JA  thrpt   10      3595.402 ±     121.188  MB/sec
Utf8Benchmark.testWrapGetBytes:gc.alloc.rate.norm        JA  thrpt   10      2696.010 ±       0.001    B/op
Utf8Benchmark.testWrapGetBytes:gc.count                  JA  thrpt   10        32.000                counts
Utf8Benchmark.testWrapGetBytes:gc.time                   JA  thrpt   10        46.000                    ms

Copy link
Contributor

@gregw gregw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything that can be done to improve org.eclipse.jetty.ee11.servlet.HttpOutput#print(java.lang.String, boolean)?

@lorban
Copy link
Contributor Author

lorban commented Nov 6, 2024

@gregw HttpOutput.print() goes into great length to pool the encoder and to to detect encoding errors like overflows/underflows.

We could theoretically replace all that with a much simpler String.getBytes(Charset), which could improve perf but may not work as expected w.r.t encoding. @joakime what's your opinion on that one?

@joakime
Copy link
Contributor

joakime commented Nov 6, 2024

@joakime what's your opinion on that one?

If the behavior of the API to the users is maintained, then I'm in favor of the change.
Is it possible to use HttpOutput.print() with partial code points? (meaning a print() is called which starts the code points, then a subsequent print() results in finishing the code point)
If so, then the String.getBytes(Charset) wouldn't work for us.

@lorban
Copy link
Contributor Author

lorban commented Nov 6, 2024

I'm going to give HttpOutput.print() a try in another PR, as it isn't trivial to change but may work and be worth the effort.

@lorban lorban merged commit 66b494d into jetty-12.0.x Nov 6, 2024
10 checks passed
@lorban lorban deleted the fix/jetty-12.0.x/12469-faster-utf8-write branch November 6, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

Content.Sink.write(sink, last, utf8Content, callback) could become faster
4 participants