-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a fast path for ASCII strings #620
Conversation
This optimization is based on a few assumptions: - Most strings are ASCII only. - Most strings had their coderange scanned already. If the above is true, then by checking the string coderange, we can use a much more streamlined function to encode ASCII strings. Before: ``` == Encoding twitter.json (466906 bytes) ruby 3.4.0preview2 (2024-10-07 master 32c733f57b) +YJIT +PRISM [arm64-darwin23] Warming up -------------------------------------- json 140.000 i/100ms oj 230.000 i/100ms rapidjson 108.000 i/100ms Calculating ------------------------------------- json 1.464k (± 1.4%) i/s (682.83 μs/i) - 7.420k in 5.067573s oj 2.338k (± 1.5%) i/s (427.64 μs/i) - 11.730k in 5.017336s rapidjson 1.075k (± 1.6%) i/s (930.40 μs/i) - 5.400k in 5.025469s Comparison: json: 1464.5 i/s oj: 2338.4 i/s - 1.60x faster rapidjson: 1074.8 i/s - 1.36x slower ``` After: ``` == Encoding twitter.json (466906 bytes) ruby 3.4.0preview2 (2024-10-07 master 32c733f57b) +YJIT +PRISM [arm64-darwin23] Warming up -------------------------------------- json 189.000 i/100ms oj 228.000 i/100ms rapidjson 108.000 i/100ms Calculating ------------------------------------- json 1.903k (± 1.2%) i/s (525.55 μs/i) - 9.639k in 5.066521s oj 2.306k (± 1.3%) i/s (433.71 μs/i) - 11.628k in 5.044096s rapidjson 1.069k (± 2.4%) i/s (935.38 μs/i) - 5.400k in 5.053794s Comparison: json: 1902.8 i/s oj: 2305.7 i/s - 1.21x faster rapidjson: 1069.1 i/s - 1.78x slower ```
break; | ||
case ENC_CODERANGE_VALID: | ||
if (RB_UNLIKELY(state->ascii_only)) { | ||
convert_UTF8_to_ASCII_only_JSON(buffer, obj, state->script_safe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reflection, the split that might have the more impact is script_safe
, as it would also speedup the ASCII path.
I pushed an extra commit that re-use the escape table from a81ec47
For now it's only applied to the ASCII fast path, but I think we could optimize UTF8 too with the assumption that most bytes in an UTF8 string are ASCII. But I'd rather not complexify this PR more. |
This performs noticeably better than the boolean logic. Before: ``` == Encoding twitter.json (466906 bytes) ruby 3.4.0preview2 (2024-10-07 master 32c733f57b) +YJIT +PRISM [arm64-darwin23] Warming up -------------------------------------- json 189.000 i/100ms oj 228.000 i/100ms rapidjson 108.000 i/100ms Calculating ------------------------------------- json 1.903k (± 1.2%) i/s (525.55 μs/i) - 9.639k in 5.066521s oj 2.306k (± 1.3%) i/s (433.71 μs/i) - 11.628k in 5.044096s rapidjson 1.069k (± 2.4%) i/s (935.38 μs/i) - 5.400k in 5.053794s Comparison: json: 1902.8 i/s oj: 2305.7 i/s - 1.21x faster rapidjson: 1069.1 i/s - 1.78x slower ``` After: ``` == Encoding twitter.json (466906 bytes) ruby 3.4.0preview2 (2024-10-07 master 32c733f57b) +YJIT +PRISM [arm64-darwin23] Warming up -------------------------------------- json 224.000 i/100ms oj 230.000 i/100ms rapidjson 107.000 i/100ms Calculating ------------------------------------- json 2.254k (± 1.6%) i/s (443.69 μs/i) - 11.424k in 5.069999s oj 2.318k (± 1.4%) i/s (431.32 μs/i) - 11.730k in 5.060421s rapidjson 1.081k (± 1.9%) i/s (925.05 μs/i) - 5.457k in 5.049738s Comparison: json: 2253.8 i/s oj: 2318.5 i/s - same-ish: difference falls within error rapidjson: 1081.0 i/s - 2.08x slower ``` The escape table is taken directly from Mame's PR. Co-Authored-By: Yusuke Endoh <mame@ruby-lang.org>
9a00965
to
e2a2a33
Compare
} | ||
|
||
beg = pos + 1; | ||
switch (ch) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating on this change in your blog post ❤️
I was wondering if we could entirely eliminate this switch
at an expense of static memory by storing escaped values instead of booleans in the lookup table. Just as an example, in ruby:
JSON_ESCAPE_TABLE['"'] = "\\\"
JSON_ESCAPE_TABLE['\\'] = "\\\\"
JSON_ESCAPE_TABLE['\b'] = "\\b"
# ...
# default: case can be pre-calculated as well for other characters
It may not result in a visible improvement in realistic benchmarks due to having not that many chars to escape but could be visible on a micro-benchmark specifically crafted to measure escaping performance at no cost (except static memory) for the main flow.
I could be missing something that makes it hard to store non-boolean values in escape_table
but would love to at least learn about this as well. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worth trying, but there's a couple things making this not an obvious win.
Escaped values in this case aren't scalar. You can either store them as a table of char *
:
static const char *[256] escape_table = {
0, 0, 0, "\\n", // ....
};
If you do this:
char *
is8B
on 64-bit platform, whereaschar
is just1B
, so we went from256B
to2kiB
, not nothing.char *
doesn't store the length, so presumably to append these you'd need to callstrlen
, they are shot string so probably not very slow, but given we're just trying to eliminate a case, it may actually be slower.char *
isn't stored contiguiously, so that's another memory region to keep in CPU cache.
Alternatively, you could store the escaped strings inline, e.g.:
struct escaped_char {
unsigned char len;
char bytes[8];
}
static const struct escaped_char[256] escape_table = {
{0, 0}, {0, 0}, {2, "\\n"}, // ....
}
But this would make the escape table even bigger.
Maybe you could only do this for strings that are known to be pure-ASCII so the table only has 128
entries, keeping the memory usage down, but then you can't really re-use it for UTF-8.
Note that all of these are just "static assumption", performance isn't always easy to predict, so maybe your suggestion would actually help.
Did this get applied to the Java parser anywhere? I can take it if not but I don't want to dive in and waste my time if it's already there. |
As mentioned previously, I don't know enough about Java to do much to it, and I don't really have any incentive either. If you want to speedup the Java version feel free to do it. |
@byroot Ok thanks for clarifying. You've done so many PRs lately I lost track! I'll do the port and push a PR. The code in the Java extension will be pretty much the same as what you wrote for the C version. |
Good news: much of the structure of the Java code already mimics the C code, and a profile of the twitter.json file dumping shows most of the overhead is in two places: handling each character (because it's using heavier UTF-8 logic always) and picking the right logic for each object type (bad lookup of handler logic based on the Ruby class, most likely). Ought to be able to speed this up nicely. |
Work is proceeding on this and other optimizations for the JRuby extension in #725. |
* Make benchmark runnable without oj available * Port convert_UTF8_to_ASCII_only_JSON to Java This is new specialized logic to reduce overhead when appending ASCII-only strings to the generated JSON. Original code by @byroot See #620 * Align string generate method with generate_json_string * Port convert_UTF8_to_JSON from C Also includes updated logic for generate (generate_json_string) based on current C code. Original code by @byroot See #620 * Use external iteration to reduce alloc Lots of surrounding state so just take the hit of a Set and Iterator rather than a big visitor object. * Remove unused imports * Inline ConvertBytes logic for long to byte[] This change duplicates some code from JRuby to allow rendering the fixnum value to a shared byte array rather than allocating new for each value. Since fixnum dumping is a leaf operation, only one is needed per session. * Eliminate * import * Restructure handlers for easier profiling Anonymous classes show up as unnamed, numbered classes in profiles which makes them difficult to read. * Avoid allocation when writing Array delimiters Rather than allocating a buffer to hold N copies of arrayNL, just write it N times. We're buffering into a stream anyway. This makes array dumping zero-alloc other than buffer growth. * Move away from Handler abstraction Since there's a fixed number of types we have special dumping logic for, this abstraction just introduces overhead we don't need. This patch starts moving away from indirecting all dumps through the Handler abstraction and directly generating from the type switch. This also aligns better with the main loop of the C code and should inline and optimize better. * Match C version of fbuffer_append_long * Minor tweaks to reduce complexity * Reimpl byte[] stream without synchronization The byte[] output stream used here extended ByteArrayOutputStream from the JDK, which sychronizes all mutation operations (like writes). Since this is only going to be used once within a given call stack, it needs no synchronization. This change more than triples the performance of a benchmark of dumping an array of empty arrays and should increase performance of all dump forms. * Reduce overhead in repeats * Return incoming array if only one repeat is needed and array is exact size. * Only retrieve ByteList fields once for repeat writes. * Use equivalent of rb_sym2str * Microoptimizations for ByteList stream * Cast to byte not necessary * Refactor this for better inlining * More tiny tweaks to reduce overhead of generateString * Refactor to avoid repeated boolean checks * Eliminate memory accesses for digits The math is much faster here than array access, due to bounds checking and pointer dereferencing. * Loosen visibility to avoid accessor methods Java will generated accessor methods for private fields, burning some inlining budget. * Modify parser bench to work without oj or rapidjson
Fix: #562
This optimization is based on a few assumptions:
If the above is true, then by checking the string coderange, we can use a much more streamlined function to encode ASCII strings.
Before:
After: