Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve JSON (de)serialization perf of longer strings on mono #39733

Merged
merged 4 commits into from
Aug 14, 2020

Conversation

steveharter
Copy link
Member

@steveharter steveharter commented Jul 21, 2020

Improves the serialization and deserialization performance of the Mono interpreter + POCO with a 1K string property by ~1.4x

Improves the serialization performance of Mono JIT + POCO with a 1K string property by ~2.5x and deserialization by 1.1x.

The same tests but with the property only 10 characters (instead of 1,024) was too small to effectively measure (without increasing the loop count).

No regressions noted with Core JIT.

The changes to S.T.E.W.dll are responsible for the serialization gains and the S.T.J.dll changes are responsible for the deserialization gains.

The test POCO is the existing SimpleStructWithProperties which was used since it contains the most difference in various benchmarks measuring Mono JIT vs Mono interpreter.

    public struct SimpleStructWithProperties
    {
        public int Num { get; set; }
        public string Text { get; set; }
    }

The serialization gains were from modifying System.Text.Encodings.Web avoiding accessing a static ReadOnly<byte> table for every character; the assumption is the Core JIT's support of a tiered JIT avoids the overhead of checking whether the table was initialized (after a few runs). In addition, some interpreter gains due to avoiding the overhead of calling a method for every character (while escaping).

The deserialization gains were from modifying System.Text.Json to optimize the check for the first characters to unescape (typically the trailing quote) by avoiding using unsafe pointers.

The tests were measured by running the benchmark 5 times and picking the fastest. The loop executed 10,000 times.

test Core JIT Mono JIT Mono interpreter
Serialize POCO with 10K string - Before 76 1342 6510
Serialize POCO with 10K string - After 76 543 (2.47x) 4525 (1.44x)
Deserialize POCO with 10K string - Before 45 761 6890
Deserialize POCO with 10K string - After 45 695 (1.09x) 5056 (1.36x)
Serialize POCO with 1K string - Before 16 132 789
Serialize POCO with 1K string - After 16 52 (2.54x) 601 (1.44x)
Deserialize POCO with 1K string - Before 11 74 855
Deserialize POCO with 1K string - After 11 67 (1.10x) 694 (1.36x)
Serialize POCO with 100 length string - Before 8 19 232
Serialize POCO with 100 length - After 8 13 (1.46x) 210 (1.10x)
Deserialize POCO with 100 length - Before 8 13 290
Deserialize POCO with 100 length - After 8 12 (1.08x) 261 (1.11x)
Serialize POCO with 10 length string - Before 7 8 158
Serialize POCO with 10 length - After 7 8 157
Deserialize POCO with 10 length - Before 6 8 221
Deserialize POCO with 10 length - After 6 7 207 (1.07x)

@steveharter steveharter added this to the 5.0.0 milestone Jul 21, 2020
@steveharter steveharter requested a review from jozkee as a code owner July 21, 2020 19:36
@steveharter steveharter self-assigned this Jul 21, 2020
}

private static unsafe int IndexOfOrLessThan(ref byte searchSpace, byte value0, byte value1, byte lessThan, int length)
private static unsafe int IndexOfOrLessThan(ReadOnlySpan<byte> span, byte value0, byte value1, byte lessThan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What length is the break-even point where this is faster than the vectorized code below?

I am surprised this provides wins over the loop-unrolled sequential scan. Where is the overhead coming from?

This code was inspired by the general-purpose IndexOf method on Span<T>, here:

public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)

Ideally, we should fix-up and make that method close to fastest for all input lengths to avoid having two separate code-paths, since that is used in a lot more places. I am not sure it is worth it to make a change like this here, just specific to JSON reading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What length is the break-even point where this is faster than the vectorized code below?

The manual loop is always faster when running under the Mono interpreter (I tried from 10 characters to 10,240 characters as per the table above).

It would be good to know when else Vector.IsHardwareAccelerated is false so I could run some benchmarks other than Mono interpreter..

Copy link
Member

@ahsonkhan ahsonkhan Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about less than 10 characters? For example, it is common for property names to be between 3-6 chars.

The manual loop is always faster when running under the Mono interpreter

What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.

I believe there is a way to turn of IsHardwareAccelerated through some JIT flag or environment variable to simulate it. @AndyAyersMS might know. You could also just assume IsHardwareAccelerated is false in your local testing and remove that branch to see the impact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to echo what @jkotas mentioned here: #39323 (comment)

We should try to avoid trade-offs that try to improve the interpreter environment at the cost of JIT environments and code maintainability/duplication.

However, if all cases improve, then I can see an argument to be made for such a change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run with COMPlus_EnableHWIntrinsic=0 to disable hardware intrinsics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing IndexOfOrLessThan method should work just fine on platforms without SIMD/Vectors. Vector.IsHardwareAccelerated will be constant and then it will go into manually unrolled non-SIMD loop.

I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may also be SIMD support for WASM eventually. My understanding is that Google already supports it in V8 and there is an entire repo + proposals/extensions attempting to officially add SIMD as an available extension: WebAssembly/simd#31 and https://webassembly.org/roadmap/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about less than 10 characters

The perf is not affected when <= 10 chars under the interpreter (within the margin of error).

I increased the iterations to 100_000 with string length of 3:

Before
serialize:Serialization took 1536 ms
deserialize:Deserialization took 2141 ms

After
serialize:Serialization took 1550 ms
deserialize:Deserialization took 2072 ms

Copy link
Member Author

@steveharter steveharter Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These optimizations are focused on improving the WASM interpreter which doesn't have SIMD/Vectors in the MVP platform. The numbers appear to show that the JIT doesn't suffer measurably?

This PR was intended to help with Mono interpreter, but helps even more with Mono JIT. However, the original code path remains for Core JIT unless Vector.IsHardwareAccelerated==false which means there is no hardware-accelerated SIMD support.

The benchmarks were performed against the raw Mono interpreter (not running under WASM) meaning the interpreter today doesn't support SIMD acceleration irregardless of running under WASM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.

Regular JIT/compiled environment use the existing code path unless Vector.IsHardwareAccelerated==false.

I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.

Thanks for the CoreCLR arm reference. As I mentioned previously, I wasn't sure what other platforms Vector.IsHardwareAccelerated would be false.

For a manual test, I commented out the first branch in IndexOfOrLessThan(ReadOnlySpan<byte> span

            if (length >= Vector<byte>.Count * 2)
            {
                int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector<byte>.Count - 1);
                nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1));
            }

to avoid the Vector optimizations. The large 10K string test ran ~1.75x slower, so yes we should do something. I believe the unsafe code in the loop unrolling is causing the interpreter to be slow. There is currently not an option to detect the interpreter, and probably not something we want to expose anyway.

However, there is RuntimeFeature.IsDynamicCodeCompiled which will be false under the interpreter and could be used in conjunction with Vector.IsHardwareAccelerated to ensure the manual loop that was added in this PR is only run when both are false. Thoughts?

@lewing
Copy link
Member

lewing commented Jul 25, 2020

The browser test failure is #39473

@steveharter
Copy link
Member Author

steveharter commented Aug 5, 2020

We should try to avoid trade-offs that try to improve the interpreter environment at the cost of JIT environments and code maintainability/duplication.

However, if all cases improve, then I can see an argument to be made for such a change.

Another option is to have the interpreter intrinsify JsonReaderHelper.IndexOfQuoteOrAnyControlOrBackSlash. This prevents the S.T.J.dll changes in this PR but also means the method can't be renamed, so it doesn't completely address the "maintainability" factor.

@steveharter
Copy link
Member Author

This code was inspired by the general-purpose IndexOf method on Span, here:
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

Ideally, we should fix-up and make that method close to fastest for all input lengths to avoid having two separate code-paths, since that is used in a lot more places. I am not sure it is worth it to make a change like this here, just specific to JSON reading.

For the serialization perf gain scenarios, the changes here are still necessary here unless we have the interpreter instrinsify it instead. If we decide to intrinsify, then I will remove the changes to STJ here and create a Mono issue to intrinsify it.

@BrzVlad what do you think about the intrinsifying System.Text.Json.JsonReaderHelper.IndexOfQuoteOrAnyControlOrBackSlash(this ReadOnlySpan<byte> span) instead of adding the #if statement and helper method in the STJ source?

to avoid having two separate code-paths, since that is used in a lot more places.

What are the "two separate code-paths"? Within Span<byte>.IndexOf()? If we apply the same change there, the implementation of that method would go from two code paths Vector+NoVector (some shared code) to three Vector+NoVector+Interpreter.

This method could also be intrinsified by the interpreter (as well as Span<char>.IndexOf()) to leave it as-is. If we optimize or intrinsify Span<byte>.IndexOf() for the interpreter we should also consider the other 6 or so variants of that, plus the ones for char.

@BrzVlad
Copy link
Member

BrzVlad commented Aug 5, 2020

@steveharter Those checks in libraries code are not the prettiest and intrinsifying seems very easy and would also lead to much faster code. How much time do you estimate is being spent in this method ? I don't remember noticing it in my original benchmark.

@steveharter
Copy link
Member Author

@steveharter Those checks in libraries code are not the prettiest and intrinsifying seems very easy and would also lead to much faster code. How much time do you estimate is being spent in this method ? I don't remember noticing it in my original benchmark.

@BrzVlad you'll need a test with larger strings (say 1K to effectively measure).

I'll remove the changes to S.T.J.dll in this PR and assume intrinsifying IndexOfQuoteOrAnyControlOrBackSlash.

You can track or measure Span<char>.IndexOf() and related Span methods and determine if those could be intrisified as well (won't help with serialization). I don't have any existing benchmarks for those that I'm tracking.

Note that only deserialization is affected - the other changes to the S.T.E.W.dll library that affect serialization in this PR can't be intrinsified and I'll keep in this PR.

@steveharter
Copy link
Member Author

Failure is unrelated: System.Text.Tests.StringBuilderTests.FailureOnLargeString

      Timed out at 10/08/2020 16:43:15 after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\9CF208CE\w\A0620929\uploads\8468.3lcel0eo.iqw.dmp
      	Process ID: 8468
      	Handle: 1032
      	Name: dotnet
      	MainModule: C:\h\w\9CF208CE\p\dotnet.exe
      	StartTime: 10/08/2020 16:42:15
      	TotalProcessorTime: 00:01:00.3593750
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(225,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(58,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/libraries/System.Runtime/tests/System/Text/StringBuilderTests.cs(2237,0): at System.Text.Tests.StringBuilderTests.FailureOnLargeString()

monojenkins pushed a commit to monojenkins/mono that referenced this pull request Aug 12, 2020
Speeds up json deserialization by 5-10%

Following discussion from dotnet/runtime#39733
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants