Improve JSON (de)serialization perf of longer strings on mono #39733

steveharter · 2020-07-21T19:36:18Z

Improves the serialization and deserialization performance of the Mono interpreter + POCO with a 1K string property by ~1.4x

Improves the serialization performance of Mono JIT + POCO with a 1K string property by ~2.5x and deserialization by 1.1x.

The same tests but with the property only 10 characters (instead of 1,024) was too small to effectively measure (without increasing the loop count).

No regressions noted with Core JIT.

The changes to S.T.E.W.dll are responsible for the serialization gains and the S.T.J.dll changes are responsible for the deserialization gains.

The test POCO is the existing SimpleStructWithProperties which was used since it contains the most difference in various benchmarks measuring Mono JIT vs Mono interpreter.

    public struct SimpleStructWithProperties
    {
        public int Num { get; set; }
        public string Text { get; set; }
    }

The serialization gains were from modifying System.Text.Encodings.Web avoiding accessing a static ReadOnly<byte> table for every character; the assumption is the Core JIT's support of a tiered JIT avoids the overhead of checking whether the table was initialized (after a few runs). In addition, some interpreter gains due to avoiding the overhead of calling a method for every character (while escaping).

The deserialization gains were from modifying System.Text.Json to optimize the check for the first characters to unescape (typically the trailing quote) by avoiding using unsafe pointers.

The tests were measured by running the benchmark 5 times and picking the fastest. The loop executed 10,000 times.

test	Core JIT	Mono JIT	Mono interpreter
Serialize POCO with 10K string - Before	76	1342	6510
Serialize POCO with 10K string - After	76	543 (2.47x)	4525 (1.44x)
Deserialize POCO with 10K string - Before	45	761	6890
Deserialize POCO with 10K string - After	45	695 (1.09x)	5056 (1.36x)

Serialize POCO with 1K string - Before	16	132	789
Serialize POCO with 1K string - After	16	52 (2.54x)	601 (1.44x)
Deserialize POCO with 1K string - Before	11	74	855
Deserialize POCO with 1K string - After	11	67 (1.10x)	694 (1.36x)

Serialize POCO with 100 length string - Before	8	19	232
Serialize POCO with 100 length - After	8	13 (1.46x)	210 (1.10x)
Deserialize POCO with 100 length - Before	8	13	290
Deserialize POCO with 100 length - After	8	12 (1.08x)	261 (1.11x)

Serialize POCO with 10 length string - Before	7	8	158
Serialize POCO with 10 length - After	7	8	157
Deserialize POCO with 10 length - Before	6	8	221
Deserialize POCO with 10 length - After	6	7	207 (1.07x)

ahsonkhan · 2020-07-22T06:00:15Z

src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.cs

        }

-        private static unsafe int IndexOfOrLessThan(ref byte searchSpace, byte value0, byte value1, byte lessThan, int length)
+        private static unsafe int IndexOfOrLessThan(ReadOnlySpan<byte> span, byte value0, byte value1, byte lessThan)


What length is the break-even point where this is faster than the vectorized code below?

I am surprised this provides wins over the loop-unrolled sequential scan. Where is the overhead coming from?

This code was inspired by the general-purpose IndexOf method on Span<T>, here:

runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

Line 189 in d8cf13e

public static unsafe int IndexOf(ref byte searchSpace, byte value, int length)

Ideally, we should fix-up and make that method close to fastest for all input lengths to avoid having two separate code-paths, since that is used in a lot more places. I am not sure it is worth it to make a change like this here, just specific to JSON reading.

What length is the break-even point where this is faster than the vectorized code below?

The manual loop is always faster when running under the Mono interpreter (I tried from 10 characters to 10,240 characters as per the table above).

It would be good to know when else Vector.IsHardwareAccelerated is false so I could run some benchmarks other than Mono interpreter..

What about less than 10 characters? For example, it is common for property names to be between 3-6 chars.

The manual loop is always faster when running under the Mono interpreter

What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.

I believe there is a way to turn of IsHardwareAccelerated through some JIT flag or environment variable to simulate it. @AndyAyersMS might know. You could also just assume IsHardwareAccelerated is false in your local testing and remove that branch to see the impact.

I want to echo what @jkotas mentioned here: #39323 (comment)

We should try to avoid trade-offs that try to improve the interpreter environment at the cost of JIT environments and code maintainability/duplication.

However, if all cases improve, then I can see an argument to be made for such a change.

Run with COMPlus_EnableHWIntrinsic=0 to disable hardware intrinsics.

The existing IndexOfOrLessThan method should work just fine on platforms without SIMD/Vectors. Vector.IsHardwareAccelerated will be constant and then it will go into manually unrolled non-SIMD loop.

I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.

There may also be SIMD support for WASM eventually. My understanding is that Google already supports it in V8 and there is an entire repo + proposals/extensions attempting to officially add SIMD as an available extension: WebAssembly/simd#31 and https://webassembly.org/roadmap/

What about less than 10 characters

The perf is not affected when <= 10 chars under the interpreter (within the margin of error).

I increased the iterations to 100_000 with string length of 3:

Before serialize:Serialization took 1536 ms deserialize:Deserialization took 2141 ms After serialize:Serialization took 1550 ms deserialize:Deserialization took 2072 ms

These optimizations are focused on improving the WASM interpreter which doesn't have SIMD/Vectors in the MVP platform. The numbers appear to show that the JIT doesn't suffer measurably?

This PR was intended to help with Mono interpreter, but helps even more with Mono JIT. However, the original code path remains for Core JIT unless Vector.IsHardwareAccelerated==false which means there is no hardware-accelerated SIMD support.

The benchmarks were performed against the raw Mono interpreter (not running under WASM) meaning the interpreter today doesn't support SIMD acceleration irregardless of running under WASM.

What about the regular JIT/compiled environments? I am concerned this will hurt those environments, particularly for larger strings which could benefit from loop unrolling.

Regular JIT/compiled environment use the existing code path unless Vector.IsHardwareAccelerated==false.

I believe that this change will make the performance measurably worse on platforms without SIMD/Vector acceleration (e.g. CoreCLR arm) because we will take this version that does not have the loop unrolled.

Thanks for the CoreCLR arm reference. As I mentioned previously, I wasn't sure what other platforms Vector.IsHardwareAccelerated would be false.

For a manual test, I commented out the first branch in IndexOfOrLessThan(ReadOnlySpan<byte> span

if (length >= Vector<byte>.Count * 2) { int unaligned = (int)Unsafe.AsPointer(ref searchSpace) & (Vector<byte>.Count - 1); nLength = (IntPtr)((Vector<byte>.Count - unaligned) & (Vector<byte>.Count - 1)); }

to avoid the Vector optimizations. The large 10K string test ran ~1.75x slower, so yes we should do something. I believe the unsafe code in the loop unrolling is causing the interpreter to be slow. There is currently not an option to detect the interpreter, and probably not something we want to expose anyway.

However, there is RuntimeFeature.IsDynamicCodeCompiled which will be false under the interpreter and could be used in conjunction with Vector.IsHardwareAccelerated to ensure the manual loop that was added in this PR is only run when both are false. Thoughts?

...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs

lewing · 2020-07-25T06:06:42Z

The browser test failure is #39473

...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs

src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.cs

steveharter · 2020-08-05T14:43:39Z

We should try to avoid trade-offs that try to improve the interpreter environment at the cost of JIT environments and code maintainability/duplication.

However, if all cases improve, then I can see an argument to be made for such a change.

Another option is to have the interpreter intrinsify JsonReaderHelper.IndexOfQuoteOrAnyControlOrBackSlash. This prevents the S.T.J.dll changes in this PR but also means the method can't be renamed, so it doesn't completely address the "maintainability" factor.

steveharter · 2020-08-05T15:18:47Z

This code was inspired by the general-purpose IndexOf method on Span, here:
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

Ideally, we should fix-up and make that method close to fastest for all input lengths to avoid having two separate code-paths, since that is used in a lot more places. I am not sure it is worth it to make a change like this here, just specific to JSON reading.

For the serialization perf gain scenarios, the changes here are still necessary here unless we have the interpreter instrinsify it instead. If we decide to intrinsify, then I will remove the changes to STJ here and create a Mono issue to intrinsify it.

@BrzVlad what do you think about the intrinsifying System.Text.Json.JsonReaderHelper.IndexOfQuoteOrAnyControlOrBackSlash(this ReadOnlySpan<byte> span) instead of adding the #if statement and helper method in the STJ source?

to avoid having two separate code-paths, since that is used in a lot more places.

What are the "two separate code-paths"? Within Span<byte>.IndexOf()? If we apply the same change there, the implementation of that method would go from two code paths Vector+NoVector (some shared code) to three Vector+NoVector+Interpreter.

This method could also be intrinsified by the interpreter (as well as Span<char>.IndexOf()) to leave it as-is. If we optimize or intrinsify Span<byte>.IndexOf() for the interpreter we should also consider the other 6 or so variants of that, plus the ones for char.

BrzVlad · 2020-08-05T20:20:36Z

@steveharter Those checks in libraries code are not the prettiest and intrinsifying seems very easy and would also lead to much faster code. How much time do you estimate is being spent in this method ? I don't remember noticing it in my original benchmark.

steveharter · 2020-08-10T15:49:13Z

@steveharter Those checks in libraries code are not the prettiest and intrinsifying seems very easy and would also lead to much faster code. How much time do you estimate is being spent in this method ? I don't remember noticing it in my original benchmark.

@BrzVlad you'll need a test with larger strings (say 1K to effectively measure).

I'll remove the changes to S.T.J.dll in this PR and assume intrinsifying IndexOfQuoteOrAnyControlOrBackSlash.

You can track or measure Span<char>.IndexOf() and related Span methods and determine if those could be intrisified as well (won't help with serialization). I don't have any existing benchmarks for those that I'm tracking.

Note that only deserialization is affected - the other changes to the S.T.E.W.dll library that affect serialization in this PR can't be intrinsified and I'll keep in this PR.

steveharter · 2020-08-11T13:54:54Z

Failure is unrelated: System.Text.Tests.StringBuilderTests.FailureOnLargeString

      Timed out at 10/08/2020 16:43:15 after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\9CF208CE\w\A0620929\uploads\8468.3lcel0eo.iqw.dmp
      	Process ID: 8468
      	Handle: 1032
      	Name: dotnet
      	MainModule: C:\h\w\9CF208CE\p\dotnet.exe
      	StartTime: 10/08/2020 16:42:15
      	TotalProcessorTime: 00:01:00.3593750
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(225,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(58,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/libraries/System.Runtime/tests/System/Text/StringBuilderTests.cs(2237,0): at System.Text.Tests.StringBuilderTests.FailureOnLargeString()

Speeds up json deserialization by 5-10% Following discussion from dotnet/runtime#39733

steveharter added area-System.Text.Json tenet-performance Performance related issue labels Jul 21, 2020

steveharter added this to the 5.0.0 milestone Jul 21, 2020

steveharter requested review from GrabYourPitchforks, BrzVlad and layomia July 21, 2020 19:36

steveharter requested a review from jozkee as a code owner July 21, 2020 19:36

steveharter self-assigned this Jul 21, 2020

ahsonkhan reviewed Jul 22, 2020

View reviewed changes

layomia reviewed Jul 22, 2020

View reviewed changes

...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs Outdated Show resolved Hide resolved

ahsonkhan reviewed Jul 26, 2020

View reviewed changes

...ystem.Text.Encodings.Web/src/System/Text/Encodings/Web/DefaultJavaScriptEncoderBasicLatin.cs Show resolved Hide resolved

lewing reviewed Aug 5, 2020

View reviewed changes

src/libraries/System.Text.Json/src/System/Text/Json/Reader/JsonReaderHelper.cs Outdated Show resolved Hide resolved

steveharter added 3 commits August 10, 2020 10:51

Perf improvements for mono

4d4ac72

Check for RuntimeFeature.IsDynamicCodeSupported to allow loop unrolling

ba6b7ca

Change to IsDynamicCodeCompiled

8a44188

steveharter force-pushed the MonoEncodingPerf branch from e54c40b to 8a44188 Compare August 10, 2020 15:51

Undo STJ changes and assume mono interpreter will intrinsify

31ea3ca

BrzVlad mentioned this pull request Aug 12, 2020

[interp] Intrinsify IndexOfOrLessThan #40705

Closed

monojenkins mentioned this pull request Aug 12, 2020

[interp] Intrinsify IndexOfOrLessThan mono/mono#20233

Closed

monojenkins pushed a commit to monojenkins/mono that referenced this pull request Aug 12, 2020

[interp] Intrinsify IndexOfOrLessThan

010a190

Speeds up json deserialization by 5-10% Following discussion from dotnet/runtime#39733

GrabYourPitchforks approved these changes Aug 14, 2020

View reviewed changes

steveharter merged commit ac492c9 into dotnet:master Aug 14, 2020

steveharter deleted the MonoEncodingPerf branch August 14, 2020 18:39

benaadams mentioned this pull request Aug 20, 2020

Intrinsicify (Sse, Axv2, Arm64, wasm) JsonReaderHelper.IndexOfOrLessThan #41097

Closed

steveharter mentioned this pull request Sep 9, 2020

System.Text.Json.JsonSerializer.Deserialize crashes when parsing large string using Blazor #41604

Closed

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JSON (de)serialization perf of longer strings on mono #39733

Improve JSON (de)serialization perf of longer strings on mono #39733

steveharter commented Jul 21, 2020 •

edited

Loading

ahsonkhan Jul 22, 2020

steveharter Jul 22, 2020

ahsonkhan Jul 22, 2020 •

edited

Loading

ahsonkhan Jul 24, 2020

AndyAyersMS Jul 25, 2020

jkotas Jul 25, 2020

tannergooding Jul 27, 2020

steveharter Aug 4, 2020

steveharter Aug 4, 2020 •

edited

Loading

steveharter Aug 4, 2020

lewing commented Jul 25, 2020

steveharter commented Aug 5, 2020 •

edited

Loading

steveharter commented Aug 5, 2020

BrzVlad commented Aug 5, 2020

steveharter commented Aug 10, 2020

steveharter commented Aug 11, 2020

Improve JSON (de)serialization perf of longer strings on mono #39733

Improve JSON (de)serialization perf of longer strings on mono #39733

Conversation

steveharter commented Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahsonkhan Jul 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveharter Aug 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewing commented Jul 25, 2020

steveharter commented Aug 5, 2020 • edited Loading

steveharter commented Aug 5, 2020

BrzVlad commented Aug 5, 2020

steveharter commented Aug 10, 2020

steveharter commented Aug 11, 2020

steveharter commented Jul 21, 2020 •

edited

Loading

ahsonkhan Jul 22, 2020 •

edited

Loading

steveharter Aug 4, 2020 •

edited

Loading

steveharter commented Aug 5, 2020 •

edited

Loading