Vxsort #37159

PeterSolMS · 2020-05-29T13:51:52Z

Faster sorting code from Dan Shechter, and bigger mark list.

Not ready to be merged yet, but ready for initial review.

…from Dan Shechter.

… the ephemeral space.

ghost · 2020-05-29T13:51:57Z

Tagging subscribers to this area: @Maoni0
Notify danmosemsft if you want to be subscribed.

stephentoub · 2020-05-29T14:02:36Z

cc: @damageboy

jkotas · 2020-05-29T14:44:34Z

src/coreclr/src/gc/vxsort.cpp

@@ -0,0 +1,282 @@
+


License header?

It may need an entry in
https://github.com/dotnet/runtime/blob/master/THIRD-PARTY-NOTICES.TXT

I don't mind adding whatever license you feel is friendliest to the original

In the meantime, I've licensed under MIT license:
https://github.com/damageboy/gcsort/blob/master/LICENSE

jkotas · 2020-05-29T14:49:11Z

src/coreclr/src/gc/gc.cpp

+    };
+}
+void vxsort(uint8_t** low, uint8_t** high, unsigned int depth)
+{


Do we need "is avx available" check here and fallback to the non-vectorized sort ?

Yeap, unless you'd like me/Peter to copy the rest of the non-vectorized partitioning back into the same sort.

It already shares the heap-sort + insertion-sort code from the original as a fallback just in case we run into one of the existing edge-cases, as with standard introsort.

This should reduce duplication, as I literally copied 2/3 of introsort into vxsort from the get go.

With clang/gcc I've previously used builtins:

if (__builtin_cpu_supports("avx2")) { // AVX2 } else { // SAD PANDA }

To cleanly detect this.

Pretty sure you'll have to do something else for MSVC.

Do we have prior art for this? I found EEJitManager::SetCpuInfo which detects the AVX2 instruction set for the JIT.

For Unix, __builtin_cpu_supports("avx2") looks pretty nice. For Windows, I guess it will need to call GetEnabledXStateFeatures and check for XSTATE_MASK_AVX

The general mechanism I'm familiar with is using function multi-versioning, this is supported in gcc since forever, and in clang from 7.0 onwards if I'm not mistaken.

The basic idea would be that a function can be decorated with:

__attribute__ ((target ("default"))) int foo () { return 1; } __attribute__ ((target ("avx2"))) int foo () { return 2; }

And you get an indirect jump embedded in the code stream courtesy of the compiler.

I'm pretty sure this does not exist for MSVC, where you need to hand-code such things, but then again, you have more apt MSVC gurus around you, so they might know something I don't. But it seems that if you would like to avoid depending on the EEJitManager, I could add support for this myself for clang/gcc the normal mechanisms, and for MSVC by following what seems to be the standard practice inside vxsort, that way I can switch from AVX2, AVX512 when my AVX512 support is complete.

Would you like me to take care of all of this internally and expose one sort functionality that does everything?

Is there any reason we can't use the cached checks the VM already has from the HWIntrinsics feature?

We want to support building GC into separate .dll (aka LocalGC).

I'll also add that you don't detect AVX512 yet either, which I'm working on in a vxsort branch, regardless if @PeterSolMS ends up taking it.

jkotas · 2020-05-29T14:50:11Z

src/coreclr/src/gc/bitonic_sort.int64_t.generated.h

@@ -0,0 +1,1481 @@
+


Comment about how this file was generated would be nice

The generator script is part of my repo:

https://github.com/damageboy/gcsort/blob/master/gcsort/smallsort/bitonic_gen.py

Note that it generates all 6 32+64 bit types.

By the way, I would highly recommend including the generator and reviewing it instead of the mountains of code it generated.

It might actually make sense that way :)

@jkotas I've updated the original repo to now include proper auto-generation disclaimers:

///////////////////////////////////////////////////////////////////////////// //// // This file was auto-generated by a tool at 2020-05-30 11:18:44 // // It is recommended you DO NOT directly edit this file but instead edit // the code-generator that generated this source file instead. /////////////////////////////////////////////////////////////////////////////

I'm assuming we'll do another pass of updating the gc copy from my original given there will be more changes stemming from this review process.

we'd definitely want to include the generator.
BTW @PeterSolMS is OOF, that's why he hasn't said anything, he will be back this coming Tue.

I wonder whether to put the generator under src/tools or src/gc?

We typically place generators like this locally to the thing they are generating. Here are a few existing examples:
https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/CrstTypeTool.cs
https://github.com/dotnet/runtime/tree/master/src/coreclr/src/System.Private.CoreLib/Tools/GenUnicodeProp

damageboy · 2020-05-31T22:27:36Z

For what it's worth, I've done some initial work to clean up the original codebase.

Currently the gcsort repo's latest and greatest can build on Mac, Linux, and windows, on a combination of msvc+clang-cl for windows and clang+gcc for mac+linux.

There is no functional change to the c++ code in the original, but knowing that many compilers and operating system all build and run tests successfully should provide some reassurance as to the quality of the code.

The macOs build, specifically, cannot run tests on GH Actions right now, as it seems those machines lack AVX2 support? which seems odd enough, but still...

.

… entries pertaining to the local heap by reading the mark lists from all the heaps appears to be too slow and scales very badly with increasing number of heaps.

danmoseley · 2020-06-08T21:59:52Z

For the sake of us curious bystanders, do we have an idea of the impact of this cool change? Eg., perf measurements.

Maoni0 · 2020-06-08T22:03:47Z

measuring and analyzing perf impact is a big part of what Peter has been doing; and will be included as part of the commit message when we merge.

danmoseley · 2020-06-08T22:05:09Z

Sounds good - I was just excited :)

…ance, optionally write mark lists and associated information to binary files for further analysis.

…ubs to call AVX2 or AVX512 flavor of the sort.

tannergooding · 2020-06-24T14:37:28Z

src/coreclr/src/gc/gc.cpp

-
-    template <>
-    class numeric_limits<int64_t>
+        do_vxsort_avx512(low, high);


Are there any numbers showing how AVX512 performs in real world scenarios?

It is fairly well documented that using AVX512 can cause downclocking which in certain circumstances can lead to decreased overall application performance (AVX2 can as well, but to a much lesser extent).

The impact is lower on newer processors such as Ice Lake as compared to older Skylake Server

The impact is typically relegated to instructions that use the floating-point or multiplication (including FMA) pipelines

The impact is typically for 256 and 512-bit variants of the instructions, rather than 128-bit

The official guidance from Intel is the following (from the May 2020 edition of Intel 64 and IA-32 Architectures Optimization Reference Manual):

The section detailing Skylake Server Power Management is a bit longer and includes more details on what impacts each frequency level and includes various graphs/examples of the typical impact, with the explicit guidance to still profile.

Downclocking can be pretty severe on 14nm parts right now:
From Anandtech's article Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?

While that is true, we are still seeing substantial perf boost on a given machine (copy pasting some preliminary results, while I'm still improving the code behind the scenes):

For 64-bit sorting, here is the original + 8way unrolled results, obtained from a shared instance running a Xeon Silver 4216 with nominal speed @ 2.1Ghz:

------------------------------------------------------------------------------------------- Benchmark (<type/vector-isa/unroll>/size/threads) Time/N (per Element) ------------------------------------------------------------------------------------------- BM_full_introsort/65536/threads:1 56.55590 ns BM_full_introsort/131072/threads:1 60.63450 ns BM_full_introsort/262144/threads:1 65.21260 ns BM_full_introsort/524288/threads:1 69.03150 ns BM_full_introsort/1048576/threads:1 73.16970 ns BM_vxsort<int64_t, vector_machine::AVX2, 8>/65536/threads:1 20.14540 ns BM_vxsort<int64_t, vector_machine::AVX2, 8>/131072/threads:1 21.08380 ns BM_vxsort<int64_t, vector_machine::AVX2, 8>/262144/threads:1 22.36290 ns BM_vxsort<int64_t, vector_machine::AVX2, 8>/524288/threads:1 23.05820 ns BM_vxsort<int64_t, vector_machine::AVX2, 8>/1048576/threads:1 24.21020 ns BM_vxsort<int64_t, vector_machine::AVX512, 8>/131072/threads:1 8.23500 ns BM_vxsort<int64_t, vector_machine::AVX512, 8>/262144/threads:1 8.90877 ns BM_vxsort<int64_t, vector_machine::AVX512, 8>/524288/threads:1 9.94376 ns BM_vxsort<int64_t, vector_machine::AVX512, 8>/1048576/threads:1 10.70730 ns

So it's pretty clear the AVX512 is out-performing AVX2, post-downclocking.
There is also a very clear reason for this staggering improvement, if you are aware of two key points:

AVX2 is missing some int64 functionality (certain int64 ops are more expensive)

There is an extra specific AVX512 intrinsic (_mm512_compress_storeu_epiXX), which is removing the entire lookup table + cache reference that was involved for loading the lookup entry (which is a huge win)

As such, the perf bump can be thought of as ONLY being 2x-ish due to downclocking, with a clear expectation of seeing 3x with future 10nm parts and below.

As for int32 (which is also probably going to be part of this PR, given that we will probably dynamically switch to using int32 for smaller mark-lists):

------------------------------------------------------------------------------------------- Benchmark (<type/vector-isa/unroll>/size/threads) Time/N (per Element) ------------------------------------------------------------------------------------------- BM_vxsort<int32_t, vector_machine::AVX2, 8>/65536/threads:1 5.47985 ns BM_vxsort<int32_t, vector_machine::AVX2, 8>/131072/threads:1 5.89806 ns BM_vxsort<int32_t, vector_machine::AVX2, 8>/262144/threads:1 6.13415 ns BM_vxsort<int32_t, vector_machine::AVX2, 8>/524288/threads:1 6.49738 ns BM_vxsort<int32_t, vector_machine::AVX2, 8>/1048576/threads:1 7.10343 ns BM_vxsort<int32_t, vector_machine::AVX512, 8>/65536/threads:1 3.70457 ns BM_vxsort<int32_t, vector_machine::AVX512, 8>/131072/threads:1 3.94963 ns BM_vxsort<int32_t, vector_machine::AVX512, 8>/262144/threads:1 4.12517 ns BM_vxsort<int32_t, vector_machine::AVX512, 8>/524288/threads:1 4.35658 ns BM_vxsort<int32_t, vector_machine::AVX512, 8>/1048576/threads:1 4.86922 ns

While the perf boost from AVX2 to AVX512 is smaller for int32, it is still substantial. Also, please remember these are preliminary results that are somewhat less than optimal given the lack of direct HW I'm experiencing, and that fact that everything I do with AVX512F is simply more painful because of this.

The benchmarks and code are all up-to-date and accessible from the vxsort-cpp

Finally, there is also a discussion to be had about how long do the downclocking effects linger on POST sorting.
As to that, the long answer is that Travis Downs (@trav_downs) measured this independently: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

The short answer is no more than ~700usec:

What Time Description

Voltage Transition ~8 to 20 μs Time required for a voltage transition, depends on the frequency

Frequency Transition ~10 μs Time required for the halted part of a frequency transition

Relaxation Period ~680 μs Time required to go back to a lower power license, measured from the last instruction requiring the higher license

It is important to note that on server CPUs, this normally applies for individual cores, where on client CPUs this is "global".
For 14nm there are currently only server parts, with the exception of Intel Icelake CPUs.

I think that given we are saving multiple milliseconds with this optimization, even while increasing the mark-list, and paying with 700usec of reduced frequency post that time-saving window directly is more than a reasonable and temporary sacrifice.

My conclusion from this is that the current "state of the union" is roughly the worst it can get (although multi-threaded results are missing, due to me not having access to a dedicated machine where I could produce meaningful results: I barely respect the results I have, in that sense since this is not a machine I can kick people off of).

As you mentioned, Icelake and Tigerlake make things only better, and it is hard to imagine AMD doing worse once they come around to supporting AVX512, probably sometime around transitioning to TSMC 5nm.

I still think I can kick things up with a few more iterations (e.g. coming weekends). Peter has inspired me with an idea that can truly go ballistic, and I have some more ammo left regardless. So this really is as "bad" as it gets.

Hope this helps clear the fog and justify the risk.

The main thing I'm worried about is "micro-benchmark" vs "real-world".

Its pretty easy to define a microbenchmark that shows AVX-512 or AVX2 is faster. The concern is then how that impacts everything that runs after that is done.

For example, given the numbers above AVX-512 is 2-5x faster (depending on if compared to AVX2 or SSE). However, that is ~10ns vs ~50ns. If the downclocking can last up to 700 microseconds, that is 14000x longer than the SSE scenario: https://www.google.com/search?q=700+microseconds+%2F+50+nanoseconds and the question is whether the gains outweigh potentially running the entire processor at 50-75% speed for that much longer.

This is naturally assuming a worst or near worst case scenario, it may not actually be that bad; I just think it is something that we should carefully measure and with real world workloads, rather than just micro-benchmarks

Well, it's 10ns vs 70ns per element.
Multiplied by 2²⁰ it is... more...

Agreed that the only thing that counts in the end of the day is real life perf for the end-to-end case rather than these micro-benchmarks. But the more the micro-benchmarks end up being overwhelmingly extreme compared to the starting point, at some point even real life's ability to mess with you comes to an end.

Sorting, and in these sizes (more importantly) tends to always go to the extreme side of the spectrum for the "Optimize at all costs" arguments.

We can hopefully allow larger and larger mark-lists with this, which as far as I can tell has all kinds of all-around healing effects, though I'm not a GC guru.

There is a real world motivation here - we have seen real world scenarios where mark list sorting takes several milliseconds (with ~100k mark list entries). We currently have a limitation at 150k entries - when we run out, we have to take a less efficient path in plan_phase, which runs into the tens of milliseconds for real world scenarios that I've seen.

I will consider limiting the AVX512 sorting to large mark lists though, to make sure we get a sizeable benefit.

tannergooding · 2020-06-24T14:39:23Z

src/coreclr/src/inc/corinfoinstructionset.h

@@ -59,6 +59,7 @@ enum CORINFO_InstructionSet
    InstructionSet_SSE2_X64=25,
    InstructionSet_SSE41_X64=26,
    InstructionSet_SSE42_X64=27,
+    InstructionSet_AVX512=28,


This needs to be AVX512F to match the actual ISA name. There are a number of AVX512 extensions and it will be important to differentiate them if we eventually extend HWIntrinsics to support them.

Agreed, this is indeed the naming on my original repo:
MSVC:
https://github.com/damageboy/vxsort-cpp/blob/master/vxsort/isa_detection_msvc.cpp#L53

clang:
https://github.com/damageboy/vxsort-cpp/blob/master/vxsort/isa_detection_sane.cpp#L20

Would it be better for GC to have its own local detection (e.g. based on @damageboy original repo), and not couple it with the JIT/EE interface and intrinsics support in the JIT?

This instruction set enum is autogenerated (look for "DO NOT EDIT THIS FILE! IT IS AUTOGENERATE" at the top of the file). Adding to this enum would mean updating the auto-generator and regenerating all places where this shows up.

I don't have a preference here. I would think it should be named AVX512F even if it is a separate define/detection/etc.

I totally missed the fact that the enum is autogenerated - back to the drawing board... Agree with naming it AVX512F, of course.

Maoni0 · 2020-07-10T04:35:14Z

src/coreclr/src/gc/gcconfig.h

@@ -129,6 +129,7 @@ class GCConfigStringHolder
    INT_CONFIG   (GCHeapHardLimitSOHPercent, "GCHeapHardLimitSOHPercent", NULL,                             0,                 "Specifies the GC heap SOH usage as a percentage of the total memory")              \
    INT_CONFIG   (GCHeapHardLimitLOHPercent, "GCHeapHardLimitLOHPercent", NULL,                             0,                 "Specifies the GC heap LOH usage as a percentage of the total memory")              \
    INT_CONFIG   (GCHeapHardLimitPOHPercent, "GCHeapHardLimitPOHPercent", NULL,                             0,                 "Specifies the GC heap POH usage as a percentage of the total memory")              \
+    INT_CONFIG   (GCEnabledInstructionSets, "GCEnabledInstructionSets", NULL,                        -1,                 "Specifies whether GC can use AVX2 or AVX512F")                                           \


could you add a comment that says the usage of this config since it's not obvious unless you read the detection code (by default we always use the highest available; 1 means AVX2 and 2 means AVX512F and if both are available you always have to specify either 1 or 3)?

Sure, I just added to the comment that is already part of the line:

"Specifies whether GC can use AVX2 or AVX512F - 0 for neither, 1 for AVX2, 3 for AVX512F"

Maoni0 · 2020-07-10T04:38:39Z

src/coreclr/src/gc/gc.cpp

@@ -19,7 +19,11 @@

 #include "gcpriv.h"

+#if defined(TARGET_AMD64) && defined(TARGET_WINDOWS)


I see that you do have detection on TARGET_UNIX as well, but we are only enabling VXSORT on windows, are you thinking of enabling it on TARGET_UNIX with a separate PR or in this one?

I was working on enabling vxsort in Linux, I will add it to this PR if you don't mind.

As it turns out, enabling this for Linux is more involved than I thought - I think this should go in a separate PR after all.

Maoni0 · 2020-07-10T04:39:52Z

src/coreclr/src/gc/vxsort/isa_detection.cpp

+void InitSupportedInstructionSet (int32_t configSetting)
+{
+    s_supportedISA = (SupportedISA)((int)DetermineSupportedISA() & configSetting);
+    // we are assuming that AVX2 can be used if AVX521F can,


typo: AVX521F -> AVX512F

Thanks for the catch!

Maoni0 · 2020-07-10T04:51:26Z

src/coreclr/src/gc/gc.cpp

+// despite possible downclocking on current devices
+const size_t AVX512F_THRESHOLD_SIZE = 128 * 1024;


is this number based on perf experiments? sorry if I missed the data.

It's based on Dan's numbers for both the downclocking impact (perhaps 30% over 700 microseconds or ~200 microseconds), and the speed advantage for AVX512F over AVX2 (about 400 microseconds at 128*1024). Dan's speed numbers are for relatively low clock rate machines (2.8 and 2.1 GHz), so I think on current hardware, the time difference may be lower due to higher clock rates on both.

I think it's just a rough estimate, but not very wrong.

Maoni0 · 2020-07-10T05:08:48Z

src/coreclr/src/gc/gc.cpp

+        {
+            int32_t* mark_list_32 = (int32_t*)mark_list;
+            uint8_t* low = gc_low;
+            ptrdiff_t range = heap_segment_allocated (ephemeral_heap_segment) - low;


instead of calculating the range this way you can check the difference between slow and shigh (which are set by m_boundary in marking) which would give you the actual survived range.

Ah, I see, I wasn't aware of these - thanks!

sorry, I should've been more clear - shigh is actually a valid object that we marked, just the highest one we marked. so the range should be shigh + Align (size (shigh)) - slow

Maoni0 · 2020-07-10T05:31:43Z

src/coreclr/src/gc/gc.cpp

+    }
+
+    // give up if this is not an ephemeral GC or the mark list size is unreasonably large
+    if (settings.condemned_generation > 1 || total_mark_list_size > total_ephemeral_size/256)


we should move the check for eph GC to be at the beginning of this method so we don't need to look at each heap's eph range at all. also to check for eph GC it'd be better to do
if (settings.condemned_generation < max_generation)
instead of > 1.

is 256 a number based on experiments? 3% survival seems ok to me for using mark list but that's just a guess.

Ok, moved the test for eph GC up and changed it to compare against max_generation.

256 is a number based on a rough back-of-the envelope computation. The idea was that walking the heap will be dominated by fetching cache lines, and for each marked object we will roughly fetch 2 cache lines (one for the marked object, and one for the unmarked object after it). Cache line size is 64 bytes, so if number of mark list elements is ephemeral_size/256, we will fetch about half the cache lines. So at this threshold, there should still be a bit of benefit for using the mark list, but probably not a whole lot, because just walking the heap would benefit from prefetching etc.

So that was my reasoning...

Maoni0 · 2020-07-10T07:09:31Z

still looking...

- fix typo in comment in InitSupportedInstructionSet - move test for full GC to beginning of sort_mark_list - in WKS GC, we can use the tighter range shigh - slow for the surviving objects instead of the full ephemeral range. - make the description for the new config setting GCEnabledInstructionSets more explicit by enumerating the legal values and their meanings.

Maoni0 · 2020-07-15T02:26:53Z

src/coreclr/src/gc/gc.cpp

+        {
+            dprintf (3, ("Sorting mark lists as 32-bit offsets"));
+
+//#define WRITE_SORT_DATA


nit - seems weird to have a #define in the middle of its usage...

Maoni0 · 2020-07-15T02:28:34Z

src/coreclr/src/gc/gc.cpp

+                int log2_item_count = index_of_highest_set_bit (item_count);
+                double elapsed_cyles_by_n_log_n = (double)elapsed_cycles / item_count / log2_item_count;
+
+//                printf ("GC#%d: first phase of sort_mark_list for heap %d took %u cycles to sort %u entries (cost/(n*log2(n) = %5.2f)\n", settings.gc_index, this->heap_number, elapsed_cycles, item_count, elapsed_cyles_by_n_log_n);


are you actually using the printf's in this method? if not we should just get rid of them... or convert to dprintf's... having all these printf's sprinkled makes the code look very temporary.

also it looks like elapsed_cycles is not used unless you have the printf or if WRITE_SORT_DATA is defined. same with the intro sort version.

I converted the printfs to dprintfs.

The call to introsort is a debug only check. We can get rid of it once we fully trust the vectorized sort.

Maoni0 · 2020-07-15T03:24:14Z

src/coreclr/src/gc/gc.cpp

+    }
+
+    // give up if this is not an ephemeral GC or the mark list size is unreasonably large
+    if (total_mark_list_size > total_ephemeral_size/256)


nit - styling

if (total_mark_list_size > (total_ephemeral_size / 256))

Maoni0 · 2020-07-15T05:18:59Z

src/coreclr/src/gc/gc.cpp

+        ptrdiff_t entry_count = mark_list_index - mark_list;
+        // conservatively use AVX2 only for large mark lists,
+        // and do runtime test to check whether AVX2 is indeed available 
+        if (entry_count > AVX2_THRESHOLD_SIZE && IsSupportedInstructionSet (InstructionSet::AVX2))


nit -

if ((entry_count > AVX2_THRESHOLD_SIZE) && IsSupportedInstructionSet (InstructionSet::AVX2))

Maoni0 · 2020-07-15T05:31:17Z

src/coreclr/src/gc/gc.cpp

+                for (ptrdiff_t i = 0; i < entry_count; i++)
+                {
+                    uint8_t* item = mark_list[i];
+                    assert (low <= item && item < high);


you could changed this assert to

assert ((low <= item) && (item <= shigh));

Maoni0 · 2020-07-15T05:38:11Z

src/coreclr/src/gc/gc.cpp

+                do_pack_avx2 (mark_list, entry_count, low);
+                _sort (&mark_list_32[0], &mark_list_32[entry_count - 1], 0);
+                do_unpack_avx2 (mark_list_32, entry_count, low);


can we get the pack/int32sort/unpack in one method, something like try_vxsort_32 so we can call it both in plan_phase and sort_mark_list?

That's a good idea - in fact I moved the code that checks the range, the instruction set etc. into one method do_vxsort.

Maoni0 · 2020-07-15T05:40:22Z

src/coreclr/src/gc/gcsvr.cpp

@@ -21,6 +21,10 @@

 #define SERVER_GC 1

+#if defined(TARGET_AMD64)


I suppose you might as well also change this to check for TARGET_WINDOWS to be consistent.

Right, that slipped throught the cracks - thanks!

Maoni0 · 2020-07-15T05:40:30Z

src/coreclr/src/gc/gcwks.cpp

@@ -21,6 +21,10 @@
 #undef SERVER_GC
 #endif

+#if defined(TARGET_AMD64)


I suppose you might as well also change this to check for TARGET_WINDOWS to be consistent.

Yep - thanks!

Maoni0 · 2020-07-15T07:13:22Z

can we add some instructions in the bitonic sort codegen dir for how to run the python scripts there if we need to change the length for which we want to sort with bitonic sort? I doubt we'll be doing that any time soon but it's always good to know how if we do want to experiment with it.

PeterSolMS · 2020-07-15T14:14:14Z

I added a comment at the top of vxsort/smallsort/bitonic_gen.py and verified that running it with the suggested arguments actually does produce the code we have.

- add instructions to bitonic_gen.py - centralize range and instruction set checks in do_vxsort - add parentheses around expressions. - removed some printfs, converted others to dprintf - strengthened assert

Maoni0 · 2020-07-15T20:54:46Z

all the failures are due to eventpipe test failing which is tracked by #39361

stephentoub · 2020-07-15T21:11:06Z

measuring and analyzing perf impact is a big part of what Peter has been doing; and will be included as part of the commit message when we merge.

Did this happen? I don't see the results in the commit message. Maybe it was in an earlier comment and I just missed it?

Maoni0 · 2020-07-15T21:17:12Z

it happened and I've chatted with Peter and there's more to do on our side. but we wanted to get this in for Preview8 so folks can try it and help flush out bugs if any. in addition to testing on our side, we had one of our 1st party customers try it and they saw 10% ephemeral pause reduction. obviously as with any perf change this is not a universal number (we had another 1st party customer try this and they saw no change as they already didn't spend much time sorting the mark list).

in .NET 6 I'd like to make it so that we always include perf results for perf changes in GC.

PeterSolMS added 4 commits May 28, 2020 17:58

Initial snapshort of vxsort implementation incorporating source code …

567b620

…from Dan Shechter.

Bug fix from Dan.

98fc7b1

Use bigger mark list for experiments.

441033b

Give up if the mark list size is bigger than a reasonable fraction of…

42d5acb

… the ephemeral space.

PeterSolMS requested a review from Maoni0 May 29, 2020 13:51

Dotnet-GitSync-Bot added the area-GC-coreclr label May 29, 2020

jkotas reviewed May 29, 2020

View reviewed changes

adamsitnik added the tenet-performance Performance related issue label Jun 1, 2020

PeterSolMS added 4 commits June 3, 2020 10:37

Latest version from Dan, disable for ARM64.

9fee09e

Fixes for Linux compile, ended up disabling vxsort for Linux for now.

6af96dd

Experimenting with 32-bit sort - the variation that gathers mark list…

af23cd2

… entries pertaining to the local heap by reading the mark lists from all the heaps appears to be too slow and scales very badly with increasing number of heaps.

32-bit sort - preserve failing case.

4c03ee4

PeterSolMS added 8 commits June 10, 2020 10:47

Do the pointer compression/decompression in place, to improve perform…

a6b2305

…ance, optionally write mark lists and associated information to binary files for further analysis.

Merge branch 'master' into vxsort

05505de

Introduce runtime check whether CPU supports AVX2 instruction set.

2f3d330

Implement mark list growth.

912f9e8

Integrate new version including AVX512 code path.

7e43b76

Implement runtime test for AVX512 support.

0f543de

Move the files for the vectorized sort to their own directory, add st…

f49b16f

…ubs to call AVX2 or AVX512 flavor of the sort.

Get rid of unneeded #include statement in two files.

d38cfee

tannergooding reviewed Jun 24, 2020

View reviewed changes

Update license headers

c3a4974

safern mentioned this pull request Jul 8, 2020

Builds timing out on CI #38964

Closed

Maoni0 reviewed Jul 10, 2020

View reviewed changes

PeterSolMS added 5 commits July 10, 2020 12:31

Snapshot for Linux changes

515cae6

Merge remote-tracking branch 'origin/vxsort' into vxsort

338c1c9

Add more definitions to immintrinh.h

2409c50

Fix cmake warnings about mismatched endif clauses.

08b4c4c

jaredpar mentioned this pull request Jul 10, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

Disable Linux support for now due to multiple compile & link errors.

f87af51

Maoni0 reviewed Jul 15, 2020

View reviewed changes

Address code review feedback:

1301d66

- add instructions to bitonic_gen.py - centralize range and instruction set checks in do_vxsort - add parentheses around expressions. - removed some printfs, converted others to dprintf - strengthened assert

Maoni0 approved these changes Jul 15, 2020

View reviewed changes

Maoni0 merged commit 69b0d16 into dotnet:master Jul 15, 2020

damageboy mentioned this pull request Jul 29, 2020

coreclr.dll regressed by ~300kB due to vxsort #39600

Closed

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

What	Time	Description
Voltage Transition	~8 to 20 μs	Time required for a voltage transition, depends on the frequency
Frequency Transition	~10 μs	Time required for the halted part of a frequency transition
Relaxation Period	~680 μs	Time required to go back to a lower power license, measured from the last instruction requiring the higher license

		@@ -19,7 +19,11 @@

		#include "gcpriv.h"

		#if defined(TARGET_AMD64) && defined(TARGET_WINDOWS)

		// despite possible downclocking on current devices
		const size_t AVX512F_THRESHOLD_SIZE = 128 * 1024;

		@@ -21,6 +21,10 @@

		#define SERVER_GC 1

		#if defined(TARGET_AMD64)

Vxsort #37159

Vxsort #37159

Conversation

PeterSolMS commented May 29, 2020

ghost commented May 29, 2020

stephentoub commented May 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damageboy May 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damageboy commented May 31, 2020

danmoseley commented Jun 8, 2020

Maoni0 commented Jun 8, 2020

danmoseley commented Jun 8, 2020

Choose a reason for hiding this comment

damageboy Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damageboy Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maoni0 commented Jul 10, 2020

Maoni0 Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Maoni0 Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maoni0 commented Jul 15, 2020

PeterSolMS commented Jul 15, 2020

Maoni0 commented Jul 15, 2020

stephentoub commented Jul 15, 2020 • edited Loading

Maoni0 commented Jul 15, 2020

damageboy May 29, 2020 •

edited

Loading

damageboy Jun 24, 2020 •

edited

Loading

damageboy Jun 24, 2020 •

edited

Loading

jkotas Jun 24, 2020 •

edited

Loading

Maoni0 Jul 15, 2020 •

edited

Loading

Maoni0 Jul 15, 2020 •

edited

Loading

stephentoub commented Jul 15, 2020 •

edited

Loading