Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vxsort #37159

Merged
merged 33 commits into from
Jul 15, 2020
Merged

Vxsort #37159

merged 33 commits into from
Jul 15, 2020

Conversation

PeterSolMS
Copy link
Contributor

Faster sorting code from Dan Shechter, and bigger mark list.

Not ready to be merged yet, but ready for initial review.

@ghost
Copy link

ghost commented May 29, 2020

Tagging subscribers to this area: @Maoni0
Notify danmosemsft if you want to be subscribed.

@stephentoub
Copy link
Member

cc: @damageboy

@@ -0,0 +1,282 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind adding whatever license you feel is friendliest to the original

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, I've licensed under MIT license:
https://github.com/damageboy/gcsort/blob/master/LICENSE

};
}
void vxsort(uint8_t** low, uint8_t** high, unsigned int depth)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need "is avx available" check here and fallback to the non-vectorized sort ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, unless you'd like me/Peter to copy the rest of the non-vectorized partitioning back into the same sort.

It already shares the heap-sort + insertion-sort code from the original as a fallback just in case we run into one of the existing edge-cases, as with standard introsort.

This should reduce duplication, as I literally copied 2/3 of introsort into vxsort from the get go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With clang/gcc I've previously used builtins:

if (__builtin_cpu_supports("avx2")) {
  // AVX2
} else {
  // SAD PANDA
}

To cleanly detect this.

Pretty sure you'll have to do something else for MSVC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have prior art for this? I found EEJitManager::SetCpuInfo which detects the AVX2 instruction set for the JIT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Unix, __builtin_cpu_supports("avx2") looks pretty nice. For Windows, I guess it will need to call GetEnabledXStateFeatures and check for XSTATE_MASK_AVX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general mechanism I'm familiar with is using function multi-versioning, this is supported in gcc since forever, and in clang from 7.0 onwards if I'm not mistaken.

The basic idea would be that a function can be decorated with:

__attribute__ ((target ("default")))
int foo () {

  return 1;
}
__attribute__ ((target ("avx2")))
int foo () {
  return 2;
}

And you get an indirect jump embedded in the code stream courtesy of the compiler.

I'm pretty sure this does not exist for MSVC, where you need to hand-code such things, but then again, you have more apt MSVC gurus around you, so they might know something I don't. But it seems that if you would like to avoid depending on the EEJitManager, I could add support for this myself for clang/gcc the normal mechanisms, and for MSVC by following what seems to be the standard practice inside vxsort, that way I can switch from AVX2, AVX512 when my AVX512 support is complete.

Would you like me to take care of all of this internally and expose one sort functionality that does everything?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we can't use the cached checks the VM already has from the HWIntrinsics feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to support building GC into separate .dll (aka LocalGC).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll also add that you don't detect AVX512 yet either, which I'm working on in a vxsort branch, regardless if @PeterSolMS ends up taking it.

@@ -0,0 +1,1481 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about how this file was generated would be nice

Copy link
Contributor

@damageboy damageboy May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generator script is part of my repo:

https://github.com/damageboy/gcsort/blob/master/gcsort/smallsort/bitonic_gen.py

Note that it generates all 6 32+64 bit types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I would highly recommend including the generator and reviewing it instead of the mountains of code it generated.

It might actually make sense that way :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas I've updated the original repo to now include proper auto-generation disclaimers:

/////////////////////////////////////////////////////////////////////////////
////
// This file was auto-generated by a tool at 2020-05-30 11:18:44
//
// It is recommended you DO NOT directly edit this file but instead edit
// the code-generator that generated this source file instead.
/////////////////////////////////////////////////////////////////////////////

I'm assuming we'll do another pass of updating the gc copy from my original given there will be more changes stemming from this review process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd definitely want to include the generator.
BTW @PeterSolMS is OOF, that's why he hasn't said anything, he will be back this coming Tue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether to put the generator under src/tools or src/gc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damageboy
Copy link
Contributor

For what it's worth, I've done some initial work to clean up the original codebase.

Currently the gcsort repo's latest and greatest can build on Mac, Linux, and windows, on a combination of msvc+clang-cl for windows and clang+gcc for mac+linux.

There is no functional change to the c++ code in the original, but knowing that many compilers and operating system all build and run tests successfully should provide some reassurance as to the quality of the code.

The macOs build, specifically, cannot run tests on GH Actions right now, as it seems those machines lack AVX2 support? which seems odd enough, but still...

.

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Jun 1, 2020
… entries pertaining to the local heap by reading the mark lists from all the heaps appears to be too slow and scales very badly with increasing number of heaps.
@danmoseley
Copy link
Member

For the sake of us curious bystanders, do we have an idea of the impact of this cool change? Eg., perf measurements.

@Maoni0
Copy link
Member

Maoni0 commented Jun 8, 2020

measuring and analyzing perf impact is a big part of what Peter has been doing; and will be included as part of the commit message when we merge.

@danmoseley
Copy link
Member

Sounds good - I was just excited :)


template <>
class numeric_limits<int64_t>
do_vxsort_avx512(low, high);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any numbers showing how AVX512 performs in real world scenarios?

It is fairly well documented that using AVX512 can cause downclocking which in certain circumstances can lead to decreased overall application performance (AVX2 can as well, but to a much lesser extent).

  • The impact is lower on newer processors such as Ice Lake as compared to older Skylake Server
  • The impact is typically relegated to instructions that use the floating-point or multiplication (including FMA) pipelines
  • The impact is typically for 256 and 512-bit variants of the instructions, rather than 128-bit

The official guidance from Intel is the following (from the May 2020 edition of Intel 64 and IA-32 Architectures Optimization Reference Manual):
image

The section detailing Skylake Server Power Management is a bit longer and includes more details on what impacts each frequency level and includes various graphs/examples of the typical impact, with the explicit guidance to still profile.

Copy link
Contributor

@damageboy damageboy Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downclocking can be pretty severe on 14nm parts right now:
From Anandtech's article Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?
image

While that is true, we are still seeing substantial perf boost on a given machine (copy pasting some preliminary results, while I'm still improving the code behind the scenes):

For 64-bit sorting, here is the original + 8way unrolled results, obtained from a shared instance running a Xeon Silver 4216 with nominal speed @ 2.1Ghz:

-------------------------------------------------------------------------------------------
Benchmark (<type/vector-isa/unroll>/size/threads)                      Time/N (per Element)
-------------------------------------------------------------------------------------------
BM_full_introsort/65536/threads:1                                      56.55590 ns
BM_full_introsort/131072/threads:1                                     60.63450 ns
BM_full_introsort/262144/threads:1                                     65.21260 ns
BM_full_introsort/524288/threads:1                                     69.03150 ns
BM_full_introsort/1048576/threads:1                                    73.16970 ns

BM_vxsort<int64_t, vector_machine::AVX2, 8>/65536/threads:1            20.14540 ns
BM_vxsort<int64_t, vector_machine::AVX2, 8>/131072/threads:1           21.08380 ns
BM_vxsort<int64_t, vector_machine::AVX2, 8>/262144/threads:1           22.36290 ns
BM_vxsort<int64_t, vector_machine::AVX2, 8>/524288/threads:1           23.05820 ns
BM_vxsort<int64_t, vector_machine::AVX2, 8>/1048576/threads:1          24.21020 ns

BM_vxsort<int64_t, vector_machine::AVX512, 8>/131072/threads:1          8.23500 ns
BM_vxsort<int64_t, vector_machine::AVX512, 8>/262144/threads:1          8.90877 ns
BM_vxsort<int64_t, vector_machine::AVX512, 8>/524288/threads:1          9.94376 ns
BM_vxsort<int64_t, vector_machine::AVX512, 8>/1048576/threads:1        10.70730 ns

So it's pretty clear the AVX512 is out-performing AVX2, post-downclocking.
There is also a very clear reason for this staggering improvement, if you are aware of two key points:

  • AVX2 is missing some int64 functionality (certain int64 ops are more expensive)
  • There is an extra specific AVX512 intrinsic (_mm512_compress_storeu_epiXX), which is removing the entire lookup table + cache reference that was involved for loading the lookup entry (which is a huge win)

As such, the perf bump can be thought of as ONLY being 2x-ish due to downclocking, with a clear expectation of seeing 3x with future 10nm parts and below.

As for int32 (which is also probably going to be part of this PR, given that we will probably dynamically switch to using int32 for smaller mark-lists):

-------------------------------------------------------------------------------------------
Benchmark (<type/vector-isa/unroll>/size/threads)                      Time/N (per Element)
-------------------------------------------------------------------------------------------
BM_vxsort<int32_t, vector_machine::AVX2, 8>/65536/threads:1            5.47985 ns
BM_vxsort<int32_t, vector_machine::AVX2, 8>/131072/threads:1           5.89806 ns
BM_vxsort<int32_t, vector_machine::AVX2, 8>/262144/threads:1           6.13415 ns
BM_vxsort<int32_t, vector_machine::AVX2, 8>/524288/threads:1           6.49738 ns
BM_vxsort<int32_t, vector_machine::AVX2, 8>/1048576/threads:1          7.10343 ns
 
BM_vxsort<int32_t, vector_machine::AVX512, 8>/65536/threads:1          3.70457 ns
BM_vxsort<int32_t, vector_machine::AVX512, 8>/131072/threads:1         3.94963 ns
BM_vxsort<int32_t, vector_machine::AVX512, 8>/262144/threads:1         4.12517 ns
BM_vxsort<int32_t, vector_machine::AVX512, 8>/524288/threads:1         4.35658 ns
BM_vxsort<int32_t, vector_machine::AVX512, 8>/1048576/threads:1        4.86922 ns

While the perf boost from AVX2 to AVX512 is smaller for int32, it is still substantial. Also, please remember these are preliminary results that are somewhat less than optimal given the lack of direct HW I'm experiencing, and that fact that everything I do with AVX512F is simply more painful because of this.

The benchmarks and code are all up-to-date and accessible from the vxsort-cpp

Finally, there is also a discussion to be had about how long do the downclocking effects linger on POST sorting.
As to that, the long answer is that Travis Downs (@trav_downs) measured this independently: https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

The short answer is no more than ~700usec:

What Time Description
Voltage Transition ~8 to 20 μs Time required for a voltage transition, depends on the frequency
Frequency Transition ~10 μs Time required for the halted part of a frequency transition
Relaxation Period ~680 μs Time required to go back to a lower power license, measured from the last instruction requiring the higher license

It is important to note that on server CPUs, this normally applies for individual cores, where on client CPUs this is "global".
For 14nm there are currently only server parts, with the exception of Intel Icelake CPUs.

I think that given we are saving multiple milliseconds with this optimization, even while increasing the mark-list, and paying with 700usec of reduced frequency post that time-saving window directly is more than a reasonable and temporary sacrifice.

My conclusion from this is that the current "state of the union" is roughly the worst it can get (although multi-threaded results are missing, due to me not having access to a dedicated machine where I could produce meaningful results: I barely respect the results I have, in that sense since this is not a machine I can kick people off of).

As you mentioned, Icelake and Tigerlake make things only better, and it is hard to imagine AMD doing worse once they come around to supporting AVX512, probably sometime around transitioning to TSMC 5nm.

I still think I can kick things up with a few more iterations (e.g. coming weekends). Peter has inspired me with an idea that can truly go ballistic, and I have some more ammo left regardless. So this really is as "bad" as it gets.

Hope this helps clear the fog and justify the risk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I'm worried about is "micro-benchmark" vs "real-world".

Its pretty easy to define a microbenchmark that shows AVX-512 or AVX2 is faster. The concern is then how that impacts everything that runs after that is done.

For example, given the numbers above AVX-512 is 2-5x faster (depending on if compared to AVX2 or SSE). However, that is ~10ns vs ~50ns. If the downclocking can last up to 700 microseconds, that is 14000x longer than the SSE scenario: https://www.google.com/search?q=700+microseconds+%2F+50+nanoseconds and the question is whether the gains outweigh potentially running the entire processor at 50-75% speed for that much longer.

  • This is naturally assuming a worst or near worst case scenario, it may not actually be that bad; I just think it is something that we should carefully measure and with real world workloads, rather than just micro-benchmarks

Copy link
Contributor

@damageboy damageboy Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's 10ns vs 70ns per element.
Multiplied by 220 it is... more...

Agreed that the only thing that counts in the end of the day is real life perf for the end-to-end case rather than these micro-benchmarks. But the more the micro-benchmarks end up being overwhelmingly extreme compared to the starting point, at some point even real life's ability to mess with you comes to an end.

Sorting, and in these sizes (more importantly) tends to always go to the extreme side of the spectrum for the "Optimize at all costs" arguments.

We can hopefully allow larger and larger mark-lists with this, which as far as I can tell has all kinds of all-around healing effects, though I'm not a GC guru.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a real world motivation here - we have seen real world scenarios where mark list sorting takes several milliseconds (with ~100k mark list entries). We currently have a limitation at 150k entries - when we run out, we have to take a less efficient path in plan_phase, which runs into the tens of milliseconds for real world scenarios that I've seen.

I will consider limiting the AVX512 sorting to large mark lists though, to make sure we get a sizeable benefit.

@@ -59,6 +59,7 @@ enum CORINFO_InstructionSet
InstructionSet_SSE2_X64=25,
InstructionSet_SSE41_X64=26,
InstructionSet_SSE42_X64=27,
InstructionSet_AVX512=28,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be AVX512F to match the actual ISA name. There are a number of AVX512 extensions and it will be important to differentiate them if we eventually extend HWIntrinsics to support them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jkotas jkotas Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for GC to have its own local detection (e.g. based on @damageboy original repo), and not couple it with the JIT/EE interface and intrinsics support in the JIT?

This instruction set enum is autogenerated (look for "DO NOT EDIT THIS FILE! IT IS AUTOGENERATE" at the top of the file). Adding to this enum would mean updating the auto-generator and regenerating all places where this shows up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a preference here. I would think it should be named AVX512F even if it is a separate define/detection/etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed the fact that the enum is autogenerated - back to the drawing board... Agree with naming it AVX512F, of course.

@safern safern mentioned this pull request Jul 8, 2020
@@ -129,6 +129,7 @@ class GCConfigStringHolder
INT_CONFIG (GCHeapHardLimitSOHPercent, "GCHeapHardLimitSOHPercent", NULL, 0, "Specifies the GC heap SOH usage as a percentage of the total memory") \
INT_CONFIG (GCHeapHardLimitLOHPercent, "GCHeapHardLimitLOHPercent", NULL, 0, "Specifies the GC heap LOH usage as a percentage of the total memory") \
INT_CONFIG (GCHeapHardLimitPOHPercent, "GCHeapHardLimitPOHPercent", NULL, 0, "Specifies the GC heap POH usage as a percentage of the total memory") \
INT_CONFIG (GCEnabledInstructionSets, "GCEnabledInstructionSets", NULL, -1, "Specifies whether GC can use AVX2 or AVX512F") \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment that says the usage of this config since it's not obvious unless you read the detection code (by default we always use the highest available; 1 means AVX2 and 2 means AVX512F and if both are available you always have to specify either 1 or 3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just added to the comment that is already part of the line:

"Specifies whether GC can use AVX2 or AVX512F - 0 for neither, 1 for AVX2, 3 for AVX512F"

@@ -19,7 +19,11 @@

#include "gcpriv.h"

#if defined(TARGET_AMD64) && defined(TARGET_WINDOWS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you do have detection on TARGET_UNIX as well, but we are only enabling VXSORT on windows, are you thinking of enabling it on TARGET_UNIX with a separate PR or in this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on enabling vxsort in Linux, I will add it to this PR if you don't mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, enabling this for Linux is more involved than I thought - I think this should go in a separate PR after all.

void InitSupportedInstructionSet (int32_t configSetting)
{
s_supportedISA = (SupportedISA)((int)DetermineSupportedISA() & configSetting);
// we are assuming that AVX2 can be used if AVX521F can,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: AVX521F -> AVX512F

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch!

Comment on lines 2095 to 2096
// despite possible downclocking on current devices
const size_t AVX512F_THRESHOLD_SIZE = 128 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this number based on perf experiments? sorry if I missed the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's based on Dan's numbers for both the downclocking impact (perhaps 30% over 700 microseconds or ~200 microseconds), and the speed advantage for AVX512F over AVX2 (about 400 microseconds at 128*1024). Dan's speed numbers are for relatively low clock rate machines (2.8 and 2.1 GHz), so I think on current hardware, the time difference may be lower due to higher clock rates on both.

I think it's just a rough estimate, but not very wrong.

{
int32_t* mark_list_32 = (int32_t*)mark_list;
uint8_t* low = gc_low;
ptrdiff_t range = heap_segment_allocated (ephemeral_heap_segment) - low;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of calculating the range this way you can check the difference between slow and shigh (which are set by m_boundary in marking) which would give you the actual survived range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, I wasn't aware of these - thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I should've been more clear - shigh is actually a valid object that we marked, just the highest one we marked. so the range should be shigh + Align (size (shigh)) - slow

}

// give up if this is not an ephemeral GC or the mark list size is unreasonably large
if (settings.condemned_generation > 1 || total_mark_list_size > total_ephemeral_size/256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should move the check for eph GC to be at the beginning of this method so we don't need to look at each heap's eph range at all. also to check for eph GC it'd be better to do
if (settings.condemned_generation < max_generation)
instead of > 1.

is 256 a number based on experiments? 3% survival seems ok to me for using mark list but that's just a guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, moved the test for eph GC up and changed it to compare against max_generation.

256 is a number based on a rough back-of-the envelope computation. The idea was that walking the heap will be dominated by fetching cache lines, and for each marked object we will roughly fetch 2 cache lines (one for the marked object, and one for the unmarked object after it). Cache line size is 64 bytes, so if number of mark list elements is ephemeral_size/256, we will fetch about half the cache lines. So at this threshold, there should still be a bit of benefit for using the mark list, but probably not a whole lot, because just walking the heap would benefit from prefetching etc.

So that was my reasoning...

@Maoni0
Copy link
Member

Maoni0 commented Jul 10, 2020

still looking...

 - fix typo in comment in InitSupportedInstructionSet
 - move test for full GC to beginning of sort_mark_list
 - in WKS GC, we can use the tighter range shigh - slow for the surviving objects instead of the full ephemeral range.
 - make the description for the new config setting GCEnabledInstructionSets more explicit by enumerating the legal values and their meanings.
{
dprintf (3, ("Sorting mark lists as 32-bit offsets"));

//#define WRITE_SORT_DATA
Copy link
Member

@Maoni0 Maoni0 Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - seems weird to have a #define in the middle of its usage...

int log2_item_count = index_of_highest_set_bit (item_count);
double elapsed_cyles_by_n_log_n = (double)elapsed_cycles / item_count / log2_item_count;

// printf ("GC#%d: first phase of sort_mark_list for heap %d took %u cycles to sort %u entries (cost/(n*log2(n) = %5.2f)\n", settings.gc_index, this->heap_number, elapsed_cycles, item_count, elapsed_cyles_by_n_log_n);
Copy link
Member

@Maoni0 Maoni0 Jul 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you actually using the printf's in this method? if not we should just get rid of them... or convert to dprintf's... having all these printf's sprinkled makes the code look very temporary.

also it looks like elapsed_cycles is not used unless you have the printf or if WRITE_SORT_DATA is defined. same with the intro sort version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I converted the printfs to dprintfs.

The call to introsort is a debug only check. We can get rid of it once we fully trust the vectorized sort.

}

// give up if this is not an ephemeral GC or the mark list size is unreasonably large
if (total_mark_list_size > total_ephemeral_size/256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - styling

if (total_mark_list_size > (total_ephemeral_size / 256))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

ptrdiff_t entry_count = mark_list_index - mark_list;
// conservatively use AVX2 only for large mark lists,
// and do runtime test to check whether AVX2 is indeed available
if (entry_count > AVX2_THRESHOLD_SIZE && IsSupportedInstructionSet (InstructionSet::AVX2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit -

if ((entry_count > AVX2_THRESHOLD_SIZE) && IsSupportedInstructionSet (InstructionSet::AVX2))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

for (ptrdiff_t i = 0; i < entry_count; i++)
{
uint8_t* item = mark_list[i];
assert (low <= item && item < high);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could changed this assert to

assert ((low <= item) && (item <= shigh));

Comment on lines 22325 to 22327
do_pack_avx2 (mark_list, entry_count, low);
_sort (&mark_list_32[0], &mark_list_32[entry_count - 1], 0);
do_unpack_avx2 (mark_list_32, entry_count, low);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get the pack/int32sort/unpack in one method, something like try_vxsort_32 so we can call it both in plan_phase and sort_mark_list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea - in fact I moved the code that checks the range, the instruction set etc. into one method do_vxsort.

@@ -21,6 +21,10 @@

#define SERVER_GC 1

#if defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you might as well also change this to check for TARGET_WINDOWS to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that slipped throught the cracks - thanks!

@@ -21,6 +21,10 @@
#undef SERVER_GC
#endif

#if defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you might as well also change this to check for TARGET_WINDOWS to be consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - thanks!

@Maoni0
Copy link
Member

Maoni0 commented Jul 15, 2020

can we add some instructions in the bitonic sort codegen dir for how to run the python scripts there if we need to change the length for which we want to sort with bitonic sort? I doubt we'll be doing that any time soon but it's always good to know how if we do want to experiment with it.

@PeterSolMS
Copy link
Contributor Author

I added a comment at the top of vxsort/smallsort/bitonic_gen.py and verified that running it with the suggested arguments actually does produce the code we have.

 - add instructions to bitonic_gen.py
 - centralize range and instruction set checks in do_vxsort
 - add parentheses around expressions.
 - removed some printfs, converted others to dprintf
 - strengthened assert
@Maoni0
Copy link
Member

Maoni0 commented Jul 15, 2020

all the failures are due to eventpipe test failing which is tracked by #39361

@Maoni0 Maoni0 merged commit 69b0d16 into dotnet:master Jul 15, 2020
@stephentoub
Copy link
Member

stephentoub commented Jul 15, 2020

measuring and analyzing perf impact is a big part of what Peter has been doing; and will be included as part of the commit message when we merge.

Did this happen? I don't see the results in the commit message. Maybe it was in an earlier comment and I just missed it?

@Maoni0
Copy link
Member

Maoni0 commented Jul 15, 2020

it happened and I've chatted with Peter and there's more to do on our side. but we wanted to get this in for Preview8 so folks can try it and help flush out bugs if any. in addition to testing on our side, we had one of our 1st party customers try it and they saw 10% ephemeral pause reduction. obviously as with any perf change this is not a universal number (we had another 1st party customer try this and they saw no change as they already didn't spend much time sorting the mark list).

in .NET 6 I'd like to make it so that we always include perf results for perf changes in GC.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.