Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Linq.Parallel not finishing in 25 minutes in CI runs #29123

Closed
wfurt opened this issue Mar 29, 2019 · 33 comments
Closed

System.Linq.Parallel not finishing in 25 minutes in CI runs #29123

wfurt opened this issue Mar 29, 2019 · 33 comments
Labels
arch-arm64 area-System.Linq.Parallel disabled-test The test is disabled in source code against the issue os-linux Linux OS (any supported distro) untriaged New issue has not been triaged by the area owner
Milestone

Comments

@wfurt
Copy link
Member

wfurt commented Mar 29, 2019

We have issues with System.Linq.Parallel timing out sometimes. I bumped timout to 25 minutes and the test still did not finish. Somebody should look into it.

[BEGIN EXECUTION]
+ ./RunTests.sh /root/helix/work/correlation /root/helix/work/correlation /root/helix/work/correlation/tools
----- start 19:38:19 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet xunit.console.dll System.Linq.Parallel.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonlinuxtests -notrait category=IgnoreForCI -notrait category=failing
popd
===========================================================================================================
/root/helix/work/workitem /root/helix/work/workitem
  Discovering: System.Linq.Parallel.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Linq.Parallel.Tests (found 1185 of 1188 test cases)
  Starting:    System.Linq.Parallel.Tests (parallel test collections = on, max threads = 46)
   System.Linq.Parallel.Tests: [Long Running Test] 'System.Linq.Parallel.Tests.ParallelQueryCombinationTests.ElementAtOrDefault', Elapsed: 00:02:02
   System.Linq.Parallel.Tests: [Long Running Test] 'System.Linq.Parallel.Tests.ParallelQueryCombinationTests.ElementAt', Elapsed: 00:02:06

[EXECUTION TIMED OUT]
Exit Code:-3Executor timed out after 1500 seconds and was killed
@tarekgh
Copy link
Member

tarekgh commented Mar 31, 2019

@wfurt could you please provide a machine or instructions how we can look at this issue?

@tarekgh
Copy link
Member

tarekgh commented Mar 31, 2019

CC @janvorli if he has some ideas how we can profile this test to know where the time get spent. I am guessing this could be a arm64 synchronization issue?

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

I'll check with @ulisesh . On machine I use it takes ~10 minutes to complete.
But for CI, this failures is pretty consistent.

@janvorli
Copy link
Member

janvorli commented Apr 1, 2019

Was this release / checked or debug?
Do we know how many processor cores the CI machine has?
Is there a link to the log of the failed run?

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2019

@wfurt do you have a link to the failure logs?

@ulisesh could you please help with the issue as we need to get the machine failing with the CI? I think @wfurt already following up with you.

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

BTW ARM64 builds are now completely busted. When back in normal, my plan was to inject fake test to dump cpu & process info.

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2019

Cool, thanks @wfurt

@janvorli it looks the failures with this test on the debug and release builds.

@janvorli
Copy link
Member

janvorli commented Apr 1, 2019

Now I've noticed the max threads = 46 in the log. So this runs on one of the beefy many-core ARM64 devices, right? @wfurt, when you were trying to repro the issue, have you used one of those? It really may be a thread sync issue exposed by running on that many cores.

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

ok. I'll work with @ulisesh. It is container and machine I was testing on has only 6 cores. So the behavior could be different.

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2019

@janvorli

It really may be a thread sync issue exposed by running on that many cores

Do you have more info about the thread sync issue? is this tracked by some other issue?

@janvorli
Copy link
Member

janvorli commented Apr 1, 2019

@tarekgh no, it is just a guess. There is no evidence from other issues that we have such a problem. But this test is specific - there is a large scale parallelism on the CI hardware, which lead me to the guess.

@janvorli
Copy link
Member

janvorli commented Apr 1, 2019

I would suggest first trying to repro it on the 46 core machine and see if it is a hang or if it just takes a long time to complete.

@stephentoub
Copy link
Member

stephentoub commented Apr 1, 2019

Now I've noticed the max threads = 46 in the log

Here's my psychic debugging guess (which could be wrong):
Environment.ProcessorCount returned 46 but the ThreadPool's min thread count was something like 2. This would make some PLINQ queries in the tests behave incredibly slowly.

Even if the min thread count matched, if ProcessorCount returned 46, it would still make some queries operate on significant amounts of data, e.g.
https://github.com/dotnet/corefx/blob/a10890f4ffe0fadf090c922578ba0e606ebdd16c/src/System.Linq.Parallel/tests/WithCancellationTests.cs#L68-L74

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

I can check that @stephentoub. Is there some easy way how to dump ThreadPool's parameters? I can make them part of test Output for future use....

@stephentoub
Copy link
Member

Is there some easy way how to dump ThreadPool's parameters?

I'd suggest dumping the results of:
Environment.ProcessorCount
ThreadPool.GetMinThreads(out int minWorkerThreads, out int minIoThreads);
ThreadPool.GetMaxThreads(out int maxWorkerThreads, out int maxIoThreads);
ThreadPool.GetAvailableThreads(out int availableWorkerThreads, out int availableIoThreads);

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

thanks. I'll do it as part of the test run and I'll share results and timing from 46 core machine.

@wfurt
Copy link
Member Author

wfurt commented Apr 1, 2019

Helpers/Sources.cs:

       public static readonly int OuterLoopCount = 4 * 1024 * Environment.ProcessorCount;

@Clockwork-Muse
Copy link
Contributor

@wfurt - Is this actually doing the outerloop run, though? Although, like @stephentoub points out, similar stuff shows up in a couple of other places.

@tarekgh
Copy link
Member

tarekgh commented Apr 1, 2019

@stephentoub I see your point but I am wondering if we have a scaling problem in general? I expect when having more cores, we can handle more data in parallel (except if there is some other factors like other things running on the machine in same time). I know we cannot judge that before collecting more data as you have suggested.

Also, can we try running this test separately while not running any other test and look if this will make any difference?

@wfurt
Copy link
Member Author

wfurt commented Apr 3, 2019

I did more testing and the test always finished without failures. Note , that it has instructions for XUnits so it only runs one test at the time. All tests bellow are from life machine with no other significant load. In some cases I did multiple runs and I was getting consistent numbers.

At first I wanted to establish run on bare OS vs unlimited Docker. (outerloop)

  Discovering: System.Linq.Parallel.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Linq.Parallel.Tests (found 1186 of 1189 test cases)
  Starting:    System.Linq.Parallel.Tests (parallel test collections = on, max threads = 2)
  Finished:    System.Linq.Parallel.Tests
=== TEST EXECUTION SUMMARY ===
   System.Linq.Parallel.Tests  Total: 201365, Errors: 0, Failed: 0, Skipped: 0, Time: XXX

Base OS Docker
Duration 51m 53m
Source.OuterLoopCount 188416 188416
Environment.ProcessorCount 46 46
MinThreads WorkerThreads=46 IoThreads=46 WorkerThreads=46 IoThreads=46
MaxThreads WorkerThreads=32767 IoThreads=1000 WorkerThreads=32767 IoThreads=1000
AvailableThreads WorkerThreads=32766 IoThreads=1000 WorkerThreads=32766 IoThreads=1000

Docker run is little bit slower but not much. When running on base OS I was also looking CPU utilization with top. dotnet process would take 500-600% cpu e.g. 5-6 core equivalent and the system was 70-80% idle. That generally means that either we cannot properly parallelize the task, we have issues in runtime or we hit other limits - like memory bandwidth.

To get more data, I did more tests with Docker and various number or cores assigned to container.
(with old --cpus allocation see https://github.com/dotnet/coreclr/issues/22302)

Cores Duration Environment.ProcessorCount Source.OuterLoopCount
46 53m 46 188416
24 22m 24
12 11m 12 32768
2 3m 2 8192

Note that the MinThreads is equal to ProcessorCount. I wrote the stats as another test so the AvailableThreads are not collected during test run. I can probably create coredump during run if anybody wants to take a closer look.

I also did quick test where Source.OuterLoopCount = Math.Min(8, Environment.ProcessorCount) * 4 * 1024 ; with goal to not increase test iterations beyond certain point. With OuterLoopCount=32768 and ProcessorCount=46 duration was 50 minutes.
I know this is not only value tests use but I was wondering if we should get data with fixed load e.g. tests not using ProcessorCount at all.

@tarekgh
Copy link
Member

tarekgh commented Apr 3, 2019

This is very useful info @wfurt. Thanks.

If I am not mistaken, the test System.Linq.Parallel.Tests.ParallelQueryCombinationTests is not using Source.OuterLoopCount. so changing this value will not affect this specific test. of course it will affect many of other tests in System.Linq.Parallel.Tests

I am seeing System.Linq.Parallel.Tests.ParallelQueryCombinationTests still be affected dramatically by the processors count by
https://github.com/dotnet/corefx/blob/2be1075a381c11330907392951813c7bbab8b1ad/src/System.Linq.Parallel/tests/Combinatorial/SourcesAndOperators.cs#L123

And I am seeing the UnaryOperations is used multiple time in the test which means the parallelism in the test is much more the processor count. So, I expect increasing the processor count will just increase the time of the test dramatically and not linearly.

@wfurt is it possible to have one more trial to change the line same way you tried Math.Min(8, Environment.ProcessorCount) and get how much this will affect the test running time?

@Clockwork-Muse
Copy link
Contributor

And I am seeing the UnaryOperations is used multiple time in the test which means the parallelism in the test is much more the processor count. So, I expect increasing the processor count will just increase the time of the test dramatically and not linearly.

...uh, what? WithDegreesOfParallelism is a hint, not a guarantee, and defaults to processor count anyways. Furthermore, when PLINQ orchestrates the query it combines the settings, include degrees of parallelism, so multiple operators don't increase the used thread count (and might be capped to processor count too, IIRC).

We probably should change that line to something like:

// DefaultSize is 16
yield return new object[] { Label("WithDegreesOfParallelism", (start, count, source) => source(start, count).WithDegreeOfParallelism(Math.Min(DefaultSize / 4, Environment.ProcessorCount))) };

...because the combinatorial tests use a fixed size of data.

Most of the time taken by the PLINQ outerloop tests is just due to the fact that there's so many combinatorial tests, and all tests have to be run sequentially.

@wfurt
Copy link
Member Author

wfurt commented Apr 3, 2019

either modification gives still runtime around 50 minutes.

assembly name="System.Linq.Parallel.Tests.dll" environment="64-bit .NET Standard [collection-per-assembly, non-parallel]" test-framework="xUnit.net 2.4.1.4059" run-date="2019-04-03" run-time="13:24:32" total="201365" passed="201365" failed="0" skipped="0" time="3073.350" errors="0"

I attached testResults.xml from the run. It seems like longest test was ~ 2s but there are just too many:

grep 'result=' testResults.xml | cut -d '"' -f2,8 | sed -e 's/ /_/g' -e 's/"/ /'| sort -g -k2 -r | head -10

System.Linq.Parallel.Tests.GroupJoinTests.GroupJoin_Unordered_CustomComparator_Longrunning 1.9401561
System.Linq.Parallel.Tests.JoinTests.Join_Unordered_CustomComparator_Longrunning 1.9133081
System.Linq.Parallel.Tests.ParallelQueryCombinationTests.GetEnumerator_Unordered(operation:_Distinct) 1.6766002
System.Linq.Parallel.Tests.ParallelQueryCombinationTests.Concat(source:_ParallelEnumerable.Range|AsOrdered,_operation:_Skip) 1.6596459
System.Linq.Parallel.Tests.WithCancellationTests.WithCancellation_DisposedEnumerator_ChannelCancellation_ProducerBlocked(labeled:_Enumerable.Range-Ordered,_count:_3014656) 1.6504532
System.Linq.Parallel.Tests.ParallelQueryCombinationTests.Aggregate_AggregateException(source:_Default|Failing,_operation:_Except-Right:Default) 1.5845938
System.Linq.Parallel.Tests.WithCancellationTests.WithCancellation_DisposedEnumerator_ChannelCancellation_ProducerBlocked(labeled:_Partitioner-Ordered,_count:_3014656) 1.2649597
System.Linq.Parallel.Tests.ExceptTests.Except_Unordered_SourceMultiple_Longrunning(leftQuery:_[0,_1,_2,_3,_4,_...],_leftCount:_23552,_rightQuery:_[],_rightCount:_0,_start:_0,_count:_23552) 1.1311573
System.Linq.Parallel.Tests.DistinctTests.Distinct_Unordered_SourceMultiple_Longrunning(labeled:_Array,_count:_188416) 1.0686759
System.Linq.Parallel.Tests.WithCancellationTests.WithCancellation_DisposedEnumerator_ChannelCancellation_ProducerBlocked(labeled:_Enumerable.Range,_count:_3014656) 1.0376968

testResults.xml.zip

@tarekgh
Copy link
Member

tarekgh commented Apr 4, 2019

Thanks again @wfurt

I want to clarify Source.OuterLoopCount still contributing a lot in the test which I believe we should change. I was just curious if the other place would make much differences which looks not per last trial.

@wfurt
Copy link
Member Author

wfurt commented Apr 4, 2019

I can easily change both. I have build tree set up for cross-compilation as well as access to test machine.
I think I was expecting some long test, but I did not see any. There just way too many small tests in outerloop.

@wfurt
Copy link
Member Author

wfurt commented Apr 4, 2019

I replaced all instances of ProcessorCount with magic constant 4. With that, Outrloop test run on 46 core systems takes 49 minutes. So it seems like the major slowdown down not come from parallelism in test it self.

@Clockwork-Muse
Copy link
Contributor

@wfurt - by default, PLINQ attempts to use Environment.ProcessorCount internally when kicking off queries.

What happens if you add .WithDegreeOfParallelism(4) to all the sources in the combinatorial tests (this is the simplest option to restrict in most cases, since it will flow through all the other combinations):

So, for the unordered sources:

yield return Label("ParallelEnumerable.Range", (start, count, ignore) => ParallelEnumerable.Range(start, count)).WithDegreeOfParallelism(4);

@wfurt
Copy link
Member Author

wfurt commented May 1, 2019

I was able to get some improvements @Clockwork-Muse.
However I don't have time to keep experimenting so dotnet/corefx#37319 disables the tests for now. Somebody with better inner knowledge of Linq should investigate.

@tarekgh
Copy link
Member

tarekgh commented May 1, 2019

Is it possible we can get a machine or VM for a couple of days we can use to investigate more?

@wfurt
Copy link
Member Author

wfurt commented May 1, 2019

@ulisesh can give you access.

@tarekgh
Copy link
Member

tarekgh commented May 1, 2019

@ulisesh would the machine be isolated and not used for any other purpose? I won't disturb any running service there.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@stephentoub
Copy link
Member

I don't think there's anything here that's actionable, so I'm going to close it. We already changed some of the PLINQ tests based on this to not run based on machine configuration (those tests refer to this issue in the source, and we can leave those references for context), in situations where it would try to take over gigantic machines with tons of cores and throw lots of work at them.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 13, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Linq.Parallel disabled-test The test is disabled in source code against the issue os-linux Linux OS (any supported distro) untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

7 participants