Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when running TestAutoInference and TestPipelineSweeper in parallel #1095

Closed
eerhardt opened this issue Sep 28, 2018 · 1 comment
Closed

Comments

@eerhardt
Copy link
Member

To reproduce:

On an Azure Standard_DS2_v2 machine (the same that is used in the Hosted VS2017 pool in Azure DevOps), run the Microsoft.ML.Predictor.Tests tests in a loop for a while. (it took me 3 runs)

Sometimes the tests will hang indefinitely.

I was able to attach a debugger when this happens, and there are 2 tests running:

  • TestPipelineSweeper.PipelineSweeperRocketEngine
  • TestAutoInference.TestLearnerConstrainingByName

And both tests were in the same callstack:

 	System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 635	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2978	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalWaitCore(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2917	C#
 	System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) Line 146	C#
 	System.Threading.Tasks.Dataflow.dll!System.Threading.Tasks.Dataflow.DataflowBlock.Receive<int>(System.Threading.Tasks.Dataflow.ISourceBlock<int> source, System.TimeSpan timeout, System.Threading.CancellationToken cancellationToken) Line 982	C#
 	System.Threading.Tasks.Dataflow.dll!System.Threading.Tasks.Dataflow.DataflowBlock.Receive<int>(System.Threading.Tasks.Dataflow.ISourceBlock<int> source) Line 888	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.ShuffleTransform.RowCursor.MoveNextCore() Line 649	C#
 	Microsoft.ML.Core.dll!Microsoft.ML.Runtime.Data.RootCursorBase.MoveNext() Line 70	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainingCursorBase.MoveNext() Line 492	C#
 	Microsoft.ML.StandardLearners.dll!Microsoft.ML.Runtime.Learners.OnlineLinearTrainer<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.TrainCore(Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data) Line 188	C#
 	Microsoft.ML.StandardLearners.dll!Microsoft.ML.Runtime.Learners.OnlineLinearTrainer<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.TrainModelCore(Microsoft.ML.Runtime.TrainContext context) Line 135	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainerEstimatorBase<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.Train(Microsoft.ML.Runtime.TrainContext context) Line 89	C#
 	Microsoft.ML.Core.dll!Microsoft.ML.Runtime.TrainerExtensions.Train<Microsoft.ML.Runtime.IPredictorProducing<float>>(Microsoft.ML.Runtime.ITrainer<Microsoft.ML.Runtime.IPredictorProducing<float>> trainer, Microsoft.ML.Runtime.Data.RoleMappedData trainData) Line 95	C#
 	Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.TrainCore.AnonymousMethod__0(Microsoft.ML.Runtime.Ensemble.Subset subset, System.Threading.Tasks.ParallelLoopState state, long index) Line 153	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.PartitionerForEachWorker.AnonymousMethod__1(ref System.Collections.IEnumerator partitionState, int timeout, out bool replicationDelegateYieldedBeforeCompletion) Line 3224	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica<System.__Canon>.ExecuteAction(out bool yieldedBeforeCompletion) Line 124	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica.Execute() Line 80	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica..ctor.AnonymousMethod__4_0(object s) Line 40	C#
 	System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot) Line 2440	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued) Line 75	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued) Line 209	C#
 	System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) Line 1126	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Run<System.Collections.IEnumerator>(System.Threading.Tasks.TaskReplicator.ReplicatableUserAction<System.Collections.IEnumerator> action, System.Threading.Tasks.ParallelOptions options, bool stopOnFirstFailure) Line 138	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.PartitionerForEachWorker<Microsoft.ML.Runtime.Ensemble.Subset, object>(System.Collections.Concurrent.Partitioner<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset> simpleBody, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState> bodyWithState, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> bodyWithStateAndIndex, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, object, object> bodyWithStateAndLocal, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long, object, object> bodyWithEverything, System.Func<object> localInit, System.Action<object> localFinally) Line 3157	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForEachWorker<Microsoft.ML.Runtime.Ensemble.Subset, object>(System.Collections.Generic.IEnumerable<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset> body, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState> bodyWithState, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> bodyWithStateAndIndex, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, object, object> bodyWithStateAndLocal, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long, object, object> bodyWithEverything, System.Func<object> localInit, System.Action<object> localFinally) Line 2139	C#
 	System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForEach<Microsoft.ML.Runtime.Ensemble.Subset>(System.Collections.Generic.IEnumerable<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> body) Line 1776	C#
 	Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.TrainCore(Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data) Line 143	C#
 	Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.Train(Microsoft.ML.Runtime.TrainContext context) Line 111	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainerBase<Microsoft.ML.Runtime.IPredictorProducing<float>>.Microsoft.ML.Runtime.ITrainer.Train(Microsoft.ML.Runtime.TrainContext context) Line 31	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.TrainUtils.TrainCore(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data, Microsoft.ML.Runtime.ITrainer trainer, Microsoft.ML.Runtime.Data.RoleMappedData validData, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainer calibrator, int maxCalibrationExamples, bool? cacheData, Microsoft.ML.Runtime.IPredictor inputPredictor) Line 259	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.TrainUtils.Train(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data, Microsoft.ML.Runtime.ITrainer trainer, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainerFactory calibrator, int maxCalibrationExamples) Line 227	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.LearnerEntryPointsUtils.Train<Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments, Microsoft.ML.Runtime.EntryPoints.CommonOutputs.BinaryClassificationOutput>(Microsoft.ML.Runtime.IHost host, Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments input, System.Func<Microsoft.ML.Runtime.ITrainer> createTrainer, System.Func<string> getLabel, System.Func<string> getWeight, System.Func<string> getGroup, System.Func<string> getName, System.Func<System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<Microsoft.ML.Runtime.Data.RoleMappedSchema.ColumnRole, string>>> getCustom, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainerFactory calibrator, int maxCalibrationExamples) Line 189	C#
 	Microsoft.ML.Ensemble.dll!Microsoft.ML.Ensemble.EntryPoints.Ensemble.CreateBinaryEnsemble(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments input) Line 24	C#
 	[Native to Managed Transition]	
 	[Managed to Native Transition]	
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.EntryPointNode.Run() Line 834	C#
 	Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.EntryPointGraph.RunNode(Microsoft.ML.Runtime.EntryPoints.EntryPointNode node) Line 1034	C#
 	Microsoft.ML.Legacy.dll!Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAllNonMacros() Line 68	C#
 	Microsoft.ML.Legacy.dll!Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAll() Line 56	C#

Both tests were waiting in the ShuffleTransform.RowCursor.MoveNextCore function waiting for _toConsume.Receive(); to return:

while (_liveCount < _poolRows && !_doneConsuming)
{
// We are under capacity. Try to get some more.
int got = _toConsume.Receive();
if (got == 0)

However, there were no background threads running that would be producing anything to consume. I'm not sure where they went or why they weren't running.

I've captured a .dmp file, which is ~200 MB, so I can't link it here. Please contact me if you'd like it and I can get it to you.

/cc @TomFinley @Zruty0

eerhardt added a commit that referenced this issue Oct 1, 2018
* Add a workaround for the tests hanging while loading MKL.

The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock.

Workaround #1073

Another deadlock also occurs when running TestAutoInference and TestPipelineSweeper in parallel. Marking these tests to not run in parallel anymore.

Workaround #1095

Moving back to the Azure Hosted VS2017 pool to run the tests now that we've narrowed the deadlocks down.
@codemzs
Copy link
Member

codemzs commented Jun 30, 2019

We don't have these tests anymore.

@codemzs codemzs closed this as completed Jun 30, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants