You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On an Azure Standard_DS2_v2 machine (the same that is used in the Hosted VS2017 pool in Azure DevOps), run the Microsoft.ML.Predictor.Tests tests in a loop for a while. (it took me 3 runs)
Sometimes the tests will hang indefinitely.
I was able to attach a debugger when this happens, and there are 2 tests running:
TestPipelineSweeper.PipelineSweeperRocketEngine
TestAutoInference.TestLearnerConstrainingByName
And both tests were in the same callstack:
System.Private.CoreLib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 635 C#
System.Private.CoreLib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2978 C#
System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalWaitCore(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) Line 2917 C#
System.Private.CoreLib.dll!System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task task) Line 146 C#
System.Threading.Tasks.Dataflow.dll!System.Threading.Tasks.Dataflow.DataflowBlock.Receive<int>(System.Threading.Tasks.Dataflow.ISourceBlock<int> source, System.TimeSpan timeout, System.Threading.CancellationToken cancellationToken) Line 982 C#
System.Threading.Tasks.Dataflow.dll!System.Threading.Tasks.Dataflow.DataflowBlock.Receive<int>(System.Threading.Tasks.Dataflow.ISourceBlock<int> source) Line 888 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.ShuffleTransform.RowCursor.MoveNextCore() Line 649 C#
Microsoft.ML.Core.dll!Microsoft.ML.Runtime.Data.RootCursorBase.MoveNext() Line 70 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainingCursorBase.MoveNext() Line 492 C#
Microsoft.ML.StandardLearners.dll!Microsoft.ML.Runtime.Learners.OnlineLinearTrainer<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.TrainCore(Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data) Line 188 C#
Microsoft.ML.StandardLearners.dll!Microsoft.ML.Runtime.Learners.OnlineLinearTrainer<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.TrainModelCore(Microsoft.ML.Runtime.TrainContext context) Line 135 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainerEstimatorBase<Microsoft.ML.Runtime.Data.BinaryPredictionTransformer<Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>, Microsoft.ML.Runtime.Learners.LinearBinaryPredictor>.Train(Microsoft.ML.Runtime.TrainContext context) Line 89 C#
Microsoft.ML.Core.dll!Microsoft.ML.Runtime.TrainerExtensions.Train<Microsoft.ML.Runtime.IPredictorProducing<float>>(Microsoft.ML.Runtime.ITrainer<Microsoft.ML.Runtime.IPredictorProducing<float>> trainer, Microsoft.ML.Runtime.Data.RoleMappedData trainData) Line 95 C#
Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.TrainCore.AnonymousMethod__0(Microsoft.ML.Runtime.Ensemble.Subset subset, System.Threading.Tasks.ParallelLoopState state, long index) Line 153 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.PartitionerForEachWorker.AnonymousMethod__1(ref System.Collections.IEnumerator partitionState, int timeout, out bool replicationDelegateYieldedBeforeCompletion) Line 3224 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica<System.__Canon>.ExecuteAction(out bool yieldedBeforeCompletion) Line 124 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica.Execute() Line 80 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Replica..ctor.AnonymousMethod__4_0(object s) Line 40 C#
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167 C#
System.Private.CoreLib.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(ref System.Threading.Tasks.Task currentTaskSlot) Line 2440 C#
System.Private.CoreLib.dll!System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued) Line 75 C#
System.Private.CoreLib.dll!System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task task, bool taskWasPreviouslyQueued) Line 209 C#
System.Private.CoreLib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) Line 1126 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.TaskReplicator.Run<System.Collections.IEnumerator>(System.Threading.Tasks.TaskReplicator.ReplicatableUserAction<System.Collections.IEnumerator> action, System.Threading.Tasks.ParallelOptions options, bool stopOnFirstFailure) Line 138 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.PartitionerForEachWorker<Microsoft.ML.Runtime.Ensemble.Subset, object>(System.Collections.Concurrent.Partitioner<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset> simpleBody, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState> bodyWithState, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> bodyWithStateAndIndex, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, object, object> bodyWithStateAndLocal, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long, object, object> bodyWithEverything, System.Func<object> localInit, System.Action<object> localFinally) Line 3157 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForEachWorker<Microsoft.ML.Runtime.Ensemble.Subset, object>(System.Collections.Generic.IEnumerable<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset> body, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState> bodyWithState, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> bodyWithStateAndIndex, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, object, object> bodyWithStateAndLocal, System.Func<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long, object, object> bodyWithEverything, System.Func<object> localInit, System.Action<object> localFinally) Line 2139 C#
System.Threading.Tasks.Parallel.dll!System.Threading.Tasks.Parallel.ForEach<Microsoft.ML.Runtime.Ensemble.Subset>(System.Collections.Generic.IEnumerable<Microsoft.ML.Runtime.Ensemble.Subset> source, System.Threading.Tasks.ParallelOptions parallelOptions, System.Action<Microsoft.ML.Runtime.Ensemble.Subset, System.Threading.Tasks.ParallelLoopState, long> body) Line 1776 C#
Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.TrainCore(Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data) Line 143 C#
Microsoft.ML.Ensemble.dll!Microsoft.ML.Runtime.Ensemble.EnsembleTrainerBase<float, Microsoft.ML.Runtime.IPredictorProducing<float>, Microsoft.ML.Runtime.Ensemble.Selector.IBinarySubModelSelector, Microsoft.ML.Runtime.Ensemble.OutputCombiners.IBinaryOutputCombiner>.Train(Microsoft.ML.Runtime.TrainContext context) Line 111 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Training.TrainerBase<Microsoft.ML.Runtime.IPredictorProducing<float>>.Microsoft.ML.Runtime.ITrainer.Train(Microsoft.ML.Runtime.TrainContext context) Line 31 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.TrainUtils.TrainCore(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data, Microsoft.ML.Runtime.ITrainer trainer, Microsoft.ML.Runtime.Data.RoleMappedData validData, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainer calibrator, int maxCalibrationExamples, bool? cacheData, Microsoft.ML.Runtime.IPredictor inputPredictor) Line 259 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.Data.TrainUtils.Train(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.IChannel ch, Microsoft.ML.Runtime.Data.RoleMappedData data, Microsoft.ML.Runtime.ITrainer trainer, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainerFactory calibrator, int maxCalibrationExamples) Line 227 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.LearnerEntryPointsUtils.Train<Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments, Microsoft.ML.Runtime.EntryPoints.CommonOutputs.BinaryClassificationOutput>(Microsoft.ML.Runtime.IHost host, Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments input, System.Func<Microsoft.ML.Runtime.ITrainer> createTrainer, System.Func<string> getLabel, System.Func<string> getWeight, System.Func<string> getGroup, System.Func<string> getName, System.Func<System.Collections.Generic.IEnumerable<System.Collections.Generic.KeyValuePair<Microsoft.ML.Runtime.Data.RoleMappedSchema.ColumnRole, string>>> getCustom, Microsoft.ML.Runtime.Internal.Calibration.ICalibratorTrainerFactory calibrator, int maxCalibrationExamples) Line 189 C#
Microsoft.ML.Ensemble.dll!Microsoft.ML.Ensemble.EntryPoints.Ensemble.CreateBinaryEnsemble(Microsoft.ML.Runtime.IHostEnvironment env, Microsoft.ML.Runtime.Ensemble.EnsembleTrainer.Arguments input) Line 24 C#
[Native to Managed Transition]
[Managed to Native Transition]
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.EntryPointNode.Run() Line 834 C#
Microsoft.ML.Data.dll!Microsoft.ML.Runtime.EntryPoints.EntryPointGraph.RunNode(Microsoft.ML.Runtime.EntryPoints.EntryPointNode node) Line 1034 C#
Microsoft.ML.Legacy.dll!Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAllNonMacros() Line 68 C#
Microsoft.ML.Legacy.dll!Microsoft.ML.Runtime.EntryPoints.JsonUtils.GraphRunner.RunAll() Line 56 C#
Both tests were waiting in the ShuffleTransform.RowCursor.MoveNextCore function waiting for _toConsume.Receive(); to return:
However, there were no background threads running that would be producing anything to consume. I'm not sure where they went or why they weren't running.
I've captured a .dmp file, which is ~200 MB, so I can't link it here. Please contact me if you'd like it and I can get it to you.
* Add a workaround for the tests hanging while loading MKL.
The workaround is to ensure the MKL library is loaded very early in the test process, so it doesn't cause the deadlock.
Workaround #1073
Another deadlock also occurs when running TestAutoInference and TestPipelineSweeper in parallel. Marking these tests to not run in parallel anymore.
Workaround #1095
Moving back to the Azure Hosted VS2017 pool to run the tests now that we've narrowed the deadlocks down.
To reproduce:
On an Azure Standard_DS2_v2 machine (the same that is used in the Hosted VS2017 pool in Azure DevOps), run the
Microsoft.ML.Predictor.Tests
tests in a loop for a while. (it took me 3 runs)Sometimes the tests will hang indefinitely.
I was able to attach a debugger when this happens, and there are 2 tests running:
And both tests were in the same callstack:
Both tests were waiting in the
ShuffleTransform.RowCursor.MoveNextCore
function waiting for_toConsume.Receive();
to return:machinelearning/src/Microsoft.ML.Data/Transforms/ShuffleTransform.cs
Lines 646 to 650 in a02807c
However, there were no background threads running that would be producing anything to consume. I'm not sure where they went or why they weren't running.
I've captured a .dmp file, which is ~200 MB, so I can't link it here. Please contact me if you'd like it and I can get it to you.
/cc @TomFinley @Zruty0
The text was updated successfully, but these errors were encountered: