Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness bug in update? #3244

Closed
rcaudy opened this issue Dec 27, 2022 · 4 comments · Fixed by #3449
Closed

Liveness bug in update? #3244

rcaudy opened this issue Dec 27, 2022 · 4 comments · Fixed by #3449
Assignees
Labels
bug Something isn't working core Core development tasks query engine triage
Milestone

Comments

@rcaudy
Copy link
Member

rcaudy commented Dec 27, 2022

I encountered an identical liveness complaint near-simultaneously in the same nightly run from io.deephaven.engine.table.impl.PartitionedTableTest.testExecutionContext:

Received error notification: io.deephaven.engine.liveness.LivenessStateException: SelectOrUpdateListener-Update([__CONSTITUENT__]) failed to add QueryTable-1471810811, because either this manager or referent is no longer live |  
-- | --
  | at io.deephaven.engine.liveness.LivenessManager.manage(LivenessManager.java:23) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.maybeManage(SelectColumnLayer.java:448) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.maybeManageAdds(SelectColumnLayer.java:460) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.doApplyUpdate(SelectColumnLayer.java:411) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.lambda$doSerialApplyUpdate$1(SelectColumnLayer.java:240) |  
  | at io.deephaven.engine.util.systemicmarking.SystemicObjectTracker.executeSystemically(SystemicObjectTracker.java:64) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.doSerialApplyUpdate(SelectColumnLayer.java:239) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer$1.lambda$onAllRequiredColumnsCompleted$1(SelectColumnLayer.java:196) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectAndViewAnalyzer$ImmediateJobScheduler.submit(SelectAndViewAnalyzer.java:644) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer$1.onAllRequiredColumnsCompleted(SelectColumnLayer.java:194) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectAndViewAnalyzer$SelectLayerCompletionHandler.onLayerCompleted(SelectAndViewAnalyzer.java:485) |  
  | at io.deephaven.engine.table.impl.select.analyzers.BaseLayer.applyUpdate(BaseLayer.java:75) |  
  | at io.deephaven.engine.table.impl.select.analyzers.SelectColumnLayer.applyUpdate(SelectColumnLayer.java:134) |  
  | at io.deephaven.engine.table.impl.SelectOrUpdateListener.onUpdate(SelectOrUpdateListener.java:86) |  
  | at io.deephaven.engine.table.impl.InstrumentedTableUpdateListener$Notification.lambda$run$0(InstrumentedTableUpdateListener.java:37) |  
  | at io.deephaven.engine.table.impl.InstrumentedTableListenerBase$NotificationBase.doRunInternal(InstrumentedTableListenerBase.java:294) |  
  | at io.deephaven.engine.table.impl.InstrumentedTableListenerBase$NotificationBase.doRun(InstrumentedTableListenerBase.java:272) |  
  | at io.deephaven.engine.table.impl.InstrumentedTableUpdateListener$Notification.run(InstrumentedTableUpdateListener.java:37) |  
  | at io.deephaven.engine.updategraph.NotificationQueue$Notification.runInContext(NotificationQueue.java:60) |  
  | at io.deephaven.engine.updategraph.UpdateGraphProcessor.runNotification(UpdateGraphProcessor.java:1298) |  
  | at io.deephaven.engine.updategraph.UpdateGraphProcessor$QueueNotificationProcessor.doWork(UpdateGraphProcessor.java:1459) |  
  | at io.deephaven.engine.updategraph.UpdateGraphProcessor.flushNormalNotificationsAndCompleteCycle(UpdateGraphProcessor.java:1177) |  
  | at io.deephaven.engine.updategraph.UpdateGraphProcessor.flushNotificationsAndCompleteCycle(UpdateGraphProcessor.java:1122) |  
  | at io.deephaven.engine.updategraph.UpdateGraphProcessor.completeCycleForUnitTestsInternal(UpdateGraphProcessor.java:954) |  
  | at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) |  
  | at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) |  
  | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) |  
  | at java.base/java.lang.Thread.run(Thread.java:829)

Gradle build scans:

CI run:
https://github.com/rcaudy/deephaven-core/actions/runs/3785209777

Other failing tests in those runs are cascading failures due to the UGP shared lock, which is due to be addressed under #3137

@rcaudy rcaudy added bug Something isn't working query engine triage core Core development tasks labels Dec 27, 2022
@rcaudy rcaudy added this to the Jan 2023 milestone Dec 27, 2022
@rcaudy rcaudy self-assigned this Dec 27, 2022
@rcaudy
Copy link
Member Author

rcaudy commented Jan 19, 2023

@rcaudy
Copy link
Member Author

rcaudy commented Jan 19, 2023

But really, it did happen again last night:
https://github.com/nbauernfeind/deephaven-core/actions/runs/3954249966

@rcaudy
Copy link
Member Author

rcaudy commented Feb 27, 2023

Some notes from Slack about this bug:

A (lightly edited) excerpt from the nugget:

                    final PartitionedTable.Proxy proxy = table
                            .partitionedAggBy(List.of(Partition.of(name + "Constituent")), true, null, "intCol")
                            .proxy(false, false);
                    final Table result = op.apply(proxy).target().merge().sort("intCol");
                    return result;

The exception is from the update listener implementing the transform in op.apply(proxy). Several partitioned tables and proxies are being “leaked”, but the important pair are the ones output by that transform. Only the transformed proxy’s target’s table is directly an ancestor of the result. If the transformed proxy or its target are GCed before the enclosing liveness scope is released, the result won’t be cleaned up until later, when those references are handles by the liveness cleanup reference processor.

I can “fix” the problem the same way we fixed it in updateBy, with a computeEnclosed that keeps the enclosed referents strongly reachable until it releases the scope and returns the result.

That said, a user could manage to do something like this, and the end result would be an unpredictable liveness failure from out of nowhere.

One solution would be to make the transform-result-PT keep a hard ref to the proxy it was created for (feels backwards, though), and for the merge to keep a hard ref to the PT it was merged from. We’d be keeping those extra objects around indefinitely, however.

But keeping them around longer, perversely, helps make it easier to ensure proactive cleanup.

Alternatively, we could tryRetainReference the listener at the start of the cycle, short-circuit if that fails, and dropReference it upon completion.

Meaning, prevent it from being in an ambiguous state during actual transform update processing.

I suppose proxies could be in a memoization cache of sorts on the PT, which would address that part. They’re immutable, and of more or less constant direct size. (edited)

And it’s not totally terrible to keep the PT around in the UnionSourceManager, for similar reasons.

I think the listener behavior change is also good, as a “belt and suspenders” measure.

@rcaudy
Copy link
Member Author

rcaudy commented Feb 27, 2023

I've made the change to SelectOrUpdateListener that I proposed. I've also changed our LivenessManager implementations to more aggressively perform cleanup, rather than always waiting for the CleanupReferenceProcessor to dequeue leaked LivenessReferent instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks query engine triage
Projects
None yet
1 participant