[SPARK-24063][SS] Control maximum epoch backlog for ContinuousExecution #21392

spaced4ndy · 2018-05-22T11:10:38Z

What changes were proposed in this pull request?

This pull request adds maxEpochBacklog SQL configuration option. EpochCoordinator tracks if the length of the queue of waiting epochs has exceeded it. Upon this condition stream is stopped with an error indicating too many epochs stacked up

How was this patch tested?

Existing unit tests

spaced4ndy · 2018-05-22T11:22:44Z

@jose-torres Hi Jose, could you take a look at this pr please? I had doubts how to properly implement error reporting logic we discussed and this is what I came up with.
Also please advise how I can test these changes. I was writing this several weeks ago so I could forget something, but if my memory doesn't fail me I thought about the approach similar to tests in ContinuousSuite with custom StreamActions. I think I wasn't completely sure about implementation though. Would that be correct?

yanlin-Lynn · 2018-05-23T02:39:12Z

...src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala

+              val maxBacklogExceeded = epochEndpoint.askSync[Boolean](CheckIfMaxBacklogIsExceeded)
+              if (maxBacklogExceeded) {
+                throw new IllegalStateException(
+                  "Size of the epochs queue has exceeded maximum allowed epoch backlog.")


Throw exception will make epochUpdateThread stop working, but the application will keep working?
I think it's better to block and wait old epoch to be committed.

Agreed that the code as written won't shut down the stream. But I think it does make sense to kill the stream rather than waiting for old epochs. If we end up with a large backlog it's almost surely because some partition isn't making any progress, so I wouldn't expect Spark to ever be able to catch up.

yanlin-Lynn · 2018-05-23T02:52:30Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

+        } else {
+          logDebug(s"Epoch $epoch has received commits from all partitions " +
+            s"and is waiting for epoch ${epoch - 1} to be committed first.")
+          epochsWaitingToBeCommitted.add(epoch)


once maxEpochBacklogExceeded is set to true, can never be set to false again?

Basing on what I discussed with Jose the stream should be killed if backlog exceeds value of a certain config option, so yes, why set it back to false later. At least that's how I see it

jose-torres

The simplest way to test the changes would be to use a TestSparkSession with too few cores - e.g. local(1) but 2 input partitions. Then you can start a stream with something like a 1 ms checkpoint interval and max backlog 10 to quickly get a failure. (All epochs will end up in the backlog if there aren't enough cores to schedule one of the partitions, since no global commit will ever succeed.)

jose-torres · 2018-05-23T03:14:09Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

+/**
+ * Returns boolean indicating if size of the epochs queue has exceeded maximum epoch backlog.
+ */
+private[sql] case object CheckIfMaxBacklogIsExceeded extends EpochCoordinatorMessage


I'm not sure we need to make a side-channel in the RPC handler for this. I'd try to just make the query fail when the condition is reached in the first place.

Do you mean make the query fail right from EpochCoordinator? If yes, I wanted to do so, but didn't figure out how to terminate query with exception.
EpochCoordinator has query: ContinuousExecution as a parameter, but then I don't see a suitable method for query.. Closest I found is stop() I guess.
Or am I looking in a completely wrong direction? Please give a hint.

I think we'd probably want to add some method like private[streaming] stopWithException(e) to ContinuousExecution.

Okay, thought about something like this but wasn't sure if it's fine to do so in scope of this change. Thanks

HyukjinKwon · 2018-07-16T02:35:44Z

Hi all, any update on this PR?

spaced4ndy · 2018-07-16T09:05:32Z

@HyukjinKwon Hi, I've stopped working on it for some time now

HyukjinKwon · 2018-07-16T09:10:29Z

In that case, would you mind if I ask to leave this closed for now and reopen when you start to work on this? I am trying to leave active PRs only.

spaced4ndy · 2018-07-16T09:16:58Z

Sure, no problem.

AmplabJenkins · 2018-10-22T12:15:53Z

Can one of the admins verify this patch?

Efim Poberezkin added 2 commits May 22, 2018 13:31

Add max epoch backlog option to SQLConf

1d2fc29

Replace logging an error with throwing an exception

0919b3f

Change conf doc

88d727b

yanlin-Lynn reviewed May 23, 2018

View reviewed changes

jose-torres reviewed May 23, 2018

View reviewed changes

spaced4ndy closed this Nov 20, 2018

arunmahadevan mentioned this pull request Dec 5, 2018

[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExecution #23156

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24063][SS] Control maximum epoch backlog for ContinuousExecution #21392

[SPARK-24063][SS] Control maximum epoch backlog for ContinuousExecution #21392

spaced4ndy commented May 22, 2018 •

edited

Loading

spaced4ndy commented May 22, 2018 •

edited

Loading

yanlin-Lynn May 23, 2018 •

edited

Loading

jose-torres May 23, 2018

yanlin-Lynn May 23, 2018 •

edited

Loading

spaced4ndy May 23, 2018

jose-torres left a comment

jose-torres May 23, 2018

spaced4ndy May 23, 2018 •

edited

Loading

jose-torres May 23, 2018

spaced4ndy May 24, 2018 •

edited

Loading

HyukjinKwon commented Jul 16, 2018

spaced4ndy commented Jul 16, 2018

HyukjinKwon commented Jul 16, 2018

spaced4ndy commented Jul 16, 2018

AmplabJenkins commented Oct 22, 2018

[SPARK-24063][SS] Control maximum epoch backlog for ContinuousExecution #21392

[SPARK-24063][SS] Control maximum epoch backlog for ContinuousExecution #21392

Conversation

spaced4ndy commented May 22, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

spaced4ndy commented May 22, 2018 • edited Loading

yanlin-Lynn May 23, 2018 • edited Loading

Choose a reason for hiding this comment

jose-torres May 23, 2018

Choose a reason for hiding this comment

yanlin-Lynn May 23, 2018 • edited Loading

Choose a reason for hiding this comment

spaced4ndy May 23, 2018

Choose a reason for hiding this comment

jose-torres left a comment

Choose a reason for hiding this comment

jose-torres May 23, 2018

Choose a reason for hiding this comment

spaced4ndy May 23, 2018 • edited Loading

Choose a reason for hiding this comment

jose-torres May 23, 2018

Choose a reason for hiding this comment

spaced4ndy May 24, 2018 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Jul 16, 2018

spaced4ndy commented Jul 16, 2018

HyukjinKwon commented Jul 16, 2018

spaced4ndy commented Jul 16, 2018

AmplabJenkins commented Oct 22, 2018

spaced4ndy commented May 22, 2018 •

edited

Loading

spaced4ndy commented May 22, 2018 •

edited

Loading

yanlin-Lynn May 23, 2018 •

edited

Loading

yanlin-Lynn May 23, 2018 •

edited

Loading

spaced4ndy May 23, 2018 •

edited

Loading

spaced4ndy May 24, 2018 •

edited

Loading