[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

uncleGen · 2019-04-03T13:39:45Z

What changes were proposed in this pull request?

In SPARK-23503, we enforce sequencing of committed epochs for Continuous Execution. In case a message for epoch n is lost and epoch (n + 1) is ready for commit before epoch n is, epoch (n + 1) will wait for epoch n to be committed first. With extreme condition, we will wait for epochBacklogQueueSize (10000 in default) epochs and then failed. There is no need to wait for such a long time before query fail if there maybe some message LATE/LOST. In this PR, we make the condition more sensitive.

How was this patch tested?

update existing unit tests.

uncleGen · 2019-04-03T13:40:15Z

cc @jose-torres

SparkQA · 2019-04-03T17:07:53Z

Test build #104248 has finished for PR 24283 at commit 8c71b2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala

SparkQA · 2019-04-04T16:40:59Z

Test build #104290 has finished for PR 24283 at commit b2281df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-04-05T09:25:28Z

As Attila pointed out it contains unnecessary changes which maybe distracted me but at the first glance I don't see the main value.

There is no need to wait for such a long time before query fail if there maybe some message LATE/LOST.

Not sure if I understand the main reasoning. If one Kafka partition have some problem because that specific server is slower than others after 10 epoch just kill the query? If it's an intermittent problem not sure it's the right thing to do.

uncleGen · 2019-04-08T06:30:28Z

@gaborgsomogyi Thanks for your reply. #23156 introduced a maximum queue threshold before stop the stream with a error. In #23156 , we used the same threshold for different queue, i.e. partitionCommits, partitionOffsets and epochsWaitingToBeCommitted. Generally, the size of partitionCommits and partitionOffsets grow much faster than epochsWaitingToBeCommitted. The stream may fail with 10 epochs if partition number is 100. However, we may wait for 10000 epochs before failure if partition number is 1 (if i understand correctly). It is such a long time before query fail. Well, this may be just a harsh boundary condition. The main concern of PR is to split these two thresholds to make query execution more sensitive to epoch message late or lost. If you feel like 10 epoch is too sensitive in some intermittent problem, we can relax this condition to 100 or other.

gaborgsomogyi · 2019-04-08T08:53:25Z

I see the main intention now. I agree that the different queues are filled up with different speed and considered this when the configuration added.

I think the threshold is configurable which makes the actual implementation flexible enough to handle this situation. From my point of view additional fine tuning doesn't really help but makes this code more complex which has to be maintained.

uncleGen · 2019-04-08T10:09:30Z

@gaborgsomogyi As you said, different queues are filled up with different speed, so we can not set a proper value for this config. If we set it too small, the queue will be filled up quickly when the number of partition is large enough. But if we set it too large, we may wait for many epochs before failure if partition number is small enough, like 1. So can we merge these two configs, just keep the epoch threshold? Then there is no need to check partitionCommits and partitionOffsets again. At one level, this makes sense. These queues would not be unbounded and consume up all the memory easily.

SparkQA · 2019-04-08T11:28:04Z

Test build #104373 has finished for PR 24283 at commit ba4d665.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2019-04-16T04:31:11Z

ping @gaborgsomogyi @zsxwing

uncleGen · 2019-04-23T06:52:12Z

@gaborgsomogyi I have unified the two configs. We will only check late epochs, but not check epochBacklogQueueSize again. As said above, checking late epochs is enough to avoid using too much memory and avoid waiting for long time before query fail if there maybe some messages LATE/LOST.

SparkQA · 2019-04-23T06:54:54Z

Test build #104826 has finished for PR 24283 at commit d2c4296.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-23T10:15:03Z

Test build #104828 has finished for PR 24283 at commit 7cfd180.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2019-04-25T05:39:44Z

ping @gaborgsomogyi and @attilapiros

SparkQA · 2019-05-07T07:05:02Z

Test build #105197 has finished for PR 24283 at commit a489327.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-07T08:26:45Z

Test build #105201 has finished for PR 24283 at commit a489327.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

uncleGen · 2019-05-08T01:51:47Z

retest this please

SparkQA · 2019-05-08T04:48:47Z

Test build #105243 has finished for PR 24283 at commit a489327.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2019-12-31T00:06:47Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

make query execution more sensitive to epoch message late or lost

8c71b2f

fix test failure

b2281df

attilapiros reviewed Apr 4, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

attilapiros reviewed Apr 4, 2019

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala Outdated Show resolved Hide resolved

attilapiros reviewed Apr 4, 2019

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochCoordinator.scala Outdated Show resolved Hide resolved

fix comments

ba4d665

uncleGen force-pushed the SPARK-27355 branch from bbc9cf9 to ba4d665 Compare April 8, 2019 07:03

unify two configs

d2c4296

uncleGen force-pushed the SPARK-27355 branch from 95cdcf5 to d2c4296 Compare April 23, 2019 06:43

bug fix

7cfd180

Merge branch 'master' into SPARK-27355

a489327

dongjoon-hyun added the STRUCTURED STREAMING label Jun 14, 2019

github-actions bot added the Stale label Dec 31, 2019

github-actions bot closed this Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

uncleGen commented Apr 3, 2019 •

edited

Loading

uncleGen commented Apr 3, 2019

SparkQA commented Apr 3, 2019

SparkQA commented Apr 4, 2019

gaborgsomogyi commented Apr 5, 2019 •

edited

Loading

uncleGen commented Apr 8, 2019 •

edited

Loading

gaborgsomogyi commented Apr 8, 2019

uncleGen commented Apr 8, 2019 •

edited

Loading

SparkQA commented Apr 8, 2019

uncleGen commented Apr 16, 2019

uncleGen commented Apr 23, 2019

SparkQA commented Apr 23, 2019

SparkQA commented Apr 23, 2019

uncleGen commented Apr 25, 2019 •

edited

Loading

SparkQA commented May 7, 2019

SparkQA commented May 7, 2019

uncleGen commented May 8, 2019

SparkQA commented May 8, 2019

github-actions bot commented Dec 31, 2019

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

Conversation

uncleGen commented Apr 3, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

uncleGen commented Apr 3, 2019

SparkQA commented Apr 3, 2019

SparkQA commented Apr 4, 2019

gaborgsomogyi commented Apr 5, 2019 • edited Loading

uncleGen commented Apr 8, 2019 • edited Loading

gaborgsomogyi commented Apr 8, 2019

uncleGen commented Apr 8, 2019 • edited Loading

SparkQA commented Apr 8, 2019

uncleGen commented Apr 16, 2019

uncleGen commented Apr 23, 2019

SparkQA commented Apr 23, 2019

SparkQA commented Apr 23, 2019

uncleGen commented Apr 25, 2019 • edited Loading

SparkQA commented May 7, 2019

SparkQA commented May 7, 2019

uncleGen commented May 8, 2019

SparkQA commented May 8, 2019

github-actions bot commented Dec 31, 2019

uncleGen commented Apr 3, 2019 •

edited

Loading

gaborgsomogyi commented Apr 5, 2019 •

edited

Loading

uncleGen commented Apr 8, 2019 •

edited

Loading

uncleGen commented Apr 8, 2019 •

edited

Loading

uncleGen commented Apr 25, 2019 •

edited

Loading