Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27355][SS] Make query execution more sensitive to epoch message late or lost #24283

Closed
wants to merge 6 commits into from

Conversation

uncleGen
Copy link
Contributor

@uncleGen uncleGen commented Apr 3, 2019

What changes were proposed in this pull request?

In SPARK-23503, we enforce sequencing of committed epochs for Continuous Execution. In case a message for epoch n is lost and epoch (n + 1) is ready for commit before epoch n is, epoch (n + 1) will wait for epoch n to be committed first. With extreme condition, we will wait for epochBacklogQueueSize (10000 in default) epochs and then failed. There is no need to wait for such a long time before query fail if there maybe some message LATE/LOST. In this PR, we make the condition more sensitive.

How was this patch tested?

update existing unit tests.

@uncleGen
Copy link
Contributor Author

uncleGen commented Apr 3, 2019

cc @jose-torres

@SparkQA
Copy link

SparkQA commented Apr 3, 2019

Test build #104248 has finished for PR 24283 at commit 8c71b2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2019

Test build #104290 has finished for PR 24283 at commit b2281df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor

gaborgsomogyi commented Apr 5, 2019

As Attila pointed out it contains unnecessary changes which maybe distracted me but at the first glance I don't see the main value.

There is no need to wait for such a long time before query fail if there maybe some message LATE/LOST.

Not sure if I understand the main reasoning. If one Kafka partition have some problem because that specific server is slower than others after 10 epoch just kill the query? If it's an intermittent problem not sure it's the right thing to do.

@uncleGen
Copy link
Contributor Author

uncleGen commented Apr 8, 2019

@gaborgsomogyi Thanks for your reply. #23156 introduced a maximum queue threshold before stop the stream with a error. In #23156 , we used the same threshold for different queue, i.e. partitionCommits, partitionOffsets and epochsWaitingToBeCommitted. Generally, the size of partitionCommits and partitionOffsets grow much faster than epochsWaitingToBeCommitted. The stream may fail with 10 epochs if partition number is 100. However, we may wait for 10000 epochs before failure if partition number is 1 (if i understand correctly). It is such a long time before query fail. Well, this may be just a harsh boundary condition. The main concern of PR is to split these two thresholds to make query execution more sensitive to epoch message late or lost. If you feel like 10 epoch is too sensitive in some intermittent problem, we can relax this condition to 100 or other.

@gaborgsomogyi
Copy link
Contributor

I see the main intention now. I agree that the different queues are filled up with different speed and considered this when the configuration added.

I think the threshold is configurable which makes the actual implementation flexible enough to handle this situation. From my point of view additional fine tuning doesn't really help but makes this code more complex which has to be maintained.

@uncleGen
Copy link
Contributor Author

uncleGen commented Apr 8, 2019

@gaborgsomogyi As you said, different queues are filled up with different speed, so we can not set a proper value for this config. If we set it too small, the queue will be filled up quickly when the number of partition is large enough. But if we set it too large, we may wait for many epochs before failure if partition number is small enough, like 1. So can we merge these two configs, just keep the epoch threshold? Then there is no need to check partitionCommits and partitionOffsets again. At one level, this makes sense. These queues would not be unbounded and consume up all the memory easily.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104373 has finished for PR 24283 at commit ba4d665.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

ping @gaborgsomogyi @zsxwing

@uncleGen
Copy link
Contributor Author

@gaborgsomogyi I have unified the two configs. We will only check late epochs, but not check epochBacklogQueueSize again. As said above, checking late epochs is enough to avoid using too much memory and avoid waiting for long time before query fail if there maybe some messages LATE/LOST.

@SparkQA
Copy link

SparkQA commented Apr 23, 2019

Test build #104826 has finished for PR 24283 at commit d2c4296.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 23, 2019

Test build #104828 has finished for PR 24283 at commit 7cfd180.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented Apr 25, 2019

ping @gaborgsomogyi and @attilapiros

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105197 has finished for PR 24283 at commit a489327.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 7, 2019

Test build #105201 has finished for PR 24283 at commit a489327.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@uncleGen
Copy link
Contributor Author

uncleGen commented May 8, 2019

retest this please

@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105243 has finished for PR 24283 at commit a489327.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 31, 2019
@github-actions github-actions bot closed this Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants