-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6222][Streaming] Dont delete checkpoint data when doing pre-batch-start checkpoint #5008
Conversation
@harishreedharan Can you take a look. |
Test build #28551 has finished for PR 5008 at commit
|
Test build #28552 has finished for PR 5008 at commit
|
@tdas - Actually this fixes one part of the problem, which is caused by starting of checkpoint at the time the job is generated. But this can still cause an issue if you set I have seen people set |
For the case where concurrentJobs == 1, this works. So let's merge this in, while I work on a cleaner approach for the case where concurrentJobs > 1. |
@harishreedharan Checkout the unit test. |
@harishreedharan If you set cleanCheckpointDataLater = true in the clearMetadata function to emulate current master, then this unit test fails. You can also checkout this new testsuite file a Spark repo and test. |
@pwendell If @harishreedharan gives a LGTM on this patch, and needs this merged urgently for CDH 5.3 (while I am in flight), please merge this. |
Test build #28721 has finished for PR 5008 at commit
|
Test build #627 has finished for PR 5008 at commit
|
LGTM. Ran the failing test on a real cluster - no data loss anymore when The unit test looks good too, ran it against current master - it fails, so it works as well. |
Just landed in NY. Will merge when I get a chance.
|
Great! |
…tch-start checkpoint This is another alternative approach to #4964 I think this is a simpler fix that can be backported easily to other branches (1.2 and 1.3). All it does it introduce a flag so that the pre-batch-start checkpoint does not call clear checkpoint. There is not unit test yet. I will add it when this approach is commented upon. Not sure if this is testable easily. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #5008 from tdas/SPARK-6222 and squashes the following commits: 7315bc2 [Tathagata Das] Removed empty line. c438de4 [Tathagata Das] Revert unnecessary change. 5e98374 [Tathagata Das] Added unit test 50cb60b [Tathagata Das] Fixed style issue 295ca5c [Tathagata Das] Fixing SPARK-6222 (cherry picked from commit 645cf3f) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
…tch-start checkpoint This is another alternative approach to apache#4964 I think this is a simpler fix that can be backported easily to other branches (1.2 and 1.3). All it does it introduce a flag so that the pre-batch-start checkpoint does not call clear checkpoint. There is not unit test yet. I will add it when this approach is commented upon. Not sure if this is testable easily. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes apache#5008 from tdas/SPARK-6222 and squashes the following commits: 7315bc2 [Tathagata Das] Removed empty line. c438de4 [Tathagata Das] Revert unnecessary change. 5e98374 [Tathagata Das] Added unit test 50cb60b [Tathagata Das] Fixed style issue 295ca5c [Tathagata Das] Fixing SPARK-6222
Hi @harishreedharan @tdas can you please help me, i am losing data when i killed the spark app (driver). Below is my case I have one JeroMQ publisher After looking at above patch, do i need to add any parameters in SparkConf like for WAL we have |
Hi, I manage to solve this issue. now i am able to recover all the data but as discussed in this thread [Data Duplicate issue] (https://www.mail-archive.com/user@spark.apache.org/msg52687.html) |
This is another alternative approach to #4964
I think this is a simpler fix that can be backported easily to other branches (1.2 and 1.3).
All it does it introduce a flag so that the pre-batch-start checkpoint does not call clear checkpoint.
There is not unit test yet. I will add it when this approach is commented upon. Not sure if this is testable easily.