-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
control-service: data job synchronizer error handling #2742
control-service: data job synchronizer error handling #2742
Conversation
8964d74
to
4f9ef0e
Compare
Why As part of the VEP-2272, we have to introduce a process to synchronize the data jobs from the database to the Kubernetes. The initial implementation depends on the data_job_deployment table for both reading and writing operations. As the deployment process operates asynchronously with unpredictable durations, users may believe that their deployment has been completed while it may not have. To address this issue, we'll be introducing a new table called "desired_data_job_deployment" and renaming the current one to "actual_data_job_deployment". The read operations will read from "actual_data_job_deployment" while the write operations will write to "desired_data_job_deployment". What We've implemented new deployment tables and modified the deployment logic to operate in conjunction with these new tables. Just so you know, this is just the second phase of implementation. The following enhancements are planned for future PRs: We will annotate the method DataJobsSynchronizer.synchronizeDataJobs() with @scheduled. Improved exception handling will be integrated. More tests will be included in subsequent updates. ThreadPool configuration will be tuned and exposed through the application.properties. Testing Done Integration tests Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
3094e98
to
5f90aa1
Compare
Why As part of VEP-2272, we need to introduce a process for synchronizing data jobs from the database to Kubernetes. In the event of a data job deployment failure (due to user or platform errors), the current implementation attempts to deploy the data job during every synchronization cycle. What We have added a data job deployment status field to the desired_data_job_deployment table. This status is used to determine whether the deployment failed in the previous cycle and should be skipped in the current one. Testing done Unit tests Signed-off-by: Miroslav Ivanov miroslavi@vmware.com
1c40bb8
to
ab3c4b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (DeploymentStatus.USER_ERROR.equals(desiredJobDeployment.getStatus()) | ||
|| DeploymentStatus.PLATFORM_ERROR.equals(desiredJobDeployment.getStatus())) { | ||
log.debug( | ||
"Skipping the data job [job_name={}] deployment due to the previously failed deployment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to prevent re-tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Why
As part of VEP-2272, we need to introduce a process for synchronizing data jobs from the database to Kubernetes. In the event of a data job deployment failure (due to user or platform errors), the current implementation attempts to deploy the data job during every synchronization cycle.
What
We have added a data job deployment status field to the desired_data_job_deployment table. This status is used to determine whether the deployment failed in the previous cycle and should be skipped in the current one. The failed deployment status can be reset by invocation of updateDeployment API event without a job code change (it will be implemented as part of https://github.com/vmware/versatile-data-kit/pull/2731/files).
This is just the third phase of implementation. The following enhancements are planned for future PRs:
Testing done
Unit tests
Signed-off-by: Miroslav Ivanov miroslavi@vmware.com