-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WX-1595 GCP Batch backend refactor to include the PAPI request manager #7412
Conversation
BatchApiRequestWorkerSpec works! The goal is to port the papiv2 request manager behavior into GCP batch. We just need to rename the methods to remove the PAPI names.
Now, we just need to fix the runtime errors, wiring the correct messages, etc.
The queue must be cleared after executing the requests.
Abort request handler was returning the wrong message, also, keep the old behavior where abort request is handled once only.
This grabs many details from PAPIv2, work still pending on the tests. NOTE: This could break the current integration.
I hope to circle back to this soon but in the meantime it looks like we picked up some compiling issues from merging all the other PRs. |
I should be able to solve these before you wake up, thanks! |
I have resolved the problems, it is also worth to execute the new tests from #7440. |
I need to put this down for the moment to finish #7439 which is currently affecting users, hope to get back to it tomorrow morning. |
This allow us handling the abort result instead of blindly marking the job as aborted.
Good news, I was able to fix Turns out that when this flag is true, Cromwell blindly marks the job as aborted, now, Cromwell waits until the abort request is executed and the job can't be retrieved from GCP anymore. |
Awesome, will have time to look today! |
Now, we look for the submit requests before sending an abort request, canceling jobs that were not submitted to GCP. Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.
Turns out that this is now unnecessary because we are already mapping query errors to a RunStatus.
When aborting an individual job, only that job can be aborted instead of all the jobs from that workflow.
In theory, this solves #7407
case GcpBatchBackendSingletonActor.Event.JobSubmitted(job) => | ||
log.info(s"Job submitted to GCP: ${job.getName}") | ||
case job: StandardAsyncJob => | ||
log.info(s"A job was submitted successfully: ${job.jobId}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problem merging as-is with these logs in place, seeing as we'll probably have some more rounds of debugging. That said, we'll probably want to reduce the info
ones eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I have marked the noisy logs as debug and kept the rest as info, it will be help to debug some of the currently open issues.
} | ||
// If error code 10, add some extra messaging to the server logging | ||
else if (runStatus.errorCode.getCode.value() == BatchMysteriouslyCrashedErrorCode) { | ||
jobLogger.info(s"Job Failed with Error Code 10 for a machine where Preemptible is set to $preemptible") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Batch use the same range of error codes? The enum com.google.cloud.batch.v1.JobStatus.State
only goes up to 6 it looks like.
@@ -1179,7 +1383,7 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar | |||
} | |||
} | |||
|
|||
// No need for Cromwell-performed localization in the PAPI backend, ad hoc values are localized directly from GCS to the VM by PAPI. | |||
// No need for Cromwell-performed localization in the Batch backend, ad hoc values are localized directly from GCS to the VM by Batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm so excited about this 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be aware that I did not do anything here, its the code/comment ported from PAPI.
@@ -1138,6 +1363,9 @@ class GcpBatchAsyncBackendJobExecutionActor(override val standardParams: Standar | |||
batchOutputs collectFirst { | |||
case batchOutput if batchOutput.name == makeSafeReferenceName(path) => | |||
val pathAsString = batchOutput.cloudPath.pathAsString | |||
|
|||
// TODO: batchOutput.cloudPath.exists invokes GCP, which causes a test ported from papi-common to fai | |||
// because GCP is not configured in tests, shall we do anything? | |||
if (batchOutput.isFileParameter && !batchOutput.cloudPath.exists) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it as a TODO for now. It seems awkward to use a throw
to signal "it's OK to do nothing".
|
||
case apiQuery: BatchApiRequest => | ||
log.debug("Forwarding API query to Batch request manager actor") | ||
jesApiQueryManager.forward(apiQuery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Legacy naming: not a super high priority, but "JES" stands for "Job Execution Service" and is an ancestor of Batch.
JES -> Pipelines API v1 -> Pipeines API v2alpha -> Pipeines API v2beta -> Genomics -> Life Sciences -> Batch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have resolved this.
import scala.util.Try | ||
|
||
// Mirrors com.google.api.client.googleapis.batch.BatchRequest but this is immutable | ||
class GcpBatchGroupedRequests(requests: List[(BatchApiRequest, Promise[Try[BatchApiResponse]])]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still a value in grouping them in that case, or should we just fire them off as they come in?
Turns out that this is not the correct fix.
- Switch the noisy logs to debug level. - Remove status codes ported from PAPI because they do not have any usefulness in batch. - Remove all the test cases involved the PAPI codes. - Clean RunStatus from the unused args. - Rename JES occurrences to Batch.
Due to the activity noise, the comments are hidden, I'll post here for better visibility.
Originally, this was created because we hoped that Google had an alternative to Batch requests, by now, Google has confirmed that there is no way to do that. These are some notes from our internal discussions:
Given that the current code has been tested so many times, my suggestion is to keep the grouping and potentially remove it in another iteration.
Google has confirmed that there are more error codes than what the grpc response provides, still, these can be found at the job events, hence, they need to be parsed from the strings (PAPI does something similar). But, this has not been done in this PR which is why I have removed a lot of code that is not necessary. In a following PR, we should implement part of this in order to handle preemption errors. See https://cloud.google.com/batch/docs/troubleshooting#reserved-exit-codes Thanks. |
WARNING: This PR is huge and needs to be reviewed carefully, we have already performed many manual tests + ported many other tests from PAPI.
Intro
The main goal is to refactor Batch backend to include PipelinesApiRequestManager and PipelinesApiRequestWorker.
This also fixes a few missing details from the initial Batch integration (#7177), for example:
I have been trying to split this into multiple smaller PRs, please let me know if you can find any piece that can be submitted independently, previous PRs:
Questions (already resolved)
Questions
false
? this is set by PAPI but it causes a centaur test to fail.request-workers
also increases the worker's delay to pull work, for example, setting this value to100
or above causes would cause the delay to become ~18m which seems insane (see BatchApiRequestManager.scala), putting an upper limit on the delay seems worth it, any thoughts?Execution events details
What GCP provides:
What we define as execution events:
Load test results
We have executed many load tests, this is the latest one involving 14k jobs.
Overall, all our tests indicate that Batch finishes executing the jobs faster than PAPIv2.
Load tests settings
We have ran Cromwell in server mode with the following settings:
JVM Options:
-Xms512m -Xmx64g
NOTE: Initially we found a bottleneck on Batch but Google enabled an experimental settings to schedule many jobs concurrently which reduced the total execution time.
Server capacity (from Google Cloud):