-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Support job logs that are not UTF-8 compatible #12666
Conversation
An additional detail that I fixed in this change is that we no longer need to load the log file into memory on the worker in order to send it to the batch pod (by using
but these felt like bigger changes that would've needed to touch more of the code so I left those out for now. |
batch/batch/front_end/front_end.py
Outdated
job_log_strings: Dict[str, Optional[str]] = {} | ||
for container, log in job_log_bytes.items(): | ||
try: | ||
job_log_strings[container] = log.decode('utf-8') if log else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check that log is not None here. I'm worried about the empty string case.
batch/batch/front_end/front_end.py
Outdated
job_log_strings_or_bytes = {} | ||
for container, log in job_log_bytes.items(): | ||
try: | ||
job_log_strings_or_bytes[container] = log.decode('utf-8') if log else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issue with not checking if log is None here for empty strings.
@@ -2152,17 +2198,20 @@ async def ui_get_job(request, userdata, batch_id): | |||
return await render_template('batch', request, userdata, 'job.html', page_context) | |||
|
|||
|
|||
# This should really be the exact same thing as the REST endpoint | |||
@routes.get('/batches/{batch_id}/jobs/{job_id}/log/{container}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this endpoint is even used at all. You can probably just delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used for downloading logs from the UI. I meant to refactor this and remove the comment. I'll do that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used for downloading logs from the UI. I meant to refactor this and remove the comment. I'll do that now
batch/batch/worker/worker.py
Outdated
@@ -2265,7 +2247,8 @@ async def create_container_and_connect( | |||
except (FileNotFoundError, ConnectionRefusedError) as err: | |||
attempts += 1 | |||
if attempts == 240: | |||
jvm_output = await container.container.get_log() or '' | |||
# NOTE we assume that the JVM logs hail emits will be valid utf-8 | |||
jvm_output = (await fs.read(container.container.log_path)).decode('utf-8') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we guarantee there is always a file existing at container.container.log_path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm maybe not here, I'll protect against this
@@ -1202,13 +1202,6 @@ def container_is_running(self): | |||
def container_finished(self): | |||
return self.process is not None and self.process.returncode is not None | |||
|
|||
async def get_log(self, offset: Optional[int] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we never use the offset anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was dead code
'GET', | ||
f'http://{ip_address}:5000/api/v1alpha/batches/{batch_id}/jobs/{job_id}/log/{container}', | ||
) as resp: | ||
return await resp.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, the FileResponse here doesn't deserve special treatment as it's just a special case of a regular Response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it subclasses Response
. The client can't tell, but it's a way for the server to stream bytes directly from an open file handle instead of loading the whole thing into memory on the server and then sending those bytes.
@@ -323,6 +329,10 @@ async def wait(self): | |||
if i < 64: | |||
i = i + 1 | |||
|
|||
async def container_log(self, container_name: str) -> bytes: | |||
async with await self._batch._client._get(f'/api/v1alpha/batches/{self.batch_id}/jobs/{self.job_id}/log/{container_name}') as resp: | |||
return await resp.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a FileResponse as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client can't tell whether or not it's a FileResponse
.
Just so I understand correctly (and sorry if this is obvious), the current job logs interface is still the same. But if you want a container's logs, then you'll get bytes which the user will have to decode themselves. How does that affect the file download button in the UI and the hailctl batch logs functionality you have? Will you see text or a random byte string? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see question
Good question. For the download button, the file you download is still a normal text file and I've confirmed that I can download and view a log file as I would expect. For the |
Ok. Last question -- do we need to do something for the aioclient for the new |
I would leave it up to the caller of the |
batch/batch/front_end/front_end.py
Outdated
return web.json_response(job_log_strings) | ||
|
||
|
||
async def get_job_container_log(request, userdata, batch_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks are failing because userdata
is not used. I think we don't need to use it because the decorators on the actual routes does the user validation check before it gets to this point. Would be great if you could double check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup you're right it's used by the decorator and unused in the function I added.
The crux of this PR is the 9 line change in
file_store.py
-- we shouldn't decode job logs tostr
before uploading because we can't trust jobs to output only valid UTF-8 to stdout. The FS can take bytes anyway so it is an unnecessary conversion. The rest of the complexity of this PR is figuring out how to communicate those logs over the Batch API in a backwards-compatible and straight-forward way. What makes that tricky is that when we say "job log" in the code we mostly talk about a dictionary or JSON object of container to log string. Valid JSON must be UTF-8, so JSON starts to make less and less sense for sending log files. Log files are easy to send though if we just want one, so I added an endpoint to request the log for a specific container in a job and we can easily fulfill that without unnecessary encoding/decoding between str and bytes.Resolves #12666