Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Support job logs that are not UTF-8 compatible #12666

Merged
merged 8 commits into from
Feb 17, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented Feb 7, 2023

The crux of this PR is the 9 line change in file_store.py -- we shouldn't decode job logs to str before uploading because we can't trust jobs to output only valid UTF-8 to stdout. The FS can take bytes anyway so it is an unnecessary conversion. The rest of the complexity of this PR is figuring out how to communicate those logs over the Batch API in a backwards-compatible and straight-forward way. What makes that tricky is that when we say "job log" in the code we mostly talk about a dictionary or JSON object of container to log string. Valid JSON must be UTF-8, so JSON starts to make less and less sense for sending log files. Log files are easy to send though if we just want one, so I added an endpoint to request the log for a specific container in a job and we can easily fulfill that without unnecessary encoding/decoding between str and bytes.

Resolves #12666

@daniel-goldstein
Copy link
Contributor Author

daniel-goldstein commented Feb 8, 2023

An additional detail that I fixed in this change is that we no longer need to load the log file into memory on the worker in order to send it to the batch pod (by using FileResponse). It would also be nice to do this

  • when the worker is uploading the file
  • on the batch pod by somehow chaining the StreamResponse from the worker/blob storage to the client

but these felt like bigger changes that would've needed to touch more of the code so I left those out for now.

jigold
jigold previously requested changes Feb 10, 2023
job_log_strings: Dict[str, Optional[str]] = {}
for container, log in job_log_bytes.items():
try:
job_log_strings[container] = log.decode('utf-8') if log else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check that log is not None here. I'm worried about the empty string case.

job_log_strings_or_bytes = {}
for container, log in job_log_bytes.items():
try:
job_log_strings_or_bytes[container] = log.decode('utf-8') if log else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue with not checking if log is None here for empty strings.

@@ -2152,17 +2198,20 @@ async def ui_get_job(request, userdata, batch_id):
return await render_template('batch', request, userdata, 'job.html', page_context)


# This should really be the exact same thing as the REST endpoint
@routes.get('/batches/{batch_id}/jobs/{job_id}/log/{container}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this endpoint is even used at all. You can probably just delete it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used for downloading logs from the UI. I meant to refactor this and remove the comment. I'll do that now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used for downloading logs from the UI. I meant to refactor this and remove the comment. I'll do that now

@@ -2265,7 +2247,8 @@ async def create_container_and_connect(
except (FileNotFoundError, ConnectionRefusedError) as err:
attempts += 1
if attempts == 240:
jvm_output = await container.container.get_log() or ''
# NOTE we assume that the JVM logs hail emits will be valid utf-8
jvm_output = (await fs.read(container.container.log_path)).decode('utf-8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we guarantee there is always a file existing at container.container.log_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm maybe not here, I'll protect against this

@@ -1202,13 +1202,6 @@ def container_is_running(self):
def container_finished(self):
return self.process is not None and self.process.returncode is not None

async def get_log(self, offset: Optional[int] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we never use the offset anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this was dead code

'GET',
f'http://{ip_address}:5000/api/v1alpha/batches/{batch_id}/jobs/{job_id}/log/{container}',
) as resp:
return await resp.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand, the FileResponse here doesn't deserve special treatment as it's just a special case of a regular Response?

Copy link
Contributor Author

@daniel-goldstein daniel-goldstein Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it subclasses Response. The client can't tell, but it's a way for the server to stream bytes directly from an open file handle instead of loading the whole thing into memory on the server and then sending those bytes.

@@ -323,6 +329,10 @@ async def wait(self):
if i < 64:
i = i + 1

async def container_log(self, container_name: str) -> bytes:
async with await self._batch._client._get(f'/api/v1alpha/batches/{self.batch_id}/jobs/{self.job_id}/log/{container_name}') as resp:
return await resp.read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be a FileResponse as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client can't tell whether or not it's a FileResponse.

@jigold
Copy link
Contributor

jigold commented Feb 10, 2023

Just so I understand correctly (and sorry if this is obvious), the current job logs interface is still the same. But if you want a container's logs, then you'll get bytes which the user will have to decode themselves. How does that affect the file download button in the UI and the hailctl batch logs functionality you have? Will you see text or a random byte string?

jigold
jigold previously requested changes Feb 10, 2023
Copy link
Contributor

@jigold jigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see question

@daniel-goldstein
Copy link
Contributor Author

Just so I understand correctly (and sorry if this is obvious), the current job logs interface is still the same. But if you want a container's logs, then you'll get bytes which the user will have to decode themselves. How does that affect the file download button in the UI and the hailctl batch logs functionality you have? Will you see text or a random byte string?

Good question. For the download button, the file you download is still a normal text file and I've confirmed that I can download and view a log file as I would expect. For the hailctl batch logs functionality, I added logic to the CLI in this PR where I download the bytes of the log and if I can decode it as UTF-8 I do, so it prints exactly as before, and if I can't I just print the bytes.

@jigold
Copy link
Contributor

jigold commented Feb 10, 2023

Ok. Last question -- do we need to do something for the aioclient for the new container_log method or are you assuming a client can make the utf-8 conversion themselves?

@daniel-goldstein
Copy link
Contributor Author

Ok. Last question -- do we need to do something for the aioclient for the new container_log method or are you assuming a client can make the utf-8 conversion themselves?

I would leave it up to the caller of the aioclient.BatchClient.container_log method. At the end of the day it's returning a file which is more generally in bytes than str, so I feel like it makes sense for the method to return bytes.

jigold
jigold previously requested changes Feb 16, 2023
return web.json_response(job_log_strings)


async def get_job_container_log(request, userdata, batch_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks are failing because userdata is not used. I think we don't need to use it because the decorators on the actual routes does the user validation check before it gets to this point. Would be great if you could double check that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup you're right it's used by the decorator and unused in the function I added.

@danking danking merged commit f7b3f10 into hail-is:main Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants