Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] prevent race to write to log file #13729

Merged
merged 1 commit into from
Sep 27, 2023
Merged

Conversation

danking
Copy link
Contributor

@danking danking commented Sep 27, 2023

Fixes #13716

This finally block is currently located after gather.take() but before we cancel (aka shutdown) all threads. As a result, it is possible for us to shudown the logging (thus flushing, closing, and destorying old appenders) then restart logging (thus opening the file in overwrite mode) and blow away whatever was there.

I have verified in my namespace across 10s of thosuands of JVM Jobs that this never blows away the log.

This finally block is currently located after `gather.take()` but *before* we cancel (aka shutdown)
all threads. As a result, it is possible for us to shudown the logging (thus flushing, closing,
and destorying old appenders) then restart logging (thus opening the file in overwrite mode) and
blow away whatever was there.

I have verified in my namespace across 10s of thosuands of JVM Jobs that this never blows away the
log.
@danking danking merged commit d2ad33d into hail-is:main Sep 27, 2023
8 checks passed
danking added a commit that referenced this pull request Oct 16, 2023
…13715)

CHANGELOG: Fixes #13697, a long standing issue with QoB, in which a
failing partition job or driver job is not failed in the Batch UI.

I am not sure why we did not do this this way in the first place. If a
JVMJob raises an exception, Batch will mark the job as failed. Ergo, we
should raise an exception when a driver or a worker fails!

Here's an example: I used a simple pipeline that write to a bucket to
which I have read-only access. You can see an example Batch (where every
partition fails): https://batch.hail.is/batches/8046901. [1]

```python3
import hail as hl
hl.utils.range_table(3, n_partitions=3).write('gs://neale-bge/foo.ht')
```

NB: I removed the `log.error` in `handleForPython` because that log is
never necessary. That function converts a stack of exceptions into a
triplet of the short message, the full exception with stack trace, and a
Hail error id (if present). That triplet is always passed along to
someone else who logs the exception.

(FWIW, the error id indicates a Python source location that is
associated with the error. On the Python-side, we can look up that error
id and provide a better stack trace.)

[1] You'll notice the logs are missing. I noticed this as well, it's a
new bug. I fixed it in #13729.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[qob] a JVM job's logs are highly likely to be replaced with a message about EOFException in cancel thread
2 participants