-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qob] fix job logs #12941
[qob] fix job logs #12941
Conversation
Citation for log4j1 programmatic configuration breaking log4j2: https://logging.apache.org/log4j/2.x/manual/migration.html#limitations-of-the-log4j-1-x-bridge |
Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system. Changes in this PR: 1. Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page. 2. JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with `javac`. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle. 3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received. 4. Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, *if you do not try to reconfigure it using log4j1 programmatic configuration*, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, **all old versions of Hail will still have no logs**. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it. 5. `QoBAppender`. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse. If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to `QoBAppender`). I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.
097f259
to
fc4162f
Compare
We might want to do this if we get rid of GSA keys. We can't have any more jars that presume the existence of some key file. It would also be a good time to fully delete the |
I'm personally OK not supporting a long lineage of Hail versions on QoB. If we don't have to make people update for a few versions, great, but we should be OK making people run a pip upgrade. |
@tpoterba @daniel-goldstein Can one of you assign yourselves and review it? I don't have enough experience with Java etc. to properly do this review. |
I can review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems mostly fine, but a lot of space -> tab changes made it hard to read the changes in JVMEntryway.java
. I'm assuming it'll be easier once those are cleaned up.
batch/batch/worker/worker.py
Outdated
# the JVM log might inadvetantly contain sensitive information. | ||
'unexpected end of stream in jvm' | ||
) from eos_exception | ||
assert False, f'unexpected message type: {message}\nJVM Output:\n\n{jvm_output}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we can do something more than assert False
. Can we instead log.exception
and raise some ValueError
akin to our HailInternalError
that tells a user they should reach out to us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a log and switched from AssertionError to ValueError. Do you want everything switched to inform the user to contact us? All three of these branches (entryway exception, JVM EOS, and invalid message) are generally not user caused (except maybe the JVM EOS if the user has large data rows).
if (t2 != null) { | ||
entrywayException.addSuppressed(t2); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental tabs?
URL url = loader.getResource("log4j2.properties"); | ||
System.err.println("reconfiguring logging " + url.toString()); | ||
context.setConfigLocation(url.toURI()); // this will force a reconfiguration | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also lots of tabs in the new code here.
private var _instances: mutable.Map[Layout[_], QoBOutputStreamManager] = mutable.Map() | ||
|
||
def getInstance(layout: Layout[_]): QoBOutputStreamManager = synchronized { | ||
_instances.getOrElseUpdate(layout, new QoBOutputStreamManager(layout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this create a new object only to likely destroy it immediately after? Is that something to worry about? Unclear to me whether any of these log4j classes register themselves somewhere upon initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map.getOrElseUpdate
is like Java's computeIfAbsent
. If the key exists it returns the value. Otherwise, it evaluates the thunk argument, inserts it into the map, and returns it.
getInstance
is called in the constructor to QoBAppender
. QoBAppender
is created by the Log4J2 infrastructure via createAppender
. Log4J2 finds that method by way of all these annotations (@PluginFactory
, etc.).
Dunno what happened with the tabs. I ran M-x untabify and pushed. |
Banning versions completely is a little tricky because the user can specify a JAR url directly instead of a version. JARs don't currently have a simple way to report pip version to the worker, though we could cook something up. We could also just delete the old JARs. |
I feel like we should make a (cached) request to ensure that the JAR exists in the front-end upon job submission and return a 400 if it doesn't exist instead of waiting for the worker to error. It would:
|
I endorse such a feature! In the meantime, I think we now have a PR that restores job logs for new versions of Hail. Let's get it out to the users post-haste! |
Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system.
Changes in this PR:
Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page.
JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with
javac
. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle.Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received.
Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, if you do not try to reconfigure it using log4j1 programmatic configuration, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, all old versions of Hail will still have no logs. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it.
QoBAppender
. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse.If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to
QoBAppender
).I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.