[qob] fix job logs #12941

danking · 2023-04-27T00:53:17Z

Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system.

Changes in this PR:

Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page.
JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with javac. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle.
Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received.
Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, if you do not try to reconfigure it using log4j1 programmatic configuration, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, all old versions of Hail will still have no logs. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it.
QoBAppender. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse.

If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to QoBAppender).

I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.

danking · 2023-04-27T00:53:24Z

cc: @daniel-goldstein @tpoterba

danking · 2023-04-27T00:53:51Z

Citation for log4j1 programmatic configuration breaking log4j2: https://logging.apache.org/log4j/2.x/manual/migration.html#limitations-of-the-log4j-1-x-bridge

Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system. Changes in this PR: 1. Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page. 2. JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with `javac`. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle. 3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received. 4. Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, *if you do not try to reconfigure it using log4j1 programmatic configuration*, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, **all old versions of Hail will still have no logs**. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it. 5. `QoBAppender`. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse. If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to `QoBAppender`). I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.

daniel-goldstein · 2023-04-28T18:53:39Z

If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to QoBAppender).

We might want to do this if we get rid of GSA keys. We can't have any more jars that presume the existence of some key file. It would also be a good time to fully delete the memory service, even though old jars should be able to tolerate that.

tpoterba · 2023-04-28T19:26:56Z

I'm personally OK not supporting a long lineage of Hail versions on QoB. If we don't have to make people update for a few versions, great, but we should be OK making people run a pip upgrade.

jigold · 2023-04-28T19:42:55Z

@tpoterba @daniel-goldstein Can one of you assign yourselves and review it? I don't have enough experience with Java etc. to properly do this review.

daniel-goldstein · 2023-04-28T19:48:14Z

I can review

daniel-goldstein

This seems mostly fine, but a lot of space -> tab changes made it hard to read the changes in JVMEntryway.java. I'm assuming it'll be easier once those are cleaned up.

daniel-goldstein · 2023-04-28T20:17:33Z

batch/batch/worker/worker.py

+                        # the JVM log might inadvetantly contain sensitive information.
+                        'unexpected end of stream in jvm'
+                    ) from eos_exception
+                assert False, f'unexpected message type: {message}\nJVM Output:\n\n{jvm_output}'


I feel like we can do something more than assert False. Can we instead log.exception and raise some ValueError akin to our HailInternalError that tells a user they should reach out to us.

I added a log and switched from AssertionError to ValueError. Do you want everything switched to inform the user to contact us? All three of these branches (entryway exception, JVM EOS, and invalid message) are generally not user caused (except maybe the JVM EOS if the user has large data rows).

daniel-goldstein · 2023-04-28T20:45:11Z

batch/jvm-entryway/src/main/java/is/hail/JVMEntryway.java

+	    if (t2 != null) {
+	      entrywayException.addSuppressed(t2);
+	    }
+	  }


Accidental tabs?

daniel-goldstein · 2023-04-28T20:45:50Z

batch/jvm-entryway/src/main/java/is/hail/JVMEntryway.java

+	  URL url = loader.getResource("log4j2.properties");
+	  System.err.println("reconfiguring logging " + url.toString());
+	  context.setConfigLocation(url.toURI()); // this will force a reconfiguration
+	}


Also lots of tabs in the new code here.

daniel-goldstein · 2023-04-28T20:50:56Z

batch/jvm-entryway/src/main/scala/is/hail/QoBAppender.scala

+  private var _instances: mutable.Map[Layout[_], QoBOutputStreamManager] = mutable.Map()
+
+  def getInstance(layout: Layout[_]): QoBOutputStreamManager = synchronized {
+    _instances.getOrElseUpdate(layout, new QoBOutputStreamManager(layout))


Does this create a new object only to likely destroy it immediately after? Is that something to worry about? Unclear to me whether any of these log4j classes register themselves somewhere upon initialization.

map.getOrElseUpdate is like Java's computeIfAbsent. If the key exists it returns the value. Otherwise, it evaluates the thunk argument, inserts it into the map, and returns it.

getInstance is called in the constructor to QoBAppender. QoBAppender is created by the Log4J2 infrastructure via createAppender. Log4J2 finds that method by way of all these annotations (@PluginFactory, etc.).

danking · 2023-04-29T01:12:11Z

Dunno what happened with the tabs. I ran M-x untabify and pushed.

danking · 2023-04-29T01:22:41Z

Banning versions completely is a little tricky because the user can specify a JAR url directly instead of a version. JARs don't currently have a simple way to report pip version to the worker, though we could cook something up. We could also just delete the old JARs.

X

daniel-goldstein · 2023-05-01T14:11:06Z

Banning versions completely is a little tricky because the user can specify a JAR url directly instead of a version. JARs don't currently have a simple way to report pip version to the worker, though we could cook something up. We could also just delete the old JARs.

I feel like we should make a (cached) request to ensure that the JAR exists in the front-end upon job submission and return a 400 if it doesn't exist instead of waiting for the worker to error. It would:

Allow us to remove support by deleting old jars
Fail fast (I know I have accidentally messed up deploying a dev jar and had to wait until a worker came online to find out)
Avoid alerts from workers that can't find dev jars due to mistakes like I mention above

danking · 2023-05-01T17:25:59Z

I endorse such a feature! In the meantime, I think we now have a PR that restores job logs for new versions of Hail. Let's get it out to the users post-haste!

danking assigned jigold Apr 27, 2023

danking force-pushed the fix-these-damn-worker-logs branch from 097f259 to fc4162f Compare April 27, 2023 00:55

fix

80a782c

danking added the prio:high label Apr 27, 2023

blacken

5b8e028

danking removed the prio:high label Apr 27, 2023

daniel-goldstein self-assigned this Apr 28, 2023

daniel-goldstein previously requested changes Apr 28, 2023

View reviewed changes

Dan King added 3 commits April 28, 2023 21:02

no tabs

5593158

remove tabs

9521af5

fix

bb65a3b

Dan King added 2 commits April 28, 2023 21:14

value error instead of assertoin

97150da

blacken

6a7d047

daniel-goldstein approved these changes May 1, 2023

View reviewed changes

danking merged commit 1940547 into hail-is:main May 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[qob] fix job logs #12941

[qob] fix job logs #12941

danking commented Apr 27, 2023

danking commented Apr 27, 2023

danking commented Apr 27, 2023

daniel-goldstein commented Apr 28, 2023

tpoterba commented Apr 28, 2023

jigold commented Apr 28, 2023

daniel-goldstein commented Apr 28, 2023

daniel-goldstein left a comment

daniel-goldstein Apr 28, 2023

danking Apr 29, 2023

daniel-goldstein Apr 28, 2023

daniel-goldstein Apr 28, 2023

daniel-goldstein Apr 28, 2023

danking Apr 29, 2023

danking commented Apr 29, 2023

danking commented Apr 29, 2023

daniel-goldstein commented May 1, 2023 •

edited

Loading

danking commented May 1, 2023

[qob] fix job logs #12941

[qob] fix job logs #12941

Conversation

danking commented Apr 27, 2023

danking commented Apr 27, 2023

danking commented Apr 27, 2023

daniel-goldstein commented Apr 28, 2023

tpoterba commented Apr 28, 2023

jigold commented Apr 28, 2023

daniel-goldstein commented Apr 28, 2023

daniel-goldstein left a comment

Choose a reason for hiding this comment

daniel-goldstein Apr 28, 2023

Choose a reason for hiding this comment

danking Apr 29, 2023

Choose a reason for hiding this comment

daniel-goldstein Apr 28, 2023

Choose a reason for hiding this comment

daniel-goldstein Apr 28, 2023

Choose a reason for hiding this comment

daniel-goldstein Apr 28, 2023

Choose a reason for hiding this comment

danking Apr 29, 2023

Choose a reason for hiding this comment

danking commented Apr 29, 2023

danking commented Apr 29, 2023

daniel-goldstein commented May 1, 2023 • edited Loading

danking commented May 1, 2023

daniel-goldstein commented May 1, 2023 •

edited

Loading