Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] fix job logs #12941

Merged
merged 8 commits into from
May 1, 2023
Merged

[qob] fix job logs #12941

merged 8 commits into from
May 1, 2023

Conversation

danking
Copy link
Contributor

@danking danking commented Apr 27, 2023

Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system, you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to use the log4j2 programmatic reconfiguration system.

Changes in this PR:

  1. Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing in production until the JVM logs are shown on a per-worker page.

  2. JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to do that by hand with javac. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're programmatically generated by gradle.

  3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log exceptions from the main thread or the cancel thread into the job log. We also flush the logs after the main thread completes, the cancel thread completes, and when the try-catch exits. This should ensure that regardless of what goes wrong (even if both threads fail to start) we at least see the arguments that the JVMEntryway received.

  4. Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough working order that, if you do not try to reconfigure it using log4j1 programmatic configuration, logs will work. All old versions of Hail use log4j1 programmatic configuration. As a result, all old versions of Hail will still have no logs. However, new versions of Hail will log correctly even if an old version of Hail used the JVM before it.

  5. QoBAppender. This is how we always should have done logging. A custom appender which we can flush and then redirect to a new file at our whim. I followed the log4j2 best practices for creating a new appender. All these annotations, factory methods, and managers are The Right Way, for better or worse.

If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to QoBAppender).

I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.

@danking
Copy link
Contributor Author

danking commented Apr 27, 2023

cc: @daniel-goldstein @tpoterba

@danking
Copy link
Contributor Author

danking commented Apr 27, 2023

Citation for log4j1 programmatic configuration breaking log4j2: https://logging.apache.org/log4j/2.x/manual/migration.html#limitations-of-the-log4j-1-x-bridge

Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1 programmatic reconfiguration system,
you will break log4j2 for you and everyone else. The only way to recover from such a breakage is to
use the log4j2 programmatic reconfiguration system.

Changes in this PR:

1. Include JVM output in error logs when the JVM crashes. This should help debugging of JVM crashing
   in production until the JVM logs are shown on a per-worker page.

2. JVMEntryway is now a real gradle project. I need to compile against log4j, and I didn't want to
   do that by hand with `javac`. Ignore gradlew, gradlew.bat, and gradle/wrapper, they're
   programmatically generated by gradle.

3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into the QoB job log. I also log
   exceptions from the main thread or the cancel thread into the job log. We also flush the logs
   after the main thread completes, the cancel thread completes, and when the try-catch exits. This
   should ensure that regardless of what goes wrong (even if both threads fail to start) we at least
   see the arguments that the JVMEntryway received.

4. Use log4j2 programmatic reconfiguration after every job. This restores log4j2 to well enough
   working order that, *if you do not try to reconfigure it using log4j1 programmatic
   configuration*, logs will work. All old versions of Hail use log4j1 programmatic
   configuration. As a result, **all old versions of Hail will still have no logs**. However,
   new versions of Hail will log correctly even if an old version of Hail used the JVM before it.

5. `QoBAppender`. This is how we always should have done logging. A custom appender which we can
   flush and then redirect to a new file at our whim. I followed the log4j2 best practices for
   creating a new appender. All these annotations, factory methods, and managers are The Right Way,
   for better or worse.

If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2
reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to
`QoBAppender`).

I would like to eliminate reconfiguration because log4j2 reconfiguration leaves around oprhaned
appenders and appender managers. Maybe I'm implementing the Appender or Appender Manager interfaces
wrong, but I've read over that code a bunch of times and I cannot sort out what I am missing.
@danking danking removed the prio:high label Apr 27, 2023
@daniel-goldstein
Copy link
Contributor

If we ever ban old versions of Hail from the cluster, then we can also eliminate the log4j2 reconfiguration. New versions of Hail work fine without any runtime log configuration (thanks to QoBAppender).

We might want to do this if we get rid of GSA keys. We can't have any more jars that presume the existence of some key file. It would also be a good time to fully delete the memory service, even though old jars should be able to tolerate that.

@tpoterba
Copy link
Contributor

I'm personally OK not supporting a long lineage of Hail versions on QoB. If we don't have to make people update for a few versions, great, but we should be OK making people run a pip upgrade.

@jigold
Copy link
Contributor

jigold commented Apr 28, 2023

@tpoterba @daniel-goldstein Can one of you assign yourselves and review it? I don't have enough experience with Java etc. to properly do this review.

@daniel-goldstein
Copy link
Contributor

I can review

@daniel-goldstein daniel-goldstein self-assigned this Apr 28, 2023
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems mostly fine, but a lot of space -> tab changes made it hard to read the changes in JVMEntryway.java. I'm assuming it'll be easier once those are cleaned up.

# the JVM log might inadvetantly contain sensitive information.
'unexpected end of stream in jvm'
) from eos_exception
assert False, f'unexpected message type: {message}\nJVM Output:\n\n{jvm_output}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can do something more than assert False. Can we instead log.exception and raise some ValueError akin to our HailInternalError that tells a user they should reach out to us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a log and switched from AssertionError to ValueError. Do you want everything switched to inform the user to contact us? All three of these branches (entryway exception, JVM EOS, and invalid message) are generally not user caused (except maybe the JVM EOS if the user has large data rows).

if (t2 != null) {
entrywayException.addSuppressed(t2);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental tabs?

URL url = loader.getResource("log4j2.properties");
System.err.println("reconfiguring logging " + url.toString());
context.setConfigLocation(url.toURI()); // this will force a reconfiguration
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also lots of tabs in the new code here.

private var _instances: mutable.Map[Layout[_], QoBOutputStreamManager] = mutable.Map()

def getInstance(layout: Layout[_]): QoBOutputStreamManager = synchronized {
_instances.getOrElseUpdate(layout, new QoBOutputStreamManager(layout))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this create a new object only to likely destroy it immediately after? Is that something to worry about? Unclear to me whether any of these log4j classes register themselves somewhere upon initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map.getOrElseUpdate is like Java's computeIfAbsent. If the key exists it returns the value. Otherwise, it evaluates the thunk argument, inserts it into the map, and returns it.

getInstance is called in the constructor to QoBAppender. QoBAppender is created by the Log4J2 infrastructure via createAppender. Log4J2 finds that method by way of all these annotations (@PluginFactory, etc.).

@danking
Copy link
Contributor Author

danking commented Apr 29, 2023

Dunno what happened with the tabs. I ran M-x untabify and pushed.

@danking
Copy link
Contributor Author

danking commented Apr 29, 2023

Banning versions completely is a little tricky because the user can specify a JAR url directly instead of a version. JARs don't currently have a simple way to report pip version to the worker, though we could cook something up. We could also just delete the old JARs.

@daniel-goldstein
Copy link
Contributor

daniel-goldstein commented May 1, 2023

Banning versions completely is a little tricky because the user can specify a JAR url directly instead of a version. JARs don't currently have a simple way to report pip version to the worker, though we could cook something up. We could also just delete the old JARs.

I feel like we should make a (cached) request to ensure that the JAR exists in the front-end upon job submission and return a 400 if it doesn't exist instead of waiting for the worker to error. It would:

  • Allow us to remove support by deleting old jars
  • Fail fast (I know I have accidentally messed up deploying a dev jar and had to wait until a worker came online to find out)
  • Avoid alerts from workers that can't find dev jars due to mistakes like I mention above

@danking
Copy link
Contributor Author

danking commented May 1, 2023

I endorse such a feature! In the meantime, I think we now have a PR that restores job logs for new versions of Hail. Let's get it out to the users post-haste!

@danking danking merged commit 1940547 into hail-is:main May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants