Skip to content

Commit

Permalink
[qob] fix job logs (#12941)
Browse files Browse the repository at this point in the history
Spark 3.3.0 uses log4j2. Note the "2". If you use the log4j1
programmatic reconfiguration system, you will break log4j2 for you and
everyone else. The only way to recover from such a breakage is to use
the log4j2 programmatic reconfiguration system.

Changes in this PR:

1. Include JVM output in error logs when the JVM crashes. This should
help debugging of JVM crashing in production until the JVM logs are
shown on a per-worker page.

2. JVMEntryway is now a real gradle project. I need to compile against
log4j, and I didn't want to do that by hand with `javac`. Ignore
gradlew, gradlew.bat, and gradle/wrapper, they're programmatically
generated by gradle.

3. Add logging to JVMEntryway. JVMEntryway now logs its arguments into
the QoB job log. I also log exceptions from the main thread or the
cancel thread into the job log. We also flush the logs after the main
thread completes, the cancel thread completes, and when the try-catch
exits. This should ensure that regardless of what goes wrong (even if
both threads fail to start) we at least see the arguments that the
JVMEntryway received.

4. Use log4j2 programmatic reconfiguration after every job. This
restores log4j2 to well enough working order that, *if you do not try to
reconfigure it using log4j1 programmatic configuration*, logs will work.
All old versions of Hail use log4j1 programmatic configuration. As a
result, **all old versions of Hail will still have no logs**. However,
new versions of Hail will log correctly even if an old version of Hail
used the JVM before it.

5. `QoBAppender`. This is how we always should have done logging. A
custom appender which we can flush and then redirect to a new file at
our whim. I followed the log4j2 best practices for creating a new
appender. All these annotations, factory methods, and managers are The
Right Way, for better or worse.

If we ever ban old versions of Hail from the cluster, then we can also
eliminate the log4j2 reconfiguration. New versions of Hail work fine
without any runtime log configuration (thanks to `QoBAppender`).

I would like to eliminate reconfiguration because log4j2 reconfiguration
leaves around oprhaned appenders and appender managers. Maybe I'm
implementing the Appender or Appender Manager interfaces wrong, but I've
read over that code a bunch of times and I cannot sort out what I am
missing.
  • Loading branch information
danking authored May 1, 2023
1 parent e0dbb8b commit 1940547
Show file tree
Hide file tree
Showing 14 changed files with 508 additions and 55 deletions.
10 changes: 3 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -165,14 +165,10 @@ hail-buildkit-image: ci/buildkit/Dockerfile
./docker-build.sh ci buildkit/Dockerfile.out $(HAIL_BUILDKIT_IMAGE)
echo $(HAIL_BUILDKIT_IMAGE) > $@

batch/jars/junixsocket-selftest-2.3.3-jar-with-dependencies.jar:
mkdir -p batch/jars
cd batch/jars && curl -LO https://github.com/kohlschutter/junixsocket/releases/download/junixsocket-parent-2.3.3/junixsocket-selftest-2.3.3-jar-with-dependencies.jar
batch/jvm-entryway/build/libs/jvm-entryway.jar: $(shell git ls-files batch/jvm-entryway)
cd batch/jvm-entryway && ./gradlew shadowJar

batch/src/main/java/is/hail/JVMEntryway.class: batch/src/main/java/is/hail/JVMEntryway.java batch/jars/junixsocket-selftest-2.3.3-jar-with-dependencies.jar
javac -cp batch/jars/junixsocket-selftest-2.3.3-jar-with-dependencies.jar $<

batch-worker-image: batch/src/main/java/is/hail/JVMEntryway.class $(SERVICES_IMAGE_DEPS) $(shell git ls-files batch)
batch-worker-image: batch/jvm-entryway/build/libs/jvm-entryway.jar $(SERVICES_IMAGE_DEPS) $(shell git ls-files batch)
$(eval BATCH_WORKER_IMAGE := $(DOCKER_PREFIX)/batch-worker:$(TOKEN))
python3 ci/jinja2_render.py '{"hail_ubuntu_image":{"image":"'$$(cat hail-ubuntu-image)'"},"global":{"cloud":"$(CLOUD)"}}' batch/Dockerfile.worker batch/Dockerfile.worker.out
./docker-build.sh . batch/Dockerfile.worker.out $(BATCH_WORKER_IMAGE)
Expand Down
3 changes: 1 addition & 2 deletions batch/Dockerfile.worker
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,5 @@ COPY batch/batch /batch/batch/

RUN hail-pip-install /hailtop /gear /batch

COPY batch/jars/junixsocket-selftest-2.3.3-jar-with-dependencies.jar /jvm-entryway/
COPY batch/src/main/java/is /jvm-entryway/is
COPY batch/jvm-entryway/build/libs/jvm-entryway.jar /jvm-entryway/
COPY letsencrypt/subdomains.txt /
33 changes: 24 additions & 9 deletions batch/batch/worker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -2338,7 +2338,7 @@ async def create_and_start(
'java',
f'-Xmx{heap_memory_mib}M',
'-cp',
f'/jvm-entryway:/jvm-entryway/junixsocket-selftest-2.3.3-jar-with-dependencies.jar:{JVM.SPARK_HOME}/jars/*',
f'/jvm-entryway/jvm-entryway.jar:{JVM.SPARK_HOME}/jars/*',
'is.hail.JVMEntryway',
socket_file,
]
Expand Down Expand Up @@ -2638,14 +2638,29 @@ async def execute(self, classpath: str, scratch_dir: str, log_file: str, jar_url
elif message == JVM.FINISH_USER_EXCEPTION:
exception = await read_str(reader)
raise JVMUserError(exception)
elif message == JVM.FINISH_ENTRYWAY_EXCEPTION:
log.warning(f'{self}: entryway exception encountered (interrupted: {wait_for_interrupt.done()})')
exception = await read_str(reader)
raise ValueError(exception)
elif message == JVM.FINISH_JVM_EOS:
assert eos_exception is not None
log.warning(f'{self}: unexpected end of stream in jvm (interrupted: {wait_for_interrupt.done()})')
raise ValueError('unexpected end of stream in jvm') from eos_exception
else:
jvm_output = ''
if os.path.exists(self.container.container.log_path):
jvm_output = (await self.fs.read(self.container.container.log_path)).decode('utf-8')

if message == JVM.FINISH_ENTRYWAY_EXCEPTION:
log.warning(
f'{self}: entryway exception encountered (interrupted: {wait_for_interrupt.done()})\nJVM Output:\n\n{jvm_output}'
)
exception = await read_str(reader)
raise ValueError(exception)
if message == JVM.FINISH_JVM_EOS:
assert eos_exception is not None
log.warning(
f'{self}: unexpected end of stream in jvm (interrupted: {wait_for_interrupt.done()})\nJVM Output:\n\n{jvm_output}'
)
raise ValueError(
# Do not include the JVM log in the exception as this is sent to the user and
# the JVM log might inadvetantly contain sensitive information.
'unexpected end of stream in jvm'
) from eos_exception
log.exception(f'{self}: unexpected message type: {message}\nJVM Output:\n\n{jvm_output}')
raise ValueError(f'{self}: unexpected message type: {message}')


class Worker:
Expand Down
2 changes: 2 additions & 0 deletions batch/jvm-entryway/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.bloop
.metals
62 changes: 62 additions & 0 deletions batch/jvm-entryway/build.gradle
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
buildscript {
repositories {
mavenCentral()
}
}

plugins {
id "application"
id 'java'
id 'scala'
id 'com.github.johnrengelman.shadow' version '8.1.1'
}

repositories {
mavenCentral()
maven { url "https://repository.cloudera.com/artifactory/cloudera-repos/" }
}

project.ext {
sparkVersion = System.getProperty("spark.version", "3.3.0")
scalaVersion = System.getProperty("scala.version", "2.12.13")
}

sourceSets {
main {
scala {
// compile java and scala together so they can interdepend
srcDirs = ['src/main/scala', 'src/main/java']
}
java {
srcDirs = []
}
}
}

dependencies {
implementation 'com.kohlschutter.junixsocket:junixsocket-core:2.6.2'
compileOnly 'org.scala-lang:scala-library:' + scalaVersion
compileOnly 'org.scala-lang:scala-reflect:' + scalaVersion
compileOnly('org.apache.spark:spark-core_2.12:' + sparkVersion) {
exclude module: 'hadoop-client'
}
}

jar {
manifest {
attributes 'Main-Class': application.mainClass
}
}

shadowJar {
archiveBaseName.set('jvm-entryway')
archiveClassifier.set('')
archiveVersion.set('')
}

application {
mainClassName = "is.hail.JVMEntryway"
// these can help debug log4j
// applicationDefaultJvmArgs = ["-Dlog4j.debug"]
// applicationDefaultJvmArgs = ["-Dlog4j2.debug"]
}
Binary file not shown.
5 changes: 5 additions & 0 deletions batch/jvm-entryway/gradle/wrapper/gradle-wrapper.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
distributionBase=GRADLE_USER_HOME
distributionPath=wrapper/dists
distributionUrl=https\://services.gradle.org/distributions/gradle-8.1.1-bin.zip
zipStoreBase=GRADLE_USER_HOME
zipStorePath=wrapper/dists
185 changes: 185 additions & 0 deletions batch/jvm-entryway/gradlew
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#!/usr/bin/env sh

#
# Copyright 2015 the original author or authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

##############################################################################
##
## Gradle start up script for UN*X
##
##############################################################################

# Attempt to set APP_HOME
# Resolve links: $0 may be a link
PRG="$0"
# Need this for relative symlinks.
while [ -h "$PRG" ] ; do
ls=`ls -ld "$PRG"`
link=`expr "$ls" : '.*-> \(.*\)$'`
if expr "$link" : '/.*' > /dev/null; then
PRG="$link"
else
PRG=`dirname "$PRG"`"/$link"
fi
done
SAVED="`pwd`"
cd "`dirname \"$PRG\"`/" >/dev/null
APP_HOME="`pwd -P`"
cd "$SAVED" >/dev/null

APP_NAME="Gradle"
APP_BASE_NAME=`basename "$0"`

# Add default JVM options here. You can also use JAVA_OPTS and GRADLE_OPTS to pass JVM options to this script.
DEFAULT_JVM_OPTS='"-Xmx64m" "-Xms64m"'

# Use the maximum available, or set MAX_FD != -1 to use that value.
MAX_FD="maximum"

warn () {
echo "$*"
}

die () {
echo
echo "$*"
echo
exit 1
}

# OS specific support (must be 'true' or 'false').
cygwin=false
msys=false
darwin=false
nonstop=false
case "`uname`" in
CYGWIN* )
cygwin=true
;;
Darwin* )
darwin=true
;;
MINGW* )
msys=true
;;
NONSTOP* )
nonstop=true
;;
esac

CLASSPATH=$APP_HOME/gradle/wrapper/gradle-wrapper.jar


# Determine the Java command to use to start the JVM.
if [ -n "$JAVA_HOME" ] ; then
if [ -x "$JAVA_HOME/jre/sh/java" ] ; then
# IBM's JDK on AIX uses strange locations for the executables
JAVACMD="$JAVA_HOME/jre/sh/java"
else
JAVACMD="$JAVA_HOME/bin/java"
fi
if [ ! -x "$JAVACMD" ] ; then
die "ERROR: JAVA_HOME is set to an invalid directory: $JAVA_HOME
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi
else
JAVACMD="java"
which java >/dev/null 2>&1 || die "ERROR: JAVA_HOME is not set and no 'java' command could be found in your PATH.
Please set the JAVA_HOME variable in your environment to match the
location of your Java installation."
fi

# Increase the maximum file descriptors if we can.
if [ "$cygwin" = "false" -a "$darwin" = "false" -a "$nonstop" = "false" ] ; then
MAX_FD_LIMIT=`ulimit -H -n`
if [ $? -eq 0 ] ; then
if [ "$MAX_FD" = "maximum" -o "$MAX_FD" = "max" ] ; then
MAX_FD="$MAX_FD_LIMIT"
fi
ulimit -n $MAX_FD
if [ $? -ne 0 ] ; then
warn "Could not set maximum file descriptor limit: $MAX_FD"
fi
else
warn "Could not query maximum file descriptor limit: $MAX_FD_LIMIT"
fi
fi

# For Darwin, add options to specify how the application appears in the dock
if $darwin; then
GRADLE_OPTS="$GRADLE_OPTS \"-Xdock:name=$APP_NAME\" \"-Xdock:icon=$APP_HOME/media/gradle.icns\""
fi

# For Cygwin or MSYS, switch paths to Windows format before running java
if [ "$cygwin" = "true" -o "$msys" = "true" ] ; then
APP_HOME=`cygpath --path --mixed "$APP_HOME"`
CLASSPATH=`cygpath --path --mixed "$CLASSPATH"`

JAVACMD=`cygpath --unix "$JAVACMD"`

# We build the pattern for arguments to be converted via cygpath
ROOTDIRSRAW=`find -L / -maxdepth 1 -mindepth 1 -type d 2>/dev/null`
SEP=""
for dir in $ROOTDIRSRAW ; do
ROOTDIRS="$ROOTDIRS$SEP$dir"
SEP="|"
done
OURCYGPATTERN="(^($ROOTDIRS))"
# Add a user-defined pattern to the cygpath arguments
if [ "$GRADLE_CYGPATTERN" != "" ] ; then
OURCYGPATTERN="$OURCYGPATTERN|($GRADLE_CYGPATTERN)"
fi
# Now convert the arguments - kludge to limit ourselves to /bin/sh
i=0
for arg in "$@" ; do
CHECK=`echo "$arg"|egrep -c "$OURCYGPATTERN" -`
CHECK2=`echo "$arg"|egrep -c "^-"` ### Determine if an option

if [ $CHECK -ne 0 ] && [ $CHECK2 -eq 0 ] ; then ### Added a condition
eval `echo args$i`=`cygpath --path --ignore --mixed "$arg"`
else
eval `echo args$i`="\"$arg\""
fi
i=`expr $i + 1`
done
case $i in
0) set -- ;;
1) set -- "$args0" ;;
2) set -- "$args0" "$args1" ;;
3) set -- "$args0" "$args1" "$args2" ;;
4) set -- "$args0" "$args1" "$args2" "$args3" ;;
5) set -- "$args0" "$args1" "$args2" "$args3" "$args4" ;;
6) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" ;;
7) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" ;;
8) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" ;;
9) set -- "$args0" "$args1" "$args2" "$args3" "$args4" "$args5" "$args6" "$args7" "$args8" ;;
esac
fi

# Escape application args
save () {
for i do printf %s\\n "$i" | sed "s/'/'\\\\''/g;1s/^/'/;\$s/\$/' \\\\/" ; done
echo " "
}
APP_ARGS=`save "$@"`

# Collect all arguments for the java command, following the shell quoting and substitution rules
eval set -- $DEFAULT_JVM_OPTS $JAVA_OPTS $GRADLE_OPTS "\"-Dorg.gradle.appname=$APP_BASE_NAME\"" -classpath "\"$CLASSPATH\"" org.gradle.wrapper.GradleWrapperMain "$APP_ARGS"

exec "$JAVACMD" "$@"
Loading

0 comments on commit 1940547

Please sign in to comment.