Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(jest-worker): fix hanging when workers are killed or unexpectedly exit #13566

Merged
merged 2 commits into from
Nov 7, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
### Fixes

- `[jest-mock]` Treat cjs modules as objects so they can be mocked ([#13513](https://github.com/facebook/jest/pull/13513))
- `[jest-worker]` Throw an error instead of hanging when jest workers are killed ([#13566](https://github.com/facebook/jest/pull/13566))

### Chore & Maintenance

Expand Down
42 changes: 41 additions & 1 deletion packages/jest-worker/src/workers/ChildProcessWorker.ts
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ export default class ChildProcessWorker
}
}

private _onExit(exitCode: number | null) {
private _onExit(exitCode: number | null, signal: NodeJS.Signals | null) {
this._workerReadyPromise = undefined;
this._resolveWorkerReady = undefined;

Expand Down Expand Up @@ -372,6 +372,46 @@ export default class ChildProcessWorker
this._child.send(this._request);
}
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this else branch has gone through several changes in the past.

// At this point, it's not clear why the child process exited. There could
// be several reasons:
Comment on lines +375 to +376
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we can reasonably guess when the Node.js internal max heap limit has been hit, it's difficult to guess other failure scenarios from exitCode and signal.

There was a previous discussion of this in the linked issue: #13183 (comment)

I feel propagating an error in the else block is the best way to handle unknown situations.

//
// 1. The child process exited successfully after finishing its work.
// This is the most likely case.
// 2. The child process crashed in a manner that wasn't caught through
// any of the heuristic-based checks above.
// 3. The child process was killed by another process or daemon unrelated
// to Jest. For example, oom-killer on Linux may have picked the child
// process to kill because overall system memory is constrained.
//
// If there's a pending request to the child process in any of those
// situations, the request still needs to be handled in some manner before
// entering the shutdown phase. Otherwise the caller expecting a response
// from the worker will never receive indication that something unexpected
// happened and hang forever.
//
// In normal operation, the request is handled and cleared before the
// child process exits. If it's still present, it's not clear what
// happened and probably best to throw an error. In practice, this usually
// happens when the child process is killed externally.
Comment on lines +392 to +395
Copy link
Contributor Author

@gluxon gluxon Nov 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a better design pattern to guarantee this._request is handled in some manner. At the moment this happens through a difficult to follow sequence:

  1. The this._onMessage method calls this._onProcessEnd. https://github.com/facebook/jest/blob/a0703c94936e2459ae8600a30895da6f6b8b05de/packages/jest-worker/src/workers/ChildProcessWorker.ts#L262-L263
  2. The this._onProcessEnd method sets this._request to null. It can reasonably do this since the child process sent a successful response. https://github.com/facebook/jest/blob/a0703c94936e2459ae8600a30895da6f6b8b05de/packages/jest-worker/src/workers/ChildProcessWorker.ts#L432-L434

The this._onMessage has several cases that may or may not call this._onProcessEnd. Without better guarantees, it's possible a refactor of this class in the future reintroduces the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to take a follow-up if you figure out a better pattern for this. But I think it's fine as is as well

//
// There's a reasonable argument that the child process should be retried
// with request re-sent in this scenario. However, if the problem was due
// to situations such as oom-killer attempting to free up system
// resources, retrying would exacerbate the problem.
const isRequestStillPending = !!this._request;
if (isRequestStillPending) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed a few tests within Jest fail without the isRequestStillPending check here. I don't believe this can happen in practice since instances of jest-worker are usually jest-runner, which is always sent a request. Some fixture workers in jest-worker appear to work differently.

Kept the check just in case. It felt safer to only throw an error in situations we're certain something went wrong.

❯ yarn jest WorkerEdgeCases
 FAIL  packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.ts (32.687 s)
  ✓ workers/processChild.js should exist (1 ms)
  ✓ workers/threadChild.js should exist
  ✓ types.js should exist
  ProcessWorker
    ✕ should get memory usage (10003 ms)
    ✕ should recycle on idle limit breach (10001 ms)
    should automatically recycle on idle limit breach
      ✓ initial state (2 ms)
      ✓ new worker starts (86 ms)
      ✓ worker continues to run after kill delay (604 ms)
      ✓ expected state order (1 ms)
    should cleanly exit on out of memory crash
      ✓ starting state
      ✓ worker ready
      ✓ worker crashes and exits (136 ms)
      ✓ worker stays dead (3 ms)
      ✓ expected state order
    should handle regular fatal crashes
      ✓ starting state (1 ms)
      ✓ processes restart (4002 ms)
      ✓ processes exits (2 ms)
  ThreadWorker
    ✓ should get memory usage (74 ms)
    ✓ should recycle on idle limit breach (646 ms)
    should automatically recycle on idle limit breach
      ✓ initial state (1 ms)
      ✓ new worker starts (71 ms)
      ✓ worker continues to run after kill delay (603 ms)
      ✓ expected state order (2 ms)
    should cleanly exit on out of memory crash
      ✓ starting state (1 ms)
      ✓ worker ready (1 ms)
      ✓ worker crashes and exits (144 ms)
      ✓ worker stays dead (1 ms)
      ✓ expected state order
    should handle regular fatal crashes
      ✓ starting state (1 ms)
      ✓ processes restart (4002 ms)
      ✓ processes exits (1 ms)
  should not hang on external process kill
    ✓ onEnd callback is called (584 ms)

  ● ProcessWorker › should get memory usage

    TypeError: this._onProcessEnd is not a function

      at ChildProcessWorker._onProcessEnd [as _onExit] (packages/jest-worker/src/workers/ChildProcessWorker.ts:410:12)

  ● ProcessWorker › should get memory usage

    thrown: "Exceeded timeout of 10000 ms for a test.
    Use jest.setTimeout(newTimeout) to increase the timeout value, if this is a long-running test."

      at test (packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.ts:111:3)
          at Array.forEach (<anonymous>)

  ● ProcessWorker › should recycle on idle limit breach

    TypeError: this._onProcessEnd is not a function

      at ChildProcessWorker._onProcessEnd [as _onExit] (packages/jest-worker/src/workers/ChildProcessWorker.ts:410:12)

  ● ProcessWorker › should recycle on idle limit breach

    thrown: "Exceeded timeout of 10000 ms for a test.
    Use jest.setTimeout(newTimeout) to increase the timeout value, if this is a long-running test."

      at test (packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.ts:127:3)
          at Array.forEach (<anonymous>)

Test Suites: 1 failed, 1 total
Tests:       2 failed, 30 passed, 32 total
Snapshots:   0 total
Time:        32.854 s
Ran all test suites matching /WorkerEdgeCases/i.

// If a signal is present, we can be reasonably confident the process
// was killed externally. Log this fact so it's more clear to users that
// something went wrong externally, rather than a bug in Jest itself.
const error = new Error(
signal != null
? `A jest worker process (pid=${this._child.pid}) was terminated by another process: signal=${signal}, exitCode=${exitCode}. Operating system logs may contain more information on why this occurred.`
: `A jest worker process (pid=${this._child.pid}) crashed for an unknown reason: exitCode=${exitCode}`,
);

this._onProcessEnd(error, null);
}

this._shutdown();
}
}
Expand Down
65 changes: 65 additions & 0 deletions packages/jest-worker/src/workers/__tests__/WorkerEdgeCases.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -386,3 +386,68 @@ describe.each([
});
});
});

// This describe block only applies to the child process worker since it's
// generally not possible for external processes to abruptly kill a thread of
// another process.
describe('should not hang on external process kill', () => {
let worker: ChildProcessWorker;

beforeEach(() => {
const options = {
childWorkerPath: processChildWorkerPath,
maxRetries: 0,
silent: true,
workerPath: join(__dirname, '__fixtures__', 'SelfKillWorker'),
} as unknown as WorkerOptions;

worker = new ChildProcessWorker(options);
});

afterEach(async () => {
await new Promise<void>(resolve => {
setTimeout(async () => {
if (worker) {
worker.forceExit();
await worker.waitForExit();
}

resolve();
}, 500);
});
});

// Regression test for https://github.com/facebook/jest/issues/13183
test('onEnd callback is called', async () => {
let onEndPromiseResolve: () => void;
let onEndPromiseReject: (err: Error) => void;
const onEndPromise = new Promise<void>((resolve, reject) => {
onEndPromiseResolve = resolve;
onEndPromiseReject = reject;
});

const onStart = jest.fn();
const onEnd = jest.fn((err: Error | null) => {
if (err) {
return onEndPromiseReject(err);
}
onEndPromiseResolve();
});
const onCustom = jest.fn();

await worker.waitForWorkerReady();

// The SelfKillWorker simulates an external process calling SIGTERM on it,
// but just SIGTERMs itself underneath the hood to make this test easier.
worker.send(
[CHILD_MESSAGE_CALL, true, 'selfKill', []],
onStart,
onEnd,
onCustom,
);

// The onEnd callback should be called when the child process exits.
await expect(onEndPromise).rejects.toBeInstanceOf(Error);
expect(onEnd).toHaveBeenCalled();
});
});
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
/**
* Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.
*
* This source code is licensed under the MIT license found in the
* LICENSE file in the root directory of this source tree.
*/
const {isMainThread} = require('worker_threads');

async function selfKill() {
// This test is intended for the child process worker. If the Node.js worker
// thread mode is accidentally tested instead, let's prevent a confusing
// situation where process.kill stops the Jest test harness itself.
if (!isMainThread) {
// process.exit is documented to only stop the current thread rather than
// the process in a worker_threads environment.
process.exit();
}

process.kill(process.pid);
}

module.exports = {
selfKill,
};