[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

andrewvc · 2022-07-18T20:41:50Z

Fixes #32363 by instructing the linux kernel to automatically kill node subprocesses if their parents die. In testing it appears chromium always dies as well, although I'm not entirely sure why. Either chrome sets the right flags itself, or the death signal propagates. Either way, in testing this works very solidly.

We don't have sufficient automated test infrastructure to write a good automated test here, so this will have to reply on manual testing.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run the below heartbeat inline yaml, this journey will hang for a long time, giving you chrome / node processes that last for a long while. Once it is started, run pkill heartbeat which should kill heartbeat and all subprocesses. On the main branch heartbeat and chromium will die, and node will persiste.

- type: browser
  enabled: true
  id: Inline mem 3
  name: Inline mem3
  source:
    inline:
      script:
        step("load homepage", async () => {
            await page.goto('https://www.elastic.co');
        });
        step("hover over products menu", async () => {
            await page.hover('css=[data-nav-item=products]');
        });
        step("failme", async () => {
           await (new Promise(done => {
             setTimeout(done, 100000);
           }));
           await page.hover('css=[data-nav-item=notathingonpage]');
        });
  schedule: "@every 1m"

elasticmachine · 2022-07-18T20:47:20Z

Pinging @elastic/uptime (Team:Uptime)

elasticmachine · 2022-07-18T20:49:37Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-07-22T17:46:58.875+0000
Duration: 46 min 47 sec

Test stats 🧪

Test	Results
Failed	0
Passed	142
Skipped	0
Total	142

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

emilioalvap

LGTM, tested E2E with Fleet integrations.

lucasfcosta · 2022-08-04T10:22:58Z

@andrewvc I've E2E tested this PR and it still didn't kill the Node processes (I'm using an x64 Mac btw)

I tested it by updating my monitor so that it would run for a long time:

heartbeat.monitors:
- type: browser
  id: exits-monitor-3
  name: exits-monitor-3
  schedule: '@every 10s'
  source:
    inline:
      script: |-
        step("second", async () => {
          await page.waitForTimeout(100_000)
          await page.goto('https://www.google.com');
        });
        step("second", async () => {
          await page.goto('https://www.bbc.co.uk');
        });

Then, I verified I did have node processes spawned by Heartbeat, killed Heartbeat, and verified whether there were any node processes hanging (see image below).

Did I do anything wrong with regards to testing or should I open an issue for this?

andrewvc · 2022-08-04T21:25:16Z

@lucasfcosta sorry if it was unclear, but this is expected! This relies on linux specific behavior, so it won't work on other platforms. It's fine, however, in that synthetics only really needs to work on linux since it requires a linux container.

The meat of the PR is in https://github.com/elastic/beats/blob/main/x-pack/heartbeat/monitors/browser/synthexec/synthexec_linux.go , note the _linux which go uses to selectively compile only for the matching platform. I wish there was a cross-platform way, but this is a linux-specific syscall. I'm sure we could work it out for OSX, but it's probably not worth the complexity given that no actual user will run it that way.

andrewvc · 2022-08-04T21:29:49Z

The easiest way to test this, I should add, would be to check out the synthetics demo repo, modify https://github.com/elastic/synthetics-demo/blob/main/heartbeat/run.sh to set LATEST_RELEASE_TAG to 8.4.0-SNAPSHOT. Then, you can try attaching a console to the container and killing heartbeat processes.

I think that should work. I tested it myself on my own linux WSL box.

lucasfcosta · 2022-08-08T10:01:00Z

@andrewvc indeed that solved the problem!

Here's the script I used to run it:

#!/bin/bash -e
LATEST_RELEASE_TAG="8.4.0-SNAPSHOT"
DOCKER_IMAGE=docker.elastic.co/beats/heartbeat:$LATEST_RELEASE_TAG
echo "Using docker image $DOCKER_IMAGE"
docker run \
  -it \
  --entrypoint=/bin/bash \
  --rm \
  --name=heartbeat \
  --user=heartbeat \
  --net=elastic-package-stack_default \
  --volume="/tmp/hb1.yml:/usr/share/heartbeat/heartbeat.yml:ro" \
  $DOCKER_IMAGE

Once I ran that script, I started heartbeat with ./heartbeat -e and ran the commands shown below to list the process tree, kill Heartbeat, and check the process tree again.

…32393) Fixes #32363 by instructing the linux kernel to automatically kill node subprocesses if their parents die. In testing it appears chromium always dies as well, although I'm not entirely sure why. Either chrome sets the right flags itself, or the death signal propagates. Either way, in testing this works very solidly. We don't have sufficient automated test infrastructure to write a good automated test here, so this will have to reply on manual testing.

Use prctl to ensure subprocesses are cleaned up on exit

cb1f08a

andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.4.0 labels Jul 18, 2022

andrewvc self-assigned this Jul 18, 2022

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 18, 2022

Add changelog

77507fc

andrewvc requested a review from emilioalvap July 18, 2022 20:44

andrewvc marked this pull request as ready for review July 18, 2022 20:47

andrewvc requested a review from a team as a code owner July 18, 2022 20:47

andrewvc added 2 commits July 18, 2022 15:48

Improve comments

605a9ec

Improve comments

5e7fe40

andrewvc added 7 commits July 18, 2022 15:50

Remove unnecessary closes

52961b3

Fix imports

df0843e

Try syscall out, maybe easier

ca4494f

Fix syntax

73c3d94

Fix platform issues

d783263

Add license

6dd8d75

Only target linux for pdeathsig

9f490c7

emilioalvap force-pushed the ensure-cleanup branch from 60e2448 to 9f490c7 Compare July 21, 2022 17:06

andrewvc added 2 commits July 21, 2022 20:35

Merge remote-tracking branch 'origin/main' into ensure-cleanup

77bf7e0

Use sigkill not sigterm

0db35d4

emilioalvap approved these changes Jul 22, 2022

View reviewed changes

andrewvc merged commit ec62a35 into elastic:main Jul 22, 2022

andrewvc deleted the ensure-cleanup branch July 22, 2022 20:58

lucasfcosta mentioned this pull request Aug 4, 2022

[Heartbeat] Zombie-ish processes created under agent #32363

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

andrewvc commented Jul 18, 2022 •

edited

Loading

elasticmachine commented Jul 18, 2022

elasticmachine commented Jul 18, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

emilioalvap left a comment

lucasfcosta commented Aug 4, 2022

andrewvc commented Aug 4, 2022

andrewvc commented Aug 4, 2022

lucasfcosta commented Aug 8, 2022

[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

Conversation

andrewvc commented Jul 18, 2022 • edited Loading

Checklist

How to test this PR locally

elasticmachine commented Jul 18, 2022

elasticmachine commented Jul 18, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

emilioalvap left a comment

Choose a reason for hiding this comment

lucasfcosta commented Aug 4, 2022

andrewvc commented Aug 4, 2022

andrewvc commented Aug 4, 2022

lucasfcosta commented Aug 8, 2022

andrewvc commented Jul 18, 2022 •

edited

Loading

elasticmachine commented Jul 18, 2022 •

edited by jenkins-beats-ci bot

Loading