Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Heartbeat] Use prctl to ensure subprocesses are cleaned up on exit #32393

Merged
merged 13 commits into from
Jul 22, 2022

Conversation

andrewvc
Copy link
Contributor

@andrewvc andrewvc commented Jul 18, 2022

Fixes #32363 by instructing the linux kernel to automatically kill node subprocesses if their parents die. In testing it appears chromium always dies as well, although I'm not entirely sure why. Either chrome sets the right flags itself, or the death signal propagates. Either way, in testing this works very solidly.

We don't have sufficient automated test infrastructure to write a good automated test here, so this will have to reply on manual testing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run the below heartbeat inline yaml, this journey will hang for a long time, giving you chrome / node processes that last for a long while. Once it is started, run pkill heartbeat which should kill heartbeat and all subprocesses. On the main branch heartbeat and chromium will die, and node will persiste.

- type: browser
  enabled: true
  id: Inline mem 3
  name: Inline mem3
  source:
    inline:
      script:
        step("load homepage", async () => {
            await page.goto('https://www.elastic.co');
        });
        step("hover over products menu", async () => {
            await page.hover('css=[data-nav-item=products]');
        });
        step("failme", async () => {
           await (new Promise(done => {
             setTimeout(done, 100000);
           }));
           await page.hover('css=[data-nav-item=notathingonpage]');
        });
  schedule: "@every 1m"

@andrewvc andrewvc added bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.4.0 labels Jul 18, 2022
@andrewvc andrewvc self-assigned this Jul 18, 2022
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 18, 2022
@andrewvc andrewvc requested a review from emilioalvap July 18, 2022 20:44
@andrewvc andrewvc marked this pull request as ready for review July 18, 2022 20:47
@andrewvc andrewvc requested a review from a team as a code owner July 18, 2022 20:47
@elasticmachine
Copy link
Collaborator

Pinging @elastic/uptime (Team:Uptime)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 18, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-07-22T17:46:58.875+0000

  • Duration: 46 min 47 sec

Test stats 🧪

Test Results
Failed 0
Passed 142
Skipped 0
Total 142

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Collaborator

@emilioalvap emilioalvap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, tested E2E with Fleet integrations.

@andrewvc andrewvc merged commit ec62a35 into elastic:main Jul 22, 2022
@andrewvc andrewvc deleted the ensure-cleanup branch July 22, 2022 20:58
@lucasfcosta
Copy link
Contributor

@andrewvc I've E2E tested this PR and it still didn't kill the Node processes (I'm using an x64 Mac btw)

I tested it by updating my monitor so that it would run for a long time:

heartbeat.monitors:
- type: browser
  id: exits-monitor-3
  name: exits-monitor-3
  schedule: '@every 10s'
  source:
    inline:
      script: |-
        step("second", async () => {
          await page.waitForTimeout(100_000)
          await page.goto('https://www.google.com');
        });
        step("second", async () => {
          await page.goto('https://www.bbc.co.uk');
        });

Then, I verified I did have node processes spawned by Heartbeat, killed Heartbeat, and verified whether there were any node processes hanging (see image below).

Screenshot 2022-08-04 at 11 20 14

Did I do anything wrong with regards to testing or should I open an issue for this?

@andrewvc
Copy link
Contributor Author

andrewvc commented Aug 4, 2022

@lucasfcosta sorry if it was unclear, but this is expected! This relies on linux specific behavior, so it won't work on other platforms. It's fine, however, in that synthetics only really needs to work on linux since it requires a linux container.

The meat of the PR is in https://github.com/elastic/beats/blob/main/x-pack/heartbeat/monitors/browser/synthexec/synthexec_linux.go , note the _linux which go uses to selectively compile only for the matching platform. I wish there was a cross-platform way, but this is a linux-specific syscall. I'm sure we could work it out for OSX, but it's probably not worth the complexity given that no actual user will run it that way.

@andrewvc
Copy link
Contributor Author

andrewvc commented Aug 4, 2022

The easiest way to test this, I should add, would be to check out the synthetics demo repo, modify https://github.com/elastic/synthetics-demo/blob/main/heartbeat/run.sh to set LATEST_RELEASE_TAG to 8.4.0-SNAPSHOT. Then, you can try attaching a console to the container and killing heartbeat processes.

I think that should work. I tested it myself on my own linux WSL box.

@lucasfcosta
Copy link
Contributor

@andrewvc indeed that solved the problem!

Here's the script I used to run it:

#!/bin/bash -e
LATEST_RELEASE_TAG="8.4.0-SNAPSHOT"
DOCKER_IMAGE=docker.elastic.co/beats/heartbeat:$LATEST_RELEASE_TAG
echo "Using docker image $DOCKER_IMAGE"
docker run \
  -it \
  --entrypoint=/bin/bash \
  --rm \
  --name=heartbeat \
  --user=heartbeat \
  --net=elastic-package-stack_default \
  --volume="/tmp/hb1.yml:/usr/share/heartbeat/heartbeat.yml:ro" \
  $DOCKER_IMAGE

Once I ran that script, I started heartbeat with ./heartbeat -e and ran the commands shown below to list the process tree, kill Heartbeat, and check the process tree again.

Screenshot 2022-08-08 at 10 59 53

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
…32393)

Fixes #32363 by instructing the linux kernel to automatically kill node subprocesses if their parents die. In testing it appears chromium always dies as well, although I'm not entirely sure why. Either chrome sets the right flags itself, or the death signal propagates. Either way, in testing this works very solidly.

We don't have sufficient automated test infrastructure to write a good automated test here, so this will have to reply on manual testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Heartbeat Team:obs-ds-hosted-services Label for the Observability Hosted Services team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Heartbeat] Zombie-ish processes created under agent
4 participants