Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

Merged
merged 2 commits into from
Jan 13, 2022

Conversation

lucasfcosta
Copy link
Contributor

@lucasfcosta lucasfcosta commented Jan 11, 2022

⚠️ There is more E2E test information here in case you'd like to use the Heartbeat image built from source.

Summary

This PR contributes to elastic/beats#29681 as it makes the synthetics runner work with the ubuntu images we will use.

After this PR, the Heartbeat container should be able to execute the synthetics runner.

The problem in detail

The Heartbeat container wasn't able to run synthetics to completion, as the runner would hang indefinitely.

It couldn't do so because it didn't have the necessary graphics library (mesagl) available to make hardware acceleration work (more specifically, it seems Chromium uses osmesa for off-screen rendering) and newPage not to hang if called too soon.

⚠️ The real problem here btw is not the lack of GL it seems, it's that the default GL used causes the newPage method to hang if we call it immediately after launching the browser. If we used mesa it seems that method would not hang, but I still have to manually test it.

When that library wasn't available, and hardware acceleration was turned on, the creation of a new page would hang indefinitely if we immediately tried to create a new page. Just adding an await new Promise(resolve => setTimeout(resolve, 5000)) statement after the browser launch did make journeys work given it caused newPage to return (instead of hanging).

Therefore, I have disabled hardware acceleration by launching Chromium with the --disable-gpu flag, which makes it work fine in our Ubuntu image. I decided to use that flag rather than changing the base image because that's the exact fix that was applied in Playwright's master branch.

⚠️ You can see in this chromium mailing list thread that needing that flag to run within Linux containers is necessary.

How to test this PR

  1. Build the Heartbeat image (your cwd must be the x-pack/heartbeat folder) using env PLATFORMS="+all linux/amd64" mage package
  2. Run the image overwriting the entrypoint with a script similar to the one below
    #!/bin/bash
    docker run \
      -it \
      --entrypoint=/bin/bash \
      --rm \
      --name=heartbeat \
      --user=heartbeat \
      --volume="$PWD/heartbeat.yml:/usr/share/heartbeat/heartbeat.yml:ro" \
      --volume="$PWD/monitors.d:/usr/share/heartbeat/monitors.d:ro" \
      --volume="/your_path_to_the_synthetics_module/synthetics:/usr/share/synthetics" \
      docker.elastic.co/beats/heartbeat:8.1.0
    
    This will execute the image you've just built.
  3. Once inside the container, you can either build the mounted synthetics module and run dist/cli.js or npm link it, and then run heartbeat, which will cause it to use the locally linked module.

The process is the same for elastic-agent, just make sure you change the image name to elastic-agent-complete.

How I debugged it (in case it's helpful for others in a next opportunity):

  • When executing the runner within the container, do so through building a mounted directory and running dist/cli.js with node (whenever you change a file, make sure to npm run build)
  • Expose the container's 9229 port through the docker arg -p 9229:9229 so that you can connect to the Node debugger in chrome://inspect from localhost when running node --inspect-brk=0.0.0.0 dist/cli.js my_synthetic_path (this is how I figured out that just waiting would not make the process hang, because as I stepped through lines that was enough time for the browser to init itself)

@apmmachine
Copy link

apmmachine commented Jan 11, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-01-13T17:13:40.507+0000

  • Duration: 13 min 27 sec

  • Commit: 6cfcecf

Test stats 🧪

Test Results
Failed 0
Passed 146
Skipped 2
Total 148

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @lucasfcosta, We are going to do following things

  • Check if disable-gpu can work on both images.
  • if yes -> do we see trace numbers differentiate with and without the flag.
  • Do this only for headless mode.

@lucasfcosta
Copy link
Contributor Author

lucasfcosta commented Jan 13, 2022

@vigneshshanmugam I have now:

  • Tested this PR with the 7.16.3 images and they work fine.
    • for the old Heartbeat image
    • for the old Elastic Agent Image
  • I have now updated this PR so that we only enable the disable-gpu for headless runs (@vigneshshanmugam can you also double check these tests? We can only run them with an actual wsEndpoint and therefore must have a conditional execution flow).
  • I have gathered trace metrics by:
    1. Running a journey which visits www.blueyard.com (a site which uses WebGL heavily) using the --rich-events flag
    2. Piping those results into jq and filtering the metric outputs
      The resulting command was:
      synthetics --rich-events ../synth-example | jq -c 'select(.type == "step/metrics")' > /tmp/with-gpu-enabled.json

Here are the results I've obtained for GPU enabled and disabled settings:

with-gpu-enabled.json

{"type":"step/metrics","@timestamp":1642072821592514.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"navigationStart","type":"mark","start":{"us":10673903804}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592701,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"firstContentfulPaint","type":"mark","start":{"us":10674706722}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592801.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"largestContentfulPaint","type":"mark","start":{"us":10675141846}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592873.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"domContentLoaded","type":"mark","start":{"us":10675776934}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593010.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"loadEvent","type":"mark","start":{"us":10685391016}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593085,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"layoutShift","type":"mark","start":{"us":10677093709},"score":2.2897720336914065e-06}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593172.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"experience":{"cls":2.2897720336914065e-06,"fcp":{"us":802918},"lcp":{"us":1238042},"dcl":{"us":1873130},"load":{"us":11487212}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}

with-gpu-disabled.json

{"type":"step/metrics","@timestamp":1642072775738702.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"navigationStart","type":"mark","start":{"us":10628714031}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738786.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"firstContentfulPaint","type":"mark","start":{"us":10629456146}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738829.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"largestContentfulPaint","type":"mark","start":{"us":10629827301}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738869.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"domContentLoaded","type":"mark","start":{"us":10630517348}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738899.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"loadEvent","type":"mark","start":{"us":10639842605}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738926,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"layoutShift","type":"mark","start":{"us":10631345774},"score":2.2009920535816084e-06}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738955.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"experience":{"cls":2.2009920535816084e-06,"fcp":{"us":742115},"lcp":{"us":1113270},"dcl":{"us":1803317},"load":{"us":11128574}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}

The experience summary in a table:

With GPU Enabled With GPU Disabled
CLS 0.0000022897720336914065 0.0000022009920535816084
FCP (micros) 802918 742115
DCL (micros) 1873130 1803317
LOAD (micros) 11487212 11128574

(wsEndpoint ? it : it.skip)(
'does not the disable-gpu flag to start browser when running headful',
async () => {
if (!wsEndpoint) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this should be enough to skip the tests right? Do we need it or it.skip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it should be enough, it can be removed.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed testing on the images and the detailed information 🎉

@lucasfcosta
Copy link
Contributor Author

As mentioned by @vigneshshanmugam we'll drop further analysis on these given there's not much variation between runs and that they don't make much sense in headless mode:

@lucasfcosta Lets drop the perf measurement, I just realized they dont make much sense in headless mode. Lots of optmisations are dropped anyways. Apologies for not thinking through much here.

Will merge as soon as CI passes and release.

@lucasfcosta lucasfcosta merged commit f8e3df1 into elastic:master Jan 13, 2022
@lucasfcosta lucasfcosta deleted the fix-ubuntu-runs branch January 13, 2022 17:27
@lucasfcosta
Copy link
Contributor Author

@vigneshshanmugam just before we push this forward, I did a rerun with enabled/disabled settings and I got bigger differences this time for 10 runs, it may indicate my last run didn't actually rebuild the synthetics executable for each, which is why results were so similar.

The difference is still small, but I wouldn't say it's insignificant, but I wanted to confirm with you that you're still okay with releasing. (CC @andrewvc)

Results are in MS when applicable (therefore not for cls).

Screenshot 2022-01-13 at 18 13 07

Below you can see the testing procedure to obtain the results above

Testing procedure

  1. I had two versions of the gatherer.ts file, one which disables the GPU and one which doesn't.
  2. I ran npm link for each, which causes a rebuild and re-link.
  3. I ran the script below to save files to /tmp (I renamed disabled to enabled for when GPU acceleration was enabled).
    for i in {0..9}
    do
      echo "running $i"
      synthetics ../synth-example --rich-events > /tmp/disabled-results-$i.json
    done
    
  4. Once results were saved, I ran the following script to calculate averages.
        const fs = require('fs');
    
    const readMetrics = path =>
      fs
        .readFileSync(path, { encoding: 'utf8' })
        .split('\n')
        .filter(v => !!v.trim())
        .map(line => line && JSON.parse(line));
    
    const disabledFilename = i => `/tmp/disabled-results-${i}.json`;
    const enabledFilename = i => `/tmp/enabled-results-${i}.json`;
    
    const getResults = filenameFn => {
      let experienceSummaries = [];
      for (let i = 0; i <= 9; i++) {
        const metrics = readMetrics(filenameFn(i));
        const exp = metrics.find(
          metricData => metricData.root_fields?.browser?.experience
        );
        experienceSummaries.push(exp.root_fields.browser.experience);
      }
    
      const avgCls = experienceSummaries.reduce((acc, s) => s.cls + acc, 0) / experienceSummaries.length;
      const avgFcp = experienceSummaries.reduce((acc, s) => s.fcp.us + acc, 0) / experienceSummaries.length / 1000;
      const avgLcp = experienceSummaries.reduce((acc, s) => s.lcp.us + acc, 0) / experienceSummaries.length / 1000;
      const avgDcl = experienceSummaries.reduce((acc, s) => s.dcl.us + acc, 0) / experienceSummaries.length / 1000;
      const avgLoad = experienceSummaries.reduce((acc, s) => s.fcp.us + acc, 0) / experienceSummaries.length / 1000;
      return { avgCls, avgFcp, avgLcp, avgDcl, avgLoad };
    };
    
    const disabledResults = getResults(disabledFilename);
    const enabledResults = getResults(enabledFilename);
    
    console.log("Disabled");
    console.table(disabledResults);
    
    console.log("Enabled");
    console.table(enabledResults);

Here's the journey being ran by synthetics for this test:

const { step, journey } = require('@elastic/synthetics');

journey('my cool journey', ({ page }) => {
  step('visit website', async () => {
    await page.goto('https://www.blueyard.com/');
  });
});

@lucasfcosta
Copy link
Contributor Author

@vigneshshanmugam @andrewvc and I have discussed the differences in performance in the comment above.

We've decided to release the package as is.

We've also reached the following conclusions:

  • These measurements are supposed to be compared to each other, not be taken as absolute values. Otherwise, the testing would not accurately represent a particular platform the users are targeting.
  • We will investigate whether we can enable GPU acceleration on our Ubuntu images. @vigneshshanmugam suggested this reference material for that: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#chrome-headless-doesnt-launch-on-unix
  • We will check whether docker containers have GPU access. If not, that means we've never actually been able to use GPU accelerated Chromium instances.
  • We will timebox the investigation on graphic libraries and the Docker GPU access to half a day.

@lucasfcosta
Copy link
Contributor Author

lucasfcosta commented Jan 17, 2022

As a side note, I've just ran the same procedure outlined in this comment to benchmark journeys with and without GPU acceleration within a Docker container. For that, I've used Heartbeat's container with a locally built version of the runner.

The results I got this time demonstrate that there's no significant difference between the 10 averaged runs for journeys with and without GPU acceleration.
Screenshot 2022-01-17 at 10 41 57
Screenshot 2022-01-17 at 10 33 52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants