Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

lucasfcosta · 2022-01-11T18:11:20Z

⚠️ There is more E2E test information here in case you'd like to use the Heartbeat image built from source.

Summary

This PR contributes to elastic/beats#29681 as it makes the synthetics runner work with the ubuntu images we will use.

After this PR, the Heartbeat container should be able to execute the synthetics runner.

The problem in detail

The Heartbeat container wasn't able to run synthetics to completion, as the runner would hang indefinitely.

It couldn't do so because it didn't have the necessary graphics library (mesagl) available to make hardware acceleration work (more specifically, it seems Chromium uses osmesa for off-screen rendering) and newPage not to hang if called too soon.

⚠️ The real problem here btw is not the lack of GL it seems, it's that the default GL used causes the newPage method to hang if we call it immediately after launching the browser. If we used mesa it seems that method would not hang, but I still have to manually test it.

When that library wasn't available, and hardware acceleration was turned on, the creation of a new page would hang indefinitely if we immediately tried to create a new page. Just adding an await new Promise(resolve => setTimeout(resolve, 5000)) statement after the browser launch did make journeys work given it caused newPage to return (instead of hanging).

Therefore, I have disabled hardware acceleration by launching Chromium with the --disable-gpu flag, which makes it work fine in our Ubuntu image. I decided to use that flag rather than changing the base image because that's the exact fix that was applied in Playwright's master branch.

⚠️ You can see in this chromium mailing list thread that needing that flag to run within Linux containers is necessary.

How to test this PR

Build the Heartbeat image (your cwd must be the x-pack/heartbeat folder) using env PLATFORMS="+all linux/amd64" mage package

Run the image overwriting the entrypoint with a script similar to the one below

#!/bin/bash
docker run \
  -it \
  --entrypoint=/bin/bash \
  --rm \
  --name=heartbeat \
  --user=heartbeat \
  --volume="$PWD/heartbeat.yml:/usr/share/heartbeat/heartbeat.yml:ro" \
  --volume="$PWD/monitors.d:/usr/share/heartbeat/monitors.d:ro" \
  --volume="/your_path_to_the_synthetics_module/synthetics:/usr/share/synthetics" \
  docker.elastic.co/beats/heartbeat:8.1.0

This will execute the image you've just built.

Once inside the container, you can either build the mounted synthetics module and run dist/cli.js or npm link it, and then run heartbeat, which will cause it to use the locally linked module.

The process is the same for elastic-agent, just make sure you change the image name to elastic-agent-complete.

How I debugged it (in case it's helpful for others in a next opportunity):

When executing the runner within the container, do so through building a mounted directory and running dist/cli.js with node (whenever you change a file, make sure to npm run build)
Expose the container's 9229 port through the docker arg -p 9229:9229 so that you can connect to the Node debugger in chrome://inspect from localhost when running node --inspect-brk=0.0.0.0 dist/cli.js my_synthetic_path (this is how I figured out that just waiting would not make the process hang, because as I stepped through lines that was enough time for the browser to init itself)

apmmachine · 2022-01-11T18:18:23Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-01-13T17:13:40.507+0000
Duration: 13 min 27 sec
Commit: 6cfcecf

Test stats 🧪

Test	Results
Failed	0
Passed	146
Skipped	2
Total	148

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

elastic/beats#29681]

vigneshshanmugam

Discussed offline with @lucasfcosta, We are going to do following things

Check if disable-gpu can work on both images.
if yes -> do we see trace numbers differentiate with and without the flag.
Do this only for headless mode.

lucasfcosta · 2022-01-13T13:11:25Z

@vigneshshanmugam I have now:

Tested this PR with the 7.16.3 images and they work fine.
- for the old Heartbeat image
- for the old Elastic Agent Image
I have now updated this PR so that we only enable the disable-gpu for headless runs (@vigneshshanmugam can you also double check these tests? We can only run them with an actual wsEndpoint and therefore must have a conditional execution flow).
I have gathered trace metrics by:
1. Running a journey which visits www.blueyard.com (a site which uses WebGL heavily) using the --rich-events flag
2. Piping those results into jq and filtering the metric outputs
  The resulting command was:
  synthetics --rich-events ../synth-example | jq -c 'select(.type == "step/metrics")' > /tmp/with-gpu-enabled.json

Here are the results I've obtained for GPU enabled and disabled settings:

with-gpu-enabled.json

{"type":"step/metrics","@timestamp":1642072821592514.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"navigationStart","type":"mark","start":{"us":10673903804}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592701,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"firstContentfulPaint","type":"mark","start":{"us":10674706722}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592801.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"largestContentfulPaint","type":"mark","start":{"us":10675141846}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821592873.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"domContentLoaded","type":"mark","start":{"us":10675776934}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593010.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"loadEvent","type":"mark","start":{"us":10685391016}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593085,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"layoutShift","type":"mark","start":{"us":10677093709},"score":2.2897720336914065e-06}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072821593172.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"experience":{"cls":2.2897720336914065e-06,"fcp":{"us":802918},"lcp":{"us":1238042},"dcl":{"us":1873130},"load":{"us":11487212}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}

with-gpu-disabled.json

{"type":"step/metrics","@timestamp":1642072775738702.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"navigationStart","type":"mark","start":{"us":10628714031}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738786.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"firstContentfulPaint","type":"mark","start":{"us":10629456146}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738829.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"largestContentfulPaint","type":"mark","start":{"us":10629827301}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738869.8,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"domContentLoaded","type":"mark","start":{"us":10630517348}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738899.2,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"loadEvent","type":"mark","start":{"us":10639842605}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738926,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"relative_trace":{"name":"layoutShift","type":"mark","start":{"us":10631345774},"score":2.2009920535816084e-06}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}
{"type":"step/metrics","@timestamp":1642072775738955.5,"journey":{"name":"my cool journey","id":"my cool journey"},"step":{"name":"google","index":1},"root_fields":{"browser":{"experience":{"cls":2.2009920535816084e-06,"fcp":{"us":742115},"lcp":{"us":1113270},"dcl":{"us":1803317},"load":{"us":11128574}}},"os":{"platform":"darwin"},"package":{"name":"@elastic/synthetics","version":"1.0.0-beta.18"}},"package_version":"1.0.0-beta.18"}

The experience summary in a table:

	With GPU Enabled	With GPU Disabled
CLS	0.0000022897720336914065	0.0000022009920535816084
FCP (micros)	802918	742115
DCL (micros)	1873130	1803317
LOAD (micros)	11487212	11128574

vigneshshanmugam · 2022-01-13T16:43:10Z

__tests__/core/gatherer.test.ts

+  (wsEndpoint ? it : it.skip)(
+    'does not the disable-gpu flag to start browser when running headful',
+    async () => {
+      if (!wsEndpoint) return;


Having this should be enough to skip the tests right? Do we need it or it.skip?

True, it should be enough, it can be removed.

vigneshshanmugam

Thanks for the detailed testing on the images and the detailed information 🎉

lucasfcosta · 2022-01-13T16:56:34Z

As mentioned by @vigneshshanmugam we'll drop further analysis on these given there's not much variation between runs and that they don't make much sense in headless mode:

@lucasfcosta Lets drop the perf measurement, I just realized they dont make much sense in headless mode. Lots of optmisations are dropped anyways. Apologies for not thinking through much here.

Will merge as soon as CI passes and release.

lucasfcosta · 2022-01-13T18:22:46Z

@vigneshshanmugam just before we push this forward, I did a rerun with enabled/disabled settings and I got bigger differences this time for 10 runs, it may indicate my last run didn't actually rebuild the synthetics executable for each, which is why results were so similar.

The difference is still small, but I wouldn't say it's insignificant, but I wanted to confirm with you that you're still okay with releasing. (CC @andrewvc)

Results are in MS when applicable (therefore not for cls).

Below you can see the testing procedure to obtain the results above

Testing procedure

I had two versions of the gatherer.ts file, one which disables the GPU and one which doesn't.
I ran npm link for each, which causes a rebuild and re-link.

I ran the script below to save files to /tmp (I renamed disabled to enabled for when GPU acceleration was enabled).

for i in {0..9}
do
  echo "running $i"
  synthetics ../synth-example --rich-events > /tmp/disabled-results-$i.json
done

Once results were saved, I ran the following script to calculate averages.

    const fs = require('fs');

const readMetrics = path =>
  fs
    .readFileSync(path, { encoding: 'utf8' })
    .split('\n')
    .filter(v => !!v.trim())
    .map(line => line && JSON.parse(line));

const disabledFilename = i => `/tmp/disabled-results-${i}.json`;
const enabledFilename = i => `/tmp/enabled-results-${i}.json`;

const getResults = filenameFn => {
  let experienceSummaries = [];
  for (let i = 0; i <= 9; i++) {
    const metrics = readMetrics(filenameFn(i));
    const exp = metrics.find(
      metricData => metricData.root_fields?.browser?.experience
    );
    experienceSummaries.push(exp.root_fields.browser.experience);
  }

  const avgCls = experienceSummaries.reduce((acc, s) => s.cls + acc, 0) / experienceSummaries.length;
  const avgFcp = experienceSummaries.reduce((acc, s) => s.fcp.us + acc, 0) / experienceSummaries.length / 1000;
  const avgLcp = experienceSummaries.reduce((acc, s) => s.lcp.us + acc, 0) / experienceSummaries.length / 1000;
  const avgDcl = experienceSummaries.reduce((acc, s) => s.dcl.us + acc, 0) / experienceSummaries.length / 1000;
  const avgLoad = experienceSummaries.reduce((acc, s) => s.fcp.us + acc, 0) / experienceSummaries.length / 1000;
  return { avgCls, avgFcp, avgLcp, avgDcl, avgLoad };
};

const disabledResults = getResults(disabledFilename);
const enabledResults = getResults(enabledFilename);

console.log("Disabled");
console.table(disabledResults);

console.log("Enabled");
console.table(enabledResults);

Here's the journey being ran by synthetics for this test:

const { step, journey } = require('@elastic/synthetics');

journey('my cool journey', ({ page }) => {
  step('visit website', async () => {
    await page.goto('https://www.blueyard.com/');
  });
});

lucasfcosta · 2022-01-13T19:08:19Z

@vigneshshanmugam @andrewvc and I have discussed the differences in performance in the comment above.

⭐ We've decided to release the package as is.

We've also reached the following conclusions:

These measurements are supposed to be compared to each other, not be taken as absolute values. Otherwise, the testing would not accurately represent a particular platform the users are targeting.
We will investigate whether we can enable GPU acceleration on our Ubuntu images. @vigneshshanmugam suggested this reference material for that: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#chrome-headless-doesnt-launch-on-unix
We will check whether docker containers have GPU access. If not, that means we've never actually been able to use GPU accelerated Chromium instances.
We will timebox the investigation on graphic libraries and the Docker GPU access to half a day.

lucasfcosta · 2022-01-17T10:43:54Z

As a side note, I've just ran the same procedure outlined in this comment to benchmark journeys with and without GPU acceleration within a Docker container. For that, I've used Heartbeat's container with a locally built version of the runner.

The results I got this time demonstrate that there's no significant difference between the 10 averaged runs for journeys with and without GPU acceleration.

lucasfcosta closed this Jan 12, 2022

lucasfcosta force-pushed the fix-ubuntu-runs branch from aafdb49 to df88c62 Compare January 12, 2022 10:26

lucasfcosta reopened this Jan 12, 2022

fix: stop runner from hanging indefinitely within ubuntu docker images [

5c3eb68

elastic/beats#29681]

lucasfcosta force-pushed the fix-ubuntu-runs branch from 9e284c2 to 5c3eb68 Compare January 12, 2022 10:34

lucasfcosta mentioned this pull request Jan 12, 2022

[Heartbeat] Missing some i18n fonts in heartbeat docker image elastic/beats#29495

Closed

lucasfcosta marked this pull request as ready for review January 12, 2022 15:38

vigneshshanmugam reviewed Jan 12, 2022

View reviewed changes

lucasfcosta force-pushed the fix-ubuntu-runs branch from a133c5d to 605e5b9 Compare January 13, 2022 11:16

vigneshshanmugam approved these changes Jan 13, 2022

View reviewed changes

fix: enable --disable-gpu flag only for headless runs

6cfcecf

lucasfcosta force-pushed the fix-ubuntu-runs branch from 605e5b9 to 6cfcecf Compare January 13, 2022 16:55

lucasfcosta merged commit f8e3df1 into elastic:master Jan 13, 2022

lucasfcosta deleted the fix-ubuntu-runs branch January 13, 2022 17:27

lucasfcosta mentioned this pull request Jan 25, 2022

Add fonts to support more different types of characters for multiple languages elastic/beats#29861

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

lucasfcosta commented Jan 11, 2022 •

edited

Loading

apmmachine commented Jan 11, 2022 •

edited

Loading

Build stats

Test stats 🧪

vigneshshanmugam left a comment

lucasfcosta commented Jan 13, 2022 •

edited

Loading

vigneshshanmugam Jan 13, 2022

lucasfcosta Jan 13, 2022

vigneshshanmugam left a comment

lucasfcosta commented Jan 13, 2022

lucasfcosta commented Jan 13, 2022

lucasfcosta commented Jan 13, 2022

lucasfcosta commented Jan 17, 2022 •

edited

Loading

Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

Stop runner from hanging indefinitely within ubuntu docker images [elastic/beats#29681] #441

Conversation

lucasfcosta commented Jan 11, 2022 • edited Loading

Summary

The problem in detail

How to test this PR

How I debugged it (in case it's helpful for others in a next opportunity):

apmmachine commented Jan 11, 2022 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

vigneshshanmugam left a comment

Choose a reason for hiding this comment

lucasfcosta commented Jan 13, 2022 • edited Loading

vigneshshanmugam Jan 13, 2022

Choose a reason for hiding this comment

lucasfcosta Jan 13, 2022

Choose a reason for hiding this comment

vigneshshanmugam left a comment

Choose a reason for hiding this comment

lucasfcosta commented Jan 13, 2022

lucasfcosta commented Jan 13, 2022

Testing procedure

lucasfcosta commented Jan 13, 2022

lucasfcosta commented Jan 17, 2022 • edited Loading

lucasfcosta commented Jan 11, 2022 •

edited

Loading

apmmachine commented Jan 11, 2022 •

edited

Loading

lucasfcosta commented Jan 13, 2022 •

edited

Loading

lucasfcosta commented Jan 17, 2022 •

edited

Loading