-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add summary to journeys which don't emit journey:end (early node subprocess exits) #29606
Conversation
This pull request does not have a backport label. Could you fix it @lucasfcosta? 🙏
NOTE: |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. 🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
47a1b96
to
8e67094
Compare
2af725c
to
d5f1daa
Compare
Pinging @elastic/uptime (Team:Uptime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Lucas, Left some comments.
@@ -112,6 +114,9 @@ func (je *journeyEnricher) enrichSynthEvent(event *beat.Event, se *SynthEvent) e | |||
} | |||
|
|||
switch se.Type { | |||
case "cmd/status": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should guard this summary creation event. If the user has afterAll hook that throws error, we would have both journey/end
followed by cmd/status
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a sort of 'task failed successfully' situation. I guess there's a question here, does a failing after hook invalidate the test result at all? It feels weird to, say, trigger an alert based on an afterAll
hook.
I'm wondering if we really would need to introduce a new concept of warnings into the schema / UI. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, Lots of test runners don't consider failures in the afterAll hooks to be failures. They mainly exist to keep some cleanup mechanism. But having warnings in the UI sounds like a cool idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment on failures on afterX
hooks
The afterAll
hook triggering a failure is a tricky situation. I previously did some work on Jest's hooks here (jestjs/jest#8654) where this same discussion has come up.
Given how complex the discussion on hooks semantics is, we didn't get to a conclusion in that PR (which is why it's still open and not much has happened since then).
Anyway, I do feel like failures in afterHooks
should be test failures. If you think about how hooks could be implemented using a procedural API under the hood, they end up being part of the upper test scope, and thus would trigger failures. For inspiration, see Deno's RFC on their testing API: denoland/deno#10771.
Proposed way forward
I'd consider that failing a journey would be out of the scope of this PR if you agree.
For now I'll follow @vigneshshanmugam suggestion of having both events given there's not necessarily success/failure tied to that (as journey/end
indicates the journey for the hook finished anyway). However, **the cmd/status
event will not include a summary as it refers to a journey itself.
IMO, moving forward, it would be great to have warnings based on cmd/status
events as @andrewvc mentioned too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider that failing a journey would be out of the scope of this PR if you agree.
++, Lets do this separately and open a issue in Synthetics runner.
Anyway, I do feel like failures in afterHooks should be test failures.
Mainly, in the runner i didn't go with this idea as we didn't have a good way of reporting the error. Warnings
would make a lot of sense here.
}) | ||
// We want one of the test journeys to end with a cmd/status indicating it failed | ||
if jIdx != 4 { | ||
testEvents = append(testEvents, &SynthEvent{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per previous comment, lets add both and see what happens with summary creation.
@@ -243,7 +244,7 @@ func lineToSynthEventFactory(typ string) func(bytes []byte, text string) (res *S | |||
logp.Info("%s: %s", typ, text) | |||
return &SynthEvent{ | |||
Type: typ, | |||
TimestampEpochMicros: float64(time.Now().UnixNano() / int64(time.Millisecond)), | |||
TimestampEpochMicros: float64(time.Now().UnixMicro()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the intended consequence of this change, I will leave it to @andrewvc who knows it better than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because timestamps calculated in that way would yield negative values, and thus not appear in my queries (I was filtering for events in the last X years).
UnixMicro
is directly what we want, so it also makes the code a bit more concise, unless I'm missing any other behaviours from the stdlib
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea the UnixMicro
method existed, As far as I'm concerned this is a nice improvement, so I'm +1 on fixing this here.
@vigneshshanmugam @andrewvc: I've just pushed an update to this PR which I've tested very thoroughly and for which I've also added a significant amount of new unit testing, but I'd like you to please make sure that:
How to test the creation of summary documents when
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks for adding extra guards and the tests 🎉
Also, I love reading the detailed descriptions. Super useful and establishes the context nicely.
// If a command failed _after_ the journey was complete, as it happens | ||
// when an `afterAll` hook fails, for example, we don't wan't to include | ||
// a summary in the cmd/status event. | ||
if je.journeyComplete == false { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: !je.journeyComplete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! Thank you @vigneshshanmugam ❤️
d55b4f5
to
a5264c2
Compare
…b-for-macos * upstream/master: (172 commits) [Elastic Agent] Fix issue with ensureServiceToken. (elastic#29800) [Winlogbeat] Add provider name to Security routing pipeline check (elastic#29781) Add summary to journeys which don't emit journey:end (early node subprocess exits) (elastic#29606) Prepare 8.0.0-rc1 changelog (elastic#29795) (elastic#29806) Change docker image from CentOS 7 to Ubuntu 20.04 (elastic#29681) libbeat/processors/add_process_metadata: implement a process cache eviction policy (elastic#29717) [Automation] Update elastic stack version to 8.1.0-7004acda for testing (elastic#29783) Missing changelog entry for elastic#29773 (elastic#29791) Add a readme for k8s autodiscover provider (elastic#28213) Remove overriding of index pattern on the Kubernetes overview dashboard (elastic#29676) jjbb: remove obsoleted branches (<7.16) (elastic#29707) Add k8s metadata in state_cronjob metricset (elastic#29572) ibmmq: Fix timestamp parsing (elastic#29773) Do not add date to index if `@meta.index` is set (elastic#29775) ci: uses aliases for the branches (elastic#29706) Filebeat tests: Restore `@timestamp` field validation (elastic#29772) Forward port 7.16.3 changelog to master (elastic#29777) auditd: Store program arguments in process.args array (elastic#29601) System/socket: Support kernel_clone() replacement for _do_fork() (elastic#29744) Do not mention removal if version is not specified in `cfgwarn` messages (elastic#29727) ...
What does this PR do?
Fixes #28770.
This PR ensures that Heartbeat will add a summary to journeys in which the
node
subprocess for thesynthetics
runner crashes before emitting ajourney:end
event.This change will ensure that all journeys will contain a summary, even the ones which cause the runner itself to exit earlier.
To better understand how Heartbeat interfaces with the
synthetics
runner, and thus understand this fix, see the section below.Understanding this fix: how Heartbeat works with
elastic-synthetics
Before I explain how to test this PR, it's worth explaining how Heartbeat and the Synthetics runner work together.
Currently,
heartbeat
will use theelastic-synthetics
executable, which comes from thesynthetics
runner project, to runinline
journeys (like the ones you define within aheartbeat.yml
file). This executable will (or at least should) be available in thePATH
.beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Line 69 in bea8e45
See the executable name defined here:
https://github.com/elastic/synthetics/blob/ea6eb357953e5a4c5ec3cd5a1ab031a03a6c918a/package.json#L22
When running the
inline
journey using theelastic-synthetics
executable, it will pass it the--rich-events
flag, which causes the runner itself to emit rich JSON events.beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Line 123 in bea8e45
Heartbeat can then easily use these JSON events to create
SynthEvents
, which it will then try to "enrich".beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Lines 240 to 252 in bea8e45
To see the runner emitting these events, try creating a simple journey, like the one below:
And then, run it using
cat ~/Repositories/synthetics/test/test.js | npx @elastic/synthetics --inline --rich-events
. You will see that for each part of the journey execution, it will emit a different type of event, as defined in the code excerpt below, which comes from the recorder:https://github.com/elastic/synthetics/blob/30c536e53d93495fa8824e31ffd99c01a95c0376/src/core/runner.ts#L117-L141
For each of these events, Heartbeat would try to "enrich" them with more data:
beats/x-pack/heartbeat/monitors/browser/synthexec/enrich.go
Lines 104 to 146 in bea8e45
Especially for
journey:end
, Heartbeat would try to add asummary
to the emitted document, so that we know data about the whole run.beats/x-pack/heartbeat/monitors/browser/synthexec/enrich.go
Lines 148 to 179 in bea8e45
Now, the problem we've had before this PR is that journeys which cause the
elastic-synthetics
executable (ran in anode
subprocess) to crash, would not emit ajourney:end
event. Consequently, we wouldn't have asummary
for those journeys.You can see that no
journey:end
event is emitted by the synthetics runner when you run the following journey, for example:After this change, we'll use the
cmd/status
SynthEvent
also as a type of event which can be enriched with asummary
.beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Lines 196 to 202 in 1386a04
Now, a critical part of this change was that we need a timestamp for the summary events when journeys fail and we emit a
cmd/status
. We need this timestamp so that we can set theend
of the journey.beats/x-pack/heartbeat/monitors/browser/synthexec/enrich.go
Lines 70 to 84 in 1386a04
The problem with the
cmd/status
SynthEvent
was that it did not have aTimestampEpochMicros
field. It did not have this field because we're not parsing@timestamp
fields emitted by thesynthetics-runner
. Instead, we're just constructing aSynthEvent
ourselves.beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Lines 198 to 201 in 1386a04
I then tried to use the same method we use to generate timestamps for when capturing line events, shown below:
beats/x-pack/heartbeat/monitors/browser/synthexec/synthexec.go
Line 246 in 1386a04
That procedure however, didn't yield correct results, causing timestamps generated in the method below to date back to 1970.
beats/x-pack/heartbeat/monitors/browser/synthexec/synthtypes.go
Lines 89 to 95 in 1386a04
That problem took me a lot of debugging time, until I noticed summaries were just published a long time in the past.
After figuring that out, I simply used the new
UnixMicro
method (as intime.Now().UnixMicro()
) to generate timestamps forcmd/status
events, and it all worked out fine 😄Why is it important?
This change is important because, in Kibana, we perform queries which look at a journey's summary to determine what to display.
Previously, the lack of a
summary
for our journeys has caused a monitor/check-group not to be displayed in Kibana, as pointed out in elastic/kibana#116850 (comment). That problem happened because we were looking for documents with asummary
as a way of indicating whether a monitor/check-group had finished running. However, monitors which caused thenode
subprocess to exit early would never be shown because Heartbeat would never add asummary
document them.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
I'd appreciate if reviewers could:
How to test this PR locally
journey:end
event by running it with the synthetics runner (you must have built it before running the command below).heartbeat.yml
pointing to your desired Elastic Stack for testing, and paste that journey into an inline monitor:mage build
withinx-pack/heartbeat
).heartbeat.yml
file from step 3Other recommended checks
Uptime > Monitors
pageRelated issues
heartbeat/summary
docs kibana#117228Use cases
This PR allows users to see their monitors in the
Uptime > Monitors
page even when those monitors crash before being finished, which means it's easier to debug what happened to them.Further Changes
As you test this PR, you will notice that the
Uptime > Monitors
page in Kibana will only display the step which did not crash. This behaviour occurs because we don't have astep/end
event for the step which crashed.Additionally, you will notice that journeys with syntax errors do not even start and therefore do not have summary. That's a different case from an early exit, instead, it's not even a start. Similar problems may occur when a journey can't even start for any other reason.
Therefore, I suggest that we create the two following issues:
journey/start
events (or, alternatively,journey/prepare
) even when there are syntax errors or other conditions which prevent journeys from even startingI considered the two issues above to be out of scope for this PR. Please let me know if you agree and if you're happy with the current behaviour.