Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logging]metrics ops logs include event loop delay histogram data #120451

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 33 additions & 29 deletions src/core/server/metrics/logging/get_ops_metrics_log.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ const testMetrics = {
memory: { heap: { used_in_bytes: 100 } },
uptime_in_millis: 1500,
event_loop_delay: 50,
event_loop_delay_histogram: { percentiles: { '50': 50, '75': 75, '95': 95, '99': 99 } },
},
os: {
load: {
Expand All @@ -56,7 +57,7 @@ describe('getEcsOpsMetricsLog', () => {
it('provides correctly formatted message', () => {
const result = getEcsOpsMetricsLog(createMockOpsMetrics(testMetrics));
expect(result.message).toMatchInlineSnapshot(
`"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: 50.000"`
`"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] mean delay: 50.000 delay histogram: { 50: 50.000; 95: 95.000; 99: 99.000 }"`
);
});

Expand All @@ -70,46 +71,49 @@ describe('getEcsOpsMetricsLog', () => {
const missingMetrics = {
...baseMetrics,
process: {},
processes: [],
os: {},
} as unknown as OpsMetrics;
const logMeta = getEcsOpsMetricsLog(missingMetrics);
expect(logMeta.message).toMatchInlineSnapshot(`""`);
});

it('provides an ECS-compatible response', () => {
const logMeta = getEcsOpsMetricsLog(createBaseOpsMetrics());
expect(logMeta).toMatchInlineSnapshot(`
const logMeta = getEcsOpsMetricsLog(createMockOpsMetrics(testMetrics));
expect(logMeta.meta).toMatchInlineSnapshot(`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot was getting too big to easily notice a change, so I've changed this assertion to only test the meta field.

Object {
"message": "memory: 1.0B load: [1.00,1.00,1.00] delay: 1.000",
"meta": Object {
"event": Object {
"category": Array [
"process",
"host",
],
"kind": "metric",
"type": Array [
"info",
],
},
"host": Object {
"os": Object {
"load": Object {
"15m": 1,
"1m": 1,
"5m": 1,
},
"event": Object {
"category": Array [
"process",
"host",
],
"kind": "metric",
"type": Array [
"info",
],
},
"host": Object {
"os": Object {
"load": Object {
"15m": 30,
"1m": 10,
"5m": 20,
},
},
"process": Object {
"eventLoopDelay": 1,
"memory": Object {
"heap": Object {
"usedInBytes": 1,
},
},
"process": Object {
"eventLoopDelay": 50,
"eventLoopDelayHistogram": Object {
"50": 50,
"95": 95,
"99": 99,
},
"memory": Object {
"heap": Object {
"usedInBytes": 100,
},
"uptime": 0,
},
"uptime": 1,
},
}
`);
Expand Down
32 changes: 29 additions & 3 deletions src/core/server/metrics/logging/get_ops_metrics_log.ts
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,29 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) {
// HH:mm:ss message format for backward compatibility
const uptimeValMsg = uptimeVal ? `uptime: ${numeral(uptimeVal).format('00:00:00')} ` : '';

// Event loop delay is in ms
// Event loop delay metrics are in ms
const eventLoopDelayVal = process?.event_loop_delay;
const eventLoopDelayValMsg = eventLoopDelayVal
? `delay: ${numeral(process?.event_loop_delay).format('0.000')}`
? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}`
Copy link
Contributor Author

@TinaHeiligers TinaHeiligers Dec 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the histogram percentiles in the log message, we need to let folks know that the first delay entry is the mean.

Copy link
Member

@lukeelmers lukeelmers Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API?

"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"

(Edit: Mostly just trying to think how we can keep this as succinct as possible)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API.

We could but that would be a breaking change for folks ingesting the log message. ATM, if someone was parsing the message string they might expect delay: number and end up with delay: [Object].

Admittedly, this PR also changes the message string but keeps the delay type the same. Appending the additional data seems like a less risky option here.

Admittedly, ingesting log message data isn't the best approach and folks should be using the meta instead. However, we removed meta from all the logs expect for debug mode so the extra info wouldn't be available for other levels.

Copy link
Member

@lukeelmers lukeelmers Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but that would be a breaking change for folks ingesting the log message.

Yeah, although I suppose we are adding mean delay so a grok pattern against just delay could match two results now (both mean and histogram). EDIT: Actually I guess not -- I misread, we have mean delay and delay histogram 🙂

However, in the past we have not considered the message format to be stable, and have made breaking changes to it in minors anyway. So I am not super concerned about this either way.

: '';

const eventLoopDelayPercentiles = process?.event_loop_delay_histogram?.percentiles;

// Extract 50th, 95th and 99th percentiles for log meta
const eventLoopDelayHistVals = eventLoopDelayPercentiles
? {
50: eventLoopDelayPercentiles[50],
95: eventLoopDelayPercentiles[95],
99: eventLoopDelayPercentiles[99],
}
: undefined;
// Format message from 50th, 95th and 99th percentiles
const eventLoopDelayHistMsg = eventLoopDelayPercentiles
? ` delay histogram: { 50: ${numeral(eventLoopDelayPercentiles['50']).format(
'0.000'
)}; 95: ${numeral(eventLoopDelayPercentiles['95']).format('0.000')}; 99: ${numeral(
eventLoopDelayPercentiles['99']
).format('0.000')} }`
: '';

const loadEntries = {
Expand Down Expand Up @@ -65,6 +84,7 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) {
},
},
eventLoopDelay: eventLoopDelayVal,
eventLoopDelayHistogram: eventLoopDelayHistVals,
},
host: {
os: {
Expand All @@ -75,7 +95,13 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) {
};

return {
message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}`,
message: [
processMemoryUsedInBytesMsg,
uptimeValMsg,
loadValsMsg,
eventLoopDelayValMsg,
eventLoopDelayHistMsg,
].join(''),
meta,
};
}
1 change: 1 addition & 0 deletions src/core/server/metrics/metrics_service.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ describe('MetricsService', () => {
},
"process": Object {
"eventLoopDelay": undefined,
"eventLoopDelayHistogram": undefined,
"memory": Object {
"heap": Object {
"usedInBytes": undefined,
Expand Down