Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-8331 client: Export client metrics via agent #13545

Closed
wants to merge 1 commit into from

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Dec 28, 2023

Adds new agent config parameters and code to
optionally export client metrics in Prometheus
format.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_retain: 5m # retain metrics for 5 minutes
# after client exit

Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea
Required-githooks: true
Signed-off-by: Michael MacDonald mjmac@google.com

Copy link

github-actions bot commented Dec 28, 2023

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'Awaiting Verification'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac mjmac force-pushed the mjmac/agent_prom branch 5 times, most recently from 7a6fb84 to e2f54b6 Compare January 5, 2024 18:27
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13545/6/display/redirect

@mjmac mjmac force-pushed the mjmac/agent_prom branch 2 times, most recently from 208898e to 3f99401 Compare January 9, 2024 20:33
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be new entries added to utils/config/daos_agent.yml?

Also, could I ask for this small change? It would allow functional tests - specifically, the performance tests - to set the config
578d907
And since it's unused, no special testing is needed

@mjmac
Copy link
Contributor Author

mjmac commented Jan 11, 2024

Should there be new entries added to utils/config/daos_agent.yml?

Yes, good catch. I forgot about those.

Also, could I ask for this small change? It would allow functional tests - specifically, the performance tests - to set the config 578d907 And since it's unused, no special testing is needed

I'll merge that in, thanks. Actually, I may try to add a ftest for this work, so that change makes it even easier.

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/284/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/279/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/366/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/361/log

@mjmac
Copy link
Contributor Author

mjmac commented Jan 31, 2024

I'll merge that in, thanks. Actually, I may try to add a ftest for this work, so that change makes it even easier.

Just refreshed this patch. I did add the agent_utils_params.py changes. I have not gotten to adding the ftest yet. As these metrics are still somewhat of a WIP, IMO it's premature to add tests that are expecting fixed sets of metrics while we're iterating. I agree with @wangdi1 that we should add the ftest later.

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13545/10/execution/node/347/log

Copy link

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'Awaiting Verification'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

Copy link

github-actions bot commented Feb 20, 2024

Functional on EL 9 Test Results (old)

135 tests   131 ✅  1h 31m 39s ⏱️
 41 suites    4 💤
 41 files      0 ❌

Results for commit 9894532.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Feb 20, 2024

Functional on EL 8.8 Test Results (old)

135 tests   131 ✅  1h 29m 5s ⏱️
 41 suites    4 💤
 41 files      0 ❌

Results for commit 9894532.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Feb 21, 2024

Functional Hardware Medium Test Results (old)

130 tests   104 ✅  2h 9m 52s ⏱️
 34 suites   26 💤
 34 files      0 ❌

Results for commit 9894532.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Feb 21, 2024

Functional Hardware Medium Verbs Provider Test Results (old)

55 tests   54 ✅  4h 7m 31s ⏱️
 7 suites   1 💤
 7 files     0 ❌

Results for commit 9894532.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Feb 21, 2024

Functional Hardware Large Test Results (old)

64 tests   64 ✅  28m 42s ⏱️
14 suites   0 💤
14 files     0 ❌

Results for commit 9894532.

♻️ This comment has been updated with latest results.

Adds new agent config parameters and code to
optionally export client metrics in Prometheus
format.

Example daos_agent.yml updates:
  telemetry_port: 9192 # export on port 9192
  telemetry_retain: 5m # retain metrics for 5 minutes
                       # after client exit

Run-GHA: true
Change-Id: I77864682cc19fa4c33f326d879e20704ef57a7ea
Required-githooks: true
Signed-off-by: Michael MacDonald <mjmac@google.com>
Copy link

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'Awaiting Verification'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac
Copy link
Contributor Author

mjmac commented Feb 26, 2024

Requesting early reviews while waiting for the base patch to land, TIA.

Copy link
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not overly familiar with the telemetry code but these changes LGTM.

src/control/cmd/daos_agent/config.go Show resolved Hide resolved
src/control/lib/telemetry/promexp/engine.go Show resolved Hide resolved
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest code changes LGTM. Thanks for adding!

Comment on lines +222 to +223
if (tm_shmem.other_rw == 1)
flags |= 0666;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is concerning to me -- allowing anyone to write to the root of client telemetry means anyone can change it, potentially crashing the agent or client process, or doing worse. IMO another way is needed to link client job telemetry from the root.

@@ -2717,7 +2727,7 @@ rm_ephemeral_dir(struct d_tm_context *ctx, struct d_tm_node_t *link)
head = &shmem->sh_subregions;
for (cur = conv_ptr(shmem, head->next); cur != head; cur = conv_ptr(shmem, cur->next)) {
curr = d_list_entry(cur, __typeof__(*curr), rl_link);
rc = rm_ephemeral_dir(ctx, curr->rl_link_node);
rc = rm_ephemeral_dir(ctx, conv_ptr(shmem, curr->rl_link_node));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Comment on lines +73 to +77
var engLabels labelMap
engLabels, name = extractEngineLabels(log, strings.Join(comps[compsIdx:], string(telemetry.PathSep)))
for k, v := range engLabels {
labels[k] = v
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractEngineLabels in the client telemetry?

@@ -1,447 +1,40 @@
//
// (C) Copyright 2021-2022 Intel Corporation.
// (C) Copyright 2024 Intel Corporation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - should be 2021-2024 right?

}
}

func (s *sourceMetricSchema) Prune() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - needs documentation comment

addFn func(logging.Logger, telemetry.Metric) *sourceMetric
}

MetricSource struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - needs documentation comment

s.enabled.SetFalse()
}

func (s *MetricSource) Collect(log logging.Logger, ch chan<- *sourceMetric) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - needs documentation comment

s.smSchema.Prune()
}

func (s *MetricSource) PruneSegments(log logging.Logger, maxSegAge time.Duration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs documentation comment


// adjustAttachInfo performs any necessary adjustments to the attach info
// before returning it.
func (c *InfoCache) adjustAttachInfo(resp *control.GetAttachInfoResp) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make other modifications to the GetAttachInfoResp struct in mgmt_rpc.go before returning to the client. Infocache has been caching only what the server sends back. I'm not exactly opposed to doing it this way, but it feels a little odd making our client-side modifications in two different layers.

@mjmac mjmac closed this Mar 24, 2024
@mjmac
Copy link
Contributor Author

mjmac commented Mar 24, 2024

Closed in favor of the approach in #14030.

@mjmac mjmac deleted the mjmac/agent_prom branch May 9, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants