Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-8331 client: add client side metrics #13517

Closed
wants to merge 30 commits into from
Closed

DAOS-8331 client: add client side metrics #13517

wants to merge 30 commits into from

Conversation

wangdi1
Copy link
Contributor

@wangdi1 wangdi1 commented Dec 17, 2023

  1. Move TLS to common, so both client and server can have TLS, which metrics can be attached metrics on it.

  2. Add object metrics on the client side, enabled by export DAOS_CLIENT_METRICS=1. And client metrics are organized as "root/jobid/pid/xxxxx"

And root/jobid/pid are stored in an independent share memory, which will only be destoryed if all jobs are destroyed.

During each daos thread initialization, it will created another shmem (pid/xxx), which all metrics of the thread will be attached to. And this metric will be destoryed once the thread exit.

  1. Add DAOS_METRIC_DUMP_ENV dump metrics from current thread once it exit.

  2. Some fixes in telemetrics about conv_ptr during re-open the share memory.

  3. Add daos_metrics --jobid XXX options to retrieve all metrics of the job.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@wangdi1 wangdi1 requested review from a team as code owners December 17, 2023 05:13
Copy link

github-actions bot commented Dec 17, 2023

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'Awaiting Verification'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

@ashleypittman
Copy link
Contributor

This is failing with a segfault in pydaos, as a start you could disable this test by commenting out the two lines at https://github.com/daos-stack/daos/blob/master/utils/node_local_test.py#L6031-L6032 which would allow the rest of the testing to proceed which may give some indication of the cause.

@wangdi1
Copy link
Contributor Author

wangdi1 commented Dec 18, 2023

This is failing with a segfault in pydaos, as a start you could disable this test by commenting out the two lines at https://github.com/daos-stack/daos/blob/master/utils/node_local_test.py#L6031-L6032 which would allow the rest of the testing to proceed which may give some indication of the cause.

Thanks! @ashleypittman

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

ops |= D_TM_ITER_RESET;
iter_cb = iter_reset;
break;
case 'j':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the usage test in print_usage() to include this new flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

iter_arg.iter_cb = iter_cb;
iter_arg.delay = delay;
iter_arg.num_iter = num_iter;
d_tm_iterate(ctx, root, 0, D_TM_DIRECTORY, NULL, format, extra_descriptors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played around with the PR on my development node where I'm running the server and some client processes. When I run daos_metrics -j , I get the engine metrics mixed in and the client metrics are all zero. I wonder if it's because the wrong context is being used in the iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you provide correct jobid after -j ? hmm, the iteration will iterate all pid under root/jobid, then iterate each pid metrics here. see d_tm_iterate->iter_per_pid->process_metrics(). It should only include client object metrics for the moment.

I tried this on my side with multiple jobid, it works here.

@daosbuild1
Copy link
Collaborator

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/6/execution/node/1111/log

@mjmac mjmac requested a review from kjacque December 19, 2023 18:35
Copy link
Contributor

@ashleypittman ashleypittman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for the namespace pollution.

If this is creating shared memory on the client then we need some documentation or user education on how to manage this and detect/cleanup and leaked resources - even if that is just "the agent does it".

Comment on lines 19 to 20
bool client_metric;
bool client_metric_retain;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (and many others) should be static if possible or renamed with a daos_ prefix if not. There are much stricter naming requirements for symbols in client libraries than there are for the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


rc = d_tm_init(DC_TM_JOB_ROOT_ID, MAX_IDS_SIZE(INIT_JOB_NUM), metrics_tag);
if (rc != 0) {
D_ERROR("init job root id %u: %d\n", DC_TM_JOB_ROOT_ID, rc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't comment on all of these, but this one in particular looks like we should log the failure properly.

Suggested change
D_ERROR("init job root id %u: %d\n", DC_TM_JOB_ROOT_ID, rc);
DL_ERROR(rc, "init job root id %u", DC_TM_JOB_ROOT_ID);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

rc = d_tm_add_metric(&job_node, D_TM_DIRECTORY,
"job id directory", "dir",
"%s/%u", dc_jobid, pid);
/* Close job root sheme */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codespell annotations all need to be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

src/common/tls.c Outdated
#include <daos/tls.h>

/* The array remember all of registered module keys on one node. */
struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR] = { NULL };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be static again and you don't need to initliase it.

Suggested change
struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR] = { NULL };
static struct daos_module_key *daos_module_keys[DAOS_MODULE_KEYS_NR];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

src/common/tls.c Outdated
/**
* Init thread context
*
* \param[in]dtls Init the thread context to allocate the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think doxygen will like this, you need to adjust the whitespace. Applies elsewhere as well.

Suggested change
* \param[in]dtls Init the thread context to allocate the
* \param[in] dtls Init the thread context to allocate the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


D_INFO("Delete share memory for id %u destory %s\n", tm_shmem.id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a DF_BOOL macro to use here rather than yes/no trigraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't require all our code to be well formatted with clang-format yet but for new public header files we really should. I can push an update to this PR for just there new files if you would like, alternatively it should just be a case of "clang-format -i ".

src/include/daos/tls.h Show resolved Hide resolved
#define METRIC_DUMP_ENV "DAOS_METRIC_DUMP_ENV"
#define DAOS_CLIENT_METRICS_ENV "DAOS_CLIENT_METRICS"
#define DAOS_CLIENT_METRICS_RETAIN_ENV "DAOS_CLIENT_METRICS_RETAIN"
extern bool client_metric;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't have a variable called "client_metric" in a shared library that you're linking with end-user applications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

src/include/daos/tls.h Show resolved Hide resolved
}

/* For now TLS is only enabled if metrics are enabled */
#define METRIC_DUMP_ENV "DAOS_METRIC_DUMP_ENV"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to mention this on my other review... Suggest that this env variable should be something more like "DAOS_CLIENT_METRICS_FILE" for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure.

src/client/api/metrics.c Outdated Show resolved Hide resolved
src/gurt/telemetry.c Outdated Show resolved Hide resolved
src/client/api/metrics.c Outdated Show resolved Hide resolved
src/engine/srv.c Outdated Show resolved Hide resolved
src/gurt/telemetry.c Outdated Show resolved Hide resolved
src/gurt/telemetry.c Show resolved Hide resolved
src/include/gurt/telemetry_common.h Outdated Show resolved Hide resolved
src/client/api/metrics.c Outdated Show resolved Hide resolved
src/include/gurt/telemetry_common.h Outdated Show resolved Hide resolved
src/include/gurt/telemetry_consumer.h Show resolved Hide resolved
src/utils/daos_metrics/daos_metrics.c Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13517/8/execution/node/335/log

@wangdi1
Copy link
Contributor Author

wangdi1 commented Feb 8, 2024

maybe we need to do some more access synchronization around creating/deleting ephemeral directories. It's more overhead, but in a world of multiple writers, seems necessary now.

FWIW, I am running with 32 client processes per node. Which might be why I'm the first to see this

I added ephemeral directory synchronization @kjacque . please take a look. @daltonbohning please try the patch. Thanks!

@wangdi1 @kjacque Does it make sense for me to test this with multiple client nodes, or would 1 client and many processes be similar? I.e. is the directory creation per agent/client node?

yes, you can test it with single client and multiple threads. and the metrics directory does create per client here. @daltonbohning

@daltonbohning
Copy link
Contributor

@wangdi1 Two issues with the latest.

  1. Most mdtest runs just hang forever with no errors in logs.

  2. One reported ERROR (aiori-DFS.c:493): 244: -1038: Failed to initialize daos
    And from the logs

DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:629 add_child() can't get parent node shmem region, key=0x0
DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:676 add_child() Failed to add metric [c142-072.frontera.tacc.utexas.edu-12827]: DER_NO_SHMEM(-1038): 'Not enough shared memory free'
0DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:2605 d_tm_add_ephemeral_dir() can't set up the link node, DER_NO_SHMEM(-1038): 'Not enough shared memory free'
DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:2648 d_tm_add_ephemeral_dir() Failed to add ephemeral dir [c142-072.frontera.tacc.utexas.edu-12827]: DER_NO_SHMEM(-1038): 'Not enough shared memory free'
DAOS[12827/12827/0] daos ERR  src/client/api/metrics.c:67 dc_tm_init() add metric c142-072.frontera.tacc.utexas.edu-12827/12827 failed.
: DER_NO_SHMEM(-1038): 'Not enough shared memory free'

If this code is complicated / problematic then we probably need CI tests to verify.

@wangdi1
Copy link
Contributor Author

wangdi1 commented Feb 8, 2024

@wangdi1 Two issues with the latest.

  1. Most mdtest runs just hang forever with no errors in logs.
  2. One reported ERROR (aiori-DFS.c:493): 244: -1038: Failed to initialize daos
    And from the logs
DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:629 add_child() can't get parent node shmem region, key=0x0
DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:676 add_child() Failed to add metric [c142-072.frontera.tacc.utexas.edu-12827]: DER_NO_SHMEM(-1038): 'Not enough shared memory free'
0DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:2605 d_tm_add_ephemeral_dir() can't set up the link node, DER_NO_SHMEM(-1038): 'Not enough shared memory free'
DAOS[12827/12827/0] telem ERR  src/gurt/telemetry.c:2648 d_tm_add_ephemeral_dir() Failed to add ephemeral dir [c142-072.frontera.tacc.utexas.edu-12827]: DER_NO_SHMEM(-1038): 'Not enough shared memory free'
DAOS[12827/12827/0] daos ERR  src/client/api/metrics.c:67 dc_tm_init() add metric c142-072.frontera.tacc.utexas.edu-12827/12827 failed.
: DER_NO_SHMEM(-1038): 'Not enough shared memory free'

If this code is complicated / problematic then we probably need CI tests to verify.

Could you please delete those sharememory before running the job. from ipcrm -a on all client nodes? Ah there is a known shmem size calculation issue if you run too much ranks per node. I think @mjmac point this out before, but we decided to resolve this later in the following patch.

Could you please run this with fewer ranks per clients? like <= 16 ranks per client. @daltonbohning

I will see what I can do here. Thanks.

@daltonbohning
Copy link
Contributor

Could you please delete those sharememory before running the job. from ipcrm -a on all client nodes? Ah there is a known shmem size calculation issue if you run too much ranks per node. I think @mjmac point this out before, but we decided to resolve this later in the following patch.

Does it matter to run ipcrm -a if there are known issues with shmem calculation?

Could you please run this with fewer ranks per clients? like <= 16 ranks per client.

Even if it works, the performance numbers won't be meaningful because we typically run with 32 ranks per client.

IMO, we should resolve the synchronization issues before running on Frontera. It's time-consuming to test small changes like this on Frontera, and we're running v2.4.2 testing now. Let me try to get a reproducer on boro so we can debug this more quickly.

@daltonbohning
Copy link
Contributor

IMO, we should resolve the synchronization issues before running on Frontera. It's time-consuming to test small changes like this on Frontera, and we're running v2.4.2 testing now. Let me try to get a reproducer on boro so we can debug this more quickly.

And if I can get a reproducer on boro, I can help with a CI test

@daltonbohning
Copy link
Contributor

Different issue on boro. I get past daos_init and ior or mdtest runs okay. But sometimes during cleanup (maybe daos_fini?) I see

telem ERR  src/gurt/telemetry.c:253 destroy_shmem() Unable to remove shared memory segment (shmid=917536). shmctl failed, Invalid argument.

@wangdi1
Copy link
Contributor Author

wangdi1 commented Feb 8, 2024

Different issue on boro. I get past daos_init and ior or mdtest runs okay. But sometimes during cleanup (maybe daos_fini?) I see

telem ERR  src/gurt/telemetry.c:253 destroy_shmem() Unable to remove shared memory segment (shmid=917536). shmctl failed, Invalid argument.

only with this D_CLIENT_METRICS_ENABLE=1 setting? how many clients and ranks per client in your run? @daltonbohning thanks.

@daltonbohning
Copy link
Contributor

Different issue on boro. I get past daos_init and ior or mdtest runs okay. But sometimes during cleanup (maybe daos_fini?) I see

telem ERR  src/gurt/telemetry.c:253 destroy_shmem() Unable to remove shared memory segment (shmid=917536). shmctl failed, Invalid argument.

only with this D_CLIENT_METRICS_ENABLE=1 setting? how many clients and ranks per client in your run? @daltonbohning thanks.

Only with D_CLIENT_METRICS_ENABLE=1 and with as few as 1 client node, 2 client procs. FWIW, I'm running the server on the same node.

@wangdi1
Copy link
Contributor Author

wangdi1 commented Feb 8, 2024

Could you please delete those sharememory before running the job. from ipcrm -a on all client nodes? Ah there is a known shmem size calculation issue if you run too much ranks per node. I think @mjmac point this out before, but we decided to resolve this later in the following patch.

Does it matter to run ipcrm -a if there are known issues with shmem calculation?

yes, it will help to resolve this shmem calculation issue. And also the updated PR actually change the format of shmem a bit. so if you do not delete the old shmem, which might cause hang. I will play with larger scale in boro see how it goes. Thanks.

@daltonbohning
Copy link
Contributor

@wangdi1 I so far can't reproduce the hangs or segfaults on boro. Even with 144 ranks one one client. I'll queue some more jobs on Frontera, but I'm not sure when they'll finish. We're running 2.4.2 testing and Frontera will be unavailable Friday through Tuesday

src/gurt/telemetry.c Outdated Show resolved Hide resolved
Comment on lines +2470 to +2472
if (d_list_empty(&shmem->sh_subregions))
cur = (d_list_t *)(shmem->sh_base_addr +
(uint64_t)(&((struct d_tm_shmem_hdr *)(0))->sh_subregions));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what this change is doing. This is a pointer to the list head, right? If the list is empty, isn't head->next == head?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is initialized by D_INIT_LIST_HEAD(&header->sh_subregions), so it is not a shmem address if it is empty, if I understand correctly.

src/gurt/telemetry.c Outdated Show resolved Hide resolved
src/gurt/telemetry.c Outdated Show resolved Hide resolved
@@ -2556,10 +2572,13 @@ d_tm_add_ephemeral_dir(struct d_tm_node_t **node, size_t size_bytes,
if (rc != 0)
D_GOTO(fail, rc);

if (tm_shmem.ephemeral_dir_lock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we only concerned with the local setting, and not the flag in the shmem header? If we don't want each shmem region to potentially have its own settings, we should only have the flag in one place.

@mjmac
Copy link
Contributor

mjmac commented Feb 20, 2024

@wangdi1, @kjacque: Can we please arrive at a consensus on how to get this landed? I have two follow-on PRs that are waiting for this one (#13545, #5382). The most recent comments by Kris look like only small adjustments. There will undoubtedly be some more adjustments needed as we start to really use this feature, but as long as we're not introducing any regressions for the non-telemetry use cases, then I don't think it needs to be perfect before landing.

Di Wang added 3 commits February 20, 2024 15:53
Required-githooks: true

Signed-off-by: Di Wang <di.wang@intel.com>
Required-githooks: true

Signed-off-by: Di Wang <di.wang@intel.com>
Resolve comments from Kris.

Required-githooks: true

Signed-off-by: Di Wang <di.wang@intel.com>
@wangdi1 wangdi1 requested a review from kjacque February 20, 2024 18:31
Copy link

Bug-tracker data:
Ticket title is 'Client side metrics/stats support for DAOS'
Status is 'Awaiting Verification'
Labels: 'HPE'
https://daosio.atlassian.net/browse/DAOS-8331

@mjmac
Copy link
Contributor

mjmac commented Feb 21, 2024

@wangdi1, @kjacque: Thanks for the quick turnaround on the feedback. FYI, I found an issue in my testing that I've fixed in my follow-on: https://github.com/daos-stack/daos/pull/13545/files#diff-49b244164a1877abd5a7363633cb22baec5bce3b270d978f1beea89a00761bbcR2730

With that fix in place, the agent was able to manage the per-job/pid subdirs with no problems, and ran for a few hours with telemetry managed for hundreds of client processes.

@mchaarawi, @ashleypittman, @daltonbohning: Can we please land these two PRs so that we can keep making progress? I think Dalton's testing has shown that there is no appreciable performance impact with metrics enabled or disabled, correct?

@mchaarawi
Copy link
Contributor

@wangdi1, @kjacque: Thanks for the quick turnaround on the feedback. FYI, I found an issue in my testing that I've fixed in my follow-on: https://github.com/daos-stack/daos/pull/13545/files#diff-49b244164a1877abd5a7363633cb22baec5bce3b270d978f1beea89a00761bbcR2730

With that fix in place, the agent was able to manage the per-job/pid subdirs with no problems, and ran for a few hours with telemetry managed for hundreds of client processes.

@mchaarawi, @ashleypittman, @daltonbohning: Can we please land these two PRs so that we can keep making progress? I think Dalton's testing has shown that there is no appreciable performance impact with metrics enabled or disabled, correct?

I don't see any feedback from Dalton on his latest run whether that segfaulted again or not.. but we probably need to run the test again since it was 2 weeks ago and more changes were landed to this PR.

@daltonbohning
Copy link
Contributor

I don't see any feedback from Dalton on his latest run whether that segfaulted again or not.. but we probably need to run the test again since it was 2 weeks ago and more changes were landed to this PR.

The last I ran with D_CLIENT_METRICS_ENABLE=1 I was still getting segfaults or hangs on init.
With D_CLIENT_METRICS_ENABLE=0, I don't yet have a definite answer on the performance impact.
I should finish 2.4.2 RC testing soon and can get back to this.

@daltonbohning
Copy link
Contributor

I don't see any feedback from Dalton on his latest run whether that segfaulted again or not.. but we probably need to run the test again since it was 2 weeks ago and more changes were landed to this PR.

The last I ran with D_CLIENT_METRICS_ENABLE=1 I was still getting segfaults or hangs on init. With D_CLIENT_METRICS_ENABLE=0, I don't yet have a definite answer on the performance impact. I should finish 2.4.2 RC testing soon and can get back to this.

Looking back at my mdtest hard results, it seems unlikely that D_CLIENT_METRICS_ENABLE=0 introduces significant overhead, though I need to run more tests.
So @mchaarawi if only the non-default (D_CLIENT_METRICS_ENABLE=1) is an issue, should we move forward with this PR and fix that after? Due to some issues on Frontera I'm not sure when I can re-test this.

@mjmac
Copy link
Contributor

mjmac commented Feb 26, 2024

For what it's worth, my follow-on PR based on this one passes all of the linter checks: #13545

Can we please get another +1 on it and get it landed? @johannlombardi, @mchaarawi, @ashleypittman? Thanks in advance. :)

@mchaarawi
Copy link
Contributor

For what it's worth, my follow-on PR based on this one passes all of the linter checks: #13545

Can we please get another +1 on it and get it landed? @johannlombardi, @mchaarawi, @ashleypittman? Thanks in advance. :)

im going to object on this until 1) we have proved that the segfault issue is resolved/tested, and 2) perf numbers show there is no affect on the non-metric path.

@mjmac
Copy link
Contributor

mjmac commented Feb 26, 2024

im going to object on this until 1) we have proved that the segfault issue is resolved/tested, and 2) perf numbers show there is no affect on the non-metric path.

OK. @daltonbohning: Please advise when you'll have time to re-run the performance tests.

@daltonbohning
Copy link
Contributor

im going to object on this until 1) we have proved that the segfault issue is resolved/tested, and 2) perf numbers show there is no affect on the non-metric path.

OK. @daltonbohning: Please advise when you'll have time to re-run the performance tests.

Sorry, I'm behind. I should get back to this either this week or early next week.

@daltonbohning
Copy link
Contributor

daltonbohning commented Mar 8, 2024

With the latest version, I still see mdtest hanging with D_CLIENT_METRICS_ENABLE=1. And it seems to only happen (or at least be more likely) with multiple client nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants