Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ldlm pool stats? #67

Closed
knweiss opened this issue May 15, 2017 · 8 comments
Closed

Missing ldlm pool stats? #67

knweiss opened this issue May 15, 2017 · 8 comments
Assignees

Comments

@knweiss
Copy link

knweiss commented May 15, 2017

Someone please double check (I don't have access to a Lustre system right now):

lustre_exporter (master) $ git grep "pool/" sources/procfs.go 
sources/procfs.go:                      {"pool/granted", "granted", "Number of granted locks", s.counterMetric},
sources/procfs.go:                      {"pool/grant_rate", "grant_rate", "Lock grant rate", s.counterMetric},
sources/procfs.go:                      {"pool/cancel_rate", "cancel_rate", "Lock cancel rate", s.counterMetric},
sources/procfs.go:                      {"pool/grant_speed", "grant_speed", "Lock grant speed", s.counterMetric},

However, in the Lustre source code:

lustre-release (master) $ git grep -A2 'lprocfs_counter_init('  |grep -A2 LDLM_POOL
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_GRANTED_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "granted", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_GRANT_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "grant", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_CANCEL_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "cancel", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_GRANT_RATE_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "grant_rate", "locks/s");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_CANCEL_RATE_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "cancel_rate", "locks/s");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_GRANT_PLAN_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "grant_plan", "locks/s");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_SLV_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "slv", "slv");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_SHRINK_REQTD_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "shrink_request", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_SHRINK_FREED_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "shrink_freed", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_RECALC_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "recalc_freed", "locks");
--
lustre/ldlm/ldlm_pool.c:        lprocfs_counter_init(pl->pl_stats, LDLM_POOL_TIMING_STAT,
lustre/ldlm/ldlm_pool.c-                             LPROCFS_CNTR_AVGMINMAX | LPROCFS_CNTR_STDDEV,
lustre/ldlm/ldlm_pool.c-                             "recalc_timing", "sec");
lustre-release (master) $ git grep -A2 'lprocfs_counter_init('  |grep LDLM_POOL |wc -l
      11
@joehandzik
Copy link
Contributor

Hmm, we didn't see any of those stats in our environment. Let us know if you see a use case for any of these in your environment, and actually see them populated.

@roclark
Copy link
Contributor

roclark commented May 15, 2017

So it looks like the ones we are not currently tracking are the following:

grant
cancel
grant_plan
slv
shrink_request
shrink_freed
recalc_freed
recalc_timing

@roclark roclark self-assigned this May 19, 2017
@roclark
Copy link
Contributor

roclark commented May 19, 2017

@knweiss I am not finding much information on the grant and cancel metrics, do you happen to have any information or examples on them? Otherwise, I created PR #79 which addresses the other metrics.

@knweiss
Copy link
Author

knweiss commented May 21, 2017

Caveat: I'm not very familiar with the Lustre source code so please take this with a grain of salt.

@roclark After taking a glance at the ldlm_pool.c I came to this conclusion: The grant and cancel metrics are per pool counters that get increased in ldlm_pool_add() and ldlm_pool_del() respectively each time a lock is granted/canceled.

AFAICS this metric invariant holds:

granted == grant - cancel

However, his begs the question: Is granted a gauge and not a counter? There's definitely an atomic_dec():

void ldlm_pool_del(struct ldlm_pool *pl, struct ldlm_lock *lock)
{
[...]
        atomic_dec(&pl->pl_granted);
[...]
}

OTOH there's only a single lprocfs_counter_add() to the LDLM_POOL_GRANTED_STAT and I don't see how its argument granted could become negative:

static void ldlm_pool_recalc_stats(struct ldlm_pool *pl)
{
[...]
        int granted = ldlm_pool_granted(pl);             // reads pl_granted
[...]
        lprocfs_counter_add(pl->pl_stats, LDLM_POOL_GRANTED_STAT,
                            granted);
[...]
}

Also, this makes me wonder if some metrics should be dropped from the lustre-exporter as Prometheus could calculate them instead? (If the user knows these relationships that is...)

A similar case is mentioned in http://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide

dlm lock grant speed = grant_rate - cancel_rate. You can use this to derive cancel_rate 'CR'. Or you can just get 'CR' from the stats file I assume.

@roclark
Copy link
Contributor

roclark commented May 22, 2017

@knweiss Thanks for the detailed dive into the code! I've been looking at the source as well, and to me, it sounds like granted should be a gauge. I was also looking at the equation granted == grant - cancel. Unless I am misinterpreting the source, I don't believe you can have a cancel without at least one grant, otherwise there is nothing to cancel. So, grant should always be greater than or equal to cancel in my mind, but they can increase asynchronously (to an extent). Take the following example where the numbers refer to various operations:

grant #1    -> granted == 1 - 0 = 1 (increase)
grant #2    -> granted == 2 - 0 = 2 (increase)
grant #3    -> granted == 3 - 0 = 3 (increase)
cancel #1   -> granted == 3 - 1 = 2 (decrease)
grant #4    -> granted == 4 - 1 = 3 (increase)
cancel #2   -> granted == 4 - 2 = 2 (decrease)
cancel #3   -> granted == 4 - 3 = 1 (decrease)

Not sure if you agree or if I am way off, but this is how I understand the granted stat and why I see it as a gauge instead of a counter (agreeing with your initial thoughts).

As for removing items from the lustre_exporter, I am all for it if they are direct calculations. Although...

If the user knows these relationships that is...

This is the key point in your statement to me. At least to me, grant speed isn't immediately obvious that it is grant_rate - cancel_rate, though it is logical once known. I'm not sure if it would be wise to document this anywhere or just to keep it. Calculations like these could easily be done in Grafana/Prometheus as opposed to the exporter. I will let @joehandzik add any input on whether he thinks we should keep/remove the metrics.

@knweiss
Copy link
Author

knweiss commented May 22, 2017

@roclark Yes, that's my understanding as well but what do you think about this line?

        lprocfs_counter_add(pl->pl_stats, LDLM_POOL_GRANTED_STAT,
                            granted);

If granted is always positive this add can only increase LDLM_POOL_GRANTED_STAT. That's what confuses me and I wonder if I'm missing something here.

@roclark
Copy link
Contributor

roclark commented May 22, 2017

Hmm... you've convinced me. I don't see any reason why it wouldn't be a counter from that line... If you're missing something, I must be too.

In short... are we all agreeing to set this as a counter?

@joehandzik
Copy link
Contributor

Fine with me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants