Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ops: add ttl metrics for certificate expiration #130110

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

angles-n-daemons
Copy link
Contributor

@angles-n-daemons angles-n-daemons commented Sep 4, 2024

ops: add ttl metrics for certificate expiration

Currently, cockroach only exposes point in time certificate
expiration metrics. If the certificate is to expire 1 day from now,
we expose a gauge security.certificate.expiration.<cert-type>
which is the unix timestamp when it will expire. This PR also
exposes a ttl metric security.certificate.ttl.<cert-type> so that
consumers of this information can run operations based on their
distance to certificate expiration without additional
transformations.

Additionally, this PR refactors how the expiration gauges are set,
so that reads of the gauge directly reference the value of the
certificate.

Epic: CRDB-40209
Fixes: #77376

Release note (ops change): new metrics which expose the ttl for various
certificates

@cockroach-teamcity
Copy link
Member

This change is Reviewable

)

func makeMetrics() Metrics {
func expirationGauge(metadata metric.Metadata, ci *CertInfo) *metric.Gauge {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these lock the certificate manager while they're being read?

// the correct expiration and ttl values are set on the metrics vars that are
// exposed to our collectors. It uses a single key pair to approximate the
// behavior across other key pairs.
func TestMetricsValues(t *testing.T) {
Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only testing the CA certs for TTL and expiration, open to going beyond that as well

@@ -100,6 +100,8 @@ func (p PemUsage) String() string {
return "Client"
case TenantPem:
return "Tenant Client"
case TenantSigningPem:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing string value

Copy link
Member

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Overall changes look good, just some small comments on naming and commit hygiene.

Reviewed 5 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons)


-- commits line 2 at r2:
Please add the headling ops:... to the commit message too.


-- commits line 7 at r2:
I'm a little confused about this release note - not sure if this security: part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages


pkg/security/certificate_metrics.go line 175 at r1 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

Should these lock the certificate manager while they're being read?

This fn seems fine, as it's not actually reading from the cm directly and the ci passed in isn't expected to change. Left a comment on the caller of this fn regarding the locking.


pkg/security/certificate_metrics.go line 132 at r2 (raw file):

	metaClientTTL = metric.Metadata{
		Name:        "security.certificate.ttl.client",
		Help:        "Seconds till expiration for the client certificates., labeled by SQL user. 0 means expired, no certificate or error.",

nit: extra . after certificates


pkg/security/certificate_metrics.go line 202 at r2 (raw file):

// metric to zero.
// cm.mu must be held to protect the certificates. Metrics do their own atomicity.
func (cm *CertificateManager) createMetrics() Metrics {

It seems like this has to be called when the cm is locked since it's accessing fields protected under mu. Do you mind just renaming this to createMetricsLocked to signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!

Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 6 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @angles-n-daemons and @xinhaoz)


pkg/security/certificate_manager.go line 115 at r2 (raw file):

	tenantIdentifier uint64

	// timeSource, if set, specifies the time source with which the metrics are set

nit

Suggestion:

set.

pkg/security/certificate_manager.go line 132 at r2 (raw file):

// WithTimeSource allows the caller to pass a time source to be used
// by the Metrics struct (mostly for testing)

nit

Suggestion:

(mostly for testing).

pkg/security/certificate_metrics.go line 160 at r2 (raw file):

		Unit:        metric.Unit_TIMESTAMP_SEC,
	}

nit: extra new line


pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

}

var defaultExpiration = timeutil.Now().Add(time.Hour)

Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.


pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):

		{"ca.crt", 0777, caCert},
		{"ca.key", 0777, []byte("ca.key")},
	} {

Could this also have the other node/client/tenant certificates?

@angles-n-daemons angles-n-daemons changed the title ops: add ttl metrics for certificate expiration obs: add ttl metrics for certificate expiration Sep 10, 2024
@angles-n-daemons angles-n-daemons changed the title obs: add ttl metrics for certificate expiration ops: add ttl metrics for certificate expiration Sep 10, 2024
Copy link
Contributor Author

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pritesh-lahoti and @xinhaoz)


-- commits line 2 at r2:

Previously, xinhaoz (Xin Hao Zhang) wrote…

Please add the headling ops:... to the commit message too.

ah makes sense, will do


-- commits line 7 at r2:

Previously, xinhaoz (Xin Hao Zhang) wrote…

I'm a little confused about this release note - not sure if this security: part is misleading. It seems like this release note is more suitable as the commit message body. We don't need to add any comments on refactoring in release notes - they are for documenting user facing changes.
https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/73072807/Git+Commit+Messages

gotcha, thanks for pointing this out


pkg/security/certificate_manager.go line 115 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit

👀


pkg/security/certificate_manager.go line 132 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit

👀


pkg/security/certificate_metrics.go line 132 at r2 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

nit: extra . after certificates

👀


pkg/security/certificate_metrics.go line 160 at r2 (raw file):

Previously, pritesh-lahoti wrote…

nit: extra new line

👀


pkg/security/certificate_metrics.go line 202 at r2 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

It seems like this has to be called when the cm is locked since it's accessing fields protected under mu. Do you mind just renaming this to createMetricsLocked to signal that? I see that it's being called in LoadCertificates which locks it so the usages are good!

make sense! I'll make that naming change, definitely seems better


pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

Previously, pritesh-lahoti wrote…

Should this initialization happen within the test?
This value might get set way before the test using it is actually being executed.

What's your preference? Right now this primarily used as a noop


pkg/security/certificate_metrics_test.go line 46 at r2 (raw file):

Previously, pritesh-lahoti wrote…

Could this also have the other node/client/tenant certificates?

yeah I'll give this a go

@angles-n-daemons
Copy link
Contributor Author

Missing: update on the client certificate TTL aggregate gauge. Need to add that before merging

@angles-n-daemons angles-n-daemons requested review from arjunmahishi and aa-joshi and removed request for a team September 12, 2024 01:56
Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)


pkg/security/cert_expiry_cache.go line 123 at r4 (raw file):

// GetTTL retrieves seconds till cert expiration for the given username, if it exists.
// An expiration of 0 indicates an entry was not found.

Suggestion:

A TTL of 0

pkg/security/cert_expiry_cache.go line 224 at r4 (raw file):

// timeNow returns the current time depending on the time source of the cache.
func (c *ClientCertExpirationCache) timeNow() int64 {

nit: extra line


pkg/util/metric/aggmetric/gauge.go line 183 at r4 (raw file):

}

// Updates the function on the gauge

nit

Suggestion:

// UpdateFn updates the function on the gauge.

pkg/security/certificate_loader_test.go line 130 at r2 (raw file):

Previously, angles-n-daemons (Brian Dillmann) wrote…

What's your preference? Right now this primarily used as a noop

I believe moving it within the test will be resilient to potential flakiness.

Currently, cockroach only exposes point in time certificate
expiration metrics. If the certificate is to expire 1 day from now,
we expose a gauge `security.certificate.expiration.<cert-type>`
which is the unix timestamp when it will expire. This PR also
exposes a ttl metric `security.certificate.ttl.<cert-type>` so that
consumers of this information can run operations based on their
distance to certificate expiration without additional
transformations.

Additionally, this PR refactors how the expiration gauges are set,
so that reads of the gauge directly reference the value of the
certificate.

Epic: CRDB-40209
Fixes: cockroachdb#77376

Release note (ops change): new metrics which expose the ttl for various
certificates
Copy link
Contributor

@pritesh-lahoti pritesh-lahoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all the changes!

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aa-joshi, @angles-n-daemons, @arjunmahishi, and @xinhaoz)

Copy link
Member

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: !
Remember to update your PR description to sync with the commit message .

@angles-n-daemons
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 66edff7 into cockroachdb:master Sep 13, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

linter: Add PR check to encourage including issue or epic refs in the PR/commit messages
4 participants