Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for TLS client settings for clustering #1724

Merged
merged 9 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ Main (unreleased)

- Changed OTEL alerts in Alloy mixin to use success rate for tracing. (@thampiotr)

- Support TLS client settings for clustering (@tiagorossig)

v1.4.0-rc.3
-----------------

Expand Down
10 changes: 9 additions & 1 deletion docs/sources/reference/cli/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,11 @@ The following flags are supported:
* `--cluster.max-join-peers`: Number of peers to join from the discovered set (default `5`).
* `--cluster.name`: Name to prevent nodes without this identifier from joining the cluster (default `""`).
* `--cluster.use-discovery-v1`: Use the older, v1 version of cluster peer discovery mechanism (default `false`). Note that this flag will be deprecated in the future and eventually removed.
* `--cluster.enable-tls`: Specifies whether TLS should be used for communication between peers (default `false`).
* `--cluster.tls-ca-path`: Path to the CA certificate file used for peer communication over TLS.
* `--cluster.tls-cert-path`: Path to the certificate file used for peer communication over TLS.
* `--cluster.tls-key-path`: Path to the key file used for peer communication over TLS.
* `--cluster.tls-server-name`: Server name used for peer communication over TLS.
* `--config.format`: The format of the source file. Supported formats: `alloy`, `otelcol`, `prometheus`, `promtail`, `static` (default `"alloy"`).
* `--config.bypass-conversion-errors`: Enable bypassing errors when converting (default `false`).
* `--config.extra-args`: Extra arguments from the original format used by the converter.
Expand Down Expand Up @@ -100,7 +105,7 @@ The rest of the `--cluster.*` command-line flags can be used to configure how no
Each cluster member’s name must be unique within the cluster.
Nodes which try to join with a conflicting name are rejected and will fall back to bootstrapping a new cluster of their own.

Peers communicate over HTTP/2 on the built-in HTTP server.
Peers communicate over HTTP/2 on the built-in HTTP server.
tiagorossig marked this conversation as resolved.
Show resolved Hide resolved
Each node must be configured to accept connections on `--server.http.listen-addr` and the address defined or inferred in `--cluster.advertise-address`.

If the `--cluster.advertise-address` flag isn't explicitly set, {{< param "PRODUCT_NAME" >}} tries to infer a suitable one from `--cluster.advertise-interfaces`.
Expand All @@ -112,6 +117,9 @@ The comma-separated list of addresses provided in `--cluster.join-addresses` can
In both cases, the port number can be specified with a `:<port>` suffix. If ports are not provided, default of the port used for the HTTP listener is used.
If you do not provide the port number explicitly, you must ensure that all instances use the same port for the HTTP listener.

The `--cluster.enable-tls` flag can be set to enable TLS for peer-to-peer communications. Additional arguments are required to configure the TLS client, including the CA certificate (`--cluster.tls-ca-path`),
the certificate (`--cluster.tls-cert-path`), the key (`--cluster.tls-key-path`), and the server name (`--cluster.tls-server-name`).
tiagorossig marked this conversation as resolved.
Show resolved Hide resolved

The `--cluster.discover-peers` command-line flag expects a list of tuples in the form of `provider=XXX key=val key=val ...`.
Clustering uses the [go-discover] package to discover peers and fetch their IP addresses, based on the chosen provider and the filtering key-values it supports.
Clustering supports the default set of providers available in go-discover and registers the `k8s` provider on top.
Expand Down
3 changes: 3 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -917,3 +917,6 @@ replace github.com/prometheus/procfs => github.com/prometheus/procfs v0.12.0
// It's important to remove it asap because in version v0.13.1 there is a fix for Beyla.
// PR to track it: https://github.com/opencontainers/runc/pull/4397
replace github.com/opencontainers/runc => github.com/rafaelroquetto/runc v1.1.14-1

// Temporary replace until ckit changes are merged upstream
replace github.com/grafana/ckit => github.com/tiagorossig/ckit v0.0.0-20240920184404-077657c65a6f
tiagorossig marked this conversation as resolved.
Show resolved Hide resolved
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -1198,8 +1198,6 @@ github.com/grafana/cadvisor v0.0.0-20240729082359-1f04a91701e2 h1:ju6EcY2aEobeBg
github.com/grafana/cadvisor v0.0.0-20240729082359-1f04a91701e2/go.mod h1:8sLW/G7rcFe1CKMaA4pYT4mX3P1xQVGqM6luzEzx/2g=
github.com/grafana/catchpoint-prometheus-exporter v0.0.0-20240606062944-e55f3668661d h1:6sNPBwOokfCxAyateu7iLdtyWDUzaLLShPs7F4eTLfw=
github.com/grafana/catchpoint-prometheus-exporter v0.0.0-20240606062944-e55f3668661d/go.mod h1:aGPSALDAkw18nn8M7gumhM/MbJG+zgOA3jNWTwPYtLg=
github.com/grafana/ckit v0.0.0-20240913130805-0ee98bafad88 h1:GgbYRGz2+/Vgz8/lk19Ht8TQDsAudl51Qenuw+COs5k=
github.com/grafana/ckit v0.0.0-20240913130805-0ee98bafad88/go.mod h1:dDqep1rKTbq2ppMYEgIM88GaPXHp4i6Cp3qantiloA0=
github.com/grafana/cloudflare-go v0.0.0-20230110200409-c627cf6792f2 h1:qhugDMdQ4Vp68H0tp/0iN17DM2ehRo1rLEdOFe/gB8I=
github.com/grafana/cloudflare-go v0.0.0-20230110200409-c627cf6792f2/go.mod h1:w/aiO1POVIeXUQyl0VQSZjl5OAGDTL5aX+4v0RA1tcw=
github.com/grafana/dskit v0.0.0-20240104111617-ea101a3b86eb h1:AWE6+kvtE18HP+lRWNUCyvymyrFSXs6TcS2vXIXGIuw=
Expand Down Expand Up @@ -2394,6 +2392,8 @@ github.com/testcontainers/testcontainers-go v0.33.0 h1:zJS9PfXYT5O0ZFXM2xxXfk4J5
github.com/testcontainers/testcontainers-go v0.33.0/go.mod h1:W80YpTa8D5C3Yy16icheD01UTDu+LmXIA2Keo+jWtT8=
github.com/tg123/go-htpasswd v1.2.2 h1:tmNccDsQ+wYsoRfiONzIhDm5OkVHQzN3w4FOBAlN6BY=
github.com/tg123/go-htpasswd v1.2.2/go.mod h1:FcIrK0J+6zptgVwK1JDlqyajW/1B4PtuJ/FLWl7nx8A=
github.com/tiagorossig/ckit v0.0.0-20240920184404-077657c65a6f h1:k8YmA8sNzKYlbHtrMkZXikDSajqkYuLdM9Eerw14qBA=
github.com/tiagorossig/ckit v0.0.0-20240920184404-077657c65a6f/go.mod h1:dDqep1rKTbq2ppMYEgIM88GaPXHp4i6Cp3qantiloA0=
github.com/tidwall/gjson v1.6.0/go.mod h1:P256ACg0Mn+j1RXIDXoss50DeIABTYK1PULOJHhxOls=
github.com/tidwall/match v1.0.1/go.mod h1:LujAq0jyVjBy028G1WhWfIzbpQfMO8bBZ6Tyb0+pL9E=
github.com/tidwall/pretty v1.0.0/go.mod h1:XNkn88O1ChpSDQmQeStsy+sBenx6DDtFZJxhVysOjyk=
Expand Down
10 changes: 10 additions & 0 deletions internal/alloycli/cluster_builder.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,11 @@ type clusterOptions struct {
ClusterName string
EnableStateUpdatesLimiter bool
EnableDiscoveryV2 bool
EnableTLS bool
TLSCAPath string
TLSCertPath string
TLSKeyPath string
TLSServerName string
}

func buildClusterService(opts clusterOptions) (*cluster.Service, error) {
Expand All @@ -54,6 +59,11 @@ func buildClusterService(opts clusterOptions) (*cluster.Service, error) {
ClusterMaxJoinPeers: opts.ClusterMaxJoinPeers,
ClusterName: opts.ClusterName,
EnableStateUpdatesLimiter: opts.EnableStateUpdatesLimiter,
EnableTLS: opts.EnableTLS,
TLSCAPath: opts.TLSCAPath,
TLSCertPath: opts.TLSCertPath,
TLSKeyPath: opts.TLSKeyPath,
TLSServerName: opts.TLSServerName,
}

if config.NodeName == "" {
Expand Down
20 changes: 20 additions & 0 deletions internal/alloycli/cmd_run.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,16 @@ depending on the nature of the reload error.
IntVar(&r.clusterMaxJoinPeers, "cluster.max-join-peers", r.clusterMaxJoinPeers, "Number of peers to join from the discovered set")
cmd.Flags().
StringVar(&r.clusterName, "cluster.name", r.clusterName, "The name of the cluster to join")
cmd.Flags().
BoolVar(&r.clusterEnableTLS, "cluster.enable-tls", r.clusterEnableTLS, "Specifies whether TLS should be used for communication between peers")
cmd.Flags().
StringVar(&r.clusterTLSCAPath, "cluster.tls-ca-path", r.clusterTLSCAPath, "Path to the CA certificate file")
cmd.Flags().
StringVar(&r.clusterTLSCertPath, "cluster.tls-cert-path", r.clusterTLSCertPath, "Path to the certificate file")
cmd.Flags().
StringVar(&r.clusterTLSKeyPath, "cluster.tls-key-path", r.clusterTLSKeyPath, "Path to the key file")
cmd.Flags().
StringVar(&r.clusterTLSServerName, "cluster.tls-server-name", r.clusterTLSServerName, "Server name to use for TLS communication")
// TODO(alloy/#1274): make this flag a no-op once we have more confidence in this feature, and add issue to
// remove it in the next major release
cmd.Flags().
Expand Down Expand Up @@ -168,6 +178,11 @@ type alloyRun struct {
clusterMaxJoinPeers int
clusterName string
clusterUseDiscoveryV1 bool
clusterEnableTLS bool
clusterTLSCAPath string
clusterTLSCertPath string
clusterTLSKeyPath string
clusterTLSServerName string
configFormat string
configBypassConversionErrors bool
configExtraArgs string
Expand Down Expand Up @@ -258,6 +273,11 @@ func (fr *alloyRun) Run(configPath string) error {
//TODO(alloy/#1274): graduate to GA once we have more confidence in this feature
EnableStateUpdatesLimiter: fr.minStability.Permits(featuregate.StabilityPublicPreview),
EnableDiscoveryV2: !fr.clusterUseDiscoveryV1,
EnableTLS: fr.clusterEnableTLS,
TLSCertPath: fr.clusterTLSCertPath,
TLSCAPath: fr.clusterTLSCAPath,
TLSKeyPath: fr.clusterTLSKeyPath,
TLSServerName: fr.clusterTLSServerName,
})
if err != nil {
return err
Expand Down
82 changes: 66 additions & 16 deletions internal/service/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ package cluster
import (
"context"
"crypto/tls"
"crypto/x509"
"fmt"
"math/rand"
"net"
Expand Down Expand Up @@ -76,6 +77,11 @@ type Options struct {

NodeName string // Name to use for this node in the cluster.
AdvertiseAddress string // Address to advertise to other nodes in the cluster.
EnableTLS bool // Specifies whether TLS should be used for communication between peers.
TLSCAPath string // Path to the CA file.
TLSCertPath string // Path to the certificate file.
TLSKeyPath string // Path to the key file.
TLSServerName string // Server name to use for TLS communication.
RejoinInterval time.Duration // How frequently to rejoin the cluster to address split brain issues.
ClusterMaxJoinPeers int // Number of initial peers to join from the discovered set.
ClusterName string // Name to prevent nodes without this identifier from joining the cluster.
Expand Down Expand Up @@ -121,26 +127,35 @@ func New(opts Options) (*Service, error) {
Log: l,
Sharder: shard.Ring(tokensPerNode),
Label: opts.ClusterName,
EnableTLS: opts.EnableTLS,
}

httpClient := &http.Client{
Transport: &http2.Transport{
AllowHTTP: true,
DialTLSContext: func(ctx context.Context, network, addr string, _ *tls.Config) (net.Conn, error) {
// Set a maximum timeout for establishing the connection. If our
// context has a deadline earlier than our timeout, we shrink the
// timeout to it.
//
// TODO(rfratto): consider making the max timeout configurable.
timeout := 30 * time.Second
if dur, ok := deadlineDuration(ctx); ok && dur < timeout {
timeout = dur
}

return net.DialTimeout(network, addr, timeout)
},
httpTransport := &http2.Transport{
AllowHTTP: false,
DialTLSContext: func(ctx context.Context, network, addr string, _ *tls.Config) (net.Conn, error) {
return net.DialTimeout(network, addr, calcTimeout(ctx))
},
}
if opts.EnableTLS {
tlsConfig, err := loadTLSConfigFromFile(opts.TLSCAPath, opts.TLSCertPath, opts.TLSKeyPath, opts.TLSServerName)
if err != nil {
return nil, fmt.Errorf("failed to load TLS config from file: %w", err)
}
level.Debug(l).Log(
"msg", "loaded TLS config for cluster http transport",
"TLSCAPath", opts.TLSCAPath,
"TLSCertPath", opts.TLSCertPath,
"TLSKeyPath", opts.TLSKeyPath,
"TLSServerName", opts.TLSServerName,
)
httpTransport.TLSClientConfig = tlsConfig
httpTransport.DialTLSContext = func(ctx context.Context, network, addr string, cfg *tls.Config) (net.Conn, error) {
return tls.DialWithDialer(&net.Dialer{Timeout: calcTimeout(ctx)}, network, addr, cfg)
}
}
httpClient := &http.Client{
Transport: httpTransport,
}

node, err := ckit.NewNode(httpClient, ckitConfig)
if err != nil {
Expand All @@ -163,6 +178,41 @@ func New(opts Options) (*Service, error) {
}, nil
}

func loadTLSConfigFromFile(TLSCAPath string, TLSCertPath string, TLSKeyPath string, serverName string) (*tls.Config, error) {
pem, err := os.ReadFile(TLSCAPath)
if err != nil {
return nil, fmt.Errorf("failed to read TLS CA file: %w", err)
}
caCertPool := x509.NewCertPool()
caCertPool.AppendCertsFromPEM(pem)
if !caCertPool.AppendCertsFromPEM(pem) {
return nil, fmt.Errorf("failed to append CA from PEM with path %w", TLSCAPath)
}

cert, err := tls.LoadX509KeyPair(TLSCertPath, TLSKeyPath)
if err != nil {
return nil, fmt.Errorf("failed to load X509 key pair: %w", err)
}

return &tls.Config{
Certificates: []tls.Certificate{cert},
RootCAs: caCertPool,
ServerName: serverName,
}, nil
}

// TODO(rfratto): consider making the max timeout configurable.
// Set a maximum timeout for establishing the connection. If our
// context has a deadline earlier than our timeout, we shrink the
// timeout to it.
func calcTimeout(ctx context.Context) time.Duration {
timeout := 30 * time.Second
if dur, ok := deadlineDuration(ctx); ok && dur < timeout {
timeout = dur
}
return timeout
}

func deadlineDuration(ctx context.Context) (d time.Duration, ok bool) {
if t, ok := ctx.Deadline(); ok {
return time.Until(t), true
Expand Down