Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager Distributor #3671

Merged
merged 23 commits into from
Feb 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@
- `-<prefix>.s3.sse.kms-encryption-context`
* [FEATURE] Querier: Enable `@ <timestamp>` modifier in PromQL using the new `-querier.at-modifier-enabled` flag. #3744
* [FEATURE] Overrides Exporter: Add `overrides-exporter` module for exposing per-tenant resource limit overrides as metrics. It is not included in `all` target, and must be explicitly enabled. #3785
* [FEATURE] Experimental thanosconvert: introduce an experimental tool `thanosconvert` to migrate Thanos block metadata to Cortex metadata. #3770
* [FEATURE] Alertmanager: It now shards the `/api/v1/alerts` API using the ring when sharding is enabled. #3671
gouthamve marked this conversation as resolved.
Show resolved Hide resolved
* Added `-alertmanager.max-recv-msg-size` (defaults to 16M) to limit the size of HTTP request body handled by the alertmanager.
* New flags added for communication between alertmanagers:
gouthamve marked this conversation as resolved.
Show resolved Hide resolved
* `-alertmanager.max-recv-msg-size`
* `-alertmanager.alertmanager-client.remote-timeout`
* `-alertmanager.alertmanager-client.tls-enabled`
* `-alertmanager.alertmanager-client.tls-cert-path`
* `-alertmanager.alertmanager-client.tls-key-path`
* `-alertmanager.alertmanager-client.tls-ca-path`
* `-alertmanager.alertmanager-client.tls-server-name`
* `-alertmanager.alertmanager-client.tls-insecure-skip-verify`
* [ENHANCEMENT] Ruler: Add TLS and explicit basis authentication configuration options for the HTTP client the ruler uses to communicate with the alertmanager. #3752
* `-ruler.alertmanager-client.basic-auth-username`: Configure the basic authentication username used by the client. Takes precedent over a URL configured username.
* `-ruler.alertmanager-client.basic-auth-password`: Configure the basic authentication password used by the client. Takes precedent over a URL configured password.
Expand All @@ -18,7 +30,6 @@
* `-ruler.alertmanager-client.tls-insecure-skip-verify`: Boolean to disable verifying the certificate.
* `-ruler.alertmanager-client.tls-key-path`: File path to the TLS key certificate.
* `-ruler.alertmanager-client.tls-server-name`: Expected name on the TLS certificate.
* [FEATURE] Experimental thanosconvert: introduce an experimental tool `thanosconvert` to migrate Thanos block metadata to Cortex metadata. #3770
* [ENHANCEMENT] Ingester: exposed metric `cortex_ingester_oldest_unshipped_block_timestamp_seconds`, tracking the unix timestamp of the oldest TSDB block not shipped to the storage yet. #3705
* [ENHANCEMENT] Prometheus upgraded. #3739
* Avoid unnecessary `runtime.GC()` during compactions.
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ pkg/scheduler/schedulerpb/scheduler.pb.go: pkg/scheduler/schedulerpb/scheduler.p
pkg/storegateway/storegatewaypb/gateway.pb.go: pkg/storegateway/storegatewaypb/gateway.proto
pkg/chunk/grpc/grpc.pb.go: pkg/chunk/grpc/grpc.proto
tools/blocksconvert/scheduler.pb.go: tools/blocksconvert/scheduler.proto
pkg/alertmanager/alertmanagerpb/alertmanager.pb.go: pkg/alertmanager/alertmanagerpb/alertmanager.proto

all: $(UPTODATE_FILES)
test: protos
Expand Down
9 changes: 9 additions & 0 deletions development/tsdb-blocks-storage-s3/config/cortex.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@ ruler:

alertmanager:
enable_api: true
sharding_enabled: true
codesome marked this conversation as resolved.
Show resolved Hide resolved
sharding_ring:
replication_factor: 3
heartbeat_period: 5s
heartbeat_timeout: 15s
kvstore:
store: consul
consul:
host: consul:8500
storage:
type: s3
s3:
Expand Down
40 changes: 35 additions & 5 deletions development/tsdb-blocks-storage-s3/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ services:
volumes:
- ./config:/cortex/config
- .data-ingester-2:/tmp/cortex-tsdb-ingester:delegated

querier:
build:
context: .
Expand Down Expand Up @@ -215,18 +215,48 @@ services:
volumes:
- ./config:/cortex/config

alertmanager:
alertmanager-1:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18031 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8031 -server.grpc-listen-port=9031 -alertmanager.web.external-url=localhost:8031"]
depends_on:
- consul
- minio
ports:
- 8031:8031
- 18031:18031
volumes:
- ./config:/cortex/config

alertmanager-2:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18032 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8032 -server.grpc-listen-port=9032 -alertmanager.web.external-url=localhost:8032"]
depends_on:
- consul
- minio
ports:
- 8032:8032
- 18032:18032
volumes:
- ./config:/cortex/config

alertmanager-3:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18010 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8010 -server.grpc-listen-port=9010 -alertmanager.web.external-url=localhost:8010"]
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18033 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8033 -server.grpc-listen-port=9033 -alertmanager.web.external-url=localhost:8033"]
depends_on:
- consul
- minio
ports:
- 8010:8010
- 18010:18010
- 8033:8033
- 18033:18033
volumes:
- ./config:/cortex/config

Expand Down
38 changes: 38 additions & 0 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -1576,6 +1576,10 @@ The `alertmanager_config` configures the Cortex alertmanager.
# CLI flag: -alertmanager.configs.poll-interval
[poll_interval: <duration> | default = 15s]

# Maximum size (bytes) of an accepted HTTP request body.
# CLI flag: -alertmanager.max-recv-msg-size
[max_recv_msg_size: <int> | default = 16777216]

# Deprecated. Use -alertmanager.cluster.listen-address instead.
# CLI flag: -cluster.listen-address
[cluster_bind_address: <string> | default = "0.0.0.0:9094"]
Expand Down Expand Up @@ -1844,6 +1848,40 @@ cluster:
# Enable the experimental alertmanager config api.
# CLI flag: -experimental.alertmanager.enable-api
[enable_api: <boolean> | default = false]

alertmanager_client:
# Timeout for downstream alertmanagers.
# CLI flag: -alertmanager.alertmanager-client.remote-timeout
[remote_timeout: <duration> | default = 2s]

# Enable TLS in the GRPC client. This flag needs to be enabled when any other
# TLS flag is set. If set to false, insecure connection to gRPC server will be
# used.
# CLI flag: -alertmanager.alertmanager-client.tls-enabled
[tls_enabled: <boolean> | default = false]

# Path to the client certificate file, which will be used for authenticating
# with the server. Also requires the key path to be configured.
# CLI flag: -alertmanager.alertmanager-client.tls-cert-path
[tls_cert_path: <string> | default = ""]

# Path to the key file for the client certificate. Also requires the client
# certificate to be configured.
# CLI flag: -alertmanager.alertmanager-client.tls-key-path
[tls_key_path: <string> | default = ""]

# Path to the CA certificates file to validate server certificate against. If
# not set, the host's root CA certificates are used.
# CLI flag: -alertmanager.alertmanager-client.tls-ca-path
[tls_ca_path: <string> | default = ""]

# Override the expected name on the server certificate.
# CLI flag: -alertmanager.alertmanager-client.tls-server-name
[tls_server_name: <string> | default = ""]

# Skip validating server certificate.
# CLI flag: -alertmanager.alertmanager-client.tls-insecure-skip-verify
[tls_insecure_skip_verify: <boolean> | default = false]
```

### `table_manager_config`
Expand Down
132 changes: 132 additions & 0 deletions pkg/alertmanager/alertmanager_client.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
package alertmanager

import (
"flag"
"time"

"github.com/go-kit/kit/log"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"google.golang.org/grpc"
"google.golang.org/grpc/health/grpc_health_v1"

"github.com/cortexproject/cortex/pkg/alertmanager/alertmanagerpb"
"github.com/cortexproject/cortex/pkg/ring/client"
"github.com/cortexproject/cortex/pkg/util/grpcclient"
"github.com/cortexproject/cortex/pkg/util/tls"
)

// ClientsPool is the interface used to get the client from the pool for a specified address.
type ClientsPool interface {
// GetClientFor returns the alertmanager client for the given address.
GetClientFor(addr string) (Client, error)
}

// Client is the interface that should be implemented by any client used to read/write data to an alertmanager via GRPC.
type Client interface {
alertmanagerpb.AlertmanagerClient

// RemoteAddress returns the address of the remote alertmanager and is used to uniquely
// identify an alertmanager instance.
RemoteAddress() string
}

// ClientConfig is the configuration struct for the alertmanager client.
type ClientConfig struct {
RemoteTimeout time.Duration `yaml:"remote_timeout"`
TLSEnabled bool `yaml:"tls_enabled"`
TLS tls.ClientConfig `yaml:",inline"`
}

// RegisterFlagsWithPrefix registers flags with prefix.
func (cfg *ClientConfig) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
f.BoolVar(&cfg.TLSEnabled, prefix+".tls-enabled", cfg.TLSEnabled, "Enable TLS in the GRPC client. This flag needs to be enabled when any other TLS flag is set. If set to false, insecure connection to gRPC server will be used.")
f.DurationVar(&cfg.RemoteTimeout, prefix+".remote-timeout", 2*time.Second, "Timeout for downstream alertmanagers.")
cfg.TLS.RegisterFlagsWithPrefix(prefix, f)
}

type alertmanagerClientsPool struct {
pool *client.Pool
}

func newAlertmanagerClientsPool(discovery client.PoolServiceDiscovery, amClientCfg ClientConfig, logger log.Logger, reg prometheus.Registerer) ClientsPool {
// We prefer sane defaults instead of exposing further config options.
grpcCfg := grpcclient.Config{
MaxRecvMsgSize: 16 * 1024 * 1024,
MaxSendMsgSize: 4 * 1024 * 1024,
GRPCCompression: "",
RateLimit: 0,
RateLimitBurst: 0,
BackoffOnRatelimits: false,
TLSEnabled: amClientCfg.TLSEnabled,
TLS: amClientCfg.TLS,
}

requestDuration := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "cortex_alertmanager_distributor_client_request_duration_seconds",
Help: "Time spent executing requests from an alertmanager to another alertmanager.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
}, []string{"operation", "status_code"})

factory := func(addr string) (client.PoolClient, error) {
return dialAlertmanagerClient(grpcCfg, addr, requestDuration)
}

poolCfg := client.PoolConfig{
CheckInterval: time.Minute,
HealthCheckEnabled: true,
HealthCheckTimeout: 10 * time.Second,
}

clientsCount := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
Namespace: "cortex",
Name: "alertmanager_distributor_clients",
Help: "The current number of alertmanager distributor clients in the pool.",
})

return &alertmanagerClientsPool{pool: client.NewPool("alertmanager", poolCfg, discovery, factory, clientsCount, logger)}
}

func (f *alertmanagerClientsPool) GetClientFor(addr string) (Client, error) {
c, err := f.pool.GetClientFor(addr)
if err != nil {
return nil, err
}
return c.(Client), nil
}

func dialAlertmanagerClient(cfg grpcclient.Config, addr string, requestDuration *prometheus.HistogramVec) (*alertmanagerClient, error) {
opts, err := cfg.DialOption(grpcclient.Instrument(requestDuration))
if err != nil {
return nil, err
}
conn, err := grpc.Dial(addr, opts...)
if err != nil {
return nil, errors.Wrapf(err, "failed to dial alertmanager %s", addr)
}

return &alertmanagerClient{
AlertmanagerClient: alertmanagerpb.NewAlertmanagerClient(conn),
HealthClient: grpc_health_v1.NewHealthClient(conn),
conn: conn,
}, nil
}

type alertmanagerClient struct {
alertmanagerpb.AlertmanagerClient
grpc_health_v1.HealthClient
conn *grpc.ClientConn
}

func (c *alertmanagerClient) Close() error {
return c.conn.Close()
}

func (c *alertmanagerClient) String() string {
return c.RemoteAddress()
}

func (c *alertmanagerClient) RemoteAddress() string {
return c.conn.Target()
}
4 changes: 2 additions & 2 deletions pkg/alertmanager/alertmanager_ring.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ const (
RingNumTokens = 128
)

// RingOp is the operation used for distributing tenants between alertmanagers.
// RingOp is the operation used for reading/writing to the alertmanagers.
var RingOp = ring.NewOp([]ring.IngesterState{ring.ACTIVE}, func(s ring.IngesterState) bool {
codesome marked this conversation as resolved.
Show resolved Hide resolved
// Only ACTIVE Alertmanager get requests. If instance is not ACTIVE, we need to find another Alertmanager.
return s != ring.ACTIVE
Expand Down Expand Up @@ -77,7 +77,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
cfg.InstanceInterfaceNames = []string{"eth0", "en0"}
f.Var((*flagext.StringSlice)(&cfg.InstanceInterfaceNames), rfprefix+"instance-interface-names", "Name of network interface to read address from.")
f.StringVar(&cfg.InstanceAddr, rfprefix+"instance-addr", "", "IP address to advertise in the ring.")
f.IntVar(&cfg.InstancePort, rfprefix+"instance-port", 0, "Port to advertise in the ring (defaults to server.http-listen-port).")
f.IntVar(&cfg.InstancePort, rfprefix+"instance-port", 0, "Port to advertise in the ring (defaults to server.grpc-listen-port).")
f.StringVar(&cfg.InstanceID, rfprefix+"instance-id", hostname, "Instance ID to register in the ring.")

cfg.RingCheckPeriod = 5 * time.Second
Expand Down
Loading