Skip to content

Commit

Permalink
Alertmanager Distributor (#3671)
Browse files Browse the repository at this point in the history
* Alertmanager Distributor

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix lint

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* WIP: HTTP over GRPC for alertmanager Distributors

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix builds

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Simplify distributor code, fix tests to use factory, add distributor to MultitenantAlertmanager

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Distribute requests from alertmanager

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix lint and tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix integration tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Modify AM client grpc config

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix Marco's comments part 2

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix most of Peter's comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix remaining comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix integration tests

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix review comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix Marco's comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

* Fix Peter's and Josh's comments

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>

Co-authored-by: gotjosh <josue@grafana.com>
  • Loading branch information
codesome and gotjosh authored Feb 18, 2021
1 parent a2be3d8 commit 6c0bebf
Show file tree
Hide file tree
Showing 18 changed files with 989 additions and 19 deletions.
13 changes: 12 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,18 @@
- `-<prefix>.s3.sse.kms-encryption-context`
* [FEATURE] Querier: Enable `@ <timestamp>` modifier in PromQL using the new `-querier.at-modifier-enabled` flag. #3744
* [FEATURE] Overrides Exporter: Add `overrides-exporter` module for exposing per-tenant resource limit overrides as metrics. It is not included in `all` target, and must be explicitly enabled. #3785
* [FEATURE] Experimental thanosconvert: introduce an experimental tool `thanosconvert` to migrate Thanos block metadata to Cortex metadata. #3770
* [FEATURE] Alertmanager: It now shards the `/api/v1/alerts` API using the ring when sharding is enabled. #3671
* Added `-alertmanager.max-recv-msg-size` (defaults to 16M) to limit the size of HTTP request body handled by the alertmanager.
* New flags added for communication between alertmanagers:
* `-alertmanager.max-recv-msg-size`
* `-alertmanager.alertmanager-client.remote-timeout`
* `-alertmanager.alertmanager-client.tls-enabled`
* `-alertmanager.alertmanager-client.tls-cert-path`
* `-alertmanager.alertmanager-client.tls-key-path`
* `-alertmanager.alertmanager-client.tls-ca-path`
* `-alertmanager.alertmanager-client.tls-server-name`
* `-alertmanager.alertmanager-client.tls-insecure-skip-verify`
* [ENHANCEMENT] Ruler: Add TLS and explicit basis authentication configuration options for the HTTP client the ruler uses to communicate with the alertmanager. #3752
* `-ruler.alertmanager-client.basic-auth-username`: Configure the basic authentication username used by the client. Takes precedent over a URL configured username.
* `-ruler.alertmanager-client.basic-auth-password`: Configure the basic authentication password used by the client. Takes precedent over a URL configured password.
Expand All @@ -21,7 +33,6 @@
* `-ruler.alertmanager-client.tls-insecure-skip-verify`: Boolean to disable verifying the certificate.
* `-ruler.alertmanager-client.tls-key-path`: File path to the TLS key certificate.
* `-ruler.alertmanager-client.tls-server-name`: Expected name on the TLS certificate.
* [FEATURE] Experimental thanosconvert: introduce an experimental tool `thanosconvert` to migrate Thanos block metadata to Cortex metadata. #3770
* [ENHANCEMENT] Ingester: exposed metric `cortex_ingester_oldest_unshipped_block_timestamp_seconds`, tracking the unix timestamp of the oldest TSDB block not shipped to the storage yet. #3705
* [ENHANCEMENT] Prometheus upgraded. #3739
* Avoid unnecessary `runtime.GC()` during compactions.
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ pkg/scheduler/schedulerpb/scheduler.pb.go: pkg/scheduler/schedulerpb/scheduler.p
pkg/storegateway/storegatewaypb/gateway.pb.go: pkg/storegateway/storegatewaypb/gateway.proto
pkg/chunk/grpc/grpc.pb.go: pkg/chunk/grpc/grpc.proto
tools/blocksconvert/scheduler.pb.go: tools/blocksconvert/scheduler.proto
pkg/alertmanager/alertmanagerpb/alertmanager.pb.go: pkg/alertmanager/alertmanagerpb/alertmanager.proto

all: $(UPTODATE_FILES)
test: protos
Expand Down
9 changes: 9 additions & 0 deletions development/tsdb-blocks-storage-s3/config/cortex.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@ ruler:

alertmanager:
enable_api: true
sharding_enabled: true
sharding_ring:
replication_factor: 3
heartbeat_period: 5s
heartbeat_timeout: 15s
kvstore:
store: consul
consul:
host: consul:8500
storage:
type: s3
s3:
Expand Down
40 changes: 35 additions & 5 deletions development/tsdb-blocks-storage-s3/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ services:
volumes:
- ./config:/cortex/config
- .data-ingester-2:/tmp/cortex-tsdb-ingester:delegated

querier:
build:
context: .
Expand Down Expand Up @@ -215,18 +215,48 @@ services:
volumes:
- ./config:/cortex/config

alertmanager:
alertmanager-1:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18031 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8031 -server.grpc-listen-port=9031 -alertmanager.web.external-url=localhost:8031"]
depends_on:
- consul
- minio
ports:
- 8031:8031
- 18031:18031
volumes:
- ./config:/cortex/config

alertmanager-2:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18032 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8032 -server.grpc-listen-port=9032 -alertmanager.web.external-url=localhost:8032"]
depends_on:
- consul
- minio
ports:
- 8032:8032
- 18032:18032
volumes:
- ./config:/cortex/config

alertmanager-3:
build:
context: .
dockerfile: dev.dockerfile
image: cortex
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18010 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8010 -server.grpc-listen-port=9010 -alertmanager.web.external-url=localhost:8010"]
command: ["sh", "-c", "sleep 3 && exec ./dlv exec ./cortex --listen=:18033 --headless=true --api-version=2 --accept-multiclient --continue -- -config.file=./config/cortex.yaml -target=alertmanager -server.http-listen-port=8033 -server.grpc-listen-port=9033 -alertmanager.web.external-url=localhost:8033"]
depends_on:
- consul
- minio
ports:
- 8010:8010
- 18010:18010
- 8033:8033
- 18033:18033
volumes:
- ./config:/cortex/config

Expand Down
38 changes: 38 additions & 0 deletions docs/configuration/config-file-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -1572,6 +1572,10 @@ The `alertmanager_config` configures the Cortex alertmanager.
# CLI flag: -alertmanager.configs.poll-interval
[poll_interval: <duration> | default = 15s]
# Maximum size (bytes) of an accepted HTTP request body.
# CLI flag: -alertmanager.max-recv-msg-size
[max_recv_msg_size: <int> | default = 16777216]
# Deprecated. Use -alertmanager.cluster.listen-address instead.
# CLI flag: -cluster.listen-address
[cluster_bind_address: <string> | default = "0.0.0.0:9094"]
Expand Down Expand Up @@ -1840,6 +1844,40 @@ cluster:
# Enable the experimental alertmanager config api.
# CLI flag: -experimental.alertmanager.enable-api
[enable_api: <boolean> | default = false]

alertmanager_client:
# Timeout for downstream alertmanagers.
# CLI flag: -alertmanager.alertmanager-client.remote-timeout
[remote_timeout: <duration> | default = 2s]

# Enable TLS in the GRPC client. This flag needs to be enabled when any other
# TLS flag is set. If set to false, insecure connection to gRPC server will be
# used.
# CLI flag: -alertmanager.alertmanager-client.tls-enabled
[tls_enabled: <boolean> | default = false]

# Path to the client certificate file, which will be used for authenticating
# with the server. Also requires the key path to be configured.
# CLI flag: -alertmanager.alertmanager-client.tls-cert-path
[tls_cert_path: <string> | default = ""]

# Path to the key file for the client certificate. Also requires the client
# certificate to be configured.
# CLI flag: -alertmanager.alertmanager-client.tls-key-path
[tls_key_path: <string> | default = ""]

# Path to the CA certificates file to validate server certificate against. If
# not set, the host's root CA certificates are used.
# CLI flag: -alertmanager.alertmanager-client.tls-ca-path
[tls_ca_path: <string> | default = ""]

# Override the expected name on the server certificate.
# CLI flag: -alertmanager.alertmanager-client.tls-server-name
[tls_server_name: <string> | default = ""]

# Skip validating server certificate.
# CLI flag: -alertmanager.alertmanager-client.tls-insecure-skip-verify
[tls_insecure_skip_verify: <boolean> | default = false]
```
### `table_manager_config`
Expand Down
132 changes: 132 additions & 0 deletions pkg/alertmanager/alertmanager_client.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
package alertmanager

import (
"flag"
"time"

"github.com/go-kit/kit/log"
"github.com/pkg/errors"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"google.golang.org/grpc"
"google.golang.org/grpc/health/grpc_health_v1"

"github.com/cortexproject/cortex/pkg/alertmanager/alertmanagerpb"
"github.com/cortexproject/cortex/pkg/ring/client"
"github.com/cortexproject/cortex/pkg/util/grpcclient"
"github.com/cortexproject/cortex/pkg/util/tls"
)

// ClientsPool is the interface used to get the client from the pool for a specified address.
type ClientsPool interface {
// GetClientFor returns the alertmanager client for the given address.
GetClientFor(addr string) (Client, error)
}

// Client is the interface that should be implemented by any client used to read/write data to an alertmanager via GRPC.
type Client interface {
alertmanagerpb.AlertmanagerClient

// RemoteAddress returns the address of the remote alertmanager and is used to uniquely
// identify an alertmanager instance.
RemoteAddress() string
}

// ClientConfig is the configuration struct for the alertmanager client.
type ClientConfig struct {
RemoteTimeout time.Duration `yaml:"remote_timeout"`
TLSEnabled bool `yaml:"tls_enabled"`
TLS tls.ClientConfig `yaml:",inline"`
}

// RegisterFlagsWithPrefix registers flags with prefix.
func (cfg *ClientConfig) RegisterFlagsWithPrefix(prefix string, f *flag.FlagSet) {
f.BoolVar(&cfg.TLSEnabled, prefix+".tls-enabled", cfg.TLSEnabled, "Enable TLS in the GRPC client. This flag needs to be enabled when any other TLS flag is set. If set to false, insecure connection to gRPC server will be used.")
f.DurationVar(&cfg.RemoteTimeout, prefix+".remote-timeout", 2*time.Second, "Timeout for downstream alertmanagers.")
cfg.TLS.RegisterFlagsWithPrefix(prefix, f)
}

type alertmanagerClientsPool struct {
pool *client.Pool
}

func newAlertmanagerClientsPool(discovery client.PoolServiceDiscovery, amClientCfg ClientConfig, logger log.Logger, reg prometheus.Registerer) ClientsPool {
// We prefer sane defaults instead of exposing further config options.
grpcCfg := grpcclient.Config{
MaxRecvMsgSize: 16 * 1024 * 1024,
MaxSendMsgSize: 4 * 1024 * 1024,
GRPCCompression: "",
RateLimit: 0,
RateLimitBurst: 0,
BackoffOnRatelimits: false,
TLSEnabled: amClientCfg.TLSEnabled,
TLS: amClientCfg.TLS,
}

requestDuration := promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Name: "cortex_alertmanager_distributor_client_request_duration_seconds",
Help: "Time spent executing requests from an alertmanager to another alertmanager.",
Buckets: prometheus.ExponentialBuckets(0.008, 4, 7),
}, []string{"operation", "status_code"})

factory := func(addr string) (client.PoolClient, error) {
return dialAlertmanagerClient(grpcCfg, addr, requestDuration)
}

poolCfg := client.PoolConfig{
CheckInterval: time.Minute,
HealthCheckEnabled: true,
HealthCheckTimeout: 10 * time.Second,
}

clientsCount := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
Namespace: "cortex",
Name: "alertmanager_distributor_clients",
Help: "The current number of alertmanager distributor clients in the pool.",
})

return &alertmanagerClientsPool{pool: client.NewPool("alertmanager", poolCfg, discovery, factory, clientsCount, logger)}
}

func (f *alertmanagerClientsPool) GetClientFor(addr string) (Client, error) {
c, err := f.pool.GetClientFor(addr)
if err != nil {
return nil, err
}
return c.(Client), nil
}

func dialAlertmanagerClient(cfg grpcclient.Config, addr string, requestDuration *prometheus.HistogramVec) (*alertmanagerClient, error) {
opts, err := cfg.DialOption(grpcclient.Instrument(requestDuration))
if err != nil {
return nil, err
}
conn, err := grpc.Dial(addr, opts...)
if err != nil {
return nil, errors.Wrapf(err, "failed to dial alertmanager %s", addr)
}

return &alertmanagerClient{
AlertmanagerClient: alertmanagerpb.NewAlertmanagerClient(conn),
HealthClient: grpc_health_v1.NewHealthClient(conn),
conn: conn,
}, nil
}

type alertmanagerClient struct {
alertmanagerpb.AlertmanagerClient
grpc_health_v1.HealthClient
conn *grpc.ClientConn
}

func (c *alertmanagerClient) Close() error {
return c.conn.Close()
}

func (c *alertmanagerClient) String() string {
return c.RemoteAddress()
}

func (c *alertmanagerClient) RemoteAddress() string {
return c.conn.Target()
}
4 changes: 2 additions & 2 deletions pkg/alertmanager/alertmanager_ring.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ const (
RingNumTokens = 128
)

// RingOp is the operation used for distributing tenants between alertmanagers.
// RingOp is the operation used for reading/writing to the alertmanagers.
var RingOp = ring.NewOp([]ring.IngesterState{ring.ACTIVE}, func(s ring.IngesterState) bool {
// Only ACTIVE Alertmanager get requests. If instance is not ACTIVE, we need to find another Alertmanager.
return s != ring.ACTIVE
Expand Down Expand Up @@ -77,7 +77,7 @@ func (cfg *RingConfig) RegisterFlags(f *flag.FlagSet) {
cfg.InstanceInterfaceNames = []string{"eth0", "en0"}
f.Var((*flagext.StringSlice)(&cfg.InstanceInterfaceNames), rfprefix+"instance-interface-names", "Name of network interface to read address from.")
f.StringVar(&cfg.InstanceAddr, rfprefix+"instance-addr", "", "IP address to advertise in the ring.")
f.IntVar(&cfg.InstancePort, rfprefix+"instance-port", 0, "Port to advertise in the ring (defaults to server.http-listen-port).")
f.IntVar(&cfg.InstancePort, rfprefix+"instance-port", 0, "Port to advertise in the ring (defaults to server.grpc-listen-port).")
f.StringVar(&cfg.InstanceID, rfprefix+"instance-id", hostname, "Instance ID to register in the ring.")

cfg.RingCheckPeriod = 5 * time.Second
Expand Down
Loading

0 comments on commit 6c0bebf

Please sign in to comment.