Implement lifecycle events, results, diagnostics with InfluxDB #3

raulk · 2020-04-27T22:58:07Z

This PR addresses points A, C, D of testground/testground#907.

Closes testground/testground#908.
Closes testground/testground#909.
Closes testground/testground#910.

SDK API design

There are kinds of observations to record in the metrics store (InfluxDB), with different circuits.

a. Lifecycle events

Purpose

To facilitate real-time progress monitoring, either by a human, or by the forthcoming watchtower/supervisor service.

Expected volume: discrete events.

Availability / ingestion

Inserted immediately into the metrics store via direct call to the InfluxDB API.

API design

runenv.RecordStart()
runenv.RecordFailure(reason)
runenv.RecordCrash(err)
runenv.RecordSuccess()

b. Diagnostics

Purpose

To track insights about the test instance execution itself, e.g. sync service stats, stages entry/exit, network API events, runtime stats (e.g. go runtime metrics).

Conceptually speaking, metrics are metadata about the test execution; they are relevant for debugging, troubleshooting, optimisation, and real-time monitoring.
They capture insights about the “border” between testground and the test plan itself. As such:

Usefulness for test developers is indirect: helps them understand the pressure their test plan exerts on testground.
Usefulness for testground developers is direct: helps us comprehend how testground is performing.

Expected volume: low.

Availability / ingestion

Inserted immediately into the metrics store via direct call to the InfluxDB API.
Recorded in file diagnostics.out.

API

runenv.D().RecordPoint(name, value, unit)
runenv.D().Counter()			via go-metrics
runenv.D().EWMA()				via go-metrics
runenv.D().Gauge()				via go-metrics
runenv.D().GaugeF()				via go-metrics
runenv.D().Histogram()			via go-metrics
runenv.D().Meter()				via go-metrics
runenv.D().Timer()				via go-metrics
runenv.D().SetFrequency():
  - sets the frequency to materialise aggregated metrics.
runenv.D().Disable():
  - (not implemented) disable diagnostics for this instance.

c. Results

Purpose

Recording observations about the subsystems and components under test. Conceptually speaking, results are a part of the test output.

Results are the end goal of running a test plan. Results feed comparative series over runs of a test plan, along time, across dependency sets.

Availability / ingestion

Batch-insertion into InfluxDB when the test plan instance concludes.

In the future, potentially via a collector daemon such as vector, filebeat, fluentd.

API design

runenv.R().RecordPoint(name, value, unit)
runenv.R().Counter()			via go-metrics
runenv.R().EWMA()				via go-metrics
runenv.R().Gauge()				via go-metrics
runenv.R().GaugeF()				via go-metrics
runenv.R().Histogram()			via go-metrics
runenv.R().Meter()				via go-metrics
runenv.R().Timer()				via go-metrics
runenv.R().SetFrequency():
  - sets the frequency to materialise aggregated metrics.

coryschwartz

Overall I like this, and works in my tests.
Test plans are able to record points as well as rcrowley/go-metrics, and this seems like a good spot to target. I would feel comfortable with this being merged.

coryschwartz · 2020-04-29T21:42:11Z

runtime/influxdb_batch.go

+}
+
+func (b *batcher) send() {
+	points, err := client.NewBatchPoints(client.BatchPointsConfig{Database: "testground"})


Do we need to specify the precision here? I actually don't know.

The default precision is ns, that's why I didn't specify it ;-)

coryschwartz · 2020-04-29T22:23:22Z

runtime/influxdb_batch_test.go

+
+	t.Run("batches_by_length", test(newBatcher(re, nil, 10, 24*time.Hour,
+		retry.Attempts(3),
+		retry.Delay(100*time.Millisecond),


What are the conditions in which a a retry occurs? I am thinking 100 milliseconds may not be long enough. if the database the network is over-loaded. I would, I think increase to random delays over a loner period or exponential backoff

There is a BackoffDelay and RandomDelay function I see there.

Basically all errors from the influxdb client trigger a retry. The default options are here:

https://github.com/testground/sdk-go/pull/3/files#diff-dc88d3a8519776f0b76229b9c65c7e10R18-R26

The library already backs off by default; I don't think we need anything else here:

If all attempts fail, the metrics will still stay in queue, and we will try to flush them with every batch tick.

Also when we close the RunEnv, we dispatch all queued metrics at once.

I think we're covered here for a first version -- we can adjust the defaults based on the feedback we gather.

coryschwartz · 2020-04-29T23:18:30Z

runtime/runenv_test.go

 )

+func init() {
+	_ = os.Setenv("INFLUXDB_URL", "http://localhost:9999")


This reminds me, we need to set INFLUXDB_URL on local docker runner.

Done here: testground/testground#934

I ended up picking INFLUXDB_ADDR as a name because the influxdb v1 client refers to the URL as an "Addr", which is wrong because in Go an addr has a different connotation.

I'm putting forth a tiny PR to rename this back to INFLUXDB_URL, which is more correct.

And setting this env var is not needed in tests, because we inject a test client anyway.

raulk · 2020-04-30T10:28:01Z

@coryschwartz I'm taking your comment to be an approval. Pinging @nonsense in case he wants to add any thoughts here. If nothing comes by in the next minutes, I'll merge this and make a release, as we need to integrate downstream.

raulk · 2020-04-30T10:51:18Z

@coryschwartz I renamed the aggregated metrics accessors to drop the New prefix. Since we use "GetOrRegister" under the hood, it's not correct to connote that we're always constructing.

nonsense · 2020-04-30T18:21:06Z

runtime/metrics.go

-	case nil, prometheus.AlreadyRegisteredError:
-	default:
-		panic(err)
+func (m *Metrics) writeToInfluxDBSink(measurement string) MetricSinkFn {


@raulk @coryschwartz I think we have to rethink this method. I don't think we want to send our whole metrics registry into two measurements, but rather create a measurement for every type of thing we want to measure.

Each metric.Measures is generally one measurement IMO, and since there are 5-6 types (counter, gauge, timer, histogram, etc.) - they have different fields.

nonsense · 2020-04-30T18:21:53Z

runtime/metrics.go

+		"error": evt.Error,
+	}
+
+	p, err := client.NewPoint("events", tags, f)


Same comment as above - I think the schema here should be events.%s where %s is the event name. Then every event generally has its own set of fields and tags that are filled for every single entry.

raulk added 8 commits April 27, 2020 23:57

wip metrics.

1dca242

batch write results to InfluxDB on close.

d653ac0

test utils for sdk-go.

4202902

influxdb v2

fa3c2b4

switch to influxdb v1; implement batching; tests.

1ac1cec

go mod tidy.

b040b9f

simplify and clean up solution.

12bb027

adjust points translation.

8291112

raulk changed the title ~~wip metrics overhaul~~ Implement lifecycle events, results, diagnostics with InfluxDB Apr 29, 2020

raulk marked this pull request as ready for review April 29, 2020 21:32

Robmat05 requested a review from coryschwartz April 29, 2020 21:53

Robmat05 assigned raulk Apr 29, 2020

coryschwartz reviewed Apr 30, 2020

View reviewed changes

raulk added 2 commits April 30, 2020 11:14

test: remove setting INFLUXDB_URL env var.

6fc3a07

rename INFLUXDB_ADDR env var to INFLUXDB_URL for correctness.

429e879

raulk added 3 commits April 30, 2020 11:44

add TestFrequencyChange test.

dc53852

remove New prefix from aggregated metrics accessors.

f69054c

rename GaugeFunctional to GaugeF.

e00149e

raulk merged commit 207954a into master Apr 30, 2020

raulk deleted the feat/metrics branch April 30, 2020 10:54

raulk mentioned this pull request Apr 30, 2020

testplan metrics with go-metrics, InfluxDB, Grafana testground/testground#605

Closed

Robmat05 added the done Completed/Fixed label Apr 30, 2020

Robmat05 added this to the Testground v0.5 milestone Apr 30, 2020

nonsense reviewed Apr 30, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement lifecycle events, results, diagnostics with InfluxDB #3

Implement lifecycle events, results, diagnostics with InfluxDB #3

raulk commented Apr 27, 2020 •

edited

Loading

coryschwartz left a comment

coryschwartz Apr 29, 2020

raulk Apr 30, 2020

coryschwartz Apr 29, 2020

raulk Apr 30, 2020

coryschwartz Apr 29, 2020

raulk Apr 30, 2020

raulk commented Apr 30, 2020

raulk commented Apr 30, 2020

nonsense Apr 30, 2020

nonsense Apr 30, 2020

Implement lifecycle events, results, diagnostics with InfluxDB #3

Implement lifecycle events, results, diagnostics with InfluxDB #3

Conversation

raulk commented Apr 27, 2020 • edited Loading

SDK API design

a. Lifecycle events

Purpose

Availability / ingestion

API design

b. Diagnostics

Purpose

Availability / ingestion

API

c. Results

Purpose

Availability / ingestion

API design

coryschwartz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk commented Apr 30, 2020

raulk commented Apr 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulk commented Apr 27, 2020 •

edited

Loading