Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lifecycle events, results, diagnostics with InfluxDB #3

Merged
merged 13 commits into from
Apr 30, 2020

Conversation

raulk
Copy link
Contributor

@raulk raulk commented Apr 27, 2020

This PR addresses points A, C, D of testground/testground#907.

Closes testground/testground#908.
Closes testground/testground#909.
Closes testground/testground#910.

SDK API design

There are kinds of observations to record in the metrics store (InfluxDB), with different circuits.

a. Lifecycle events

Purpose

To facilitate real-time progress monitoring, either by a human, or by the forthcoming watchtower/supervisor service.

Expected volume: discrete events.

Availability / ingestion

Inserted immediately into the metrics store via direct call to the InfluxDB API.

API design

runenv.RecordStart()
runenv.RecordFailure(reason)
runenv.RecordCrash(err)
runenv.RecordSuccess()

b. Diagnostics

Purpose

To track insights about the test instance execution itself, e.g. sync service stats, stages entry/exit, network API events, runtime stats (e.g. go runtime metrics).

Conceptually speaking, metrics are metadata about the test execution; they are relevant for debugging, troubleshooting, optimisation, and real-time monitoring.
They capture insights about the “border” between testground and the test plan itself. As such:

  • Usefulness for test developers is indirect: helps them understand the pressure their test plan exerts on testground.
  • Usefulness for testground developers is direct: helps us comprehend how testground is performing.

Expected volume: low.

Availability / ingestion

Inserted immediately into the metrics store via direct call to the InfluxDB API.
Recorded in file diagnostics.out.

API

runenv.D().RecordPoint(name, value, unit)
runenv.D().Counter()			via go-metrics
runenv.D().EWMA()				via go-metrics
runenv.D().Gauge()				via go-metrics
runenv.D().GaugeF()				via go-metrics
runenv.D().Histogram()			via go-metrics
runenv.D().Meter()				via go-metrics
runenv.D().Timer()				via go-metrics
runenv.D().SetFrequency():
  - sets the frequency to materialise aggregated metrics.
runenv.D().Disable():
  - (not implemented) disable diagnostics for this instance.

c. Results

Purpose

Recording observations about the subsystems and components under test. Conceptually speaking, results are a part of the test output.

Results are the end goal of running a test plan. Results feed comparative series over runs of a test plan, along time, across dependency sets.

Availability / ingestion

Batch-insertion into InfluxDB when the test plan instance concludes.

In the future, potentially via a collector daemon such as vector, filebeat, fluentd.

API design

runenv.R().RecordPoint(name, value, unit)
runenv.R().Counter()			via go-metrics
runenv.R().EWMA()				via go-metrics
runenv.R().Gauge()				via go-metrics
runenv.R().GaugeF()				via go-metrics
runenv.R().Histogram()			via go-metrics
runenv.R().Meter()				via go-metrics
runenv.R().Timer()				via go-metrics
runenv.R().SetFrequency():
  - sets the frequency to materialise aggregated metrics.

Copy link

@coryschwartz coryschwartz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this, and works in my tests.
Test plans are able to record points as well as rcrowley/go-metrics, and this seems like a good spot to target. I would feel comfortable with this being merged.

}

func (b *batcher) send() {
points, err := client.NewBatchPoints(client.BatchPointsConfig{Database: "testground"})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify the precision here? I actually don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default precision is ns, that's why I didn't specify it ;-)


t.Run("batches_by_length", test(newBatcher(re, nil, 10, 24*time.Hour,
retry.Attempts(3),
retry.Delay(100*time.Millisecond),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the conditions in which a a retry occurs? I am thinking 100 milliseconds may not be long enough. if the database the network is over-loaded. I would, I think increase to random delays over a loner period or exponential backoff

There is a BackoffDelay and RandomDelay function I see there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all errors from the influxdb client trigger a retry. The default options are here:

https://github.com/testground/sdk-go/pull/3/files#diff-dc88d3a8519776f0b76229b9c65c7e10R18-R26

image

The library already backs off by default; I don't think we need anything else here:

image

If all attempts fail, the metrics will still stay in queue, and we will try to flush them with every batch tick.

Also when we close the RunEnv, we dispatch all queued metrics at once.

I think we're covered here for a first version -- we can adjust the defaults based on the feedback we gather.

)

func init() {
_ = os.Setenv("INFLUXDB_URL", "http://localhost:9999")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me, we need to set INFLUXDB_URL on local docker runner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here: testground/testground#934

I ended up picking INFLUXDB_ADDR as a name because the influxdb v1 client refers to the URL as an "Addr", which is wrong because in Go an addr has a different connotation.

I'm putting forth a tiny PR to rename this back to INFLUXDB_URL, which is more correct.

And setting this env var is not needed in tests, because we inject a test client anyway.

@raulk
Copy link
Contributor Author

raulk commented Apr 30, 2020

@coryschwartz I'm taking your comment to be an approval. Pinging @nonsense in case he wants to add any thoughts here. If nothing comes by in the next minutes, I'll merge this and make a release, as we need to integrate downstream.

@raulk
Copy link
Contributor Author

raulk commented Apr 30, 2020

@coryschwartz I renamed the aggregated metrics accessors to drop the New prefix. Since we use "GetOrRegister" under the hood, it's not correct to connote that we're always constructing.

@raulk raulk merged commit 207954a into master Apr 30, 2020
@raulk raulk deleted the feat/metrics branch April 30, 2020 10:54
@Robmat05 Robmat05 added the done Completed/Fixed label Apr 30, 2020
@Robmat05 Robmat05 added this to the Testground v0.5 milestone Apr 30, 2020
case nil, prometheus.AlreadyRegisteredError:
default:
panic(err)
func (m *Metrics) writeToInfluxDBSink(measurement string) MetricSinkFn {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulk @coryschwartz I think we have to rethink this method. I don't think we want to send our whole metrics registry into two measurements, but rather create a measurement for every type of thing we want to measure.

Each metric.Measures is generally one measurement IMO, and since there are 5-6 types (counter, gauge, timer, histogram, etc.) - they have different fields.

"error": evt.Error,
}

p, err := client.NewPoint("events", tags, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - I think the schema here should be events.%s where %s is the event name. Then every event generally has its own set of fields and tags that are filled for every single entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Completed/Fixed
Projects
None yet
4 participants