-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lifecycle events, results, diagnostics with InfluxDB #3
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like this, and works in my tests.
Test plans are able to record points as well as rcrowley/go-metrics, and this seems like a good spot to target. I would feel comfortable with this being merged.
} | ||
|
||
func (b *batcher) send() { | ||
points, err := client.NewBatchPoints(client.BatchPointsConfig{Database: "testground"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify the precision here? I actually don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default precision is ns
, that's why I didn't specify it ;-)
|
||
t.Run("batches_by_length", test(newBatcher(re, nil, 10, 24*time.Hour, | ||
retry.Attempts(3), | ||
retry.Delay(100*time.Millisecond), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the conditions in which a a retry occurs? I am thinking 100 milliseconds may not be long enough. if the database the network is over-loaded. I would, I think increase to random delays over a loner period or exponential backoff
There is a BackoffDelay and RandomDelay function I see there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically all errors from the influxdb client trigger a retry. The default options are here:
https://github.com/testground/sdk-go/pull/3/files#diff-dc88d3a8519776f0b76229b9c65c7e10R18-R26
The library already backs off by default; I don't think we need anything else here:
If all attempts fail, the metrics will still stay in queue, and we will try to flush them with every batch tick.
Also when we close the RunEnv
, we dispatch all queued metrics at once.
I think we're covered here for a first version -- we can adjust the defaults based on the feedback we gather.
runtime/runenv_test.go
Outdated
) | ||
|
||
func init() { | ||
_ = os.Setenv("INFLUXDB_URL", "http://localhost:9999") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me, we need to set INFLUXDB_URL on local docker runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here: testground/testground#934
I ended up picking INFLUXDB_ADDR
as a name because the influxdb v1 client refers to the URL as an "Addr", which is wrong because in Go an addr has a different connotation.
I'm putting forth a tiny PR to rename this back to INFLUXDB_URL
, which is more correct.
And setting this env var is not needed in tests, because we inject a test client anyway.
@coryschwartz I'm taking your comment to be an approval. Pinging @nonsense in case he wants to add any thoughts here. If nothing comes by in the next minutes, I'll merge this and make a release, as we need to integrate downstream. |
@coryschwartz I renamed the aggregated metrics accessors to drop the New prefix. Since we use "GetOrRegister" under the hood, it's not correct to connote that we're always constructing. |
case nil, prometheus.AlreadyRegisteredError: | ||
default: | ||
panic(err) | ||
func (m *Metrics) writeToInfluxDBSink(measurement string) MetricSinkFn { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulk @coryschwartz I think we have to rethink this method. I don't think we want to send our whole metrics registry into two measurements
, but rather create a measurement
for every type of thing
we want to measure.
Each metric.Measures
is generally one measurement
IMO, and since there are 5-6 types (counter, gauge, timer, histogram, etc.) - they have different fields
.
"error": evt.Error, | ||
} | ||
|
||
p, err := client.NewPoint("events", tags, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above - I think the schema here should be events.%s
where %s is the event name. Then every event generally has its own set of fields and tags that are filled for every single entry.
This PR addresses points A, C, D of testground/testground#907.
Closes testground/testground#908.
Closes testground/testground#909.
Closes testground/testground#910.
SDK API design
There are kinds of observations to record in the metrics store (InfluxDB), with different circuits.
a. Lifecycle events
Purpose
To facilitate real-time progress monitoring, either by a human, or by the forthcoming watchtower/supervisor service.
Expected volume: discrete events.
Availability / ingestion
Inserted immediately into the metrics store via direct call to the InfluxDB API.
API design
b. Diagnostics
Purpose
To track insights about the test instance execution itself, e.g. sync service stats, stages entry/exit, network API events, runtime stats (e.g. go runtime metrics).
Conceptually speaking, metrics are metadata about the test execution; they are relevant for debugging, troubleshooting, optimisation, and real-time monitoring.
They capture insights about the “border” between testground and the test plan itself. As such:
Expected volume: low.
Availability / ingestion
Inserted immediately into the metrics store via direct call to the InfluxDB API.
Recorded in file diagnostics.out.
API
c. Results
Purpose
Recording observations about the subsystems and components under test. Conceptually speaking, results are a part of the test output.
Results are the end goal of running a test plan. Results feed comparative series over runs of a test plan, along time, across dependency sets.
Availability / ingestion
Batch-insertion into InfluxDB when the test plan instance concludes.
In the future, potentially via a collector daemon such as vector, filebeat, fluentd.
API design