Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate trace logs for a number of important InfluxDB operations #9456

Merged
merged 1 commit into from
Feb 21, 2018

Conversation

stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Feb 16, 2018

Notes

Traces create a child logger with additional context that is used to log any subsequent events during the trace. Specifically

  • op.name is a unique identifier for the operation. We can filter on all operations of a specific name. An example of an op.name might be tsm1.compact_group
  • op.event currently has two values start or end indicating the start or end of an operation
    • for example, you can grep by values in op.name AND op.event to find all starting operation logs
  • Traces begin and end with a consistent log entry
    • traces begin with a friendly description of the operation, like TSM compaction and the op.event context
    • Traces end with the same description of the operation, like TSM compaction
  • trace_id is a unique identifier used for a specific instance of a trace and can be used to correlate all related log events
  • log_id is added to every log entry for a single execution of the influxd process, allowing all log entries for a single run to be easily identified. There are also other ways a log file could be split by a single execution, however, the consistent log_id will make it easier when searching log aggregations services.
  • op.elapsed is the amount of time the operation was spent executing and is logged with ending trace entry
  • a few consistent context keys have been established when logging related events
    • db.instance identifies the name of the database
    • db.rp identifies the name of the retention policy
    • db.shard_id identifies the id of the shard
    • db.shard_group identifies the shard group identifier

Tooling

There are a number of tools for processing and filtering log files in JSON or logfmt

  • hutils from heroku are designed to work with logfmt
  • influxlog – WIP will be our tool to quickly report on various interesting metrics and / or conditions
    • this will go hand-in-hand with all the usual text filtering
  • lnav – useful for watching and filtering files real-time

Operations

  • tsdb.Open traces all events related to the initial opening of the tsdb.Store
    • op.name : tsdb.open
  • retention policy shard deletions
    • op.name : retention.delete_check
  • all TSM compaction strategies
    • op.name : tsm1.compact_group
    • strategy : (level|full)
    • level : [1,3]
    • optimize : (true|false)
  • series file compactions
    • op.name : series_partition.compaction
  • continuous query execution (if logging enabled)
    • op.name : continuous_querier.execute
  • TSI log file compaction
    • op.name : index.tsi.compact_log_file
  • TSI level compaction
    • op.name : index.tsi.compact_to_level

Some examples of related trace log entries in logfmt:

ts=2018-02-21T20:22:06.293163Z lvl=info msg="Cache snapshot (start)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKbqG000 op.name=cache.snapshot op.event=start
ts=2018-02-21T20:22:06.373734Z lvl=info msg="Snapshot for path written" log_id=06QW8yAl000 engine=tsm1 path=/Users/stuartcarnie/.influxdb/data/bar/autogen/438 duration=80.582ms
ts=2018-02-21T20:22:06.373775Z lvl=info msg="Cache snapshot (end)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKbqG000 op.name=cache.snapshot op.event=end op.elapsed=80ms
ts=2018-02-21T20:22:09.292390Z lvl=info msg="Cache snapshot (start)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKnZ0000 op.name=cache.snapshot op.event=start
ts=2018-02-21T20:22:09.375122Z lvl=info msg="Snapshot for path written" log_id=06QW8yAl000 engine=tsm1 path=/Users/stuartcarnie/.influxdb/data/bar/autogen/438 duration=82.744ms
ts=2018-02-21T20:22:09.375169Z lvl=info msg="Cache snapshot (end)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKnZ0000 op.name=cache.snapshot op.event=end op.elapsed=82ms
ts=2018-02-21T20:22:12.292417Z lvl=info msg="Cache snapshot (start)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKzH0000 op.name=cache.snapshot op.event=start
ts=2018-02-21T20:22:12.371576Z lvl=info msg="Snapshot for path written" log_id=06QW8yAl000 engine=tsm1 path=/Users/stuartcarnie/.influxdb/data/bar/autogen/438 duration=79.174ms
ts=2018-02-21T20:22:12.371621Z lvl=info msg="Cache snapshot (end)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWKzH0000 op.name=cache.snapshot op.event=end op.elapsed=79ms
ts=2018-02-21T20:22:13.292489Z lvl=info msg="TSM compaction (start)" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group op.event=start
ts=2018-02-21T20:22:13.292571Z lvl=info msg="Beginning compaction" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group files=8
ts=2018-02-21T20:22:13.292618Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=0 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000122-000000001.tsm
ts=2018-02-21T20:22:13.292653Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=1 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000123-000000001.tsm
ts=2018-02-21T20:22:13.292670Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=2 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000124-000000001.tsm
ts=2018-02-21T20:22:13.292685Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=3 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000125-000000001.tsm
ts=2018-02-21T20:22:13.292700Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=4 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000126-000000001.tsm
ts=2018-02-21T20:22:13.292727Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=5 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000127-000000001.tsm
ts=2018-02-21T20:22:13.292755Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=6 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000128-000000001.tsm
ts=2018-02-21T20:22:13.292773Z lvl=info msg="Compacting file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=7 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000129-000000001.tsm
ts=2018-02-21T20:22:14.262006Z lvl=info msg="Compacted file" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group index=0 file=/Users/stuartcarnie/.influxdb/data/bar/autogen/438/000000129-000000002.tsm.tmp
ts=2018-02-21T20:22:14.262060Z lvl=info msg="Finished compacting files" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group groups=8 files=1 duration=969.583ms
ts=2018-02-21T20:22:14.262078Z lvl=info msg="TSM compaction (end)" log_id=06QW8yAl000 engine=tsm1 level=1 strategy=level trace_id=06QWL2B0000 op.name=tsm1.compact_group op.event=end op.elapsed=969ms
ts=2018-02-21T20:22:15.292772Z lvl=info msg="Cache snapshot (start)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWL9~0000 op.name=cache.snapshot op.event=start
ts=2018-02-21T20:22:15.394407Z lvl=info msg="Snapshot for path written" log_id=06QW8yAl000 engine=tsm1 path=/Users/stuartcarnie/.influxdb/data/bar/autogen/438 duration=101.658ms
ts=2018-02-21T20:22:15.394459Z lvl=info msg="Cache snapshot (end)" log_id=06QW8yAl000 engine=tsm1 trace_id=06QWL9~0000 op.name=cache.snapshot op.event=end op.elapsed=101ms

@ghost ghost assigned stuartcarnie Feb 16, 2018
@ghost ghost added the review label Feb 16, 2018
@stuartcarnie stuartcarnie changed the title Generate trace logs for a number of significant InfluxDB operations Generate trace logs for a number of important InfluxDB operations Feb 16, 2018
@stuartcarnie stuartcarnie force-pushed the sgc-logging branch 2 times, most recently from ff19a5b to 36fd733 Compare February 16, 2018 04:55
@influxdata influxdata deleted a comment from hercules-influx Feb 16, 2018
@influxdata influxdata deleted a comment from hercules-influx Feb 16, 2018
@stuartcarnie stuartcarnie force-pushed the sgc-logging branch 3 times, most recently from 861a5a8 to 8b9725d Compare February 17, 2018 20:11
@hercules-influx
Copy link
Contributor

During a run of megacheck the following issues were discovered:

@e-dard
Copy link
Contributor

e-dard commented Feb 18, 2018

@stuartcarnie I’m not sure why megacheck is moaning. Does it pass locally?

@stuartcarnie
Copy link
Contributor Author

Runs fine locally @e-dard - not sure why it just reports an empty output in this PR

@stuartcarnie
Copy link
Contributor Author

@e-dard I'm an idiot – I re-pushed the branch and missed a couple of files, which is why it wasn't building

@e-dard
Copy link
Contributor

e-dard commented Feb 20, 2018 via email

@stuartcarnie
Copy link
Contributor Author

@e-dard I suspect that was the cause

Copy link
Contributor

@jsternberg jsternberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks good to me. I just have a bunch of requests and then to update the logging messages to follow the brand new style guide and we can get this merged.

logger/logger.go Outdated
@@ -6,7 +6,7 @@ import (
"time"

"github.com/jsternberg/zap-logfmt"
isatty "github.com/mattn/go-isatty"
"github.com/mattn/go-isatty"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be kept? It's something that goimports adds automatically. The intention is because the package has isatty as its name. While it's not necessary to "rename" it, the rename is added to make it explicit that the package's name differs from its basename so it's easier for a human to see, "Oh, go-isatty is what's creating the itatty package name."

@@ -79,7 +79,7 @@ func (s *Service) run() {
return

case <-ticker.C:
s.logger.Info("Retention policy shard deletion check commencing.")
log, logEnd := logger.NewOperation(s.logger, "Retention policy deletion check", "retention.delete_check")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call and the eventual invocation are far enough away that I think this entire section deserves a separate function. Either an inline closure or, since the indentation seems to be becoming too deep, you might want to consider moving this operation into its own function.

I think since the body is this long, a defer is more suitable here.

logger/fields.go Outdated
OperationElapsedKey = "op_elapsed"

// DBInstanceKey is the logging context key used for identifying name of the relevant database.
DBInstanceKey = "db_instance"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These instance keys should match what we use through most of the database code. So I think things like db, rp, shard_id, and group_id are more appropriate.

I like the idea of standardizing these though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll notice these keys are specifically namespaced:

	// DBInstanceKey is the logging context key used for identifying name of the relevant database.
	DBInstanceKey = "db_instance"

	// DBRetentionKey is the logging context key used for identifying name of the relevant retention policy.
	DBRetentionKey = "db_rp"

	// DBShardGroupKey is the logging context key used for identifying relevant shard group.
	DBShardGroupKey = "db_shard_group"

	// DBShardIDKey is the logging context key used for identifying name of the relevant shard group.
	DBShardIDKey = "db_shard_id"

I would prefer to keep them this way. I was planning to update your structured logging calls to use these keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolution is we re-introduced periods for name spacing, as latest go-logfmt parses the keys fine

logger/fields.go Outdated
)

const (
eventStart = "evt.start"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we are abbreviating evt? It seems event would be more user readable and it's only two extra characters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree

logger/logger.go Outdated
@@ -43,7 +43,7 @@ func (c *Config) New(defaultOutput io.Writer) (*zap.Logger, error) {
encoder,
zapcore.Lock(zapcore.AddSync(w)),
c.Level,
)), nil
), zap.Fields(zap.String("cid", NextTraceID()))), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does cid stand for? Could we replace this with trace_id or at least tid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR request contains all the keys, including cid. I'd rather than that correlation_id

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the PR request, the correlation id sounds more like a process id and using pid might be simpler since that's already an established abbreviation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was explicitly changed from pid (the original choice) to a unique identifier, so that the same process identifier would not occur in the same set of log files ever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you misunderstood me. I didn't mean the physical process id.

Either way, is there something other than cid? I'm not particularly wed to a specific value, but cid doesn't follow any commonly known conventions. At the least, can we get a comment stating what it means in the code? The code itself says NextTraceID() and then assigns it to a value with cid so there's no mention of correlation in the code or comments.

I think I would be fine with that even if it isn't the ideal. I agree that correlation is a long word.

Any comments in this PR are not good enough documentation IMO. When someone is looking at a log file or looking at the code, they are not going to find this PR without some digging.

Copy link
Contributor Author

@stuartcarnie stuartcarnie Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I've renamed NextTraceID as it is just an unique identifier. Also, will use log_id vs cid

elapsed := time.Since(*started)
e.Cache.UpdateCompactTime(elapsed)
e.logger.Info(fmt.Sprintf("Snapshot for path %s written in %v", e.path, elapsed))
logEnd()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge, NewOperation logs a start event, but this only logs an end event if started is not nil. Can you move logEnd out of this block to ensure it is always called?

It looks like, from the design of the code, it is impossible for started to be nil so this might not be necessary. The structure is a bit confusing though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought so too – started is always set. Most of the changes were avoiding logic changes to be of the lowest risk / impact. I'll see if it can be improved safely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code, since started was always set

@@ -1990,12 +2001,14 @@ func (e *Engine) fullCompactionStrategy(group CompactionGroup, optimize bool) *c

if optimize {
s.description = "optimize"
s.name = "tsm.compact_group.optimize"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be tsm1? I think that's used in other places. If tsm1 isn't logged in any other locations, disregard this comment. If it is, then we should specify that this is for tsm1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to tsm1

logger/fields.go Outdated
DBShardIDKey = "db_shard_id"

// TraceS is a standard prefix for log messages that start a trace.
TraceS = "TRACE->S: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as a side note, can this be an extra field that's added as log context instead of a prefix? I think adding something like:

zap.String("trace", "start")
zap.String("trace", "end")

I like those better since they are less intrusive and easier to ignore if you don't like the prefix keys. I would like it if the messages themselves were human readable sentences.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More on this, something like:

func TraceStart() zapcore.Field {
        return zap.String("trace", "start")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsternberg per the PR notes again, TRACE->S: and TRACE->E were explicitly added so it was easy to identify trace start and end messages. Also, per the PR notes,

op_event currently has two values evt.start or evt.end indicating the start or end of an operation

Which adds those to the context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed prefixes

@jsternberg
Copy link
Contributor

Another comment on the op_name convention. It seems to me like this is the purpose of the message field and I'm not sure why we need both.

@stuartcarnie
Copy link
Contributor Author

stuartcarnie commented Feb 21, 2018

@jsternberg

Another comment on the op_name convention. It seems to me like this is the purpose of the message field and I'm not sure why we need both.

Per the PR notes:

op_name is a unique identifier for the operation. We can filter on all operations of a specific name. An example of an op_name might be tsm.compact_group

It uses an identifier type syntax, with no spaces, so that they are easily searched. The msg text is for readability

Copy link
Contributor

@jsternberg jsternberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuartcarnie is addressing the comments and I'll double check later, but if the comments I've left are addressed, then I approve.

@stuartcarnie stuartcarnie force-pushed the sgc-logging branch 3 times, most recently from 863f09b to d86a20d Compare February 21, 2018 20:20
@stuartcarnie stuartcarnie removed the request for review from e-dard February 21, 2018 20:23
* tsdb Store.Open traces all events related to opening files
    * op.name : tsdb.open
* retention policy shard deletions
    * op.name : retention.delete_check
* all TSM compaction strategies
    * op.name : tsm1.compact_group
* series file compactions
    * op.name : series_partition.compaction
* continuous query execution (if logging enabled)
    * op.name : continuous_querier.execute
* TSI log file compaction
    * op_name: index.tsi.compact_log_file
* TSI level compaction
    * op.name: index.tsi.compact_to_level
@stuartcarnie stuartcarnie merged commit d40d3ec into master Feb 21, 2018
@stuartcarnie stuartcarnie deleted the sgc-logging branch February 21, 2018 22:09
@ghost ghost removed the review label Feb 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants