-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate trace logs for a number of important InfluxDB operations #9456
Conversation
ff19a5b
to
36fd733
Compare
861a5a8
to
8b9725d
Compare
During a run of megacheck the following issues were discovered:
|
@stuartcarnie I’m not sure why megacheck is moaning. Does it pass locally? |
Runs fine locally @e-dard - not sure why it just reports an empty output in this PR |
8b9725d
to
06027a6
Compare
@e-dard I'm an idiot – I re-pushed the branch and missed a couple of files, which is why it wasn't building |
Ah so maybe when it’s returning with an empty comment it’s actually the
build that’s broken... I’ll try and make that clearer.
On Mon, 19 Feb 2018 at 02:32, Stuart Carnie ***@***.***> wrote:
@e-dard <https://github.com/e-dard> I'm an idiot – I re-pushed the branch
and missed a couple of files, which is why it wasn't building
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9456 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAeo6YtrSSLLYmvVrJc_hO1BQZ7QZfyTks5tWN1BgaJpZM4SH1RQ>
.
--
*Edd Robinson*
InfluxDB Senior Engineer
▼▴
*InfluxData.com <http://influxdata.com/>*
Github <http://www.github.com/e-dard> / LinkedIn
<https://www.linkedin.com/in/eddrobinson> / Personal <http://eddrobinson.io>
|
@e-dard I suspect that was the cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good to me. I just have a bunch of requests and then to update the logging messages to follow the brand new style guide and we can get this merged.
logger/logger.go
Outdated
@@ -6,7 +6,7 @@ import ( | |||
"time" | |||
|
|||
"github.com/jsternberg/zap-logfmt" | |||
isatty "github.com/mattn/go-isatty" | |||
"github.com/mattn/go-isatty" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be kept? It's something that goimports adds automatically. The intention is because the package has isatty
as its name. While it's not necessary to "rename" it, the rename is added to make it explicit that the package's name differs from its basename so it's easier for a human to see, "Oh, go-isatty is what's creating the itatty package name."
services/retention/service.go
Outdated
@@ -79,7 +79,7 @@ func (s *Service) run() { | |||
return | |||
|
|||
case <-ticker.C: | |||
s.logger.Info("Retention policy shard deletion check commencing.") | |||
log, logEnd := logger.NewOperation(s.logger, "Retention policy deletion check", "retention.delete_check") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This call and the eventual invocation are far enough away that I think this entire section deserves a separate function. Either an inline closure or, since the indentation seems to be becoming too deep, you might want to consider moving this operation into its own function.
I think since the body is this long, a defer is more suitable here.
logger/fields.go
Outdated
OperationElapsedKey = "op_elapsed" | ||
|
||
// DBInstanceKey is the logging context key used for identifying name of the relevant database. | ||
DBInstanceKey = "db_instance" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These instance keys should match what we use through most of the database code. So I think things like db
, rp
, shard_id
, and group_id
are more appropriate.
I like the idea of standardizing these though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll notice these keys are specifically namespaced:
// DBInstanceKey is the logging context key used for identifying name of the relevant database.
DBInstanceKey = "db_instance"
// DBRetentionKey is the logging context key used for identifying name of the relevant retention policy.
DBRetentionKey = "db_rp"
// DBShardGroupKey is the logging context key used for identifying relevant shard group.
DBShardGroupKey = "db_shard_group"
// DBShardIDKey is the logging context key used for identifying name of the relevant shard group.
DBShardIDKey = "db_shard_id"
I would prefer to keep them this way. I was planning to update your structured logging calls to use these keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolution is we re-introduced periods for name spacing, as latest go-logfmt parses the keys fine
logger/fields.go
Outdated
) | ||
|
||
const ( | ||
eventStart = "evt.start" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we are abbreviating evt
? It seems event
would be more user readable and it's only two extra characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agree
logger/logger.go
Outdated
@@ -43,7 +43,7 @@ func (c *Config) New(defaultOutput io.Writer) (*zap.Logger, error) { | |||
encoder, | |||
zapcore.Lock(zapcore.AddSync(w)), | |||
c.Level, | |||
)), nil | |||
), zap.Fields(zap.String("cid", NextTraceID()))), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does cid
stand for? Could we replace this with trace_id
or at least tid
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR request contains all the keys, including cid
. I'd rather than that correlation_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the PR request, the correlation id sounds more like a process id and using pid
might be simpler since that's already an established abbreviation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was explicitly changed from pid
(the original choice) to a unique identifier, so that the same process identifier would not occur in the same set of log files ever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you misunderstood me. I didn't mean the physical process id.
Either way, is there something other than cid
? I'm not particularly wed to a specific value, but cid
doesn't follow any commonly known conventions. At the least, can we get a comment stating what it means in the code? The code itself says NextTraceID()
and then assigns it to a value with cid
so there's no mention of correlation in the code or comments.
I think I would be fine with that even if it isn't the ideal. I agree that correlation is a long word.
Any comments in this PR are not good enough documentation IMO. When someone is looking at a log file or looking at the code, they are not going to find this PR without some digging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I've renamed NextTraceID
as it is just an unique identifier. Also, will use log_id
vs cid
tsdb/engine/tsm1/engine.go
Outdated
elapsed := time.Since(*started) | ||
e.Cache.UpdateCompactTime(elapsed) | ||
e.logger.Info(fmt.Sprintf("Snapshot for path %s written in %v", e.path, elapsed)) | ||
logEnd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge, NewOperation
logs a start event, but this only logs an end event if started is not nil. Can you move logEnd
out of this block to ensure it is always called?
It looks like, from the design of the code, it is impossible for started to be nil so this might not be necessary. The structure is a bit confusing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought so too – started is always set. Most of the changes were avoiding logic changes to be of the lowest risk / impact. I'll see if it can be improved safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code, since started
was always set
tsdb/engine/tsm1/engine.go
Outdated
@@ -1990,12 +2001,14 @@ func (e *Engine) fullCompactionStrategy(group CompactionGroup, optimize bool) *c | |||
|
|||
if optimize { | |||
s.description = "optimize" | |||
s.name = "tsm.compact_group.optimize" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be tsm1
? I think that's used in other places. If tsm1
isn't logged in any other locations, disregard this comment. If it is, then we should specify that this is for tsm1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to tsm1
logger/fields.go
Outdated
DBShardIDKey = "db_shard_id" | ||
|
||
// TraceS is a standard prefix for log messages that start a trace. | ||
TraceS = "TRACE->S: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, as a side note, can this be an extra field that's added as log context instead of a prefix? I think adding something like:
zap.String("trace", "start")
zap.String("trace", "end")
I like those better since they are less intrusive and easier to ignore if you don't like the prefix keys. I would like it if the messages themselves were human readable sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More on this, something like:
func TraceStart() zapcore.Field {
return zap.String("trace", "start")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsternberg per the PR notes again, TRACE->S:
and TRACE->E
were explicitly added so it was easy to identify trace start and end messages. Also, per the PR notes,
op_event
currently has two valuesevt.start
orevt.end
indicating the start or end of an operation
Which adds those to the context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed prefixes
Another comment on the |
Per the PR notes:
It uses an identifier type syntax, with no spaces, so that they are easily searched. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stuartcarnie is addressing the comments and I'll double check later, but if the comments I've left are addressed, then I approve.
863f09b
to
d86a20d
Compare
* tsdb Store.Open traces all events related to opening files * op.name : tsdb.open * retention policy shard deletions * op.name : retention.delete_check * all TSM compaction strategies * op.name : tsm1.compact_group * series file compactions * op.name : series_partition.compaction * continuous query execution (if logging enabled) * op.name : continuous_querier.execute * TSI log file compaction * op_name: index.tsi.compact_log_file * TSI level compaction * op.name: index.tsi.compact_to_level
d86a20d
to
d135aec
Compare
Notes
Traces create a child logger with additional context that is used to log any subsequent events during the trace. Specifically
op.name
is a unique identifier for the operation. We can filter on all operations of a specific name. An example of anop.name
might betsm1.compact_group
op.event
currently has two valuesstart
orend
indicating the start or end of an operationop.name
ANDop.event
to find all starting operation logsop.event
contexttrace_id
is a unique identifier used for a specific instance of a trace and can be used to correlate all related log eventslog_id
is added to every log entry for a single execution of theinfluxd
process, allowing all log entries for a single run to be easily identified. There are also other ways a log file could be split by a single execution, however, the consistentlog_id
will make it easier when searching log aggregations services.op.elapsed
is the amount of time the operation was spent executing and is logged with ending trace entrydb.instance
identifies the name of the databasedb.rp
identifies the name of the retention policydb.shard_id
identifies the id of the sharddb.shard_group
identifies the shard group identifierTooling
There are a number of tools for processing and filtering log files in JSON or logfmt
Operations
tsdb.Store
op.name
:tsdb.open
op.name
:retention.delete_check
op.name
:tsm1.compact_group
strategy
:(level|full)
level
:[1,3]
optimize
:(true|false)
op.name
:series_partition.compaction
op.name
:continuous_querier.execute
op.name
:index.tsi.compact_log_file
op.name
:index.tsi.compact_to_level
Some examples of related trace log entries in
logfmt
: