Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More detail in Telegraf logs for log parser errors #3187

Closed
birengoodco opened this issue Aug 30, 2017 · 12 comments · Fixed by #3188
Closed

More detail in Telegraf logs for log parser errors #3187

birengoodco opened this issue Aug 30, 2017 · 12 comments · Fixed by #3188

Comments

@birengoodco
Copy link

Bug report

I just got the below error in my telegraf log, and given that I have multiple log parsers configured in my telegraf.conf, it's fairly difficult to figure out which log parser it's talking about.

Also, I looked through the logs to find a record where there might be a trailing backslash, and I couldn't find one... so not sure what this error is talking about, but ultimately, this error caused Telegraf to die, and so I lost lots of data overnight.

It would be great to have just a little bit more information in the standard logging.

2017-08-30T08:32:49Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:50Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:50Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:50Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:50Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:51Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:52Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:53Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:54Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:32:54Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
2017-08-30T08:33:36Z E! Error parsing log line: Metric cannot have tag value ending with a backslash
panic: runtime error: slice bounds out of range

goroutine 36 [running]:
github.com/influxdata/telegraf/metric.(*metric).Fields(0xc4219b0780, 0xc420794c40)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/metric/metric.go:315 +0x5c8
github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).parser(0xc420272180)
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:261 +0x339
created by github.com/influxdata/telegraf/plugins/inputs/logparser.(*LogParserPlugin).Start
	/home/ubuntu/telegraf-build/src/github.com/influxdata/telegraf/plugins/inputs/logparser/logparser.go:156 +0x56d

Relevant telegraf.conf:

System info:

Ubuntu 14.04
Telegraf 1.4.0-RC2

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

  1. ...
  2. ...

Expected behavior:

Actual behavior:

Additional info:

[Include gist of relevant config, logs, etc.]

Feature Request

Opening a feature request kicks off a discussion.

Proposal:

Current behavior:

Desired behavior:

Use case: [Why is this important (helps with prioritizing requests)]

@danielnelson
Copy link
Contributor

Sorry about that, what the log message is referring to is that one of the metrics collected has a tag value that ends with a backslash. This value is impossible to encode in line protocol due to some unfortunate escaping rules. I'll add a few more details to the log message.

I don't think its very likely that this error caused the crash, it does indicate that the logfiles being parsed have complex escaping though.

Its most likely that there was a line in your logfile that triggered an error in Telegraf, I'd like to try to find which line this was. You should be able to use the Telegraf log to determine the approximate time in your logfile. Can you copy the log files it was processing to a location where they will not be modified and run a instance of telegraf with only your logparser configuration with from_beginning = true and a single file output. If we can find the line that is triggering the crash it would be extremely helpful.

@birengoodco
Copy link
Author

Alright... so I processed every single log file individually, and none of them resulted in a panic. So this is clearly not helpful haha... so perhaps something external caused it? I'm not sure...

@danielnelson
Copy link
Contributor

I have had the same sort of trouble in this piece of code before. Since we use a go map reproducing the error can depend on the ordering of the random ordering that it is iterated over. Do you think you could try running it multiple times, perhaps 10 times?

@birengoodco
Copy link
Author

Just so I'm clear on what you want me to try... You want me to iterate 10 times on each file individually? Or just iterate through all files in one shot over and over again? I would think the former? Cause otherwise, the issue is we don't even know which file the error is happening in, right?

@danielnelson
Copy link
Contributor

Ultimately, the goal is to use the process of elimination to narrow it down to a small set of lines or even the exact line, but first lets just confirm we can reproduce it. So if doing them in one shot is easier I think you could start with that, but after reproducing it we will want to narrow it down by doing it one file at a time.

@danielnelson
Copy link
Contributor

I think I have found the cause of this crash, will try to have a fix tonight.

@birengoodco
Copy link
Author

HUZZAH!!!!!! I was going to comment back in the morning... I was running the tests and replicated the failure a couple times, and was trying to narrow it down to which file was causing it... it's slow going haha... so this is good news :) Thanks!

@danielnelson
Copy link
Contributor

Since you were able to reproduce a few times, maybe you can try to trigger it again once rc4 is out?

@birengoodco
Copy link
Author

Yeah, absolutely!! Will give it a whirl as soon as it's out!

@danielnelson
Copy link
Contributor

Here are the official builds of rc4: https://github.com/influxdata/telegraf/releases/tag/1.4.0-rc4

@birengoodco
Copy link
Author

Looks like the fix must've worked... I've re-run the test a number of times now, and haven't been able to reproduce the panic. Awesome! Thanks so much!

@danielnelson
Copy link
Contributor

Thanks for doing all that testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants