Introduce auto detection of format #18095

ycombinator · 2020-04-29T14:08:40Z

What does this PR do?

This PR introduces auto-detection of Logstash's log file format (plaintext or JSON) and calls the appropriate ingest pipeline for parsing.

Why is it important?

The logstash Filebeat module has always has the ability to parse either plaintext or JSON logs emitted by Logstash. Prior to this PR users would need to manually choose a format by specifying the var.format configuration setting in their Logstash module configuration.

With this PR they will no longer need to manually choose the format; the module will auto-detect it for them. This is in line with what we do in the elasticsearch Filebeat module.

This change is also a requirement for migration modules to packages (see elastic/package-registry#270 (comment)).

Checklist

My code follows the style guidelines of this project
~~I have commented my code, particularly in hard-to-understand areas~~
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
~~I have added tests that prove my fix is effective or that my feature works~~ Tests already exist.
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Closes Make logstash Filebeat module use multiple pipelines #9964

elasticmachine · 2020-04-29T14:09:03Z

Pinging @elastic/integrations-services (Team:Services)

elasticmachine · 2020-04-29T15:00:20Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request Introduce auto detection of format #18095 updated]
Start Time: 2020-05-13T22:08:18.762+0000
Duration: 54 min 11 sec (3251124)

Test stats 🧪

Test	Results
Failed	0
Passed	2781
Skipped	418
Total	3199

elasticmachine · 2020-05-13T10:21:08Z

Pinging @elastic/stack-monitoring (Stack monitoring)

mtojek

LGTM. It's great that it was possible to achieve the goal only by using configuration. The only concerns I have are mixed files (plain text combined together with JSON):

[2019-11-20T19:04:48,468][WARN ][org.logstash.dissect.Dissector][the_pipeline_id] Dissector mapping, pattern not found {"field"=>"message", "pattern"=>"%{LogLineTimeStamp->}\t%{Healthy}\t%{Fatals}\t%{Errors}\t%{Warnings}\t%{TimeToBuildPatternsCache}\t%{CachedPatternsCount}\t%{MessagesEnqueued}\t%{DropMsgNoSubscribers}\t%{MessagesEnqueued}\t%{TotalDests}\t%{CycleProcTime}\t%{TimeSinceNap}\t%{QUtilPermilAvg}\t%{QUtilPermilMax}\t%{QUtilPermilCount}\t%{NotifierRequests}\t%{NotifierProcessedRequests}\t%{NotifierRequestsChangeDynamicSubs}\t%{NotifierSentRequestsChangeExtDynamicSubs}\t%{NotifierProcessedRequestsDropped}\t%{NotifierBadTargets}\t%{NotifierCycleTimeNetAvg}\t%{NotifierCycleTimeNetCount}\t%{NotifierUtilAvg->}", "event"=>{"fields"=>{"pipeline"=>"mypipeline", "indexprefix"=>"idx", "regid"=>"w", "env"=>"production"}, "beat"=>{"version"=>"6.8.3", "hostname"=>"myhostname", "name"=>"myname"}, "message"=>"msg", "tags"=>["production", "beats_input_codec_plain_applied"], "host"=>{"name"=>"myhostname"}}}

I'm not blocking this PR.

mtojek · 2020-05-13T10:58:58Z

filebeat/docs/modules/logstash.asciidoc

@@ -9,7 +9,7 @@ This file is generated! See scripts/docs_collector.py
 == Logstash module

 The +{modulename}+ module parse logstash regular logs and the slow log, it will support the plain text format


nit: parses

mtojek · 2020-05-13T11:01:01Z

filebeat/module/logstash/log/config/log.yml

 multiline:
-  pattern: ^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
+  pattern: ^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}|{)


Is it possible to trick this pattern with { character? I mean having a plain text file with curly brackets inside.

Yes, it's a good point. If there's a multiline log event where, say, the 2nd line starts with a {, then this pattern breaks down. Unfortunately, I'm not really sure how to handle this scenario well.

How about extending this pattern to {"level" ?

For JSON-formatted logs, each log line is a JSON object. Being an object, I don't want to depend on a specific property, e.g. level, being the first one.

ycombinator · 2020-05-13T11:19:49Z

The only concerns I have are mixed files (plain text combined together with JSON).

If a plain text log event has JSON anywhere after the first character, it should be handled fine. The problem only comes with plain text log events that have { as the first character of a line, which could happen in multiline plaintext events (see our other discussion in the comment about this).

mtojek · 2020-05-13T11:22:48Z

I wonder if we can use the fact that a plain text file will always be a plain text file (and also the other way round).

elasticmachine · 2020-05-14T11:37:57Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request Introduce auto detection of format #18095 updated]
Start Time: 2020-05-14T11:41:50.115+0000
Duration: 53 min 42 sec (3161776)

Test stats 🧪

Test	Results
Failed	0
Passed	2781
Skipped	418
Total	3199

ycombinator · 2020-05-14T11:39:01Z

@mtojek Turns out it's not a matter of simply adding a closing ] to the regex pattern. That is, we can't just change it from:

^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}|({.+}))

to:

^(\[[0-9]{4}-[0-9]{2}-[0-9]{2}\]|({.+}))

That's because the timestamp pattern is incomplete in the regex. It only accounts for the date part, not the time part. So either we have to change the regex to:

^((\[[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}\])|({.+}))

or to:

^((\[[0-9]{4}-[0-9]{2}-[0-9]{2}[^\]]+\])|({.+}))

I'm not sure either change, for the sake of completeness, is worth the extra processing. The purpose of this regex is simply to detect if a new multiline event should be started (and the previous one completed) or not. So I'm going to leave the regex as-is.

* Introduce auto detection of format * Update docs * Auto detect format for slowlogs * Exclude JSON logs from multiline matching * Adding CHANGELOG entry * Fix typo * Parsing everything as JSON first * Going back to old processor definitions * Adding Known Issues section in doc * Completing regex pattern * Updating regex pattern * Generating docs

…w-oss * upstream/master: (27 commits) Disable host fields for "cloud", panw, cef modules (elastic#18223) [docs] Rename monitoring collection from legacy internal collection to legacy collection (elastic#18504) Introduce auto detection of format (elastic#18095) Add additional fields to address issue elastic#18465 for googlecloud audit log (elastic#18472) Fix libbeat import path in seccomp policy template (elastic#18418) Address Okta input issue elastic#18530 (elastic#18534) [Ingest Manager] Avoid Chown on windows (elastic#18512) Fix Cisco ASA/FTD msgs that use a host name as NAT address (elastic#18376) [CI] Optimise stash/unstash performance (elastic#18473) Libbeat: Remove global loggers from libbeat/metric and libbeat/cloudid (elastic#18500) Fix PANW bad mapping of client/source and server/dest packets and bytes (elastic#18525) Add a file lock to the data directory on startup to prevent multiple agents. (elastic#18483) Followup to 12606 (elastic#18316) changed input from syslog to tcp/udp due to unsupported RFC (elastic#18447) Improve ECS field mappings in Sysmon module. (elastic#18381) [Elastic Agent] Cleaner output of inspect command (elastic#18405) [Elastic Agent] Pick up version from libbeat (elastic#18350) Update communitybeats.asciidoc (elastic#18470) [Metricbeat] Change visualization interval from 15m to >=15m (elastic#18466) docs: Fix typo in kerberos docs (elastic#18503) ...

* Introduce auto detection of format * Update docs * Auto detect format for slowlogs * Exclude JSON logs from multiline matching * Adding CHANGELOG entry * Fix typo * Parsing everything as JSON first * Going back to old processor definitions * Adding Known Issues section in doc * Completing regex pattern * Updating regex pattern * Generating docs

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 29, 2020

ycombinator added Team:Services (Deprecated) Label for the former Integrations-Services team and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 29, 2020

ycombinator added the in progress Pull request is currently in progress. label Apr 29, 2020

andresrc added [zube]: Inbox [zube]: In Progress and removed [zube]: Inbox labels May 2, 2020

ycombinator added 4 commits May 13, 2020 02:29

Introduce auto detection of format

11c23c0

Update docs

d9e3b5c

Auto detect format for slowlogs

7b33c05

Exclude JSON logs from multiline matching

db8f990

ycombinator force-pushed the fb-ls-multi-pipeline branch from d305993 to db8f990 Compare May 13, 2020 10:15

Adding CHANGELOG entry

478c4d3

ycombinator marked this pull request as ready for review May 13, 2020 10:20

ycombinator added [zube]: In Review Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. Feature:Stack Monitoring v7.9.0 v8.0.0 and removed [zube]: In Progress in progress Pull request is currently in progress. labels May 13, 2020

ycombinator requested a review from mtojek May 13, 2020 10:22

mtojek approved these changes May 13, 2020

View reviewed changes

Fix typo

e984cd1

zube bot added [zube]: In Review and removed [zube]: Done labels May 14, 2020

zube bot closed this May 14, 2020

zube bot added [zube]: Done and removed [zube]: In Review labels May 14, 2020

zube bot reopened this May 14, 2020

zube bot added [zube]: In Review and removed [zube]: Done labels May 14, 2020

zube bot closed this May 14, 2020

zube bot added [zube]: Done and removed [zube]: In Review labels May 14, 2020

zube bot reopened this May 14, 2020

zube bot added [zube]: In Review and removed [zube]: Done labels May 14, 2020

ycombinator added 3 commits May 14, 2020 04:25

Adding Known Issues section in doc

89c3ae7

Completing regex pattern

203e1cb

Updating regex pattern

7094eb8

Generating docs

6a7740c

ycombinator merged commit 6bdc7d7 into elastic:master May 14, 2020

ycombinator deleted the fb-ls-multi-pipeline branch May 14, 2020 22:50

zube bot added [zube]: Done and removed [zube]: In Review labels May 14, 2020

ycombinator mentioned this pull request May 14, 2020

[7.x] Introduce auto detection of format (#18095) #18555

Merged

ycombinator removed the needs_backport PR is waiting to be backported to other branches. label May 15, 2020

andresrc removed the [zube]: Done label May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce auto detection of format #18095

Introduce auto detection of format #18095

ycombinator commented Apr 29, 2020 •

edited

Loading

elasticmachine commented Apr 29, 2020

elasticmachine commented Apr 29, 2020 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented May 13, 2020

mtojek left a comment

mtojek May 13, 2020

mtojek May 13, 2020

ycombinator May 13, 2020

mtojek May 13, 2020

ycombinator May 13, 2020

ycombinator commented May 13, 2020

mtojek commented May 13, 2020

elasticmachine commented May 14, 2020 •

edited

Loading

Build stats

Test stats 🧪

ycombinator commented May 14, 2020 •

edited

Loading

		@@ -9,7 +9,7 @@ This file is generated! See scripts/docs_collector.py
		== Logstash module

		The +{modulename}+ module parse logstash regular logs and the slow log, it will support the plain text format

Introduce auto detection of format #18095

Introduce auto detection of format #18095

Conversation

ycombinator commented Apr 29, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

Related issues

elasticmachine commented Apr 29, 2020

elasticmachine commented Apr 29, 2020 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

elasticmachine commented May 13, 2020

mtojek left a comment

Choose a reason for hiding this comment

mtojek May 13, 2020

Choose a reason for hiding this comment

mtojek May 13, 2020

Choose a reason for hiding this comment

ycombinator May 13, 2020

Choose a reason for hiding this comment

mtojek May 13, 2020

Choose a reason for hiding this comment

ycombinator May 13, 2020

Choose a reason for hiding this comment

ycombinator commented May 13, 2020

mtojek commented May 13, 2020

elasticmachine commented May 14, 2020 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

ycombinator commented May 14, 2020 • edited Loading

ycombinator commented Apr 29, 2020 •

edited

Loading

elasticmachine commented Apr 29, 2020 •

edited

Loading

elasticmachine commented May 14, 2020 •

edited

Loading

ycombinator commented May 14, 2020 •

edited

Loading