-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Dissect processor #6925
Conversation
This is not true anymore we have changed the behavior so the last key is not greedy. |
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) | ||
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) | ||
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) | ||
at java.lang.Thread.run(Thread.java:748)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ph i would be interested to know what would be the performance impact of extracting something from the very end of this long string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this code has to read all the bytes to find delimiters and keys the performance will be dependant on the following things:
- Length of the string;
- Numbers of keys;
- appending vs. simple key extraction.
If we take a string as long as the current stack trace to extract one key at the end, assuming we have some repetitive patterns allowing us to create a tokenizer we will have theses results.
With dissect: "MACHINE[%{key}]"
=> 350 ns/op
Only a regexp match: regexp.MustCompile("MACHINE\\[(\\w+)\\]$")
=> 656 ns/op
It will still faster, in that case, assuming we have an easy pattern.
Note: in the regexp I am giving a hint that this pattern is a the end of the line, we cannot do that with dissect
right now with the current syntax.
defined patterns. | ||
|
||
|
||
Note that if the key exists in the `target_field`, the original value will be replaced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we don't allow this and someone would have to use drop_field before. This would be the same behaviour as we have in the rename processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with that.
|
||
`field`:: (Optional) The event field to tokenize. Default is `message`. | ||
|
||
`target_field`:: (Optional) The name of the field where the values will be extracted. When an empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more the target object or prefix
? If I understand this correctly the data will be under extracted.key1, extracted.key2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin Yes it is exactly that, it will use another key as the prefix, are you suggesting to rename it to target_prefix
or just prefix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as target field suggests to me that is the field it ends up in.
delimiterRE = regexp.MustCompile("(?s)(.*?)%\\{([^}]*?)}") | ||
suffixRE = regexp.MustCompile("(.+?)(/(\\d{1,2}))?(->)?$") | ||
|
||
skipFieldPrefix = "?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really need to support all of these prefixes or if we can leave the more complex ones to LS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the syntax is still quite small I prefer we keep the two implementations as close as possible. If we ever make change or additions in the syntax we might discuss it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the indirect is really useful, take this example:
%{?key} %{&key}
-> "hello world" -> {"hello" => "world"}
"github.com/stretchr/testify/assert" | ||
) | ||
|
||
var export = flag.Bool("dissect-tests", false, "export dissect tests to JSON.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this test is in the dissect
package I wonder if this flag will show up globally. Probably not as _test files should not be compiled, but better double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its only scoped to this part and won't be present in the end binary.
|
||
Note that if the key exists in the `target_field`, the original value will be replaced. | ||
|
||
TODO: need to finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dedemorton Just to keep you in the loop, since both beat and LS will share the same syntax and behavior, should we try to make a high definition/examples of the syntax and share it with between the two project or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ph I think that's a good idea, but there are a few things to consider:
- Where will the shared asciidoc source files live? Will there be a new repo for code/docs shared across ingest?
- Will the behavior across LS and Beats always be identical? Will the projects share code? If we add a new processor to Beats, will we always add it to Logstash in the same release (and vice versa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will the shared asciidoc source files live? Will there be a new repo for code/docs shared across ingest?
Good question, I think for now we can duplicate the content but have the same description.
Will the behavior across LS and Beats always be identical? Will the projects share code? If we add a new processor to Beats, will we always add it to Logstash in the same release (and vice versa).
For now we only have dissect and in the short term we are not planning anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now i would keep it separate to make sure it does not hold back the progress of this PR. We can still do a follow up PR / discussion no how to merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin I will make the docs PR as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we can have this feature in Beats first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DanielYWoo IIRC Dissect appeared in LS two years ago.
1c83ba4
to
c124419
Compare
auditbeat/auditbeat.reference.yml
Outdated
@@ -250,6 +250,14 @@ auditbeat.modules: | |||
# - from: "a" | |||
# to: "b" | |||
# | |||
# The following example use dissect syntax to tokenize the string into fields: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just say, "The following example tokenizes the string into fields:" That way, if you change the name of the processor someday, you won't have to update the description. :-)
If you feel you need to mention dissect syntax, then say, "The following example uses..." (plural).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
@@ -761,3 +762,33 @@ The fields added to the event are looking as following: | |||
------------------------------------------------------------------------------- | |||
|
|||
NOTE: The host information is refreshed every 5 minutes. | |||
|
|||
[[dissect]] | |||
=== Dissect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to use goal-oriented titles here. How about:
Dissect strings or Tokenize strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go with Dissect strings
[[dissect]] | ||
=== Dissect | ||
|
||
The `dissect` processor allows you to use the `dissect` syntax to tokenizer incoming string into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you're planning more work for the docs, so I won't make too many comments. But I would shorten this to say:
The dissect
processor tokenizes incoming strings....
c124419
to
ab3b4ba
Compare
any update on this? Eager to see it released |
} | ||
|
||
type dissectTest struct { | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the json part in the struct here for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared tests between LS and Beats see logstash-plugins/logstash-filter-dissect#55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on sharing. Best leave a comment here to the LS code to make this clear as otherwise someone could clean up the code in the future.
libbeat/processors/dissect/config.go
Outdated
|
||
var defaultConfig = config{ | ||
Field: "message", | ||
TargetPrefix: "extracted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it dissect
to make it clear it comes from dissect?
|
||
`field`:: (Optional) The event field to tokenize. Default is `message`. | ||
|
||
`target_prefix`:: (Optional) The name of the field where the values will be extracted. When an empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document on what happens if ""
is specified and a key already exists?
# | ||
#processors: | ||
#- dissect: | ||
# tokenizer: "%{key1} - %{key2}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have some guidelines / requirements on what users should use for the keys
? Like don't use spaces for example? As dissect allows any match to be a key, this will become interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space is fine, added a note about suffix and prefix modifiers.
} | ||
|
||
type dissectTest struct { | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on sharing. Best leave a comment here to the LS code to make this clear as otherwise someone could clean up the code in the future.
`target_prefix`:: (Optional) The name of the field where the values will be extracted. When an empty | ||
string is defined, the processor will create the keys at the root of the event. Default is | ||
`extracted`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing also missing here: what happens on error if dissect does not match or only a part of it. Is an error reported?
libbeat/processors/dissect/field.go
Outdated
return fmt.Sprintf("field: %s, ordinal: %d, greedy: %v", f.key, f.ordinal, f.IsGreedy()) | ||
} | ||
|
||
type normalField struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have some docs here or on the top of this file to explain the difference between baseField, normalField, namedSkipField. This should make it easier for someone contributing to the code in the future.
You explain it well in the PR message the different cases but it's not documented in the code.
libbeat/processors/dissect/field.go
Outdated
return rawKey, 0, false | ||
} | ||
|
||
if m[0][3] != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I saw this code my first worry is a panic in case index 3
does not exist. Same for 4
below. Should we have a check or should this never be possible based on the code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty good question, let me verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The len(m)
will take care of unmatched regular expression and the regular expression when matched will always return 4 groups, if no value is found or extracted it will be an empty string.
But the slice will always have 4 elements.
|
||
pos := 0 | ||
for id, m := range matches { | ||
d := newDelimiter(tokenizer[m[2]:m[3]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same worries as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine for the same reason as above.
|
||
for k, v := range m { | ||
if _, err := event.GetValue(k); err == common.ErrKeyNotFound { | ||
event.PutValue(k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use safemapstr.Put(...) here so people can't shoot in their own foot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in that case its preferable to have an errors and this will wont lead to surprises, IE when you want to use the key for an index name.
520028c
to
b6549fe
Compare
@ruflin I've went through the first round of comments and updated/rebased the code. |
Code LGTM. I'm planning to pull the branch and play with the binary beginning next week. |
|
||
// Raw returns the raw tokenizer used to generate the actual parser. | ||
func (d *Dissector) Raw() string { | ||
return d.Raw() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be return d.raw
I think. Just tried it to run it locally and got
runtime: goroutine stack exceeds 1000000000-byte limit
fatal error: stack overflow
runtime stack:
runtime.throw(0x4b6f321, 0xe)
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81
runtime.newstack()
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/stack.go:1054 +0x71f
runtime.morestack()
/usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:480 +0x89
goroutine 1 [running]:
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)
/Users/ruflin/Dev/gopath/src/github.com/elastic/beats/libbeat/processors/dissect/dissect.go:48 +0x4e fp=0xc4404a6378 sp=0xc4404a6370 pc=0x484e19e
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)
/Users/ruflin/Dev/gopath/src/github.com/elastic/beats/libbeat/processors/dissect/dissect.go:49 +0x2b fp=0xc4404a63a0 sp=0xc4404a6378 pc=0x484e17b
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)
I tried locally the following config (after applying the fix to
The idea here is to extract information from the file path. This has been a requested feature in the past especially for the docker case (which we solved differently now). Also users requested to get only the file name and not the full path. The above works as expected on logs coming from
@ph I think the |
Correct that should have poped, I think I've changed the name at some point.
Let me correct the situation with tests, I should have them today.
On Mon, May 28, 2018 at 8:30 AM Nicolas Ruflin ***@***.***> wrote:
I tried locally the following config (after applying the fix to Raw()
method mentioned above):
processors:
- dissect:
tokenizer: "/var/%{prefix}/%{path}"
field: "source"
target_prefix: "extracted"
The idea here is to extract information from the file path. This has been
a requested feature in the past especially for the docker case (which we
solved differently now). Also users requested to get only the file name and
not the full path.
The above works as expected on logs coming from /var/log/* and the output
contains the following fields:
"extracted": {
"prefix": "log",
"path": "fsck_apfs.log"
},
@ph <https://github.com/ph> I think the Raw() problem is really something
that should pop up in our tests. I suggest to add at least 1 or 2 system
tests to cover for such cases.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6925 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAACgB9OiBXPTL32YXos5Fgmzu6A7Buqks5t2-3JgaJpZM4TgilI>
.
--
ph
|
@ruflin I've fixed the Raw case and also added 2 new system tests for the following scenario.
|
auditbeat/auditbeat.reference.yml
Outdated
#- dissect: | ||
# tokenizer: "%{key1} - %{key2}" | ||
# field: "message" | ||
# target_prefix: "extracted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have in the code now dissect
as default. This should also be adjusted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment left. Otherwise LGTM
"dissect": { | ||
"tokenizer": "\"%{key} world\"", | ||
"field": "message", | ||
"target_prefix": "extracted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave this one out, you can test the defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added something specific to test all the defaults instead, I prefer clearer intend.
Good comments will make the changes thanks @ruflin
On Tue, May 29, 2018 at 3:36 AM Nicolas Ruflin ***@***.***> wrote:
***@***.**** commented on this pull request.
One minor comment left. Otherwise LGTM
------------------------------
In filebeat/tests/system/test_processors.py
<#6925 (comment)>:
> @@ -115,3 +115,56 @@ def test_condition(self):
assert "beat.name" in output
assert "message" in output
assert "test" in output["message"]
+
+ def test_dissect_good_tokenizer(self):
+ """
+ Check dissect with a good tokenizer
+ """
+ self.render_config_template(
+ path=os.path.abspath(self.working_dir) + "/test.log",
+ processors=[{
+ "dissect": {
+ "tokenizer": "\"%{key} world\"",
+ "field": "message",
+ "target_prefix": "extracted"
If you leave this one out, you can test the defaults.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6925 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACgHWY9bCH13sf8oRvievAgGAW7Kh8ks5t3Pp6gaJpZM4TgilI>
.
--
ph
|
aa49bba
to
7bf3b5a
Compare
Implement the Dissect Tokenizer from `logstash-input-dissect`[1] This tokenizer allows you to define patterns of strings and extract the relevant informations. It also permet to do some string manipulations when extracting the keys. Example tokenizer: ```yaml tokenizer: "%{at} - [%{machine}] %{code} - %{message}" message: "10/10/2017 - [wopr] 1 - oh fire fire!" result: at: "10/10/2017" machine: "wopr" code: "1" message: "of fire fire!" ``` ```yaml tokenizer: "%{?key} %{&key}" message: "hello world" result: hello: "world" ``` Example of configuration: ```yaml processors: - dissect: tokenizer: "%{key1} - %{key2}" field: "message" target_field: "extracted" ``` Dissect support a few more features: - Indirect field - Append - skip field - Greedy padding for CSV file [1]: https://github.com/logstash-plugins/logstash-filter-dissect
7bf3b5a
to
8dedd14
Compare
@ruflin I did the last changes, I think we are good now, thanks! |
} | ||
|
||
var defaultConfig = config{ | ||
Field: "message", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on elastic/ecs#3 this could become log.message (not now)
Hi guys, what's the next release date? |
Implement the Dissect Tokenizer from
logstash-input-dissect
1This tokenizer allows you to define patterns of strings and extract the
relevant informations. It also permet to do some string manipulations
when extracting the keys.
Example tokenizer:
Example of configuration:
Dissect support a few more features:
Performance
When working on that feature I've made some work to reduce the allocation and try to increase the speed, I've defined a few simple regular expression that just do a regexp's match on the data without doing anything else and the dissect is faster and provide more flexibility.
Note to reviewers:
@dedemorton I will do a followup PR with added documentation.
To understand dissect I suggest either look at the test suite or at the logstash dissect documentation.