Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Dissect processor #6925

Merged
merged 1 commit into from
May 29, 2018
Merged

Conversation

ph
Copy link
Contributor

@ph ph commented Apr 23, 2018

Implement the Dissect Tokenizer from logstash-input-dissect1

This tokenizer allows you to define patterns of strings and extract the
relevant informations. It also permet to do some string manipulations
when extracting the keys.

Example tokenizer:

tokenizer: "%{at} - [%{machine}] %{code} - %{message}"
message: "10/10/2017 - [wopr] 1 - oh fire fire!"
result:
  at: "10/10/2017"
  machine: "wopr"
  code: "1"
  message: "of fire fire!"
tokenizer: "%{?key} %{&key}"
message: "hello world"
result:
  hello: "world"

Example of configuration:

processors:
 - dissect:
    tokenizer: "%{key1} - %{key2}"
    field: "message"
    target_prefix: "extracted"

Dissect support a few more features:

  • Indirect field
  • Append
  • skip field
  • Greedy padding for CSV file

Performance

When working on that feature I've made some work to reduce the allocation and try to increase the speed, I've defined a few simple regular expression that just do a regexp's match on the data without doing anything else and the dissect is faster and provide more flexibility.

goos: darwin
goarch: amd64
pkg: github.com/elastic/beats/libbeat/processors/dissect
BenchmarkDissect/Complex_stack_trace-8           2000000               607 ns/op             464 B/op          3 allocs/op
BenchmarkDissect/fails_when_delimiter_is_not_found_at_the_beginning_of_the_string-8             2000000000               0.00 ns/op            0 B/op          0 allocs/op
BenchmarkDissect/fails_when_delimiter_is_not_found_after_the_key-8                              2000000000               0.00 ns/op            0 B/op          0 allocs/op
BenchmarkDissect/simple_dissect-8                                                               10000000               199 ns/op             352 B/op          3 allocs/op
BenchmarkDissect/dissect_two_replacement-8                                                       5000000               255 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/one_level_dissect_not_end_of_string-8                                          10000000               233 ns/op             352 B/op          3 allocs/op
BenchmarkDissect/one_level_dissect-8                                                            10000000               209 ns/op             352 B/op          3 allocs/op
BenchmarkDissect/multiple_keys_dissect_end_of_string-8                                           5000000               264 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/multiple_keys_not_end_of_string-8                                               5000000               289 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/simple_ordered-8                                                                3000000               440 ns/op             392 B/op          5 allocs/op
BenchmarkDissect/simple_append-8                                                                 3000000               431 ns/op             392 B/op          5 allocs/op
BenchmarkDissect/indirect_field-8                                                                5000000               269 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/skip_field-8                                                                    5000000               254 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/named_skiped_field_with_indirect-8                                              5000000               298 ns/op             368 B/op          3 allocs/op
BenchmarkDissect/missing_fields-8                                                                3000000               475 ns/op             432 B/op          3 allocs/op
BenchmarkDissect/ignore_right_padding-8                                                          5000000               371 ns/op             384 B/op          3 allocs/op
BenchmarkDissect/padding_on_the_last_key_need_a_delimiter-8                                      5000000               385 ns/op             384 B/op          3 allocs/op
BenchmarkDissect/ignore_left_padding-8                                                           5000000               364 ns/op             384 B/op          3 allocs/op
BenchmarkDissect/when_the_delimiters_contains_`{`_and_`}`-8                                      5000000               316 ns/op             384 B/op          3 allocs/op
BenchmarkDissect/Regular_expression-8                                                            3000000               560 ns/op             240 B/op          1 allocs/op
BenchmarkDissect/Larger_regular_expression-8                                                      500000              3136 ns/op             528 B/op          3 allocs/op
BenchmarkDissect/regular_expression_to_match_end_of_line-8                                       2000000               651 ns/op             304 B/op          3 allocs/op


Note to reviewers:

@dedemorton I will do a followup PR with added documentation.

To understand dissect I suggest either look at the test suite or at the logstash dissect documentation.

@tsg tsg added the in progress Pull request is currently in progress. label Apr 23, 2018
@ph
Copy link
Contributor Author

ph commented Apr 23, 2018

I've talked with @guyboertje and it's not a bug but a feature useful for debugging and making our pattern is right, so to make it work you have to create an empty key at the end to grab the greedy content. I will make the change to this PR so behavior is the same.

This is not true anymore we have changed the behavior so the last key is not greedy.

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph i would be interested to know what would be the performance impact of extracting something from the very end of this long string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this code has to read all the bytes to find delimiters and keys the performance will be dependant on the following things:

  • Length of the string;
  • Numbers of keys;
  • appending vs. simple key extraction.

If we take a string as long as the current stack trace to extract one key at the end, assuming we have some repetitive patterns allowing us to create a tokenizer we will have theses results.

With dissect: "MACHINE[%{key}]" => 350 ns/op
Only a regexp match: regexp.MustCompile("MACHINE\\[(\\w+)\\]$") => 656 ns/op

It will still faster, in that case, assuming we have an easy pattern.

Note: in the regexp I am giving a hint that this pattern is a the end of the line, we cannot do that with dissect right now with the current syntax.

defined patterns.


Note that if the key exists in the `target_field`, the original value will be replaced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't allow this and someone would have to use drop_field before. This would be the same behaviour as we have in the rename processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with that.


`field`:: (Optional) The event field to tokenize. Default is `message`.

`target_field`:: (Optional) The name of the field where the values will be extracted. When an empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more the target object or prefix? If I understand this correctly the data will be under extracted.key1, extracted.key2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin Yes it is exactly that, it will use another key as the prefix, are you suggesting to rename it to target_prefix or just prefix ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as target field suggests to me that is the field it ends up in.

delimiterRE = regexp.MustCompile("(?s)(.*?)%\\{([^}]*?)}")
suffixRE = regexp.MustCompile("(.+?)(/(\\d{1,2}))?(->)?$")

skipFieldPrefix = "?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need to support all of these prefixes or if we can leave the more complex ones to LS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the syntax is still quite small I prefer we keep the two implementations as close as possible. If we ever make change or additions in the syntax we might discuss it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the indirect is really useful, take this example:

%{?key} %{&key} -> "hello world" -> {"hello" => "world"}

"github.com/stretchr/testify/assert"
)

var export = flag.Bool("dissect-tests", false, "export dissect tests to JSON.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this test is in the dissect package I wonder if this flag will show up globally. Probably not as _test files should not be compiled, but better double check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its only scoped to this part and won't be present in the end binary.


Note that if the key exists in the `target_field`, the original value will be replaced.

TODO: need to finish
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dedemorton Just to keep you in the loop, since both beat and LS will share the same syntax and behavior, should we try to make a high definition/examples of the syntax and share it with between the two project or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ph I think that's a good idea, but there are a few things to consider:

  • Where will the shared asciidoc source files live? Will there be a new repo for code/docs shared across ingest?
  • Will the behavior across LS and Beats always be identical? Will the projects share code? If we add a new processor to Beats, will we always add it to Logstash in the same release (and vice versa).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will the shared asciidoc source files live? Will there be a new repo for code/docs shared across ingest?

Good question, I think for now we can duplicate the content but have the same description.

Will the behavior across LS and Beats always be identical? Will the projects share code? If we add a new processor to Beats, will we always add it to Logstash in the same release (and vice versa).

For now we only have dissect and in the short term we are not planning anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now i would keep it separate to make sure it does not hold back the progress of this PR. We can still do a follow up PR / discussion no how to merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin I will make the docs PR as a separate PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we can have this feature in Beats first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DanielYWoo IIRC Dissect appeared in LS two years ago.

@@ -250,6 +250,14 @@ auditbeat.modules:
# - from: "a"
# to: "b"
#
# The following example use dissect syntax to tokenize the string into fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just say, "The following example tokenizes the string into fields:" That way, if you change the name of the processor someday, you won't have to update the description. :-)

If you feel you need to mention dissect syntax, then say, "The following example uses..." (plural).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

@@ -761,3 +762,33 @@ The fields added to the event are looking as following:
-------------------------------------------------------------------------------

NOTE: The host information is refreshed every 5 minutes.

[[dissect]]
=== Dissect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to use goal-oriented titles here. How about:

Dissect strings or Tokenize strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with Dissect strings

[[dissect]]
=== Dissect

The `dissect` processor allows you to use the `dissect` syntax to tokenizer incoming string into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you're planning more work for the docs, so I won't make too many comments. But I would shorten this to say:

The dissect processor tokenizes incoming strings....

@ph ph force-pushed the feature/dissect-tokenizer branch from c124419 to ab3b4ba Compare May 17, 2018 15:21
@ph ph added review and removed in progress Pull request is currently in progress. labels May 17, 2018
@DanielYWoo
Copy link

any update on this? Eager to see it released

}

type dissectTest struct {
Name string `json:"name"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the json part in the struct here for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared tests between LS and Beats see logstash-plugins/logstash-filter-dissect#55

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on sharing. Best leave a comment here to the LS code to make this clear as otherwise someone could clean up the code in the future.


var defaultConfig = config{
Field: "message",
TargetPrefix: "extracted",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it dissect to make it clear it comes from dissect?


`field`:: (Optional) The event field to tokenize. Default is `message`.

`target_prefix`:: (Optional) The name of the field where the values will be extracted. When an empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document on what happens if "" is specified and a key already exists?

#
#processors:
#- dissect:
# tokenizer: "%{key1} - %{key2}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some guidelines / requirements on what users should use for the keys? Like don't use spaces for example? As dissect allows any match to be a key, this will become interesting.

Copy link
Contributor Author

@ph ph May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space is fine, added a note about suffix and prefix modifiers.

}

type dissectTest struct {
Name string `json:"name"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on sharing. Best leave a comment here to the LS code to make this clear as otherwise someone could clean up the code in the future.

`target_prefix`:: (Optional) The name of the field where the values will be extracted. When an empty
string is defined, the processor will create the keys at the root of the event. Default is
`extracted`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing also missing here: what happens on error if dissect does not match or only a part of it. Is an error reported?

return fmt.Sprintf("field: %s, ordinal: %d, greedy: %v", f.key, f.ordinal, f.IsGreedy())
}

type normalField struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have some docs here or on the top of this file to explain the difference between baseField, normalField, namedSkipField. This should make it easier for someone contributing to the code in the future.

You explain it well in the PR message the different cases but it's not documented in the code.

return rawKey, 0, false
}

if m[0][3] != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw this code my first worry is a panic in case index 3 does not exist. Same for 4 below. Should we have a check or should this never be possible based on the code path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty good question, let me verify that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The len(m) will take care of unmatched regular expression and the regular expression when matched will always return 4 groups, if no value is found or extracted it will be an empty string.

But the slice will always have 4 elements.


pos := 0
for id, m := range matches {
d := newDelimiter(tokenizer[m[2]:m[3]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same worries as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fine for the same reason as above.


for k, v := range m {
if _, err := event.GetValue(k); err == common.ErrKeyNotFound {
event.PutValue(k, v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use safemapstr.Put(...) here so people can't shoot in their own foot?

Copy link
Contributor Author

@ph ph May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in that case its preferable to have an errors and this will wont lead to surprises, IE when you want to use the key for an index name.

@ph ph force-pushed the feature/dissect-tokenizer branch from 520028c to b6549fe Compare May 24, 2018 15:57
@ph
Copy link
Contributor Author

ph commented May 24, 2018

@ruflin I've went through the first round of comments and updated/rebased the code.

@ruflin
Copy link
Member

ruflin commented May 25, 2018

Code LGTM. I'm planning to pull the branch and play with the binary beginning next week.


// Raw returns the raw tokenizer used to generate the actual parser.
func (d *Dissector) Raw() string {
return d.Raw()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be return d.raw I think. Just tried it to run it locally and got

runtime: goroutine stack exceeds 1000000000-byte limit
fatal error: stack overflow

runtime stack:
runtime.throw(0x4b6f321, 0xe)
        /usr/local/Cellar/go/1.10.2/libexec/src/runtime/panic.go:616 +0x81
runtime.newstack()
        /usr/local/Cellar/go/1.10.2/libexec/src/runtime/stack.go:1054 +0x71f
runtime.morestack()
        /usr/local/Cellar/go/1.10.2/libexec/src/runtime/asm_amd64.s:480 +0x89

goroutine 1 [running]:
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)
        /Users/ruflin/Dev/gopath/src/github.com/elastic/beats/libbeat/processors/dissect/dissect.go:48 +0x4e fp=0xc4404a6378 sp=0xc4404a6370 pc=0x484e19e
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)
        /Users/ruflin/Dev/gopath/src/github.com/elastic/beats/libbeat/processors/dissect/dissect.go:49 +0x2b fp=0xc4404a63a0 sp=0xc4404a6378 pc=0x484e17b
github.com/elastic/beats/libbeat/processors/dissect.(*Dissector).Raw(0xc4204375a0, 0x0, 0x0)

@ruflin
Copy link
Member

ruflin commented May 28, 2018

I tried locally the following config (after applying the fix to Raw() method mentioned above):

processors:
 - dissect:
     tokenizer: "/var/%{prefix}/%{path}"
     field: "source"
     target_prefix: "extracted"

The idea here is to extract information from the file path. This has been a requested feature in the past especially for the docker case (which we solved differently now). Also users requested to get only the file name and not the full path.

The above works as expected on logs coming from /var/log/* and the output contains the following fields:

  "extracted": {
    "prefix": "log",
    "path": "fsck_apfs.log"
  },

@ph I think the Raw() problem is really something that should pop up in our tests. I suggest to add at least 1 or 2 system tests to cover for such cases.

@ph
Copy link
Contributor Author

ph commented May 28, 2018 via email

@ph
Copy link
Contributor Author

ph commented May 28, 2018

@ruflin I've fixed the Raw case and also added 2 new system tests for the following scenario.

  • Good tokenizer => Expect keys to be extracted in a specific field.
  • Bad tokenizer => Expect no keys to be extracted but the event should not be dropped.

#- dissect:
# tokenizer: "%{key1} - %{key2}"
# field: "message"
# target_prefix: "extracted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have in the code now dissect as default. This should also be adjusted here.

Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment left. Otherwise LGTM

"dissect": {
"tokenizer": "\"%{key} world\"",
"field": "message",
"target_prefix": "extracted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you leave this one out, you can test the defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added something specific to test all the defaults instead, I prefer clearer intend.

@ph
Copy link
Contributor Author

ph commented May 29, 2018 via email

@ph ph force-pushed the feature/dissect-tokenizer branch from aa49bba to 7bf3b5a Compare May 29, 2018 13:13
Implement the Dissect Tokenizer from `logstash-input-dissect`[1]

This tokenizer allows you to define patterns of strings and extract the
relevant informations. It also permet to do some string manipulations
when extracting the keys.

Example tokenizer:

```yaml
tokenizer: "%{at} - [%{machine}] %{code} - %{message}"
message: "10/10/2017 - [wopr] 1 - oh fire fire!"
result:
  at: "10/10/2017"
  machine: "wopr"
  code: "1"
  message: "of fire fire!"
```

```yaml
tokenizer: "%{?key} %{&key}"
message: "hello world"
result:
  hello: "world"
```

Example of configuration:

```yaml
processors:
 - dissect:
    tokenizer: "%{key1} - %{key2}"
    field: "message"
    target_field: "extracted"
```

Dissect support a few more features:

- Indirect field
- Append
- skip field
- Greedy padding for CSV file

[1]: https://github.com/logstash-plugins/logstash-filter-dissect
@ph ph force-pushed the feature/dissect-tokenizer branch from 7bf3b5a to 8dedd14 Compare May 29, 2018 13:19
@ph
Copy link
Contributor Author

ph commented May 29, 2018

@ruflin I did the last changes, I think we are good now, thanks!

}

var defaultConfig = config{
Field: "message",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on elastic/ecs#3 this could become log.message (not now)

@ruflin ruflin merged commit 81c8671 into elastic:master May 29, 2018
@DanielYWoo
Copy link

Hi guys, what's the next release date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants