Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vrl: null and unicode escapes not working in string literals #148

Open
hhromic opened this issue Mar 21, 2022 · 23 comments
Open

vrl: null and unicode escapes not working in string literals #148

hhromic opened this issue Mar 21, 2022 · 23 comments
Labels
type: bug A code related bug vrl: parser Changes to the syntax parser

Comments

@hhromic
Copy link
Contributor

hhromic commented Mar 21, 2022

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I recently found out that the null and unicode character escape sequences are not working in string literal expressions in VRL.
The documentation makes sense, therefore I think something is off in the VRL parser.

Examples from the VRL REPL:

$ "hello\0world"
error[E202]: syntax error
  ┌─ :1:1
  │
1 │ "hello\0world"
  │ ^^^^^^^^^^^^^^ unexpected error: invalid escape character: \0
  │
  = see language documentation at https://vrl.dev
$ "hello\u{1F30E}world"  # also tested with \U{1F30E} and \u1F30E just in case
error[E202]: syntax error
  ┌─ :1:1
  │
1 │ "hello\u{1F30E}world"
  │ ^^^^^^^^^^^^^^^^^^^^^ unexpected error: invalid escape character: \u
  │
  = see language documentation at https://vrl.dev

All the other documented escape sequences seems to be working fine (although not all printed correctly):

$ "hello\nworld"
"hello\nworld"

$ "hello\"world"
"hello\"world"

$ "hello\'world"
"hello'world"

$ "hello\\world"
"hello\\world"

$ "hello\nworld"
"hello\nworld"

$ "hello\rworld"  # \r (carriage-return) was rendered instead of printed with an escape
world"

$ "hello\tworld"  # \t (tab) was rendered instead of printed with an escape
"hello  world"

Configuration

N/A

Version

vector 0.20.0 (x86_64-unknown-linux-gnu 2a706a3 2022-02-11)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@hhromic hhromic added the type: bug A code related bug label Mar 21, 2022
@hhromic
Copy link
Contributor Author

hhromic commented Mar 21, 2022

@hhromic
Copy link
Contributor Author

hhromic commented Mar 21, 2022

After further investigation, it seems that these escape sequences used to work before, but they were not ported over in the VRL parser/compiler re-implementation PR vectordotdev/vector#6353.

@jszwedko
Copy link
Member

Interesting, thanks for the sleuthing! cc/ @StephenWakely @JeanMertz for thoughts on this.

@JeanMertz
Copy link
Contributor

This is correct. I didn't port them over, not for any particular reason, other than wanting to start out simple, and adding more escape sequences as requested by the community.

I think it makes sense to handle escape characters similar to how the Rust compiler handles them.

@hhromic
Copy link
Contributor Author

hhromic commented Mar 23, 2022

I'm glad the omission was just for simplicity of the initial implementation instead of a limitation :)
And yes, having parity with Rust's escape character mechanism would be very nice and complete indeed.

In fairness, I don't think the missing escape characters I reported are a big use-case, I haven't seen anyone asking for them except for me in this issue after a long time since the new lexer was introduced.

The use-case for me that triggered discovering the missing escape characters is that I was testing my implementations of ip_ntop() and ip_pton() using the VRL REPL and I wasn't able to generate strings with arbitrary bytes. In other words, the escape I needed was the binary \xNN escape. Then the documentation pointed me to \uNNNNNN (which is not exactly the same but close-enough), which didn't work either.

@JeanMertz JeanMertz added vrl: parser Changes to the syntax parser type: enhancement A value-adding code change that enhances its existing functionality and removed type: bug A code related bug labels Jun 7, 2022
@hhromic hhromic changed the title VRL: null and unicode escapes not working in string literals vrl: null and unicode escapes not working in string literals Sep 13, 2022
@EdN-terascope
Copy link

In VRL, I'm having troubles with the limits Rust has for escaped characters and regex. PCRE2, Python or Golang style is preferable to Rust or ECMAScript here.

Thx

@jszwedko jszwedko added type: bug A code related bug and removed type: enhancement A value-adding code change that enhances its existing functionality labels Mar 9, 2023
@fuchsnj fuchsnj transferred this issue from vectordotdev/vector Mar 28, 2023
@arthmoeros
Copy link
Contributor

arthmoeros commented Apr 28, 2023

Is anyone by any chances working on this?, I did try to add null escape in the mentioned new lexer, but I can't get it to work, still says \0 is an invalid escape.

I'm not very well versed on Rust so maybe I'm missing something, this is my fork, if anybody could point me in the right direction, any help would be appreciated.

main...arthmoeros:vrl:null_escape_lexer

We do have a use case where we need to remove null characters in a message, a workaround would also be welcome. In the meantime I will try with a lua transform.

@StephenWakely
Copy link
Contributor

@arthmoeros Your code looks fine and works for me.

I'm curious how you are testing it? Perhaps you are still running the Vector project that is pulling VRL from Github and not from your local repo?

@arthmoeros
Copy link
Contributor

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.

arthmoeros/vector@4f8eb9e

So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

@StephenWakely
Copy link
Contributor

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.

arthmoeros/vector@4f8eb9e

So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

Ah, you want to specify branch rather than rev.

vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }

An easier way to test directly in the VRL repo is to just run the cli project:

> cd lib/cli
> cargo run

@arthmoeros
Copy link
Contributor

@StephenWakely I compiled Vector changing every reference of VRL using my Github repo, I just pushed my fork of Vector where I made the change.
arthmoeros/vector@4f8eb9e
So maybe I'm doing something wrong on Vector compilation? (I dont really know what though hehe)

Ah, you want to specify branch rather than rev.

vrl = { package = "vrl", git = "https://github.com/arthmoeros/vrl", branch = "null_escape_lexer" }

An easier way to test directly in the VRL repo is to just run the cli project:

> cd lib/cli
> cargo run

Wonderful, thanks for the tip on the cli project, worked like charm and indeed easier to test.

Just submitted the PR: #219

Thanks a lot for your help!

@dss010101
Copy link

would this be the reason that this:

source = '''
    . |= parse_regex!(
            .message, 
            pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
    )
    .log_time, err = parse_timestamp(.log_time, "%Y-%m-%d %H:%M:%S,%f")
    .process = to_int!(.process)
    .thread = to_int!(.thread)
'''

causes this exception:


        | 2023-05-15T04:21:51.724291Z  INFO vector::app: Loading configs. paths=["/etc/vector/vector.toml"]
        | 2023-05-15T04:21:51.747923Z ERROR vector::topology: Configuration error. error=Transform "log_parser":
        | error[E202]: syntax error
        |   ┌─ :1:1
        |   │
        | 1 │ ╭     . |= parse_regex!(
        | 2 │ │             .message,
        | 3 │ │             pattern = [r"\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)"]
        | 4 │ │     )
        |   · │
        | 7 │ │     .thread = to_int!(.thread)
        | 8 │ │
        |   │ ╰^ unexpected error: invalid escape character: \[

Is there a workaround for this?

@StephenWakely
Copy link
Contributor

Is there a workaround for this?

Hi. There are multiple issues with your code, none of which are related to this issue.

  • Named parameters use : - so use pattern: rather than pattern = for the pattern parameter.
  • Regex literals use r'..' rather than r".." - this is the primary cause the error you are getting.
  • The pattern parameter is a single regex, not an array - so don't enclose the pattern with [...].

This should work:

. |= parse_regex!(
            .message, 
            pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)'
    )

@dss010101
Copy link

Is there a workaround for this?

Hi. There are multiple issues with your code, none of which are related to this issue.

  • Named parameters use : - so use pattern: rather than pattern = for the pattern parameter.
  • Regex literals use r'..' rather than r".." - this is the primary cause the error you are getting.
  • The pattern parameter is a single regex, not an array - so don't enclose the pattern with [...].

This should work:

. |= parse_regex!(
            .message, 
            pattern: r'\[(?P<log_time>.+)\]\[(?P<process>\d+):(?P<thread>\d+)\]\[(?P<severity>\w+)\]\[(?P<module>\w+)\.(?P<func>\w+)\] (?P<msg>.*$)'
    )

ok thank you - now i know. i actually had success with the grok patterns. but have a question. In the below i have several parsing patterns.
if the 'msg' field i parse out may contain a pattern like 'app_id=app_1234', can i post parse and add 'app_id' as a field, rather than include another parser pattern for it as i do below? It seems it would be cleaner to post parse and add optional fields that may or may not be in the msg. Are there any examples of doing this?

source = '''
    . |= parse_groks!(
            .message, 
            patterns:[
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] app_id:%{NOTSPACE:app_id} %{GREEDYDATA:msg}",
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]\\[%{WORD:severity}\\]\\[%{WORD:module}.%{WORD:func}\\] %{GREEDYDATA:msg}",
                "\\[%{TIMESTAMP_ISO8601:log_time}\\]\\[%{INT:process}:%{INT:thread}\\]%{WORD:severity}:%{WORD:module}:%{GREEDYDATA:msg}"
            ]
    )

'''

example logs with/and without app_id:

[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] initalizing application
[2023-05-15 11:52:29,401][2555:140431094425152][INFO][default.run] app_id:my_app_12345 initialized.

@dss010101
Copy link

dss010101 commented May 16, 2023

  • single regex, not an array - so don't enclose the pattern with [...].

Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?

@jszwedko
Copy link
Member

  • single regex, not an array - so don't enclose the pattern with [...].

Is it only parse_groks that allow any array of patterns then? as im learning more - i read that grok internally uses regex anyway, so i thought it may be a bit more performant if i use parse_regex. but i have logs that have different formats...and wondering if there is a way to pass these different patterns to parse_regex. or do i have to have multiple parse_regex calls?

👋 do you mind opening this as a separate discussion post since it is unrelated to this issue?

@dss010101
Copy link

Done...sorry for hijacking threads...#246

@polarathene
Copy link

Just chiming in to mention that the docs example for strip_ansi_escape_codes!() fails with \e escape code. It suggests to use the VRL REPL which likewise fails:

image

$ vector vrl
$ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")

error[E202]: syntax error
  ┌─ :1:1

1 │ strip_ansi_escape_codes("\e[46mfoo\e[0m bar")
  │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ unexpected error: invalid escape character: \e

  = see language documentation at https://vrl.dev
  = try your code in the VRL REPL, learn more at https://vrl.dev/examples

In my case, I was actually trying to apply ANSI escape codes for coloured output in the console sink, which either outputs unescaped (or partial) ANSI codes as plain text, or fails with an error like above.

I wasn't sure if it was unsupported by the sink, but perhaps it's related to this issue (Vectors own internal log output leveraged colour without issue):

image

@fingon
Copy link

fingon commented Mar 7, 2024

It is somewhat frustrating that I cannot cut-n-paste (json encoded, e.g. \u..) ANSI escape strings from logs to vector test so that I could ensure that my pipeline (correctly) handles their stripping. In regexps it is less important, I guess, although it would be nice to have it working there too.

@DylanRJohnston
Copy link

Should probably adjust the documentation on string expressions until support for these escape sequences is re-added https://vector.dev/docs/reference/vrl/expressions/#string-characteristics. I thought I was doing something wrong.

@pront
Copy link
Member

pront commented Jan 14, 2025

Should probably adjust the documentation on string expressions until support for these escape sequences is re-added https://vector.dev/docs/reference/vrl/expressions/#string-characteristics. I thought I was doing something wrong.

Fair point. I will try to find some time this week.

@ulidtko
Copy link

ulidtko commented Feb 18, 2025

Same applies to \x sequences — understood by the compiler, but syntax error in REPL.

I.e. this works in actual config:

.message = replace(.message, r'\x1b\[[0-9;]*m', "") # strip colors

But try testing that in the repl 😉 You'll be surprised (unpleasantly).


Should probably adjust the documentation

From the pointers given by @hhromic, implementing support in the lexer seems more productive than eternalizing the bug by documenting it.

I'm confused though, is the lexer of repl different from the one of compiler?.. That seems weird 🤔

@zpriddy
Copy link

zpriddy commented Feb 26, 2025

I just ran across this issue this week. Logs were pulled from GCP Logging via PubSub. There were terminal color encodings that were included in the log itself.

I took some tips from another post for testing in the VRL console. Adding the log event to a file and running the VRL with --input as vector vrl --input file_with_events.json worked with the unicode in the event.

Also found out that using s'... subj=system_u:system_r:init_t:s0 key=(null)\u001dARCH=x86_64 SYSCALL= ... did work for stripping the escape..

For the time I just updated the agent to strip escapes before writing the logs for the aggregator to pick up..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A code related bug vrl: parser Changes to the syntax parser
Projects
None yet
Development

No branches or pull requests