Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

andsel · 2025-01-28T11:28:19Z

Release notes

[rn:skip]

What does this PR do?

This is a second take to fix the processing of tokens from the tokenizer after a buffer full error. The first try #16482 was rollbacked to the encoding error #16694.
The first try failed on returning the tokens in the same encoding of the input.
This PR does a couple of things:

accumulates the tokens, so that after a full condition can resume with the next tokens after the offending one.
respect the encoding of the input string. Use concat method instead of addAll, which avoid to convert RubyString to String and back to RubyString. When return the head StringBuilder it enforce the encoding with the input charset.

Why is it important/What is the impact to the user?

Permit to use effectively the tokenizer also in context where a line is bigger than a limit.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files (and/or docker env variables)~~
I have added tests that prove my fix is effective or that my feature works

Author's Checklist

[ ]

How to test this PR locally

The test plan has two sides:

one to check that the behaviour of size limiting acts as expected. In such case follow the instructions in BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483.
the other to verify the encoding is respected.

How to test the encoding is respected

Startup a REPL with Logstash and exercise the tokenizer:

$> bin/logstash -i irb
> buftok = FileWatch::BufferedTokenizer.new
> buftok.extract("\xA3".force_encoding("ISO8859-1")); buftok.flush.bytes

or use the following script

require 'socket'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

text = "\xA3" # the £ symbol in ISO-8859-1 aka Latin-1
text.force_encoding("ISO-8859-1")
socket.puts(text)

socket.close

with the Logstash run as

bin/logstash -e "input { tcp { port => 1234 codec => line { charset => 'ISO8859-1' } } } output { stdout { codec => rubydebug } }"

In the output the £ as to be present and not Â£

Related issues

github-actions · 2025-01-31T11:48:22Z

It looks like this PR modifies one or more .asciidoc files. These files are being migrated to Markdown, and any changes merged now will be lost. See the migration guide for details.

github-actions · 2025-01-31T12:11:32Z

📃 DOCS PREVIEW ✨ https://logstash_bk_16968.docs-preview.app.elstc.co/diff

…transformation

…as the same encoding

…g and avoid implicit deconding in addAll iterator

…y strings

…h data input encoding, to do not change encoding

elastic-sonarqube · 2025-01-31T12:55:13Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

elasticmachine · 2025-01-31T13:06:33Z

💚 Build Succeeded

Buildkite Build
Commit: b42ca05

History

💚 Build #2195 succeeded b682c20
💔 Build #2163 failed 8dfa1fe
💔 Build #2161 failed 76139e9
💔 Build #2151 failed a84656e
💔 Build #2147 failed 69bd4f4
💔 Build #2146 failed 3a090a0

cc @andsel

andsel · 2025-02-03T10:49:22Z

Uncovered use cases

This is a bugfix on the original code to solve the problem to respect sizeLimit when the first token is fragmented on different input buffers.
However, as the original implementation, this doesn't cover the case where the exceeding token is not the first of the data fragment, but is in the middle.
Consider a sizeLimit of 100. If the second second token is wide more than 100 chars, no errors are raised and the token is parsed.

Check with the pipeline:

input {
  tcp {
    port => 1234

    codec => json_lines {
      decode_size_limit_bytes => 100
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

and a loding script as:

require 'socket' 
require 'json'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

data = {"a" => "a"*10}.to_json + "\n" + {"b" => "b" * 105}.to_json; socket.write(data)

socket.close

it produces an output like:

{
      "@version" => "1",
             "a" => "aaaaaaaaaa",
    "@timestamp" => 2025-02-03T10:40:06.093178Z
}
{
      "@version" => "1",
             "b" => "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb",
    "@timestamp" => 2025-02-03T10:40:06.094601Z
}

Ideal solution

To solve this problem, the BufferedTokenizer 's extract method should return an iterator and not array (or list). The iterator should apply the boundary check on each next invocation.

andsel self-assigned this Jan 28, 2025

andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from 69bd4f4 to a84656e Compare January 28, 2025 16:02

andsel added 9 commits January 31, 2025 13:38

Re-established existing tests with the addition of the encoding case …

2382fee

…transformation

Fixed the test to compare Java strings byte rappresentation so that h…

339d645

…as the same encoding

Re-imported previously reverted code

0a910a2

Try to make the test reporduce the encoding problem

33dc6b1

Fixed test to verify the case of encoding preservation

674d377

Switch from RubyArray addAll to concat method to preserve the encodin…

5af3a62

…g and avoid implicit deconding in addAll iterator

Covered with more use cases and verifies encoding of the returned Rub…

0f76bd0

…y strings

Updates the point of return Java String to return the one encoded wit…

a5d6bef

…h data input encoding, to do not change encoding

Minor, removed unused import

b42ca05

andsel force-pushed the fix/buffered_tokenizer_clean_state_in_case_of_line_too_big_respecting_character_encoding branch from b682c20 to b42ca05 Compare January 31, 2025 12:40

andsel changed the title ~~Fix/buffered tokenizer clean state in case of line too big respecting character encoding~~ Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string Jan 31, 2025

andsel added the bug label Jan 31, 2025

andsel marked this pull request as ready for review January 31, 2025 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

andsel commented Jan 28, 2025 •

edited

Loading

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

elastic-sonarqube bot commented Jan 31, 2025

elasticmachine commented Jan 31, 2025

andsel commented Feb 3, 2025

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Are you sure you want to change the base?

Fix BufferedTokenizer to properly resume after a buffer full condition respecting the encoding of the input string #16968

Conversation

andsel commented Jan 28, 2025 • edited Loading

Release notes

What does this PR do?

Why is it important/What is the impact to the user?

Checklist

Author's Checklist

How to test this PR locally

How to test the encoding is respected

Related issues

github-actions bot commented Jan 31, 2025

github-actions bot commented Jan 31, 2025

elastic-sonarqube bot commented Jan 31, 2025

Quality Gate passed

elasticmachine commented Jan 31, 2025

💚 Build Succeeded

History

andsel commented Feb 3, 2025

Uncovered use cases

Ideal solution

andsel commented Jan 28, 2025 •

edited

Loading