Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/filelog] Flush can send partial input #32170

Closed
OverOrion opened this issue Apr 4, 2024 · 11 comments
Closed

[receiver/filelog] Flush can send partial input #32170

OverOrion opened this issue Apr 4, 2024 · 11 comments
Labels

Comments

@OverOrion
Copy link
Contributor

Component(s)

pkg/stanza

What happened?

Description

As mentioned in #31512 the filelogreceiver could "loose" some characters. The issue is a bit long now, so this is here to summarize the problem, minimize the reproduction steps and point to the relevant lines of code.

The problem is how the flush logic behaves with a scanner. By default the scanner expects newline terminated lines, but if it can't find it, then it will return the current buffer once flush timeout expires. The problem with this is that there is no communication between the flushing and scanner, so the following is possible:

  1. scanner scanned some bytes, but has not reached EOF yet
  2. forced flush happens, then the scanner yields its' inner buffer

The problem here seems to be with the different lifetimes of FlushState and Scanner as there is a single FlushState instance for a reader which will have different Scanner instances

func New(r io.Reader, maxLogSize int, bufferSize int, startOffset int64, splitFunc bufio.SplitFunc) *Scanner {

This means that these scanner instances will all all share the same fate:

  1. Read n bytes (n == initial buffer size), try to read more but since it can't find the newline terminator it won't return with tokens.
  2. Becase it did not run successfully, then a new one will be constructed:
    s := scanner.New(r, r.maxLogSize, r.initialBufferSize, r.Offset, r.splitFunc)
    // Iterate over the tokenized file, emitting entries as we go
    for {
    select {
    case <-ctx.Done():
    return
    default:
    }
    ok := s.Scan()
    if !ok {
    if err := s.Error(); err != nil {
    r.logger.Errorw("Failed during scan", zap.Error(err))
    } else if r.deleteAtEOF {
    r.delete()
    }
    return
    }
  3. Once the flush timer expires, then the current Scanner will be force flushed, yielding n bytes only (this could be different, depenending on when the flush timeout reaps it)

The reconstruction is needed because:

  1. once a scanner reached the end of its' input then it won't be usable anymore and
  2. because this is how the Collector gets the new input from a file: a new Scanner with the offset

Steps to Reproduce

Input creation without newline ending

printf "2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpPQ78jzoFu" > input.log

Collector

  1. The easiest is to change the default scanner buffer size to something small (50 bytes, as the input is just a little over 100 bytes)
  2. Build and run the collector with the given configuration
  3. Check the output log file, the input is chunked into 50 bytes
{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240671818953417","body":{"stringValue":"2024-03-19T11:21:00.839338492-05:00 stdout P 2024-"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}
{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240677018432716","body":{"stringValue":"03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpP"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}
{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240682018598612","body":{"stringValue":"Q78jzoFu"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}

Expected Result

Whole line as is

Actual Result

Chunked line

Possible solutions

  • A possible workaround would be if flushing only happened when atEOF was true, but it's only for file based sources, so it would not work for TCP for example.
  • A different "polling" method for Scanners so they would not have to be recreated just to read new input/lines, and the flush timeout would only send the the buffer if the scanning can't advance anymore
  • The bufio.Scanner might need to be retired in favor of something else (something based on bufio.Reader perhaps?), combined with keeping track of the current partial token. This is something that needs to be given some thought because many things rely on the scanner currently.

What do you think @djaglowski @ChrsMark?

Also huge kudos to @MrAnno for pair debugging this issue with me 🚀

Collector version

e4c5b51

Environment information

Environment

OS: Ubuntu 23.10
Compiler(if manually compiled): go 1.21.6

OpenTelemetry Collector configuration

receivers:
  filelog:
    start_at: beginning
    include:
    - /home/orion/input.log

exporters:
  file/simple:
    path: ./partial_output
  debug:
    verbosity: detailed

service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [debug, file/simple]

Log output

{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240671818953417","body":{"stringValue":"2024-03-19T11:21:00.839338492-05:00 stdout P 2024-"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}
{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240677018432716","body":{"stringValue":"03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpP"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}
{"resourceLogs":[{"resource":{},"scopeLogs":[{"scope":{},"logRecords":[{"observedTimeUnixNano":"1712240682018598612","body":{"stringValue":"Q78jzoFu"},"attributes":[{"key":"log.file.name","value":{"stringValue":"input.log"}}],"traceId":"","spanId":""}]}]}]}

Additional context

I also added some good ole' print statements to Func()

func (s *State) Func(splitFunc bufio.SplitFunc, period time.Duration) bufio.SplitFunc {
and to s.Bytes()
which helped with debugging.

// First scanner instance
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024- // First 50 bytes
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpP // First 2*50 bytes
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpPQ78jzoFu // Leftovers
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpPQ78jzoFu //EOF

// Second scanner instance, same fate
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpP
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpPQ78jzoFu
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-03-13 11:51:00,838 [scheduler-2] INFO  dLphJ63kHpPQ78jzoFu

// Third scanner instance
inside Func, data is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-

// Flush timeout, sending current scanner's buffer, calling s.Bytes()
 tokenizing, len(bytes): 50,
tokenizing, bytes is: 2024-03-19T11:21:00.839338492-05:00 stdout P 2024-
@OverOrion OverOrion added bug Something isn't working needs triage New item requiring triage labels Apr 4, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski djaglowski changed the title stanza flush can send partial input [receiver/filelog] Flush can send partial input Apr 4, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

Pinging code owners for receiver/filelog: @djaglowski. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

djaglowski commented Apr 4, 2024

Thanks for investigating and writing this up @OverOrion.

To summarize the expected vs problematic behavior:

Expected: If a file ends with an unterminated log and the flush timer expires, we should flush a token containing the content after the offset, up until either the end of file or max log size.

Problem: When a file ends with an unterminated log and the flush timer expires, we are flushing a token containing the content after the offset, but only up until initial buffer size.


My understanding of the incorrect behavior does not rely on multiple sequential scanners. I would describe it as follows:

A scanner is created with an initial buffer size.

  • It reads into its buffer until full but doesn't find a complete token.
  • It immediately and automatically enlarges the buffer and tries again.
  • It continues to enlarge the buffer and try again until one of following happens:
  1. The end of a token is found, in which case the token is returned (correct behavior).
  2. The buffer reaches its max size, in which case the entire buffer is returned (correct behavior).
  3. The EOF is found, in which case no token is returned (correct behavior).
  4. The flush timer expires, in which case the contents of the buffer are returned immediately (incorrect behavior).

The correct behavior would be the same, except the last two items should be related:

3a. The EOF is found before the flush timer has expired. No token is returned.
3b. The EOF is found after the flush timer has expired. The entire buffer is returned.

A possible workaround would be if flushing only happened when atEOF was true

I think this is the solution to enable the correct behavior described above. In fact, I believe this behavior previously existed but the nuance was not tested or documented. Let's make sure to include both this time so we don't regress again in the future.

@atoulme atoulme removed the needs triage New item requiring triage label Apr 5, 2024
@ChrsMark
Copy link
Member

ChrsMark commented Apr 5, 2024

Thank's @OverOrion for digging into this and sharing the details!

The EOF is found, in which case no token is returned (correct behavior).

I wonder if it's correct to wait for the flush timer to expire in order to return the buffer in case the EOF is found instead of doing it immediately when EOF is found 🤔 .

If we return at EOF immediately, a bigger force_flush_period would give more time to the Scanner to increase the buffer and consume the remaining message until EOF. Then the complete message would be returned. I have replicated this at ChrsMark@e12a690 and seems to work with the example tested by @OverOrion and the big ones we used at #31512.

If we want to keep this in order to preserve 3a (from #32170 (comment)) then 3b should be like:

3b: If flush timer has expired but the EOF is not found skip flush and try again?

In that case the flush timer actually has more or less no effect, right?

@ChrsMark
Copy link
Member

ChrsMark commented Apr 5, 2024

Another option could be to move the We're seeing new data so postpone the next flush block:

// We're seeing new data so postpone the next flush
if len(data) > s.LastDataLength {
s.LastDataChange = time.Now()
s.LastDataLength = len(data)
}

before the timeout check (previous if block).

This if-block only makes sense if the goal is to actually consume the rest of the data but by leaving it in the end will not ensure that on the next call the timeout won't be reached. So it actually has an a non deterministic impact which depends on "timing".

However this would be completely identical with preserving the reading till the EOF.

@djaglowski
Copy link
Member

I wonder if it's correct to wait for the flush timer to expire in order to return the buffer in case the EOF is found instead of doing it immediately when EOF is found 🤔 .

This is worth a separate issue to discuss if you want. In short though, I think this makes a lot of assumptions about how files are written which I've never been comfortable making. Does every application & OS write complete logs atomically? Otherwise we're just emitting partial logs which would have been complete if we just waited a little longer. Maybe there's a case to be made but I think this could potentially create big problems.

If we return at EOF immediately, a bigger force_flush_period would give more time to the Scanner to increase the buffer and consume the remaining message until EOF.

If we want to keep this in order to preserve 3a (from #32170 (comment)) then 3b should be like:

3b: If flush timer has expired but the EOF is not found skip flush and try again?

In that case the flush timer actually has more or less no effect, right?

The key concept which I maybe didn't articulate well is that flushing is never a necessity prior to EOF. It shouldn't be part of the logic at all. (This is essentially what the bug boils down to, that the flush timer is being applied when it really shouldn't even be considered.) As long as we're not yet at EOF, we should just keep consuming the data rapidly and never look at the flush timer. Once at EOF, it's a relevant consideration.

Typically when flushing is necessary at all, it's not because the timer expires while reading the file. It's because file is sitting idle and the timer expired in between polls. It can work either way, but it's not intended to be a consideration which is relevant during the normal course of consuming the file.

@djaglowski
Copy link
Member

This if-block only makes sense if the goal is to actually consume the rest of the data but by leaving it in the end will not ensure that on the next call the timeout won't be reached. So it actually has an a non deterministic impact which depends on "timing".

Not sure I understand what you are suggesting but the purpose of that is to reset the timer because the timer is relative to the last time a log was emitted. In other words, when the final (partial) log in a file is reached, this is basically when the timer should start.

@ChrsMark
Copy link
Member

ChrsMark commented Apr 5, 2024

This is worth a separate issue to discuss if you want. In short though, I think this makes a lot of assumptions about how files are written which I've never been comfortable making. Does every application & OS write complete logs atomically? Otherwise we're just emitting partial logs which would have been complete if we just waited a little longer. Maybe there's a case to be made but I think this could potentially create big problems.

I see yeap. It makes sense to not rely on such an assumption.

This if-block only makes sense if the goal is to actually consume the rest of the data but by leaving it in the end will not ensure that on the next call the timeout won't be reached. So it actually has an a non deterministic impact which depends on "timing".

Not sure I understand what you are suggesting but the purpose of that is to reset the timer because the timer is relative to the last time a log was emitted. In other words, when the final (partial) log in a file is reached, this is basically when the timer should start.

My point was mainly that by inverting the order of the final 2 if-blocks we ensure that first the check for remaining data happens and only if we have no remaining data we proceed to the timeout check. So the flow should be:

...
// We're seeing new data so postpone the next flush
if len(data) > s.LastDataLength {
	s.LastDataChange = time.Now()
	s.LastDataLength = len(data)
}

// Flush timed out
if time.Since(s.LastDataChange) > period {
	s.LastDataChange = time.Now()
	s.LastDataLength = 0
	return len(data), data, nil
}

// Ask for more data
return 0, nil, nil
...

If I don't miss anything this looks equivalent to introducing the EOF requirement/check?
I verified this with some manual tests as well.

Overall I believe we are aligned here. We can keep any remaining discussions at #32100.

@djaglowski
Copy link
Member

My point was mainly that by inverting the order of the final 2 if-blocks we ensure that first the check for remaining data happens and only if we have no remaining data we proceed to the timeout check.

That makes sense. Thanks for clarifying.

djaglowski added a commit that referenced this issue Apr 23, 2024
**Description:**
Flush could have sent partial input before EOF was reached, this PR
fixes it.

**Link to tracking Issue:** #31512, #32170

**Testing:** Added unit test `TestFlushPeriodEOF`

**Documentation:** Added a note to `force_flush_period` option

---------

Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
@ChrsMark
Copy link
Member

Since #32100 was merged I guess we can close this?

@ChrsMark
Copy link
Member

@crobert-1 since we closed #31512 I think we can close this one as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants