Skip to content
This repository has been archived by the owner on May 25, 2022. It is now read-only.

Source identifier #341

Merged
merged 8 commits into from
Jan 20, 2022
Merged

Conversation

rockb1017
Copy link
Contributor

fixes #276
Added sourceIdentifier parameter to prevent logs from multiple files being combined into one event.

Had to update the flush timeout logic to apply the timeout separately for each source.

@rockb1017 rockb1017 requested a review from a team January 11, 2022 19:43
@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #341 (7a7bcb7) into main (0197a90) will increase coverage by 0.0%.
The diff coverage is 81.3%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #341   +/-   ##
=====================================
  Coverage   77.2%   77.2%           
=====================================
  Files         94      94           
  Lines       4448    4471   +23     
=====================================
+ Hits        3434    3454   +20     
- Misses       697     698    +1     
- Partials     317     319    +2     
Impacted Files Coverage Δ
...perator/builtin/transformer/recombine/recombine.go 76.3% <81.3%> (+2.0%) ⬆️

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rockb1017 Thanks for working on this. I like how it's looking but have some suggestions on one aspect of the design. Curious if you agree.

docs/operators/recombine.md Outdated Show resolved Hide resolved
operator/builtin/transformer/recombine/recombine.go Outdated Show resolved Hide resolved
operator/builtin/transformer/recombine/recombine.go Outdated Show resolved Hide resolved
} else {
s = "DefaultSourceIdentifier"
}
if r.batchSize >= r.maxBatchSize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you've chosen to use this is interesting. We are batching entries by source, but counting them all together as an overall maximum. My first instinct was to suggest that we just allow each batch to hit a max size, but the overall max has a couple nice benefits for memory usage. For one, it roughly limits the overall size of memory that may be need to be allocated by the operator. The other benefit is that it provides a simple way to garbage collect old sources over time.

All that said, I wonder if this could lead to difficult-to-explain behavior in some cases. For example, as a user, I think I would expect that max_batch_size will only place a per-source limit on the number of entries that would be combined into a single message. Let's say I know my system splits all logs into two parts, so I just set max_batch_size: 2. Now if I have more than one source, I will have lots of premature flushes.

An alternate design, which I think would solve all the same problems while being a little easier to understand:

  • Let's apply max_batch_size to each source individually.
    • This allows us to remove r.batchSize and just depend on len(batchMap[s])
  • Add another config setting called max_sources with a reasonable default, maybe 1000
    • At the end of addToBatch, check if len(batchMap) > r.maxSources
      • If so, we need to determine flushing logic. I think it's probably ok to just flush all at this point
  • Whenever we flush entries for a source, delete(batchMap, s) to ensure we are tracking number of sources correctly

rockb1017 and others added 2 commits January 14, 2022 10:46
Comment on lines +284 to +287
for source := range r.batchMap {
for _, entry := range r.batchMap[source] {
r.Write(ctx, entry)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we combined entries for each source, but given that this is a fairly atypical bailout condition, I think it's ok to proceed with individual entries. I'll make a ticket to capture this as a future improvement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filelog: Multiline merging mixes up logs from different files
2 participants