-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
20c97e0
to
8802b65
Compare
Codecov Report
@@ Coverage Diff @@
## main #341 +/- ##
=====================================
Coverage 77.2% 77.2%
=====================================
Files 94 94
Lines 4448 4471 +23
=====================================
+ Hits 3434 3454 +20
- Misses 697 698 +1
- Partials 317 319 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rockb1017 Thanks for working on this. I like how it's looking but have some suggestions on one aspect of the design. Curious if you agree.
} else { | ||
s = "DefaultSourceIdentifier" | ||
} | ||
if r.batchSize >= r.maxBatchSize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way you've chosen to use this is interesting. We are batching entries by source, but counting them all together as an overall maximum. My first instinct was to suggest that we just allow each batch to hit a max size, but the overall max has a couple nice benefits for memory usage. For one, it roughly limits the overall size of memory that may be need to be allocated by the operator. The other benefit is that it provides a simple way to garbage collect old sources over time.
All that said, I wonder if this could lead to difficult-to-explain behavior in some cases. For example, as a user, I think I would expect that max_batch_size
will only place a per-source limit on the number of entries that would be combined into a single message. Let's say I know my system splits all logs into two parts, so I just set max_batch_size: 2
. Now if I have more than one source, I will have lots of premature flushes.
An alternate design, which I think would solve all the same problems while being a little easier to understand:
- Let's apply
max_batch_size
to each source individually.- This allows us to remove
r.batchSize
and just depend onlen(batchMap[s])
- This allows us to remove
- Add another config setting called
max_sources
with a reasonable default, maybe 1000- At the end of
addToBatch
, checkif len(batchMap) > r.maxSources
- If so, we need to determine flushing logic. I think it's probably ok to just flush all at this point
- At the end of
- Whenever we flush entries for a source,
delete(batchMap, s)
to ensure we are tracking number of sources correctly
Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
b57be02
to
7a7bcb7
Compare
for source := range r.batchMap { | ||
for _, entry := range r.batchMap[source] { | ||
r.Write(ctx, entry) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we combined entries for each source, but given that this is a fairly atypical bailout condition, I think it's ok to proceed with individual entries. I'll make a ticket to capture this as a future improvement.
fixes #276
Added sourceIdentifier parameter to prevent logs from multiple files being combined into one event.
Had to update the flush timeout logic to apply the timeout separately for each source.