Source identifier #341

rockb1017 · 2022-01-11T19:43:15Z

fixes #276
Added sourceIdentifier parameter to prevent logs from multiple files being combined into one event.

Had to update the flush timeout logic to apply the timeout separately for each source.

codecov · 2022-01-11T19:45:34Z

Codecov Report

Merging #341 (7a7bcb7) into main (0197a90) will increase coverage by 0.0%.
The diff coverage is 81.3%.

@@          Coverage Diff          @@
##            main    #341   +/-   ##
=====================================
  Coverage   77.2%   77.2%           
=====================================
  Files         94      94           
  Lines       4448    4471   +23     
=====================================
+ Hits        3434    3454   +20     
- Misses       697     698    +1     
- Partials     317     319    +2

Impacted Files	Coverage Δ
...perator/builtin/transformer/recombine/recombine.go	`76.3% <81.3%> (+2.0%)`	⬆️

djaglowski

@rockb1017 Thanks for working on this. I like how it's looking but have some suggestions on one aspect of the design. Curious if you agree.

docs/operators/recombine.md

operator/builtin/transformer/recombine/recombine.go

djaglowski · 2022-01-12T16:58:26Z

operator/builtin/transformer/recombine/recombine.go

+	} else {
+		s = "DefaultSourceIdentifier"
+	}
+	if r.batchSize >= r.maxBatchSize {


The way you've chosen to use this is interesting. We are batching entries by source, but counting them all together as an overall maximum. My first instinct was to suggest that we just allow each batch to hit a max size, but the overall max has a couple nice benefits for memory usage. For one, it roughly limits the overall size of memory that may be need to be allocated by the operator. The other benefit is that it provides a simple way to garbage collect old sources over time.

All that said, I wonder if this could lead to difficult-to-explain behavior in some cases. For example, as a user, I think I would expect that max_batch_size will only place a per-source limit on the number of entries that would be combined into a single message. Let's say I know my system splits all logs into two parts, so I just set max_batch_size: 2. Now if I have more than one source, I will have lots of premature flushes.

An alternate design, which I think would solve all the same problems while being a little easier to understand:

Let's apply max_batch_size to each source individually.

This allows us to remove r.batchSize and just depend on len(batchMap[s])

Add another config setting called max_sources with a reasonable default, maybe 1000

At the end of addToBatch, check if len(batchMap) > r.maxSources

If so, we need to determine flushing logic. I think it's probably ok to just flush all at this point

Whenever we flush entries for a source, delete(batchMap, s) to ensure we are tracking number of sources correctly

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

operator/builtin/transformer/recombine/recombine.go

operator/builtin/transformer/recombine/recombine_test.go

djaglowski · 2022-01-20T14:23:53Z

operator/builtin/transformer/recombine/recombine.go

+	for source := range r.batchMap {
+		for _, entry := range r.batchMap[source] {
+			r.Write(ctx, entry)
+		}


It would be nice if we combined entries for each source, but given that this is a fairly atypical bailout condition, I think it's ok to proceed with individual entries. I'll make a ticket to capture this as a future improvement.

tidy

888d113

rockb1017 requested a review from a team January 11, 2022 19:43

recombine - add sourceIdentifier, update doc

8802b65

rockb1017 force-pushed the sourceIdentifier branch from 20c97e0 to 8802b65 Compare January 11, 2022 19:44

djaglowski reviewed Jan 12, 2022

View reviewed changes

rockb1017 and others added 2 commits January 14, 2022 10:46

Update operator/builtin/transformer/recombine/recombine.go

143adf8

Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>

adding max_sources

90fc4ac

djaglowski reviewed Jan 17, 2022

View reviewed changes

operator/builtin/transformer/recombine/recombine.go Outdated Show resolved Hide resolved

operator/builtin/transformer/recombine/recombine.go Show resolved Hide resolved

operator/builtin/transformer/recombine/recombine_test.go Show resolved Hide resolved

rockb1017 added 4 commits January 18, 2022 12:17

improve performance and add test cases

303b83d

fix max_source test

4f54ae7

improve max batch size test

019ccb6

revert flushUncombined

7a7bcb7

rockb1017 force-pushed the sourceIdentifier branch from b57be02 to 7a7bcb7 Compare January 20, 2022 08:09

rockb1017 requested a review from djaglowski January 20, 2022 08:58

djaglowski approved these changes Jan 20, 2022

View reviewed changes

djaglowski mentioned this pull request May 24, 2022

[pkg/stanza] Recombine - When exceeding max_sources, combine entries before flushing open-telemetry/opentelemetry-collector-contrib#10281

Closed

djaglowski merged commit d1c9d7a into open-telemetry:main Jan 20, 2022

rockb1017 deleted the sourceIdentifier branch January 20, 2022 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source identifier #341

Source identifier #341

rockb1017 commented Jan 11, 2022

codecov bot commented Jan 11, 2022 •

edited

Loading

djaglowski left a comment

djaglowski Jan 12, 2022

djaglowski Jan 20, 2022

Source identifier #341

Source identifier #341

Conversation

rockb1017 commented Jan 11, 2022

codecov bot commented Jan 11, 2022 • edited Loading

Codecov Report

djaglowski left a comment

Choose a reason for hiding this comment

djaglowski Jan 12, 2022

Choose a reason for hiding this comment

djaglowski Jan 20, 2022

Choose a reason for hiding this comment

codecov bot commented Jan 11, 2022 •

edited

Loading