Log agent based on container runtime as buffer instead of own filesystem buffer #479

a-thaler · 2023-10-23T14:57:35Z

Description
As part of the epic kyma-project/kyma#11236 the log agent was designed to support multiple pipelines were every pipeline can run in isolation. Whenever a pipeline runs into conncetivity/backpressure issues and the logs cannot be shipped fast enough, that situation should never impact another pipeline. That could only be achieved by introducing a filesystem based buffering, so that the central tail input reads and dispatches in the persistent buffers. Whenever a buffer uns full, data gets evicted from the buffer but the tail plugin continues to read and dispatch. That

The problems

The log agent is introducing a routing of log data and gives the impression that this can happen in a reliable way without big data loss in case of downtimes/backpressure of the backend. Indeed, the buffer is based on the node filesystem and very limited, under high load only small durations can be compensated (down to 5min only). With that, data gets dropped while the actual source of the data is still keeping it. By just passing to read the sources, quite a long time downtimes can be compensated (up to 1 day)
The assumption that people want to route the log data into different backends is not proven. For now there was no use case. The assumption was made because of the legacy loki stack which had to be operated in parallel.
A buffer based on the node filesystem is dangerous in general as it can harm the full node if size limits are not effective
A custom sidecar had to be developed to track the buffer sizes via metrics, so that we can safely alert on data loss (the regular metrics do not indicate that)
The buffer implementation is unstable, the tail buffer runs full in specific conditions and does not recover.
Features of the central pipeline can only be configured centrally, not per pipeline. Configuration of the k8sfilter for examnple cannot be done individual and complex logic is needed to control the enrichment with annotations

The goal
Tear down the assumption that users want to route into different backends. Support multiple pipelines but let them be coupled, so if one has backpressure, the other will suffer as well. We see scenarios were multiple pipelines are defined for the same backend only. That will bring the big advantage of passing the back pressure back to the source of the logs, the container runtime, without introducing any buffer layer in between.

The challenges

detect pause mode via metrics, so that we know that this pipeline is in trouble
detect data loose by source rotation, so that the user can be informed
migration, assure that this will work out

Actions

Prove the challenges, and identify how to solve them (docs: Add Fluent Bit buffer setup ADR #574)
Make a decision (docs: Add Fluent Bit buffer setup ADR #574)
Come up with a plan for implementation
Implement a releasable MVP
Implement leftovers

Implementation plan first chunk

Generate a central pipeline per pipeline by generating the tail input and k8sfilter per pipeline
- the new tail input uses an in-memory buffer already
- using dedicated databases for line number tracking
- k8sfilter uses dedicated
Apply the filtering to the tail input
Remove the rewrite_tag from the individual pipelines
Remove the buffer settings from the outputs
Adjust documentation about application logs regarding central pipeline - rewrite tag usage
Document new metrics and limitations

Leftovers after first chunk

Remove the sidecar
Set skip annotations/labels directly on k8sfilter

Criterias

As a user the only difference is a changed behaviour in buffering and data loss limitations, which I can read in the documentation
As a operator, the operational flow has changed
- As I don't need to run an additional sidecar anymore, only one metric endpoint (of fluentbit) gets scraped
- as the buffer metrics changed, however I still can have alerts to indicate backpressure. I can read in the user documentation what new metrics should be used instead
As a operator I see an improved fluentbit config which
- is configuring the annotations/labels feature of the k8sfilter directly
- which has dedicated fluentbit pipelines per LogPipelines
As a operator I see a performance gain
- as namespace/container filtering is happening on the tail input directly, no rewrite_tag filter is used anymore
- No filesystem buffer is in use anymore and the attached volume is gone

a-thaler · 2023-11-09T12:46:04Z

In regards to detect a pause situation, this new feature might be helpful, introduced with 2.2.0: fluent/fluent-bit#8044

chrkl · 2023-12-11T15:51:34Z

The Fluent Bit configuration was changed to use a dedicated tail plugin per LogPipeline by #590. However, we decided to keep the file-system buffer as documented by #624.

a-thaler added the area/logs LogPipeline label Oct 23, 2023

chrkl self-assigned this Nov 13, 2023

a-thaler mentioned this issue Nov 20, 2023

Remove moduleTemplate usage from module releases and documentation #566

Closed

chrkl mentioned this issue Nov 22, 2023

docs: Add Fluent Bit buffer setup ADR #574

Merged

8 tasks

a-thaler added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 27, 2023

chrkl mentioned this issue Nov 28, 2023

feat: Use separate tail plugin per LogPipeline #590

Merged

8 tasks

This was referenced Dec 11, 2023

docs: Update Fluent Bit Config ADR #624

Merged

docs: Remove obsolete log attribute prefix limitations #625

Merged

chrkl closed this as completed Dec 11, 2023

a-thaler added this to the 1.6.0 milestone Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log agent based on container runtime as buffer instead of own filesystem buffer #479

Log agent based on container runtime as buffer instead of own filesystem buffer #479

a-thaler commented Oct 23, 2023 •

edited by chrkl

Loading

a-thaler commented Nov 9, 2023

chrkl commented Dec 11, 2023

Log agent based on container runtime as buffer instead of own filesystem buffer #479

Log agent based on container runtime as buffer instead of own filesystem buffer #479

Comments

a-thaler commented Oct 23, 2023 • edited by chrkl Loading

a-thaler commented Nov 9, 2023

chrkl commented Dec 11, 2023

a-thaler commented Oct 23, 2023 •

edited by chrkl

Loading