-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for queue settings under outputs #36693
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
55975d4
to
1b4bd24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make it an error to set both ^queue
and output.*.queue
?
added validation, see what you think. |
libbeat/docs/queueconfig.asciidoc
Outdated
You can configure the type and behavior of the internal queue by | ||
setting options in the `queue` section of the +{beatname_lc}.yml+ | ||
config file or by setting options in the `queue` section of the | ||
output. Only one queue type can be configured. If both the top level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both the top level queue section and the output section are specified the output section
takes precedence.
This is no longer true now that it will fail validation.
queue: | ||
mem: | ||
events: 8096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the disk queue can also be enabled this way? Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.
I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.
I could be convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the disk queue can also be enabled this way?
Yep
Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.
I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.
I could be convinced otherwise.
I'll code up a check and we can evaluate. My personal preference is not for Beats to have different features when run under agent. It makes it harder to debug if the the same config does different things when run under agent or not. But given that users can input any yaml, this might be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for the check is small, but there will be work in ensuring the error message shows up in an obvious way in the elastic agent status output and in fleet.
I think I am fine with allowing the disk queue, but considering it unsupported and just not documenting it until we do some performance testing. This also allows us to retroactively declare we support it once the testing is done or the work is done to expose the configurations in fleet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at checkAgentDiskQueue
in beat.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could enable disk queue as a tech preview or beta and perform the testing at a later sprint. for the most part the functionality has been in beats for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I've changed my mind on this, the disk queue needs changes in Elastic Agent to work properly so we shouldn't allow using it.
Agent needs to create a unique disk queue path per running Beat, or all of them will try to share the same queue which would lead to data loss or duplication. We also need to put the disk queue outside of the agent data directory by default so it doesn't have to be copied between upgrades. This isn't as simple as allowing us to turn it on in the configuration.
I will create a follow up issue specifically for this allowing this. For now let's forbid using the disk queue under agent to avoid allowing configurations that don't work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See elastic/elastic-agent#3490 for the complications involved in enabling the disk queue for the Elastic Agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into this Craig.
d61263d
to
61c2d28
Compare
libbeat/cmd/instance/beat.go
Outdated
|
||
// checkAgentDiskQueue should be run after management.NewManager() so | ||
// that publisher.UnderAgent will be set with correct value | ||
func checkAgentDiskQueue(bc *beatConfig) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and it doesn't seem to work:
sudo elastic-agent inspect
...
outputs:
default:
api_key: sfga4ooBeuYcqqQuT4_b:h4h_6lRaTEqN48FSOwST-A
hosts:
- https://60a01e9179764ca0b9c2e2fbf47d6d67.eastus2.staging.azure.foundit.no:443
queue:
disk:
max_size: 10GB
type: elasticsearch
path:
config: /Library/Elastic/Agent
data: /Library/Elastic/Agent/data
home: /Library/Elastic/Agent/data/elastic-agent-123ba9
logs: /Library/Elastic/Agent
sudo elastic-agent status
┌─ fleet
│ └─ status: (HEALTHY) Connected
└─ elastic-agent
└─ status: (HEALTHY) Running
Steps to reproduce:
- Build metricbeat and filebeat from this branch. I used
cd x-pack/metricbeat && mage build
. - Build or download an 8.11 snapshot agent.
- Copy the built metricbeat and filebeat into the
data/elastic-agent-$hash/components
directory of the agent.cp ~/go/src/github.com/elastic/beats/x-pack/metricbeat/metricbeat /Users/cmackenzie/Downloads/elastic-agent-8.11.0-SNAPSHOT-darwin-aarch64 /data/elastic-agent-123ba9/components
- Install the agent. I enrolled it with Fleet but you could test this with a standalone agent too.
- Configure the disk queue in the output as shown above.
- Expect to see an error but see nothing.
I also saw no obvious logs about the queue type being configured in the logs or metrics. I suspect the memory queue configuration might be getting ignored as well.
I did confirm the running version of the beat was the most recent commit from this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Fleet I had the following in the advanced yaml parameters box of the output, with the system integration installed with logs and monitoring enabled.
queue.disk:
max_size: 10GB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I see in the beat-rendered-config.yml
for the log-default
component:
outputs:
elasticsearch:
api_key: <REDACTED>
bulk_max_size: 50
hosts:
- https://<REDACTED>.eastus2.staging.azure.foundit.no:443
queue:
disk:
max_size: 10GB
type: elasticsearch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something on the agent output reload path might not be reloading the queue settings:
beats/x-pack/libbeat/management/managerV2.go
Lines 689 to 726 in 61c2d28
func (cm *BeatV2Manager) reloadOutput(unit *client.Unit) (bool, error) { | |
// Assuming that the output reloadable isn't a list, see createBeater() in cmd/instance/beat.go | |
output := cm.registry.GetReloadableOutput() | |
if output == nil { | |
return false, fmt.Errorf("failed to find beat reloadable type 'output'") | |
} | |
if unit == nil { | |
// output is being stopped | |
err := output.Reload(nil) | |
if err != nil { | |
return false, fmt.Errorf("failed to reload output: %w", err) | |
} | |
cm.lastOutputCfg = nil | |
cm.lastBeatOutputCfg = nil | |
return false, nil | |
} | |
expected := unit.Expected() | |
if expected.Config == nil { | |
// should not happen; hard stop | |
return false, fmt.Errorf("output unit has no config") | |
} | |
if cm.lastOutputCfg != nil && gproto.Equal(cm.lastOutputCfg, expected.Config) { | |
// configuration for the output did not change; do nothing | |
cm.logger.Debug("Skipped reloading output; configuration didn't change") | |
return false, nil | |
} | |
cm.logger.Debugf("Got output unit config '%s'", expected.Config.GetId()) | |
if cm.stopOnOutputReload && cm.lastOutputCfg != nil { | |
cm.logger.Info("beat is restarting because output changed") | |
_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil) | |
cm.Stop() | |
return true, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to be a Validation check, which should happen whenever the config is unpacked. So that should fix the disk queue problem.
The reload may be more interesting. I think this will mean every output change requires a full restart since you need the queue settings earlier to setup the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already restart the Beats when the output configuration changes, it looks optional in the code but it is enabled in all the Beat spec files: https://github.com/elastic/elastic-agent/blob/123ba9ce80c9865f72fa3659b5cafe9b51954f49/specs/filebeat.spec.yml#L32-L33
- "-E"
- "management.restart_on_output_change=true"
We also need restarts whenever any of the TLS configuration changes for example.
- add support for `idle_connection_timeout` for ES output - add support for queue settings under output Closes elastic#35615
61c2d28
to
476780e
Compare
quick note on why this is working with a standalone beat but not elastic-agent For elastic-agent, Beats is started with a blank config. So that means when the elastic-agent sends the output over the control protocol, which is handled by the manager. Right now, managers output reloader is created from a publish.OutputReloader. And we don't have a publish object until we create a new pipeline that requires the queue settings. So getting this to work with elastic-agent will require some more re-working of the output reload works and it's relationship to a pipeline. |
Proposed commit message
add support for
idle_connection_timeout
for ES output. This allows connections to be closed if they aren't being used.add support for queue settings under output. Validation ensure only top level or output level is specified.
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Start Filebeat with following config file:
Should show
queue.max_events
in the metrics to be 1024, and connections to elasticsearch should only stay open for 3 seconds. Connection status can be checked withnetstat -an
Related issues