Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for queue settings under outputs #36693

Closed
wants to merge 7 commits into from

Conversation

leehinman
Copy link
Contributor

@leehinman leehinman commented Sep 27, 2023

Proposed commit message

  • add support for idle_connection_timeout for ES output. This allows connections to be closed if they aren't being used.

  • add support for queue settings under output. Validation ensure only top level or output level is specified.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Start Filebeat with following config file:

output.elasticsearch:
  idle_connection_timeout: 3s
  queue.mem:
    events: 1024
    flush.min_events: 2
    flush.timeout: 15s

Should show queue.max_events in the metrics to be 1024, and connections to elasticsearch should only stay open for 3 seconds. Connection status can be checked with netstat -an

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@mergify
Copy link
Contributor

mergify bot commented Sep 27, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @leehinman? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@leehinman leehinman added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 27, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 27, 2023
@leehinman leehinman force-pushed the 35615_queue_under_output branch from 55975d4 to 1b4bd24 Compare September 27, 2023 18:32
@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 27, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-09-29T18:54:30.727+0000

  • Duration: 134 min 30 sec

Test stats 🧪

Test Results
Failed 0
Passed 28382
Skipped 2013
Total 30395

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@leehinman leehinman marked this pull request as ready for review September 27, 2023 21:52
@leehinman leehinman requested review from a team as code owners September 27, 2023 21:52
Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make it an error to set both ^queue and output.*.queue?

libbeat/docs/queueconfig.asciidoc Outdated Show resolved Hide resolved
libbeat/cmd/instance/beat.go Outdated Show resolved Hide resolved
@jlind23 jlind23 requested a review from cmacknz September 28, 2023 06:19
@leehinman
Copy link
Contributor Author

Could we make it an error to set both ^queue and output.*.queue?

added validation, see what you think.

You can configure the type and behavior of the internal queue by
setting options in the `queue` section of the +{beatname_lc}.yml+
config file or by setting options in the `queue` section of the
output. Only one queue type can be configured. If both the top level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both the top level queue section and the output section are specified the output section
takes precedence.

This is no longer true now that it will fail validation.

Comment on lines +308 to +310
queue:
mem:
events: 8096
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the disk queue can also be enabled this way? Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.

I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.

I could be convinced otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the disk queue can also be enabled this way?

Yep

Do we want to enable configuring the disk queue for agent? Or should we have an explicit check that if a Beat is managed by agent the disk queue is disabled.

I lean towards the latter because we haven't planned any testing effort around the disk queue in agent yet, and the goal of this implementation was mostly to allow performance tuning of the memory queue settings in the field.

I could be convinced otherwise.

I'll code up a check and we can evaluate. My personal preference is not for Beats to have different features when run under agent. It makes it harder to debug if the the same config does different things when run under agent or not. But given that users can input any yaml, this might be acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for the check is small, but there will be work in ensuring the error message shows up in an obvious way in the elastic agent status output and in fleet.

I think I am fine with allowing the disk queue, but considering it unsupported and just not documenting it until we do some performance testing. This also allows us to retroactively declare we support it once the testing is done or the work is done to expose the configurations in fleet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at checkAgentDiskQueue in beat.go

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could enable disk queue as a tech preview or beta and perform the testing at a later sprint. for the most part the functionality has been in beats for a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I've changed my mind on this, the disk queue needs changes in Elastic Agent to work properly so we shouldn't allow using it.

Agent needs to create a unique disk queue path per running Beat, or all of them will try to share the same queue which would lead to data loss or duplication. We also need to put the disk queue outside of the agent data directory by default so it doesn't have to be copied between upgrades. This isn't as simple as allowing us to turn it on in the configuration.

I will create a follow up issue specifically for this allowing this. For now let's forbid using the disk queue under agent to avoid allowing configurations that don't work properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See elastic/elastic-agent#3490 for the complications involved in enabling the disk queue for the Elastic Agent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this Craig.


// checkAgentDiskQueue should be run after management.NewManager() so
// that publisher.UnderAgent will be set with correct value
func checkAgentDiskQueue(bc *beatConfig) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and it doesn't seem to work:

sudo elastic-agent inspect
...
outputs:
  default:
    api_key: sfga4ooBeuYcqqQuT4_b:h4h_6lRaTEqN48FSOwST-A
    hosts:
    - https://60a01e9179764ca0b9c2e2fbf47d6d67.eastus2.staging.azure.foundit.no:443
    queue:
      disk:
        max_size: 10GB
    type: elasticsearch
path:
  config: /Library/Elastic/Agent
  data: /Library/Elastic/Agent/data
  home: /Library/Elastic/Agent/data/elastic-agent-123ba9
  logs: /Library/Elastic/Agent
sudo elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   └─ status: (HEALTHY) Running

Steps to reproduce:

  1. Build metricbeat and filebeat from this branch. I used cd x-pack/metricbeat && mage build.
  2. Build or download an 8.11 snapshot agent.
  3. Copy the built metricbeat and filebeat into the data/elastic-agent-$hash/components directory of the agent. cp ~/go/src/github.com/elastic/beats/x-pack/metricbeat/metricbeat /Users/cmackenzie/Downloads/elastic-agent-8.11.0-SNAPSHOT-darwin-aarch64 /data/elastic-agent-123ba9/components
  4. Install the agent. I enrolled it with Fleet but you could test this with a standalone agent too.
  5. Configure the disk queue in the output as shown above.
  6. Expect to see an error but see nothing.

I also saw no obvious logs about the queue type being configured in the logs or metrics. I suspect the memory queue configuration might be getting ignored as well.

I did confirm the running version of the beat was the most recent commit from this branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Fleet I had the following in the advanced yaml parameters box of the output, with the system integration installed with logs and monitoring enabled.

queue.disk:
  max_size: 10GB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I see in the beat-rendered-config.yml for the log-default component:

outputs:
    elasticsearch:
        api_key: <REDACTED>
        bulk_max_size: 50
        hosts:
            - https://<REDACTED>.eastus2.staging.azure.foundit.no:443
        queue:
            disk:
                max_size: 10GB
        type: elasticsearch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something on the agent output reload path might not be reloading the queue settings:

func (cm *BeatV2Manager) reloadOutput(unit *client.Unit) (bool, error) {
// Assuming that the output reloadable isn't a list, see createBeater() in cmd/instance/beat.go
output := cm.registry.GetReloadableOutput()
if output == nil {
return false, fmt.Errorf("failed to find beat reloadable type 'output'")
}
if unit == nil {
// output is being stopped
err := output.Reload(nil)
if err != nil {
return false, fmt.Errorf("failed to reload output: %w", err)
}
cm.lastOutputCfg = nil
cm.lastBeatOutputCfg = nil
return false, nil
}
expected := unit.Expected()
if expected.Config == nil {
// should not happen; hard stop
return false, fmt.Errorf("output unit has no config")
}
if cm.lastOutputCfg != nil && gproto.Equal(cm.lastOutputCfg, expected.Config) {
// configuration for the output did not change; do nothing
cm.logger.Debug("Skipped reloading output; configuration didn't change")
return false, nil
}
cm.logger.Debugf("Got output unit config '%s'", expected.Config.GetId())
if cm.stopOnOutputReload && cm.lastOutputCfg != nil {
cm.logger.Info("beat is restarting because output changed")
_ = unit.UpdateState(client.UnitStateStopping, "Restarting", nil)
cm.Stop()
return true, nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be a Validation check, which should happen whenever the config is unpacked. So that should fix the disk queue problem.

The reload may be more interesting. I think this will mean every output change requires a full restart since you need the queue settings earlier to setup the pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already restart the Beats when the output configuration changes, it looks optional in the code but it is enabled in all the Beat spec files: https://github.com/elastic/elastic-agent/blob/123ba9ce80c9865f72fa3659b5cafe9b51954f49/specs/filebeat.spec.yml#L32-L33

        - "-E"
        - "management.restart_on_output_change=true"

We also need restarts whenever any of the TLS configuration changes for example.

- add support for `idle_connection_timeout` for ES output
- add support for queue settings under output

Closes elastic#35615
@leehinman leehinman force-pushed the 35615_queue_under_output branch from 61c2d28 to 476780e Compare September 29, 2023 18:54
@leehinman
Copy link
Contributor Author

quick note on why this is working with a standalone beat but not elastic-agent

For elastic-agent, Beats is started with a blank config. So that means when the promoteOutputQueueSettings runs, there aren't any outputs defined.

elastic-agent sends the output over the control protocol, which is handled by the manager. Right now, managers output reloader is created from a publish.OutputReloader. And we don't have a publish object until we create a new pipeline that requires the queue settings. So getting this to work with elastic-agent will require some more re-working of the output reload works and it's relationship to a pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow queue configuration to be specified under the output type
5 participants