Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudflare Logpush Integration not working reliably with S3/SQS #5526

Closed
nathangiuliani opened this issue Mar 14, 2023 · 21 comments
Closed

Cloudflare Logpush Integration not working reliably with S3/SQS #5526

nathangiuliani opened this issue Mar 14, 2023 · 21 comments
Assignees
Labels

Comments

@nathangiuliani
Copy link

I’m having a hell of a time trying to make use of the Cloudflare LogPush integration. I realise this is currently in beta, however going by #5033, it appears this is likely close to release.

Our Cloudflare instance is configured to logpush all log types (both zone and account) to a single S3 bucket. This is essentially our log archive. There are over 1.5 million objects in this s3 bucket, and it will grow to significantly more than this.

When using the SQS mode, log entries are missed. In fact, it’s more accurate to say that only a small portion of the log entries are actually ingested.

I can reproduce this fairly easily with a new elastic instance, s3 bucket and sqs queue. Copying all of our audit logs (the smallest data set we have) with the default SQS configuration for the integration resulted in somewhere around 35 entries being processed – of 1850. Interestingly, disabling all other log types (so that only audit logs is enabled) resulted in a much larger portion of the logs being ingested – around 1000 entries.

The SQS queue stats quite clearly show the messages being sent, received, and ultimately deleted by the elastic agent.

Looking at the agent debug logs, I don’t spot anything obvious other than bunch of no file_selectors matched messages – but I believe this is due to the multiple log types being configured, and each one not being triggered on each poll of the queue.

Using the S3 poll method results in all log entries being ingested without issue.

Using S3 polling mode for all log types works, but due to the large number of objects in the bucket, this can lag behind – up to multiple days. We’ve also seen the EC2 instance running the elastic agent completely die due to running out of memory.

I initially thought I had this issue fixed by rolling back to agent 8.5.2 due to the issue described in elastic/beats#34219 (comment). I saw this stack trace many times while trying to get this to work. Agent 8.5.2 seems to fix this issue and cause the agent to process incoming items a lot faster due to not constantly crashing, but the behaviour is still the same. In SQS mode most log entries are missed, and in S3 mode it appears to be unable to cope with the number of items in the bucket.

Any advice you can offer would be appreciated, including what logs you’d like to see. The debug logs seem to include a lot of data, so we may need to look at a more secure avenue to deliver these, or some specific log entries only.

@elasticmachine
Copy link

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@ShourieG ShourieG self-assigned this Mar 15, 2023
@kcreddy
Copy link
Contributor

kcreddy commented Mar 15, 2023

@nathangiuliani, thanks for raising the issue.
Would it be possible to share/confirm following for further investigation:

  1. Share Agent diagnostics to know versions, policy and configuration.
  2. The sample dataset that was used when not all messages were processed. This can help us reproduce the issue and confirm bug if any.
  3. Could you increase Maximum Concurrent SQS Messages from the UI for Cloudflare integration to a higher value and report any improvement?
  4. Have you tried using Parallel processing of the messages in the queue?
  5. Were there any WARN level logs indicating the messages were just deleted from the queue?
    When a non-retryable error is reached, the input deletes the message from the queue, and doesn't process it as per here
  6. Need debug logs from the agent during this period to understand further if issue cannot be reproducible from sample dataset.

@ShourieG
Copy link
Contributor

ShourieG commented Mar 15, 2023

Hi @nathangiuliani,

To add to @kcreddy 's suggetions can you tell us if you are seeing any errors similar to attempt to use a closed processor or tried to close already closed "processor_name" processor in the logs while using SQS mode ?

Also while using S3 mode have you tried increasing max no. of workers and see if it's helping with the processing speed ?

@ShourieG ShourieG removed their assignment Mar 15, 2023
@nathangiuliani
Copy link
Author

nathangiuliani commented Mar 16, 2023

Thanks for your responses @ShourieG and @kcreddy.

@kcreddy:

Share Agent diagnostics to know versions, policy and configuration.
The sample dataset that was used when not all messages were processed. This can help us reproduce the issue and confirm bug if any.

Unfortunately I may need to sanitise the logs and data before it can be shared. I'll see what I can do.

Could you increase Maximum Concurrent SQS Messages from the UI for Cloudflare integration to a higher value and report any improvement?

Yes, see scenario 3 below.

Have you tried using Parallel processing of the messages in the queue?

No, just a single agent. While this may help with the final live configuration, I can't see how that would help with such a small sample data set - it is all able to be processed within the sqs visibility timeout. See scenario 2 below.

Were there any WARN level logs indicating the messages were just deleted from the queue?

There doesn't appear to be - I've searched for "log.level":"warn" and ("deleted" or "sqs" or "queue"). Let me know if this is not appropriate.

When a non-retryable error is reached, the input deletes the message from the queue, and doesn't process it as per here

I can't find either of those error strings in the logs during the testing period - only one occurrence from a message that was in the queue from initial creation prior to being populated with cloudflare log files.

Need debug logs from the agent during this period to understand further if issue cannot be reproducible from sample dataset.

Understood, will try to sanitise them as above.

@ShourieG:

To add to @kcreddy 's suggestions can you tell us if you are seeing any errors similar to attempt to use a closed processor or tried to close already closed "processor_name" processor in the logs while using SQS mode ?

No, I can't find this in any of the logs.

Also while using S3 mode have you tried increasing max no. of workers and see if it's helping with the processing speed ?

Yes, this helps somewhat, but the time delay seems to be after processing of files has completed, as filebeat updates the status of all the files in the bucket.

I've done some more testing today, with a few scenarios. In each case, the s3 bucket was emptied, sqs queue was purged, and elastic agent was removed and re-installed.

In all scenarios, we are using elasticsearch 8.6.2 deployed on elastic cloud, and I'm only enabling the SQS queue mode.

Scenario 1: all logs types enabled in the integration, elastic agent 8.5.2

Uploading our sample audit log data set with 1850 entries resulted in 53 audit log entries being ingested. Agent logs show SQS messages being received and some S3 files processed. I couldn't spot any errors or warnings. SQS queue stats show messages being received and deleted as expected.

I retried this scenario a few times and it's reproducible, but each time a slightly different number of messages are processed. The second time it was 69.

I also left this for half an hour after the ingestion of events stopped, nothing else was processed, and the sqs queue was empty.

Scenario 2: only audit logs enabled in the integration, all others log types disabled. elastic agent 8.5.2

Uploading our sample audit log data set with 1850 entries resulted in all 1850 audit log entries being ingested.

Scenario 3: all logs types enabled in the integration, but max_number_of_messages set to 100 for audit logs. elastic agent 8.5.2

Uploading our sample audit log data set with 1850 entries resulted in 548 audit log entries being ingested.

Scenario 4: all logs types enabled in the integration, elastic agent 8.6.2

Uploading our sample audit log data set with 1850 entries resulted in continual crashes of the aws-s3 component, as per elastic/beats#34219 (comment). After 10 minutes (far longer than any scenario above), 6 audit log entries had been ingested.

@ShourieG ShourieG self-assigned this Mar 16, 2023
@ShourieG
Copy link
Contributor

@nathangiuliani thanks for sharing the extensive test cases. It seems to me that enabling more than one log type is causing some sort of processor reuse which might be causing filebeat to drop events. However the diagnostics/debug logs are required to dig deeper, but I have a feeling it is probably due to the issue reported here : elastic/beats#34219 (comment) .

It's mostly an issue related to beats and not the integration. This bug was introduced in 7.9 and mostly went unnoticed until recently, the underlying root cause is filebeats trying to reuse a closed processor. The solution to this in already merged in the following PRs : elastic/beats#34761 and elastic/beats#34647. These fixes are confirmed with the release of 8.7.

However it would be really helpful if you could share the sanitised logs, as it will help us confirm if its indeed being caused due to the issue mentioned above.

@nathangiuliani
Copy link
Author

@ShourieG take a look at support case 01331462.

@ShourieG
Copy link
Contributor

ShourieG commented Mar 16, 2023

@nathangiuliani thanks for the update. Our support engineers will be opening an official SDH where we can continue our investigation. We as dev engineers don't have direct access to support case conversations.

@ShourieG
Copy link
Contributor

ShourieG commented Mar 16, 2023

@nathangiuliani after going through the logs and the config, we saw that all the data streams are using the same sqs queue: https://sqs.ap-southeast-2.amazonaws.com/024370163937/hmb-cloudflare-logs-dev-s3-event-queue.

This is most likely what is causing the issue as the general requirement is for each data stream to have its own separate queue, since currently each data stream uses a separate instance of the aws input. Having separate queues should solve the issue of dropped events in 8.5.2. This would be possible easily if there are different bucket paths. Then each bucket path can be routed to different queues. 8.6.2 on the other hand has a different bug right now which should be fixed in 8.7.
Also the docs currently do not mention this clearly, apologies for that. We will be updating the docs soon to have these Notes/pointers.

@tomcart90
Copy link

tomcart90 commented Mar 16, 2023

I've been debugging essentially the same issue for the past few days. I hadn't realized that it was recommended to have an SQS queue per data stream/type until reading the last comment here. However, I found that having the logpush job push to data-set specific keys within the s3 bucket (e.g. http_requests/) and configuring the [SQS] File Selectors option in the agent policy for the integration to point at the specific paths (e.g. - regex: '^http_requests/') solved this problem. Is that a valid alternative @ShourieG?

Edit: also, just to add to that: in the integration UI there's only a single queue URL input field which would suggest only a single queue should be used?

Edit: also, is there an ETA on 8.7? I'm also seeing the issue reported here: elastic/beats#34219 (comment).

@ShourieG
Copy link
Contributor

@tomcart90 A similar thing was done here in this issue and the problem still cropped up. I guess the more number of data streams you have active, can aggrevate the issue if you have a single queue. So in order to have the integration work reliably at scale, currently separate queues are recommended for now until an enhancement is made.

@nathangiuliani
Copy link
Author

I think it's fair to say that in SQS mode it doesn't work at all if you have the streams sharing a single SQS queue. Thinking through what's happening, it makes sense, but the documentation and even UI don't reflect this at all. Essentially we'd need to add several instances of the integration, each with only a single data stream enabled, as the SQS URL is only set per-instance of the integration, not per-data stream.

I'll give this configuration a go today and report back. Obviously it's pretty challenging to confirm it's working reliably though, especially given how many files logpush generates.

@nathangiuliani
Copy link
Author

In regards to your documentation, the wording of this line definitely makes more sense now, but it's still not really clear. My current understanding is:

  • A separate SQS queue is required for each data stream
  • Each data stream also has a corresponding s3 bucket notification, with a filter configured for the matching prefix. Note the documentation does not use the term filter which is how it's presented in the UI, terraform, etc.
  • The above points also means running multiple copies of the integration, each with only a single data stream enabled

I've got this going in a test instance now, with a much larger data set (but still only one days worth for each type other than audit logs...) being ingested:

type              files   lines
----              -----   -----
audit-logs         7856   10374
dns-logs          23049  975339
firewall-events    4902   36710
http-requests     16831 1191184
network-analytics 25700 4738402

So far it's looking good.

Happy to submit a PR with documentation adjustments once we've confirmed my understanding is correct.

@tomcart90
Copy link

Thanks for the investigation @nathangiuliani, that's saved me having to do the same! Good to know we need a separate instance of the integration per data stream!

@nathangiuliani
Copy link
Author

Closer, but unfortunately it looks like it's still missing some events :(

type              files   lines		docs
----              -----   -----		------
audit-logs         7856   10374		10368
dns-logs          23049  975339		975094
firewall-events    4902   36710		36677
http-requests     16831 1191184		1190933
network-analytics 25700 4738402		4737150

I've used some scripting and an aggregation to compare the s3 key names and doc_count vs local file name and line count for audit-logs and firewall-events, and in both cases, 5 files were missed. I haven't bothered with the others, but seeing as it's the end of the working week, I cleared the data streams and bucket and have uploaded the files to the bucket again, and will review again on Monday.

At one point during the day the agent logged a few errors:

12:55:55.301
elastic_agent
[elastic_agent][error] filebeat stderr: "panic: close of closed channel\n\ngoroutine 103 [running]:\ngithub.com/elastic/beats/v7/filebeat/beater.(*Filebeat).Stop(0x4000a1b7a0)\n\tgithub.com/elastic/beats/v7/filebeat/beater/filebeat.go:435 +0x48\ngithub.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch.func5()\n\tgithub.com/elastic/beats/v7/libbeat/cmd/instance/beat.go:462 +0x54\n"
12:55:55.302
elastic_agent
[elastic_agent][error] filebeat stderr: "sync.(*Once).doSlow(0x40003e2390?, 0xaaaae84cf5ff?)\n\tsync/once.go:68 +0x10c\nsync.(*Once).Do(...)\n\tsync/once.go:59\ngithub.com/elastic/elastic-agent-libs/service.HandleSignals.func1()\n\tgithub.com/elastic/elastic-agent-libs@v0.2.11/service/service.go:60 +0x168\ncreated by github.com/elastic/elastic-agent-libs/service.HandleSignals\n\tgithub.com/elastic/elastic-agent-libs@v0.2.11/service/service.go:49 +0x168\n"

there's some similar ones for metricbeat at the same time.

and this one a few times also

13:34:23.098
elastic_agent.filebeat
[elastic_agent.filebeat][warn] failed to get 'filebeat.modules' from config: key not found. Continuing to next one

not sure if related?

@ShourieG
Copy link
Contributor

@nathangiuliani the panic "closed of closed channel" error is a known issue and will be addressed with 8.7.

You can ignore the warning "failed to get filebeat.modules" message, as that has no impact on this issue, also these two are not related.

@nathangiuliani
Copy link
Author

So the test I left running after my last post worked - I had a 100% match of number of documents to lines in the files for all data streams.

The only things I did differently for the previous test was to change the max messages in flight settings, and adjust the agent logging level once or twice.

Have done a bit more testing and have been able to get it to skip messages by adjusting the log level. It's not completely consistent, but if left alone it seems to be fine.

I think I’m ready to just accept that’s how it is for now.

@ShourieG
Copy link
Contributor

@nathangiuliani, Awesome to hear that its working, also thanks a lot for the extensive tests on your end. We are taking all the feedback from this experience and will push updates in the future to make this input more stable and scalable.

@tomcart90
Copy link

@nathangiuliani, out of interest what settings are you using for max messages in flight? What log level did you settle on?

@andrewkroh andrewkroh added Integration:cloudflare_logpush Cloudflare Logpush and removed Integration:cloudflare Cloudflare labels Mar 20, 2023
@nathangiuliani
Copy link
Author

@ShourieG did you want me to submit a PR with an updated Readme, or is this better handled by your team internally?

@tomcart90 I'm using 50 for each data stream/log type while loading in a large number of existing logs. This has made a huge difference to the speed of ingesting the existing logs. I don't think it's required when everything is up to date though, I'll have to keep an eye on the SQS queue stats once everything is up to date.

As a general update, our archive of all log streams other than dns logs and network analytics are loaded into our production elastic instance, and are now just ingesting new files as they are created. It's working pretty well, and we see new events within a couple of minutes (e.g. from when the event happened).

@ShourieG
Copy link
Contributor

@nathangiuliani We are more than happy to accept community PRs, so you can definitely go ahead and create one and we will merge it after a proper review.

@ShourieG
Copy link
Contributor

Closing this issue as the issue has been resolved for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants