-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudflare Logpush Integration not working reliably with S3/SQS #5526
Comments
Pinging @elastic/security-external-integrations (Team:Security-External Integrations) |
@nathangiuliani, thanks for raising the issue.
|
Hi @nathangiuliani, To add to @kcreddy 's suggetions can you tell us if you are seeing any errors similar to attempt to use a closed processor or tried to close already closed "processor_name" processor in the logs while using SQS mode ? Also while using S3 mode have you tried increasing max no. of workers and see if it's helping with the processing speed ? |
Thanks for your responses @ShourieG and @kcreddy.
Unfortunately I may need to sanitise the logs and data before it can be shared. I'll see what I can do.
Yes, see scenario 3 below.
No, just a single agent. While this may help with the final live configuration, I can't see how that would help with such a small sample data set - it is all able to be processed within the sqs visibility timeout. See scenario 2 below.
There doesn't appear to be - I've searched for
I can't find either of those error strings in the logs during the testing period - only one occurrence from a message that was in the queue from initial creation prior to being populated with cloudflare log files.
Understood, will try to sanitise them as above.
No, I can't find this in any of the logs.
Yes, this helps somewhat, but the time delay seems to be after processing of files has completed, as filebeat updates the status of all the files in the bucket. I've done some more testing today, with a few scenarios. In each case, the s3 bucket was emptied, sqs queue was purged, and elastic agent was removed and re-installed. In all scenarios, we are using elasticsearch 8.6.2 deployed on elastic cloud, and I'm only enabling the SQS queue mode. Scenario 1: all logs types enabled in the integration, elastic agent 8.5.2Uploading our sample audit log data set with 1850 entries resulted in 53 audit log entries being ingested. Agent logs show SQS messages being received and some S3 files processed. I couldn't spot any errors or warnings. SQS queue stats show messages being received and deleted as expected. I retried this scenario a few times and it's reproducible, but each time a slightly different number of messages are processed. The second time it was 69. I also left this for half an hour after the ingestion of events stopped, nothing else was processed, and the sqs queue was empty. Scenario 2: only audit logs enabled in the integration, all others log types disabled. elastic agent 8.5.2Uploading our sample audit log data set with 1850 entries resulted in all 1850 audit log entries being ingested. Scenario 3: all logs types enabled in the integration, but max_number_of_messages set to 100 for audit logs. elastic agent 8.5.2Uploading our sample audit log data set with 1850 entries resulted in 548 audit log entries being ingested. Scenario 4: all logs types enabled in the integration, elastic agent 8.6.2Uploading our sample audit log data set with 1850 entries resulted in continual crashes of the aws-s3 component, as per elastic/beats#34219 (comment). After 10 minutes (far longer than any scenario above), 6 audit log entries had been ingested. |
@nathangiuliani thanks for sharing the extensive test cases. It seems to me that enabling more than one log type is causing some sort of processor reuse which might be causing filebeat to drop events. However the diagnostics/debug logs are required to dig deeper, but I have a feeling it is probably due to the issue reported here : elastic/beats#34219 (comment) . It's mostly an issue related to beats and not the integration. This bug was introduced in 7.9 and mostly went unnoticed until recently, the underlying root cause is filebeats trying to reuse a closed processor. The solution to this in already merged in the following PRs : elastic/beats#34761 and elastic/beats#34647. These fixes are confirmed with the release of 8.7. However it would be really helpful if you could share the sanitised logs, as it will help us confirm if its indeed being caused due to the issue mentioned above. |
@ShourieG take a look at support case 01331462. |
@nathangiuliani thanks for the update. Our support engineers will be opening an official SDH where we can continue our investigation. We as dev engineers don't have direct access to support case conversations. |
@nathangiuliani after going through the logs and the config, we saw that all the data streams are using the same sqs queue: https://sqs.ap-southeast-2.amazonaws.com/024370163937/hmb-cloudflare-logs-dev-s3-event-queue. This is most likely what is causing the issue as the general requirement is for each data stream to have its own separate queue, since currently each data stream uses a separate instance of the aws input. Having separate queues should solve the issue of dropped events in 8.5.2. This would be possible easily if there are different bucket paths. Then each bucket path can be routed to different queues. 8.6.2 on the other hand has a different bug right now which should be fixed in 8.7. |
I've been debugging essentially the same issue for the past few days. I hadn't realized that it was recommended to have an SQS queue per data stream/type until reading the last comment here. However, I found that having the logpush job push to data-set specific keys within the s3 bucket (e.g. http_requests/) and configuring the Edit: also, just to add to that: in the integration UI there's only a single queue URL input field which would suggest only a single queue should be used? Edit: also, is there an ETA on 8.7? I'm also seeing the issue reported here: elastic/beats#34219 (comment). |
@tomcart90 A similar thing was done here in this issue and the problem still cropped up. I guess the more number of data streams you have active, can aggrevate the issue if you have a single queue. So in order to have the integration work reliably at scale, currently separate queues are recommended for now until an enhancement is made. |
I think it's fair to say that in SQS mode it doesn't work at all if you have the streams sharing a single SQS queue. Thinking through what's happening, it makes sense, but the documentation and even UI don't reflect this at all. Essentially we'd need to add several instances of the integration, each with only a single data stream enabled, as the SQS URL is only set per-instance of the integration, not per-data stream. I'll give this configuration a go today and report back. Obviously it's pretty challenging to confirm it's working reliably though, especially given how many files logpush generates. |
In regards to your documentation, the wording of this line definitely makes more sense now, but it's still not really clear. My current understanding is:
I've got this going in a test instance now, with a much larger data set (but still only one days worth for each type other than audit logs...) being ingested:
So far it's looking good. Happy to submit a PR with documentation adjustments once we've confirmed my understanding is correct. |
Thanks for the investigation @nathangiuliani, that's saved me having to do the same! Good to know we need a separate instance of the integration per data stream! |
Closer, but unfortunately it looks like it's still missing some events :(
I've used some scripting and an aggregation to compare the s3 key names and doc_count vs local file name and line count for At one point during the day the agent logged a few errors:
there's some similar ones for metricbeat at the same time. and this one a few times also
not sure if related? |
@nathangiuliani the panic "closed of closed channel" error is a known issue and will be addressed with 8.7. You can ignore the warning "failed to get filebeat.modules" message, as that has no impact on this issue, also these two are not related. |
So the test I left running after my last post worked - I had a 100% match of number of documents to lines in the files for all data streams. The only things I did differently for the previous test was to change the max messages in flight settings, and adjust the agent logging level once or twice. Have done a bit more testing and have been able to get it to skip messages by adjusting the log level. It's not completely consistent, but if left alone it seems to be fine. I think I’m ready to just accept that’s how it is for now. |
@nathangiuliani, Awesome to hear that its working, also thanks a lot for the extensive tests on your end. We are taking all the feedback from this experience and will push updates in the future to make this input more stable and scalable. |
@nathangiuliani, out of interest what settings are you using for max messages in flight? What log level did you settle on? |
@ShourieG did you want me to submit a PR with an updated Readme, or is this better handled by your team internally? @tomcart90 I'm using 50 for each data stream/log type while loading in a large number of existing logs. This has made a huge difference to the speed of ingesting the existing logs. I don't think it's required when everything is up to date though, I'll have to keep an eye on the SQS queue stats once everything is up to date. As a general update, our archive of all log streams other than dns logs and network analytics are loaded into our production elastic instance, and are now just ingesting new files as they are created. It's working pretty well, and we see new events within a couple of minutes (e.g. from when the event happened). |
@nathangiuliani We are more than happy to accept community PRs, so you can definitely go ahead and create one and we will merge it after a proper review. |
Closing this issue as the issue has been resolved for now. |
I’m having a hell of a time trying to make use of the Cloudflare LogPush integration. I realise this is currently in beta, however going by #5033, it appears this is likely close to release.
Our Cloudflare instance is configured to logpush all log types (both zone and account) to a single S3 bucket. This is essentially our log archive. There are over 1.5 million objects in this s3 bucket, and it will grow to significantly more than this.
When using the SQS mode, log entries are missed. In fact, it’s more accurate to say that only a small portion of the log entries are actually ingested.
I can reproduce this fairly easily with a new elastic instance, s3 bucket and sqs queue. Copying all of our audit logs (the smallest data set we have) with the default SQS configuration for the integration resulted in somewhere around 35 entries being processed – of 1850. Interestingly, disabling all other log types (so that only audit logs is enabled) resulted in a much larger portion of the logs being ingested – around 1000 entries.
The SQS queue stats quite clearly show the messages being sent, received, and ultimately deleted by the elastic agent.
Looking at the agent debug logs, I don’t spot anything obvious other than bunch of no file_selectors matched messages – but I believe this is due to the multiple log types being configured, and each one not being triggered on each poll of the queue.
Using the S3 poll method results in all log entries being ingested without issue.
Using S3 polling mode for all log types works, but due to the large number of objects in the bucket, this can lag behind – up to multiple days. We’ve also seen the EC2 instance running the elastic agent completely die due to running out of memory.
I initially thought I had this issue fixed by rolling back to agent 8.5.2 due to the issue described in elastic/beats#34219 (comment). I saw this stack trace many times while trying to get this to work. Agent 8.5.2 seems to fix this issue and cause the agent to process incoming items a lot faster due to not constantly crashing, but the behaviour is still the same. In SQS mode most log entries are missed, and in S3 mode it appears to be unable to cope with the number of items in the bucket.
Any advice you can offer would be appreciated, including what logs you’d like to see. The debug logs seem to include a lot of data, so we may need to look at a more secure avenue to deliver these, or some specific log entries only.
The text was updated successfully, but these errors were encountered: