-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make UDP-receiver/operator asynchronous & concurrent #27613
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Created PR - #27620 |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
pkg/stanza, receiver/udplog
Is your feature request related to a problem? Please describe.
TL;DR: In high scale scenarios, UDP-receiver has bursts of data-loss due to its working synchronously and single-threadedly.
In high scale scenarios, it's easy to lose UDP packets If the receiving side slows down even for a very short time (for example, the otel-exporter sends data to an endpoint that shortly takes longer to respond), the sender doesn't get any indication about it (in distinction from TCP), and keeps sending data in the same rate.
During that time, the receiver's network buffer gets full and you get data-loss. Also, if there's a short burst of more data than usual (bigger than usual), it also causes data loss due to same reason.
This happens in part because the current UDP receiver works synchronously. If exporter slows down for even a short time, there's data-loss during high scale scenarios.
Describe the solution you'd like
The UDP-receiver (more accurately, the udp input operator in stanza [stanza\operator\input\udp]) needs to process logs in an asynchronous manner to reduce data-loss and increase processing rate. That's important for high-rate scenarios.
Code is already ready for PR, btw.
Our stress tests indicate that changing the UDP stanza input operator to have 2 go-routines solved continuous data-loss issues (not to mention, increase the processing rate of the otel collector).
a. 1st go routine ('reader') only reads from UDP and puts the data into a channel - no processing is done there at all (including splitting, adding attributes, etc.).
2. 2nd go routine ('processor') reads from that channel, performs the processing offered by the UDP-operator, and pushes into the next otel step (in our case, it would be a batch processor).
It's better to add concurrency to the mix (for example, allow the 'processor' to run with 5 go routines) since our tests indicate it improved processing rate even further. The internal processing in the udp receiver may be a bit complicated, since it involves splitting, adding attributes. It might help some consumers to have multiple such 'processors' routines that work concurrently before sending the data downstream.
This would require a graceful shutdown mechanism that allows the receiver to finish handling the items already read and pushed to the channel, so they can be pushed downstream during shutdown (while stopping the 'reader' routine from reading more items from the UDP port).
The suggested feature allows the customer to "pay" with available memory (which can be much bigger than the max size you can set the network buffer to be) to reduce the risk of data-loss due to these issues. Of course, this won't help if our otel collector can handle X EPS, but consistently gets 1.1X EPS. The intention is only to prevent data loss in scenarios when the otel-collector gets data-rate it's usually able to handle, but has short term latency.
Our tests indicate that using more go-routines here (2+) didn't have a major affect on CPU usage overall (but there was a small one, obviously). Again, it should be the consumer's choice to "pay" with more CPU, to reduce risk of data-loss.
Describe alternatives you've considered
Additional context
We have a scenario that requires our otel collector to process high scale data that's read from from a UDP port.
Along with the UDP-receiver with have an otel-batch-processor, and our otel-exporter sends the logs over the network (after being compressed). Our custom otel-exporter is maximally optimized (including using lots of concurrent channels, putting as much data as possible in each network request, compressing, etc.).
The text was updated successfully, but these errors were encountered: