Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP service listener performance enhancements #4676

Merged
merged 2 commits into from
Nov 7, 2015
Merged

UDP service listener performance enhancements #4676

merged 2 commits into from
Nov 7, 2015

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Nov 5, 2015

No description provided.

@@ -98,11 +101,13 @@ func (s *Service) Open() (err error) {
s.Logger.Printf("Failed to set up UDP listener at address %s: %s", s.addr, err)
return err
}
s.conn.SetReadBuffer(s.config.ReadBuffer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most significant change of this entire PR, increasing the read buffer size has a dramatic improvement in the ability to handle large bursts of UDP traffic. We are defaulting to 8mb. The previous default was whatever the OS was configured for, which is usually ~128kb

@sparrc
Copy link
Contributor Author

sparrc commented Nov 5, 2015

@otoolep @nkatsaros

Change for the UDP service listener. This does 2 things:

  1. Increases the default UDP read buffer size to 8MB (from OS default ~128kb). It also makes this buffer size configurable for users wanting it to handle even more traffic.
  2. Splits parsing incoming UDP packets into a separate goroutine (parser()), so that serve() only needs to read a packet, check for error, and send into a channel.

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

cc @daviesalex @dswarbrick

@@ -64,6 +66,7 @@ func NewService(c Config) *Service {
return &Service{
config: d,
done: make(chan struct{}),
bytebuf: make(chan []byte, 1000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this magic number come from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a better name. parserChan perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to parserChan

I really don't have any particular reason to set it to 1000. In my testing I rarely saw the length of this channel get beyond 10. So I thought I'd just go up two orders of magnitude, since 1000 isn't going to consume much memory at all and should allow it to handle very high loads with smaller packets. I will make it a constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

OK, this generally looks good but I have some feedback before we can merge.

We need to remove the magic number 1000. Let's make it a constant at the top of the file. Do you really think it needs to be 1000?

What testing have you actually performed to show that this improves performance?

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

I would also like us to wait for feedback from the others CC on this ticket before we merge (wait 24-48 hours anyway) since they have all been good enough to help us with UDP performance.

@nkatsaros
Copy link
Contributor

If you SIGINT influxd you'll get [udp] 2015/11/05 14:20:03 Failed to read UDP message: read udp [::]:8089: use of closed network connection. This happens because s.conn.Close() is called in the middle of a read.

Fixing it requires changing the order you close stuff, adding a timeout to the UDP read, and doing a continue in the UDP read loop if you get a timeout error. That will allow you to stop the read loop cleanly.

It's not a big deal, but I can see the message causing confusion.

@sparrc
Copy link
Contributor Author

sparrc commented Nov 5, 2015

@otoolep Re: performance testing

I've been doing local testing with 64kb packet sizes.

When bursting very large numbers of UDP packets (100,000 metrics in one batch, so 100s of 64kb packets within nanoseconds) I would see ~20% loss on the old listener. After splitting out parser() I was seeing ~15% loss. After increasing the read buffer I'm now seeing 0% loss.

I wrote some test scripts at various burst sizes, I can send you them if you want for reference

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

When I last tested Go's network code, adding that timeout on the read had a significant impact on performance. I would not add it without testing. It could be a distinct PR.

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

Sounds great @sparrc -- sounds like you may have cracked it. What you were seeing sounds rather like what others were seeing.

@nkatsaros
Copy link
Contributor

+1 on increasing the read buffer size. It made my UDP service much more performant as well. I also stumbled upon 8*1024*1024, but it was arbitrary and I don't remember how I picked it.

@sparrc
Copy link
Contributor Author

sparrc commented Nov 5, 2015

@nkatsaros I think that issue with the SIGINT is not specific to this change though, correct? We could fix that in a separate PR. I also wonder if simply changing the ordering of these lines would fix that: https://github.com/influxdb/influxdb/blob/master/services/udp/service.go#L178-L180

@nkatsaros
Copy link
Contributor

No, it wasn't specific to this change and it wasn't related to performance but I figured I'd mention it. I don't think changing the order of those lines would fix it unless you were actively receiving UDP packets.

@otoolep
Copy link
Contributor

otoolep commented Nov 5, 2015

+1 on a distinct PR for that.

@dswarbrick
Copy link

Out of curiosity, is the collectd input plugin derived from this same UDP listener code? Back when we still were using the collectd input, I was aware that InfluxDB was simply using the OS default receive buffer size. I don't remember exactly which version this was, but we experimented with increasing the OS defaults, and confirmed that InfluxDB was inheriting these. I think we pushed it as high as 32 MB, which is starting to get a bit silly. The buffer still ultimately filled up however, and the kernel started to drop packets. Bear in mind however that this was long before tsm1 engine, when we still had major IOPS issues. I suspect the DB simply wasn't keeping up, and the buffer bloat eventually overflowed.

@sparrc
Copy link
Contributor Author

sparrc commented Nov 5, 2015

@dswarbrick, that is true, @otoolep opened a case to increase the buffer size of those input plugins also #4678

@sparrc
Copy link
Contributor Author

sparrc commented Nov 6, 2015

@otoolep I made the doc changes in #4681, OK to merge?

@sparrc
Copy link
Contributor Author

sparrc commented Nov 6, 2015

@otoolep OK, now I've changed read-buffer to default to 0, which means to use the OS default buffer size. More documentation will be in PR #4681

@otoolep
Copy link
Contributor

otoolep commented Nov 7, 2015

Looks great @sparrc

@otoolep
Copy link
Contributor

otoolep commented Nov 7, 2015

+1

@sparrc sparrc merged commit 78e6979 into master Nov 7, 2015
@mark-rushakoff mark-rushakoff deleted the udp-perf branch February 13, 2016 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants