Envoy is spinning CPU on Windows #13189

davinci26 · 2020-09-18T23:34:30Z

Title: Envoy is spinning CPU on Windows

Description:

In 2019 version of Windows Server there is no support for level-based events which causes Envoy to spin CPU to process "empty" write ready events.

To resolve this issue we consider the following three approaches:

Selectively turn on and off the level based events. The idea is to only listen to file event notifications when we want to write to the socket (either because the previous write call got blocked or new data came in).
Change the socket implementation to use the I/O completion ports paradigm.
Rely on a different backend (see libevent replacement with libev #4952)

cc: @envoyproxy/windows-dev

The text was updated successfully, but these errors were encountered:

davinci26 · 2020-09-18T23:43:39Z

Given that the fact that newer versions of Windows will support edge-based events and that it requires the least amount of changes, I want to start experimenting with approach (1).

I got some input from @antoniovicente on the approach and I will start working on it for the beta release.

antoniovicente · 2020-09-19T00:26:15Z

It would be good to see if approach (1) works out. It certainly seems the least disruptive. The level trigger mode of operation could also be an option under linux, so testing sockets that operate in this paradigm should be possible on multiple platforms. I hope that the changes can be relatively localized: either Network::Connection and/or IoHandle.

Keep me posted on progress.

yanavlasov · 2020-09-22T00:32:24Z

Don't we already subscribe to write ready events when we have data to write?

davinci26 · 2020-09-22T01:19:41Z

Don't we already subscribe to write ready events when we have data to write?
I missed a only in my initial post (I will edit for clarity).

The high level concept is to optimize the time that we are subscribed to write-ready events. The idea is to only listen for write ready events on specific conditions Like when the previous write call to the socket got blocked or when we have data to the buffer.

mattklein123 · 2020-09-22T02:40:35Z

Don't you have the same problem with read events when we don't consume the entire read buffer?

antoniovicente · 2020-09-22T02:57:55Z

Don't you have the same problem with read events when we don't consume the entire read buffer?

In the current implementation, the call to readDisable(true) when above high watermark (or after we have a full H1 request message) removes read from the fd registration mask, so we already avoid having spurious read upcalls.

Of course there's this change I want to make once we have benchmarks that measure impact. The removal of spurious changes to the fd registraion mask should provide performance gains. master...antoniovicente:optimize_read_disable

mattklein123 · 2020-09-22T04:04:54Z

Of course there's this change I want to make once we have benchmarks that measure impact. The removal of spurious changes to the fd registraion mask should provide performance gains. master...antoniovicente:optimize_read_disable

Ah OK, cool. I had assumed we already did that. :)

piscisaureus · 2020-11-01T11:11:08Z

In lieu of windows/wepoll support for EPOLLET, I'd suggest to use EPOLLONESHOT instead.

EPOLLONESHOT is similar to EPOLLET in the sense that when an event (like EPOLLIN) is reported, it is also automatically disabled.

The difference is that in edge triggered mode the event is also automatically turned back on after calling read()/write(), whereas in ONESHOT mode this needs to be done "manually" by envoy.

So every time you're "done" reading from/writing to a socket (typically that is when you get a EWOULDBLOCK error, although there can be other reasons to "move on" from interacting with a particular socket), you have to re-enable the EPOLLIN/EPOLLOUT event with epoll_ctl().

It might require a bit of work but it should be straightforward, almost mechanical.

W.r.t wepoll performance - this change would not introduce additional overhead. In default (level-triggered) mode wepoll is doing exactly the same thing internally.

I'd suggest to to avoid hacks that switch level-triggered events on and off all the time. It's a lot more complicated, and it can make wepoll a bit slower (that depends on the exact epoll_ctl()/epoll_wait() call pattern).

@davinci26, @antoniovicente

davinci26 · 2020-11-01T21:25:12Z

@piscisaureus thanks a lot for your input! Highly appreciated.

EPOLLONESHOT is similar to EPOLLET in the sense that when an event (like EPOLLIN) is reported, it is also automatically disabled. The difference is that in edge triggered mode the event is also automatically turned back on after calling read()/write(), whereas in ONESHOT mode this needs to be done "manually" by envoy.

Conceptually this is pretty similar to what we are trying to do. This is good validation for what we are trying to achieve.

you have to re-enable the EPOLLIN/EPOLLOUT event with epoll_ctl()

Does libevent give you access to modify the flags of the backend? I have to investigate that further. @nigriMSFT do you happen to know?

In default (level-triggered) mode wepoll is doing exactly the same thing internally.

I am not sure I fully understand what you mean here.

I'd suggest to to avoid hacks that switch level-triggered events on and off all the time. It's a lot more complicated, and it can make wepoll a bit slower (that depends on the exact epoll_ctl()/epoll_wait() call pattern).

I have implemented this as part of #13787. But I do not see much extra complication. Am I missing something?

davinci26 added triage Issue requires triage design proposal Needs design doc/proposal before implementation area/windows labels Sep 18, 2020

davinci26 self-assigned this Sep 19, 2020

yanavlasov removed the triage Issue requires triage label Sep 22, 2020

sunjayBhatia added this to the windows-beta milestone Oct 2, 2020

This was referenced Oct 6, 2020

Adds block property to socket io results #13409

Closed

Stop exposing file_event_ from ioHandle #13478

Merged

davinci26 mentioned this issue Oct 27, 2020

[Level Events] manage level events registration mask #13787

Merged

mattklein123 closed this as completed in #13787 Nov 20, 2020

wrowe removed this from the windows-ga milestone Jan 15, 2021

davinci26 mentioned this issue Mar 5, 2021

NT support #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy is spinning CPU on Windows #13189

Envoy is spinning CPU on Windows #13189

davinci26 commented Sep 18, 2020 •

edited

Loading

davinci26 commented Sep 18, 2020

antoniovicente commented Sep 19, 2020

yanavlasov commented Sep 22, 2020

davinci26 commented Sep 22, 2020

mattklein123 commented Sep 22, 2020

antoniovicente commented Sep 22, 2020

mattklein123 commented Sep 22, 2020

piscisaureus commented Nov 1, 2020 •

edited

Loading

davinci26 commented Nov 1, 2020

Envoy is spinning CPU on Windows #13189

Envoy is spinning CPU on Windows #13189

Comments

davinci26 commented Sep 18, 2020 • edited Loading

davinci26 commented Sep 18, 2020

antoniovicente commented Sep 19, 2020

yanavlasov commented Sep 22, 2020

davinci26 commented Sep 22, 2020

mattklein123 commented Sep 22, 2020

antoniovicente commented Sep 22, 2020

mattklein123 commented Sep 22, 2020

piscisaureus commented Nov 1, 2020 • edited Loading

davinci26 commented Nov 1, 2020

davinci26 commented Sep 18, 2020 •

edited

Loading

piscisaureus commented Nov 1, 2020 •

edited

Loading