Warn on ingress drop in NFV #1045

wingo · 2016-10-19T15:17:36Z

WDYT @lukego? We found it useful to know when the slow part of our system was the NFV and not the guest. Otherwise you can NACK it, that's fine, we keep it in the lwaftr as it is.

Following up with your comments from #1039 (comment):

Please allow me to zoom out for a moment, not to suggest changing this patch, but rather to map out the context for this whole ingress-drops business mostly for the sake of future directions.

Here is how I have it all in my head:

The problem we want to solve is to detect when a Snabb process is overloaded and then to run a callback. The callback could take corrective action (JIT flush) or provide diagnostics (log message).

Definition: Snabb is overloaded when the average packet arrival rate exceeds the average packet processing rate.

This means that during overload a backlog will grow on one or more ingress interfaces, and once the backlog exceeds the queue size we will start to drop packets. Conversely, during "underload" we are processing packets faster than they are arriving, on average, and this means that any backlog on ingress interfaces will be shrinking and each interface will regularly be drained empty during a pull().

So how do we detect overload? There are a bunch of conditions and we could use one or more of them as proxies:
Packets are being dropped on ingress. Pro: Closely related to overload that exceeds tolerable bounds. Con: Has to be implemented specially for each I/O driver.
I/O interface runs out of packet buffers for receive (requires a complete refill on pull()). Pro/Con: Similar to (1) but perhaps easier/harder to implement for certain I/O drivers.
I/O interface consistently outputs the maximum burst size on pull() i.e. ~100 packets per breath (engine.pull_npackets) meaning that the packet arrival rate is >= the packet processing rate (because otherwise the queue would be shrinking towards zero). Pro: Should work for any I/O driver with an ingress queue size >= ~100. Con: ?
So putting this into the context of detecting ingress overload on a Virtio-net interface if we don't immediately have a "dropped packets counter" for (1) then we could use (2) by detecting if every descriptor on the receive vring is occupied with a packet (no room for more) or (3) by checking if the number of packets pulled by the virtio app is consistently engine.pull_npackets.

How much of this makes sense and what am I missing?

All of what you say makes sense :) The overloaded/underloaded terms make sense and I would be happy to detect that via any of the options, though (3) does sound appealing. Still, detecting the precise number of packets dropped on ingress is important to some of our users and that's a useful thing too, and might even be orthogonal to a (3) solution.

I look forward to your trace-specific watchdog work -- that sounds much much better than doing a heavy-handled jit.flush()!

Before, snabbnfv would run the engine for one-second intervals, checking for restart every second and restarting the engine. However this interacts poorly with the latency-tracking mechanism, which expects to be able to call non-performant functions like ffi.typeof() at setup, because it's just setup. With this fix we have less trace creation and interpreter bailout at run-time.

wingo · 2016-10-19T15:18:28Z

Note that the first patch is already part of wingo-next as it came from #1039.

Ingress drop monitor can warn, or warn and flush. Change lwaftr command line argument to take a parameter.

This has the side effect of removing the ingress drop monitor from the NFV; see upstream #1045.

wingo · 2016-10-20T08:35:50Z

I believe that the changes to IngressDropMonitor in #1045 are effectively included in this one just because it's coming already from the lwaftr, but this branch does not add the drop monitor to the NFV. I will keep #1045 open though because it includes the change to fix the tips URL.

lukego · 2016-10-20T09:19:44Z

Yes, I am on board with taking in the extensions to ingress drop monitor and running it on NFV.

Just wanted to take the opportunity to muse on the future because the current implementation feels like a short-term expedient thing to me (e.g. will only detect network-sourced overloads and not VM-sourced overloads).

I am also not sure yet on whether it's better to make global JIT flushes or selective ones or a combination of both. Interested in getting to the bottom of this and having a really robust solution. Some musing at LuaJIT/LuaJIT#218 (comment).

eugeneia · 2016-10-20T14:15:24Z

LGTM!

@lukego Seeing you are already onto #1039 I assigned this to you, OK?

wingo · 2016-10-20T14:19:28Z

I suggest putting off merging this until #1047 lands, FWIW. I think it has a minor conflict.

This also addresses the review comments from upstream #1047, noting that the shm.exists() API changed.

wingo · 2016-10-21T08:30:16Z

Merged to lwaftr which is under review in #1047.

Extend find-limit to support multiple NICs.

wingo added 2 commits October 19, 2016 15:20

Refactor ingress drop monitor to have configurable actions

6c1f585

Ingress drop monitor can warn, or warn and flush. Change lwaftr command line argument to take a parameter.

Enable warning ingress drop monitor on the NFV

de93573

wingo force-pushed the ingress-drop-nfv branch from f6c1691 to de93573 Compare October 19, 2016 15:20

wingo referenced this pull request in Igalia/snabb Oct 19, 2016

Merge branch 'wingo-next' into merge-wingo-next

d7bd8a9

This has the side effect of removing the ingress drop monitor from the NFV; see upstream #1045.

wingo mentioned this pull request Oct 19, 2016

Merge wingo-next Igalia/snabb#502

Merged

eugeneia assigned eugeneia and lukego and unassigned eugeneia Oct 20, 2016

wingo referenced this pull request in Igalia/snabb Oct 21, 2016

Merge #1045 (ingress-drop-nfv) into lwaftr

d4ba4b9

This also addresses the review comments from upstream #1047, noting that the shm.exists() API changed.

wingo mentioned this pull request Oct 21, 2016

Merge upstream ingress drop Igalia/snabb#506

Merged

wingo added the merged label Oct 21, 2016

eugeneia merged commit de93573 into snabbco:master Nov 7, 2016

wingo deleted the ingress-drop-nfv branch June 5, 2017 13:00

dpino pushed a commit to dpino/snabb that referenced this pull request Apr 19, 2018

Merge pull request snabbco#1045 from Igalia/extend-find-limit

44bb708

Extend find-limit to support multiple NICs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warn on ingress drop in NFV #1045

Warn on ingress drop in NFV #1045

wingo commented Oct 19, 2016

wingo commented Oct 19, 2016

wingo commented Oct 20, 2016

lukego commented Oct 20, 2016

eugeneia commented Oct 20, 2016

wingo commented Oct 20, 2016

wingo commented Oct 21, 2016

Warn on ingress drop in NFV #1045

Warn on ingress drop in NFV #1045

Conversation

wingo commented Oct 19, 2016

wingo commented Oct 19, 2016

wingo commented Oct 20, 2016

lukego commented Oct 20, 2016

eugeneia commented Oct 20, 2016

wingo commented Oct 20, 2016

wingo commented Oct 21, 2016