Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on ingress drop in NFV #1045

Merged
merged 3 commits into from
Nov 7, 2016
Merged

Conversation

wingo
Copy link
Contributor

@wingo wingo commented Oct 19, 2016

WDYT @lukego? We found it useful to know when the slow part of our system was the NFV and not the guest. Otherwise you can NACK it, that's fine, we keep it in the lwaftr as it is.

Following up with your comments from #1039 (comment):

Please allow me to zoom out for a moment, not to suggest changing this patch, but rather to map out the context for this whole ingress-drops business mostly for the sake of future directions.

Here is how I have it all in my head:

The problem we want to solve is to detect when a Snabb process is overloaded and then to run a callback. The callback could take corrective action (JIT flush) or provide diagnostics (log message).

Definition: Snabb is overloaded when the average packet arrival rate exceeds the average packet processing rate.

This means that during overload a backlog will grow on one or more ingress interfaces, and once the backlog exceeds the queue size we will start to drop packets. Conversely, during "underload" we are processing packets faster than they are arriving, on average, and this means that any backlog on ingress interfaces will be shrinking and each interface will regularly be drained empty during a pull().

So how do we detect overload? There are a bunch of conditions and we could use one or more of them as proxies:

Packets are being dropped on ingress. Pro: Closely related to overload that exceeds tolerable bounds. Con: Has to be implemented specially for each I/O driver.
I/O interface runs out of packet buffers for receive (requires a complete refill on pull()). Pro/Con: Similar to (1) but perhaps easier/harder to implement for certain I/O drivers.
I/O interface consistently outputs the maximum burst size on pull() i.e. ~100 packets per breath (engine.pull_npackets) meaning that the packet arrival rate is >= the packet processing rate (because otherwise the queue would be shrinking towards zero). Pro: Should work for any I/O driver with an ingress queue size >= ~100. Con: ?

So putting this into the context of detecting ingress overload on a Virtio-net interface if we don't immediately have a "dropped packets counter" for (1) then we could use (2) by detecting if every descriptor on the receive vring is occupied with a packet (no room for more) or (3) by checking if the number of packets pulled by the virtio app is consistently engine.pull_npackets.

How much of this makes sense and what am I missing?

All of what you say makes sense :) The overloaded/underloaded terms make sense and I would be happy to detect that via any of the options, though (3) does sound appealing. Still, detecting the precise number of packets dropped on ingress is important to some of our users and that's a useful thing too, and might even be orthogonal to a (3) solution.

I look forward to your trace-specific watchdog work -- that sounds much much better than doing a heavy-handled jit.flush()!

Before, snabbnfv would run the engine for one-second intervals, checking
for restart every second and restarting the engine.  However this
interacts poorly with the latency-tracking mechanism, which expects to
be able to call non-performant functions like ffi.typeof() at setup,
because it's just setup.  With this fix we have less trace creation and
interpreter bailout at run-time.
@wingo
Copy link
Contributor Author

wingo commented Oct 19, 2016

Note that the first patch is already part of wingo-next as it came from #1039.

wingo added 2 commits October 19, 2016 15:20
Ingress drop monitor can warn, or warn and flush.  Change lwaftr command
line argument to take a parameter.
wingo referenced this pull request in Igalia/snabb Oct 19, 2016
This has the side effect of removing the ingress drop monitor from the
NFV; see upstream #1045.
@wingo
Copy link
Contributor Author

wingo commented Oct 20, 2016

I believe that the changes to IngressDropMonitor in #1045 are effectively included in this one just because it's coming already from the lwaftr, but this branch does not add the drop monitor to the NFV. I will keep #1045 open though because it includes the change to fix the tips URL.

@lukego
Copy link
Member

lukego commented Oct 20, 2016

Yes, I am on board with taking in the extensions to ingress drop monitor and running it on NFV.

Just wanted to take the opportunity to muse on the future because the current implementation feels like a short-term expedient thing to me (e.g. will only detect network-sourced overloads and not VM-sourced overloads).

I am also not sure yet on whether it's better to make global JIT flushes or selective ones or a combination of both. Interested in getting to the bottom of this and having a really robust solution. Some musing at LuaJIT/LuaJIT#218 (comment).

@eugeneia eugeneia assigned eugeneia and lukego and unassigned eugeneia Oct 20, 2016
@eugeneia
Copy link
Member

LGTM!

@lukego Seeing you are already onto #1039 I assigned this to you, OK?

@wingo
Copy link
Contributor Author

wingo commented Oct 20, 2016

I suggest putting off merging this until #1047 lands, FWIW. I think it has a minor conflict.

wingo referenced this pull request in Igalia/snabb Oct 21, 2016
This also addresses the review comments from upstream #1047, noting that
the shm.exists() API changed.
@wingo wingo added the merged label Oct 21, 2016
@wingo
Copy link
Contributor Author

wingo commented Oct 21, 2016

Merged to lwaftr which is under review in #1047.

@eugeneia eugeneia merged commit de93573 into snabbco:master Nov 7, 2016
@wingo wingo deleted the ingress-drop-nfv branch June 5, 2017 13:00
dpino pushed a commit to dpino/snabb that referenced this pull request Apr 19, 2018
Extend find-limit to support multiple NICs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants