Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize lib.ipsec.esp #845

Merged
merged 8 commits into from
Apr 26, 2016
Merged

Optimize lib.ipsec.esp #845

merged 8 commits into from
Apr 26, 2016

Conversation

eugeneia
Copy link
Member

This optimizes the encapsulation and decapsulation routines provided by lib.ipsec.esp:

  • The routines now use raw buffer mutations instead of packet.append, and no longer create a new packet, instead they only mutate the input packet. For this I added a function packet.resize that is really just a setter for packet.length that additionally checks for overflow.
  • Following the awesome JIT tracology research by @alexandergall, the branch heavy sequence number tracking is now implemented as a C function. (This is actually not a caveat, it would probably have to become a C function anyways in the course of implementing efficient anti-replay functionality.)

Benchmark results:

$ sudo ./snabb snabbmark esp 1e7 1024 encapsulate
Encapsulation (packet size = 1024): 13.78 Gbit/s
$ sudo ./snabb snabbmark esp 1e7 1024 decapsulate
Decapsulation (packet size = 1024): 14.23 Gbit/s

Cc @lukego

@eugeneia eugeneia self-assigned this Mar 24, 2016
@eugeneia eugeneia mentioned this pull request Apr 20, 2016
4 tasks
assert(len <= max_payload, "packet payload overflow")
p.length = len
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resize function seems potentially error-prone to me: if you resize the packet but don't fill in all of the data then you will be leaking unknown information from the previous user of the packet.

One alternative would be for encapsulation/encryption to use a new packet.pad(p, n) that always fills with zeros and for decapsulation/decryption to use a new function akin to packet.from_pointer() to set payload and size at the same time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two options:

  1. Divide resize up into grow and shrink (makes the code slightly awkward and API kind of encourages branching, although not in this specific case) :
--- Set packet data length.
-function resize (p, len)
-   assert(len <= max_payload, "packet payload overflow")
-   p.length = len
+-- Grow packet data length.
+function grow (p, o)
+   local len = p.length
+   local new_len = len + o
+   assert(new_len <= max_payload, "packet payload overflow")
+   ffi.fill(p.data+len, o)
+   p.length = new_len
+end
+
+-- Shrink packet data length.
+function shrink (p, o)
+   local len = p.length
+   local new_len = len-o
+   assert(new_len >= 0, "packet payload underflow")
+   p.length = new_len
  1. Always zero packet memory in free (gives a pretty solid protection against information leaks, performance impact?):
 function free (p)
    counter.add(engine.frees)
    counter.add(engine.freebytes, p.length)
+   -- Zero packet data
+   ffi.fill(p.data, p.length)
 [...]

@lukego
Copy link
Member

lukego commented Apr 21, 2016

Looks like solid optimization work, and awesome performance :-).

@@ -102,6 +102,7 @@ function length (p) return p.length end
-- Set packet data length.
function resize (p, len)
assert(len <= max_payload, "packet payload overflow")
ffi.fill(p.data + p.length, math.max(0, len - p.length))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukego:

This resize function seems potentially error-prone to me: if you resize the packet but don't fill in all of the data then you will be leaking unknown information from the previous user of the packet.

How about this: I take a slight performance hit (pretty synchronous >13Gbit/s with the changes) which is somewhat amortized by the now obsolete call to ffi.fill for zeroing padding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Can be further optimized in a more complicated way if/when necessary.

Side note: "Gbps" metric is only meaningful when also saying which processor you are using. I don't know if you are testing a 2GHz CPU or a 4GHz CPU or something between. This is something we need to be careful of so that users don't deploy on a 2GHz machine and expect the same per-core performance that we have measured on a 4GHz machine. Just something to keep in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note on side note: I wonder if bits per cycle (aka Gbps per GHz) could be a handy metric for this kind of thing. For example if you are seeing 13 Gigabits per second on a 3.5 GHz lugano server then you could say you are seeing 3.7 bits-per-cycle with 1KB packet size on Haswell.

I will also be interested in seeing comparative performance between Haswell (v3) and Ivy Bridge (v2) and Broadwell (v4). Haswell's AES-NI hardware is supposed to be twice as fast as Ivy Bridge for AES-GCM and I am not sure whether this improves again in Broadwell. Also Intel provide two reference implementations, one optimized for Ivy Bridge and one for Haswell, and we have only implemented the Haswell-oriented one and not checked how that performs on Ivy Bridge. (Intel say the Haswell code works on older CPUs but with lower performance -- I am assuming this is very slight but have not quantified.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(These performance measurement activities are perhaps best addressed by just making sure that make benchmarks covers IPsec and waiting until we have Hydra running that on all available CPU generations. Morally we should not have to be doing this kind of testing by hand these days.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was indeed tested on lugano-1 (Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz). I think the most interesting part is actually the (annotated) profiler output:

$ sudo lock ./snabb snabbmark esp 1e7 1024 encapsulate Fpv
locking /var/lock/lab.. flock: getting lock took 0.000002 seconds
81%  lib/ipsec/aes_128_gcm.lua:encrypt
  <- 100%  TRACE  24          ->loop     
 4%  lib/ipsec/esp.lua:encapsulate
  <- 92%  TRACE  24          ->loop     
  <-  8%  JIT Compiler
 3%  program/snabbmark/snabbmark.lua:esp
  <- 100%  Garbage Collector
Encapsulation (packet size = 1024): 13.15 Gbit/s
$ sudo lock ./snabb snabbmark esp 1e7 1024 decapsulate Fpv
locking /var/lock/lab.. flock: getting lock took 0.000002 seconds
83%  lib/ipsec/aes_128_gcm.lua:decrypt
  <- 100%  TRACE   7          ->loop     
 5%  lib/ipsec/esp.lua:decapsulate
  <- 100%  TRACE   7          ->loop     
 3%  program/snabbmark/snabbmark.lua:esp
  <- 94%  Garbage Collector
  <-  6%  TRACE   7          ->loop     
 3%  core/packet.lua:clone
  <- 100%  TRACE   7          ->loop     
Decapsulation (packet size = 1024): 13.86 Gbit/s

We see 80% of the time used by our AES DynASM routines (“a” for annotate shows this more definitively). I read this as 20% overhead due to ESP on 1KB packets (vs plain AES-GCM).

@eugeneia eugeneia merged commit 41de2a2 into snabbco:ipsec Apr 26, 2016
eugeneia added a commit that referenced this pull request Apr 26, 2016
dpino pushed a commit to dpino/snabb that referenced this pull request Jun 28, 2017
Add choice statement printer for YANG data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants