-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize lib.ipsec.esp #845
Conversation
assert(len <= max_payload, "packet payload overflow") | ||
p.length = len | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resize
function seems potentially error-prone to me: if you resize the packet but don't fill in all of the data then you will be leaking unknown information from the previous user of the packet.
One alternative would be for encapsulation/encryption to use a new packet.pad(p, n)
that always fills with zeros and for decapsulation/decryption to use a new function akin to packet.from_pointer()
to set payload and size at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two options:
- Divide
resize
up intogrow
andshrink
(makes the code slightly awkward and API kind of encourages branching, although not in this specific case) :
--- Set packet data length.
-function resize (p, len)
- assert(len <= max_payload, "packet payload overflow")
- p.length = len
+-- Grow packet data length.
+function grow (p, o)
+ local len = p.length
+ local new_len = len + o
+ assert(new_len <= max_payload, "packet payload overflow")
+ ffi.fill(p.data+len, o)
+ p.length = new_len
+end
+
+-- Shrink packet data length.
+function shrink (p, o)
+ local len = p.length
+ local new_len = len-o
+ assert(new_len >= 0, "packet payload underflow")
+ p.length = new_len
- Always zero packet memory in
free
(gives a pretty solid protection against information leaks, performance impact?):
function free (p)
counter.add(engine.frees)
counter.add(engine.freebytes, p.length)
+ -- Zero packet data
+ ffi.fill(p.data, p.length)
[...]
Looks like solid optimization work, and awesome performance :-). |
@@ -102,6 +102,7 @@ function length (p) return p.length end | |||
-- Set packet data length. | |||
function resize (p, len) | |||
assert(len <= max_payload, "packet payload overflow") | |||
ffi.fill(p.data + p.length, math.max(0, len - p.length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resize function seems potentially error-prone to me: if you resize the packet but don't fill in all of the data then you will be leaking unknown information from the previous user of the packet.
How about this: I take a slight performance hit (pretty synchronous >13Gbit/s with the changes) which is somewhat amortized by the now obsolete call to ffi.fill
for zeroing padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Can be further optimized in a more complicated way if/when necessary.
Side note: "Gbps" metric is only meaningful when also saying which processor you are using. I don't know if you are testing a 2GHz CPU or a 4GHz CPU or something between. This is something we need to be careful of so that users don't deploy on a 2GHz machine and expect the same per-core performance that we have measured on a 4GHz machine. Just something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note on side note: I wonder if bits per cycle (aka Gbps per GHz) could be a handy metric for this kind of thing. For example if you are seeing 13 Gigabits per second on a 3.5 GHz lugano server then you could say you are seeing 3.7 bits-per-cycle with 1KB packet size on Haswell.
I will also be interested in seeing comparative performance between Haswell (v3) and Ivy Bridge (v2) and Broadwell (v4). Haswell's AES-NI hardware is supposed to be twice as fast as Ivy Bridge for AES-GCM and I am not sure whether this improves again in Broadwell. Also Intel provide two reference implementations, one optimized for Ivy Bridge and one for Haswell, and we have only implemented the Haswell-oriented one and not checked how that performs on Ivy Bridge. (Intel say the Haswell code works on older CPUs but with lower performance -- I am assuming this is very slight but have not quantified.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(These performance measurement activities are perhaps best addressed by just making sure that make benchmarks
covers IPsec and waiting until we have Hydra running that on all available CPU generations. Morally we should not have to be doing this kind of testing by hand these days.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was indeed tested on lugano-1 (Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz). I think the most interesting part is actually the (annotated) profiler output:
$ sudo lock ./snabb snabbmark esp 1e7 1024 encapsulate Fpv
locking /var/lock/lab.. flock: getting lock took 0.000002 seconds
81% lib/ipsec/aes_128_gcm.lua:encrypt
<- 100% TRACE 24 ->loop
4% lib/ipsec/esp.lua:encapsulate
<- 92% TRACE 24 ->loop
<- 8% JIT Compiler
3% program/snabbmark/snabbmark.lua:esp
<- 100% Garbage Collector
Encapsulation (packet size = 1024): 13.15 Gbit/s
$ sudo lock ./snabb snabbmark esp 1e7 1024 decapsulate Fpv
locking /var/lock/lab.. flock: getting lock took 0.000002 seconds
83% lib/ipsec/aes_128_gcm.lua:decrypt
<- 100% TRACE 7 ->loop
5% lib/ipsec/esp.lua:decapsulate
<- 100% TRACE 7 ->loop
3% program/snabbmark/snabbmark.lua:esp
<- 94% Garbage Collector
<- 6% TRACE 7 ->loop
3% core/packet.lua:clone
<- 100% TRACE 7 ->loop
Decapsulation (packet size = 1024): 13.86 Gbit/s
We see 80% of the time used by our AES DynASM routines (“a” for annotate shows this more definitively). I read this as 20% overhead due to ESP on 1KB packets (vs plain AES-GCM).
Add choice statement printer for YANG data
This optimizes the encapsulation and decapsulation routines provided by
lib.ipsec.esp
:packet.append
, and no longer create a new packet, instead they only mutate the input packet. For this I added a functionpacket.resize
that is really just a setter forpacket.length
that additionally checks for overflow.Benchmark results:
Cc @lukego