Implement freelist with FFI cdata object instead of Lua table #771

xrme · 2016-02-22T18:50:00Z

This (tiny) commit changes the representation of freelists from a Lua table to an ffi cdata object.

This is useful because it lets assembly language code access freelists in a straightforward way.

I believe I figured out the correct LuaJIT FFI notation for parameterizing the type of object that the freelist will hold, but I don't know if it is worth keeping that ability around. The only things that are ever freelisted are packets. It therefore might make sense to remove the type parameterization as an unnecessary abstraction.

This enables assembly language code to access freelists in a fairly straightforward manner.

mention-bot · 2016-02-22T18:50:01Z

By analyzing the blame information on this pull request, we identified @lukego and @nnikolaev-virtualopensystems to be potential reviewers

ghost · 2016-02-23T07:01:53Z

src/core/freelist.lua

-            -- XXX Better LuaJIT idiom for specifying the array type?
-            list = ffi.new(type.."[?]", size) }
+   local element_ct = ffi.typeof(type)
+   local fl_type = ffi.typeof("struct { int nfree, max; $ list[?]; }",


Why ints for free and max? Should it be a 64bit type?
What about cache alignment? I'd personally prefer list to start on an aligned address.

Did you measure the performance impact?

Performance impact should be measured automatically by the CI. If this testing is not sufficient then we should improve it.

FYI: You can see the performance regression tests at the top of the results page that is linked from the little green tick / red cross next to the commit.

Here is the snippet from this PR:

Checking for performance regressions: BENCH basic1-100e6 -> 0.967078 of 14.58 (SD: 0.0979796 ) BENCH packetblaster-64 -> 0.997168 of 10.592 (SD: 0.0292575 ) BENCH packetblaster-synth-64 -> 0.991587 of 10.698 (SD: 0.00979796 ) BENCH snabbnfv-iperf-1500 -> 0.899233 of 5.478 (SD: 0.326399 ) BENCH snabbnfv-iperf-jumbo -> 0.997787 of 6.326 (SD: 0.191896 ) BENCH snabbnfv-loadgen-dpdk -> 0.991281 of 2.6838 (SD: 0.0322639 )

@eugeneia this actually looks suspicious to me. Is SnabbBot being quite conservative about flagging problems with there is high variance in results? If I am reading this right then the iperf-1500 result had around -10% for the mean value but this is accepted due to high variance / error bars i.e. likely false negative?

Unrelated to above comments, but related to the line in question: If there is ever more than one freelist used by a function, then this switch could cause branchy traces as each ffi.typeof('struct {}') will create a fresh type. Makes me nervous!

@lukego Standard deviation is only printed and not used in any equation. Any result below 85% of master results is considered a failure. Might be a ridiculous margin, I had no idea of what would be a good margin at the time of picking this number. If I remember correctly 0.85 was chosen because it yielded an acceptable false positive rate. If I were to revisit that code, I would not calculate average at all and use minimum instead.

@eugeneia Thanks for explaining. One day we can revisit this once we also have a better handle on the whole "scientific testing" concept. For now it is probably sufficient that we check more closely before making a release. I would not like to ship even a 1% performance regression if we can avoid it but I understand that it is non-trivial to verify this (particularly on a reasonable processing budget).

lukego · 2016-02-23T14:18:41Z

The only things that are ever freelisted are packets. It therefore might make sense to remove the type parameterization as an unnecessary abstraction.

This sounds like a good idea in light of @wingo's observation.

LuaJIT generates code that is very type-specialized and every non-trivial call to ffi.typeof() returns a new type from the JIT perspective. This means that hypothetically if a function were using multiple freelist objects then the JIT would see them all as distinct datatypes and the type guards in generated code would often fail and cause branches off the fast path.

This can be avoided by hoisting the ffi.typeof() call up to the module level.

I don't think this would actually happen today with the code as written - we're only working on one freelist - but if you were already thinking of simplifying the type then this would be another reason to do so.

This whole "FFI ctype diversity" JIT issue is something that @alexandergall discovered relatively recently in #612.

lukego · 2016-02-23T14:18:51Z

@wingo will you be upstream for this one?

wingo · 2016-02-23T14:23:06Z

avec plaisir :)

wingo · 2016-02-23T14:32:55Z

There are a number of modules that require the freelist module:

./core/packet.lua:8:local freelist = require("core.freelist")
./apps/intel/intel_app.lua:10:local freelist = require("core.freelist")
./apps/basic/basic_apps.lua:4:local freelist = require("core.freelist")
./apps/solarflare/solarflare.lua:4:local freelist = require("core.freelist")
./lib/virtio/virtq_device.lua:6:local freelist  = require("core.freelist")
./lib/virtio/net_device.lua:6:local freelist  = require("core.freelist")
./program/snabbmark/snabbmark.lua:8:local freelist      = require("core.freelist")

But AFAICT only the packet module uses it. @xrme, if you have no plans to use the freelist module more widely, would you mind inlining the freelist module into packet.lua and removing freelist.lua? Then we can get rid of the ffi.typeof of the tagless struct to avoid this hazard in the future. Probably you will have to update the documentation as well, but it will just be deletion. Thanks!

xrme · 2016-02-24T02:33:15Z

The apparent performance regression is bothering me.

I don't think I understand the comment by @nnikolaev-virtualopensystems regarding the alignment of the list vector. Is there some reason it should be more than 8-byte aligned?

ghost · 2016-02-24T06:39:52Z

When I access element number 1 from the list, I prefer that the next couple of elements are also fetched in the cacheline (which is 64 bytes on Intel's Xeon). So 64 byte alignment of performance sensitive structures might be crucuial in some cases.

wingo · 2016-02-24T08:08:25Z

Regarding the perf regression, I only see that on the snabbnfv-iperf-1500 case, and then only sometimes. That case seems to have a large standard deviation so perhaps that's normal. I don't suspect this change as having significant perf impact.

ghost · 2016-02-24T08:11:30Z

But maybe we can gain some performance here?

wingo · 2016-02-24T08:14:20Z

src/core/packet.lua

+    int nfree, max;
+    struct packet *list[?];
+};
+]]


Thanks for doing the refactor, looks great :) I have two more nits before landing:

Would you mind changing to unsigned integers please, unless you have a reason to use signed integers

I wonder about "struct freelist" -- I can see this name as easily colliding with a name from some other user's module, struct tags being global and all. Unless you need this struct to have a tag, in this case local freelist_t = ffi.typeof[[struct { ... }]] is more appropriate. In that case I would change the ffi.new call below to be local packets_fl = freelist_t(max_packets, 0, max_packets).

Thanks!

I did some quick basic benchmark testing and got slightly better results with:

ffi.cdef[[ struct freelist { uint64_t nfree, max; __attribute__((aligned(64))) struct packet *list[?]; }; ]]

I understand that it is probably annoying to insist here - but this and struct packet are the most used (at runtime) data structures in the project, so I believe every detail here is important. So if this is going upstream, IMHO we should look at it very carefully.

Interesting! Would you mind quantifying what you did to benchmark this change?

for i in {1..10}; do sudo ./bench/basic1-100e6; done
Without the patch:

27.6 29.5 29.4 29.4 29.0 29.2 29.4 29.4 29.3 28.5

With the patch:

28.3 30.0 29.9 29.8 30.0 29.8 29.3 30.0 30.0 29.9

I know, it's not much. But the basic1 benchmark has only 3 links to process. I believe the effect can (and will) be multiplied once you scale your configuration.

Sounds good to me, only thing is I'm not sure that align((64)) works with ffi.new. It certainly wouldn't work if it were just malloc. Incidentally, why do you consider alignment to be important if we are always working on the back of the list, a dynamic location which will likely be unaligned? I also note that it's not the number of links in the network but the number of places the packet ingresses and egresses from Snabb. For me I leave it up to @xrme to determine whether he wants to incorporate a change like this, I won't block the commit because of it, though I'm happy to accept follow-ups :)

A 10-minute glance at the LuaJIT source tree suggests to me that it probably does, but I haven't absolutely verified it.
http://luajit.org/ext_ffi_semantics.html mentions the alignment attributes LuaJIT accepts. The mechanism looks to be CTA_ALIGN (defined in lj_ctype.h, set in lj_cparse.c). Then, in lib_ffi.c, LJLIB_CF(ffi_new) calls lj_cdata_newx (which is in lj_cdata.c), which in turn calls lj_cdata_newv, which in turn says /* Allocate variable-sized or specially aligned C data object. */
TL;DR: I haven't tested it, but the LuaJIT code certainly makes it look plausible.

wingo · 2016-02-24T08:18:36Z

@nnikolaev-virtualopensystems regarding alignment, since on this freelist we're always reading and writing on the back, and we expect the freelist to never be empty, I don't see an immediate way that differing alignment can help; do you? Happy to learn :)

Regarding perf, I'm always happy for more perf :) but @xrme is in the middle of a refactor to pull multiple packets off the freelist at a time, from assembly, promising some significant gains; with that in mind I think it's OK to keep the focus on simplicity for now :)

lukego · 2016-02-24T08:40:47Z

The apparent performance regression is bothering me.

This is bothering me too. I want a well-defined way to reach definitive conclusions on these issues. So that we could say with certainty what is the impact of the FFI freelist and of different possible alignments. I am feeling around for steps we could take in this direction.

The basic1 benchmark should be the most sensitive to freelist performance. It does something like 100M packet freelist operations per second. (The iperf-1500 benchmark is actually dominated by totally different factors e.g. SIMD checksum routine performance which is more reason to doubt the significance of that result.)

I compared 100 runs on this branch with 100 runs of the master branch on lugano-1 and collected the results in a spreadsheet. I see 0.9% difference. I have not done an analysis of variance to determine whether this is more likely due to background variation (significant) vs the code change, I am not quite at this level of sophisticated yet, but I would suggest that the difference seems very small indeed and not likely something we should worry about here and now.

I want to start doing more thorough A/B performance evaluations of release candidates on the next branch so that we can look at the sum total impact of the changes we have merged and then go digging in more detail if we find significant regressions. It's probably overkill to do this much analysis on individual PRs.

xrme · 2016-02-24T15:47:52Z

I wonder about "struct freelist" -- I can see this name as easily colliding with a name from some other user's module, struct tags being global and all. Unless you need this struct to have a tag...

DynASM has a feature where you can say

|.type link, struct link

-- Emit code to receive packet pointer into register p
-- from the ring buffer pointer in register link.
-- i is a temp register
local function emit_receive(p, link, i)
   local mask = (C.LINK_RING_SIZE - 1)
   | mov Rd(i), link:Rq(link)->read
   -- assumes packets[] is first in struct link
   -- could use displacement of offsetof(link,packets)
   | mov Rq(p), [Rq(link)+Rq(i)*8]
   | add Rd(i), 1
   | and Rd(i), mask
   | mov dword link:Rq(link)->read, Rd(i)
end

See http://corsix.github.io/dynasm-doc/reference.html#_type

I'd like to be able to use notation like mov Rd(i), link:Rq(link)->read, and I think I need struct packet to exist to do that.

wingo · 2016-02-24T16:12:19Z

Neat feature, thanks for explaining it! I had no idea :)

wingo · 2016-02-24T16:15:05Z

Merged. If you have anything you want added, let's do it in another PR. Cheers :)

lukego · 2016-02-27T10:49:04Z

For an exercise I made a CSV file from Nikolay's data:

$ cat nn.csv
base,patched
27.6,28.3
29.5,30.0
29.4,29.9
29.4,29.8
29.0,30.0
29.2,29.8
29.4,29.3
29.4,30.0
29.3,30.0
28.5,29.9

and wrote an R one-liner to summarize that:

$ R
> summary(read.csv(file='nn.csv'))
      base          patched
 Min.   :27.60   Min.   :28.3
 1st Qu.:29.05   1st Qu.:29.8
 Median :29.35   Median :29.9
 Mean   :29.07   Mean   :29.7
 3rd Qu.:29.40   3rd Qu.:30.0
 Max.   :29.50   Max.   :30.0

and then I reformatted the data into another format, which I am sure could have been done within the R script instead:

$ head -3 nikolay.csv 
code,Mpps
base,27.6
base,29.5

so that I can easily ask R for a post-hoc analysis using Tukey's test:

$ R
> nn <- read.csv(file='nikolay.csv')
> df <- data.frame(code=nn$code, Mpps=nn$Mpps)
> TukeyHSD(aov(Mpps ~ code, data=df))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Mpps ~ code, data = df)

$code
             diff        lwr      upr     p adj
patched-base 0.63 0.09845798 1.161542 0.0227726

which tells us that the measured average difference from patched to base is +0.63 Mpps and we are 95% confident that the real difference is between +0.1 Mpps and +1.16 Mpps. (lwr and upr are the lower and upper bounds that we are 95% confident of.)

This says to me that the change looks promising but it would be interesting to run with more measurements so that we can say with 95% confidence that the difference is more than 0.1%.

Nikolay if you want to submit that change on its own PR I would be happy to poke around at it as an exercise in data analysis.

Implement freelist with FFI cdata object instead of Lua table

7ea1bc3

This enables assembly language code to access freelists in a fairly straightforward manner.

ghost reviewed Feb 23, 2016
View reviewed changes

wingo self-assigned this Feb 23, 2016

lukego mentioned this pull request Feb 23, 2016

Zen of snabb: ABI > API #774

Open

xrme added 2 commits February 23, 2016 12:08

Don't require the freelist module in modules that don't use it.

499a579

Remove freelist.lua. Inline its functionality into packet.lua.

47b9b37

wingo reviewed Feb 24, 2016
View reviewed changes

Use uint64_t for nfree and max in freelist structure.

28593e6

wingo added a commit to wingo/snabb that referenced this pull request Feb 24, 2016

Merge 'xrme/ffi-freelist' (snabbco#771) into wingo-next

c623fe3

wingo added the merged label Feb 24, 2016

This was referenced Apr 26, 2016

Merge wingo-next into next #894

Closed

Merge wingo-next into next #895

Merged

eugeneia merged commit 28593e6 into snabbco:master May 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement freelist with FFI cdata object instead of Lua table #771

Implement freelist with FFI cdata object instead of Lua table #771

xrme commented Feb 22, 2016

mention-bot commented Feb 22, 2016

ghost Feb 23, 2016

lukego Feb 23, 2016

lukego Feb 23, 2016

wingo Feb 23, 2016

eugeneia Feb 23, 2016

lukego Feb 23, 2016

lukego commented Feb 23, 2016

lukego commented Feb 23, 2016

wingo commented Feb 23, 2016

wingo commented Feb 23, 2016

xrme commented Feb 24, 2016

ghost commented Feb 24, 2016

wingo commented Feb 24, 2016

ghost commented Feb 24, 2016

wingo Feb 24, 2016

ghost Feb 24, 2016

wingo Feb 24, 2016

ghost Feb 24, 2016

wingo Feb 24, 2016

kbara Feb 24, 2016

wingo commented Feb 24, 2016

lukego commented Feb 24, 2016

xrme commented Feb 24, 2016

wingo commented Feb 24, 2016

wingo commented Feb 24, 2016

lukego commented Feb 27, 2016

Implement freelist with FFI cdata object instead of Lua table #771

Implement freelist with FFI cdata object instead of Lua table #771

Conversation

xrme commented Feb 22, 2016

mention-bot commented Feb 22, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukego commented Feb 23, 2016

lukego commented Feb 23, 2016

wingo commented Feb 23, 2016

wingo commented Feb 23, 2016

xrme commented Feb 24, 2016

ghost commented Feb 24, 2016

wingo commented Feb 24, 2016

ghost commented Feb 24, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wingo commented Feb 24, 2016

lukego commented Feb 24, 2016

xrme commented Feb 24, 2016

wingo commented Feb 24, 2016

wingo commented Feb 24, 2016

lukego commented Feb 27, 2016