Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement freelist with FFI cdata object instead of Lua table #771

Merged
merged 4 commits into from
May 12, 2016
Merged

Implement freelist with FFI cdata object instead of Lua table #771

merged 4 commits into from
May 12, 2016

Conversation

xrme
Copy link
Contributor

@xrme xrme commented Feb 22, 2016

This (tiny) commit changes the representation of freelists from a Lua table to an ffi cdata object.

This is useful because it lets assembly language code access freelists in a straightforward way.

I believe I figured out the correct LuaJIT FFI notation for parameterizing the type of object that the freelist will hold, but I don't know if it is worth keeping that ability around. The only things that are ever freelisted are packets. It therefore might make sense to remove the type parameterization as an unnecessary abstraction.

This enables assembly language code to access freelists in a fairly
straightforward manner.
@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @lukego and @nnikolaev-virtualopensystems to be potential reviewers

-- XXX Better LuaJIT idiom for specifying the array type?
list = ffi.new(type.."[?]", size) }
local element_ct = ffi.typeof(type)
local fl_type = ffi.typeof("struct { int nfree, max; $ list[?]; }",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ints for free and max? Should it be a 64bit type?
What about cache alignment? I'd personally prefer list to start on an aligned address.

Did you measure the performance impact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance impact should be measured automatically by the CI. If this testing is not sufficient then we should improve it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: You can see the performance regression tests at the top of the results page that is linked from the little green tick / red cross next to the commit.

Here is the snippet from this PR:

Checking for performance regressions:
BENCH basic1-100e6 -> 0.967078 of 14.58 (SD: 0.0979796 )
BENCH packetblaster-64 -> 0.997168 of 10.592 (SD: 0.0292575 )
BENCH packetblaster-synth-64 -> 0.991587 of 10.698 (SD: 0.00979796 )
BENCH snabbnfv-iperf-1500 -> 0.899233 of 5.478 (SD: 0.326399 )
BENCH snabbnfv-iperf-jumbo -> 0.997787 of 6.326 (SD: 0.191896 )
BENCH snabbnfv-loadgen-dpdk -> 0.991281 of 2.6838 (SD: 0.0322639 )

@eugeneia this actually looks suspicious to me. Is SnabbBot being quite conservative about flagging problems with there is high variance in results? If I am reading this right then the iperf-1500 result had around -10% for the mean value but this is accepted due to high variance / error bars i.e. likely false negative?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to above comments, but related to the line in question: If there is ever more than one freelist used by a function, then this switch could cause branchy traces as each ffi.typeof('struct {}') will create a fresh type. Makes me nervous!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukego Standard deviation is only printed and not used in any equation. Any result below 85% of master results is considered a failure. Might be a ridiculous margin, I had no idea of what would be a good margin at the time of picking this number. If I remember correctly 0.85 was chosen because it yielded an acceptable false positive rate. If I were to revisit that code, I would not calculate average at all and use minimum instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eugeneia Thanks for explaining. One day we can revisit this once we also have a better handle on the whole "scientific testing" concept. For now it is probably sufficient that we check more closely before making a release. I would not like to ship even a 1% performance regression if we can avoid it but I understand that it is non-trivial to verify this (particularly on a reasonable processing budget).

@lukego
Copy link
Member

lukego commented Feb 23, 2016

The only things that are ever freelisted are packets. It therefore might make sense to remove the type parameterization as an unnecessary abstraction.

This sounds like a good idea in light of @wingo's observation.

LuaJIT generates code that is very type-specialized and every non-trivial call to ffi.typeof() returns a new type from the JIT perspective. This means that hypothetically if a function were using multiple freelist objects then the JIT would see them all as distinct datatypes and the type guards in generated code would often fail and cause branches off the fast path.

This can be avoided by hoisting the ffi.typeof() call up to the module level.

I don't think this would actually happen today with the code as written - we're only working on one freelist - but if you were already thinking of simplifying the type then this would be another reason to do so.

This whole "FFI ctype diversity" JIT issue is something that @alexandergall discovered relatively recently in #612.

@lukego
Copy link
Member

lukego commented Feb 23, 2016

@wingo will you be upstream for this one?

@wingo
Copy link
Contributor

wingo commented Feb 23, 2016

avec plaisir :)

@wingo wingo self-assigned this Feb 23, 2016
@wingo
Copy link
Contributor

wingo commented Feb 23, 2016

There are a number of modules that require the freelist module:

./core/packet.lua:8:local freelist = require("core.freelist")
./apps/intel/intel_app.lua:10:local freelist = require("core.freelist")
./apps/basic/basic_apps.lua:4:local freelist = require("core.freelist")
./apps/solarflare/solarflare.lua:4:local freelist = require("core.freelist")
./lib/virtio/virtq_device.lua:6:local freelist  = require("core.freelist")
./lib/virtio/net_device.lua:6:local freelist  = require("core.freelist")
./program/snabbmark/snabbmark.lua:8:local freelist      = require("core.freelist")

But AFAICT only the packet module uses it. @xrme, if you have no plans to use the freelist module more widely, would you mind inlining the freelist module into packet.lua and removing freelist.lua? Then we can get rid of the ffi.typeof of the tagless struct to avoid this hazard in the future. Probably you will have to update the documentation as well, but it will just be deletion. Thanks!

@xrme
Copy link
Contributor Author

xrme commented Feb 24, 2016

The apparent performance regression is bothering me.

I don't think I understand the comment by @nnikolaev-virtualopensystems regarding the alignment of the list vector. Is there some reason it should be more than 8-byte aligned?

@ghost
Copy link

ghost commented Feb 24, 2016

When I access element number 1 from the list, I prefer that the next couple of elements are also fetched in the cacheline (which is 64 bytes on Intel's Xeon). So 64 byte alignment of performance sensitive structures might be crucuial in some cases.

@wingo
Copy link
Contributor

wingo commented Feb 24, 2016

Regarding the perf regression, I only see that on the snabbnfv-iperf-1500 case, and then only sometimes. That case seems to have a large standard deviation so perhaps that's normal. I don't suspect this change as having significant perf impact.

@ghost
Copy link

ghost commented Feb 24, 2016

But maybe we can gain some performance here?

int nfree, max;
struct packet *list[?];
};
]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing the refactor, looks great :) I have two more nits before landing:

  1. Would you mind changing to unsigned integers please, unless you have a reason to use signed integers
  2. I wonder about "struct freelist" -- I can see this name as easily colliding with a name from some other user's module, struct tags being global and all. Unless you need this struct to have a tag, in this case local freelist_t = ffi.typeof[[struct { ... }]] is more appropriate. In that case I would change the ffi.new call below to be local packets_fl = freelist_t(max_packets, 0, max_packets).

Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some quick basic benchmark testing and got slightly better results with:

ffi.cdef[[
struct freelist {
    uint64_t nfree, max;
    __attribute__((aligned(64)))
    struct packet *list[?];
};
]]

I understand that it is probably annoying to insist here - but this and struct packet are the most used (at runtime) data structures in the project, so I believe every detail here is important. So if this is going upstream, IMHO we should look at it very carefully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Would you mind quantifying what you did to benchmark this change?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for i in {1..10}; do sudo ./bench/basic1-100e6; done
Without the patch:

27.6
29.5
29.4
29.4
29.0
29.2
29.4
29.4
29.3
28.5

With the patch:

28.3
30.0
29.9
29.8
30.0
29.8
29.3
30.0
30.0
29.9

I know, it's not much. But the basic1 benchmark has only 3 links to process. I believe the effect can (and will) be multiplied once you scale your configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, only thing is I'm not sure that align((64)) works with ffi.new. It certainly wouldn't work if it were just malloc. Incidentally, why do you consider alignment to be important if we are always working on the back of the list, a dynamic location which will likely be unaligned? I also note that it's not the number of links in the network but the number of places the packet ingresses and egresses from Snabb. For me I leave it up to @xrme to determine whether he wants to incorporate a change like this, I won't block the commit because of it, though I'm happy to accept follow-ups :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 10-minute glance at the LuaJIT source tree suggests to me that it probably does, but I haven't absolutely verified it.
http://luajit.org/ext_ffi_semantics.html mentions the alignment attributes LuaJIT accepts. The mechanism looks to be CTA_ALIGN (defined in lj_ctype.h, set in lj_cparse.c). Then, in lib_ffi.c, LJLIB_CF(ffi_new) calls lj_cdata_newx (which is in lj_cdata.c), which in turn calls lj_cdata_newv, which in turn says /* Allocate variable-sized or specially aligned C data object. */
TL;DR: I haven't tested it, but the LuaJIT code certainly makes it look plausible.

@wingo
Copy link
Contributor

wingo commented Feb 24, 2016

@nnikolaev-virtualopensystems regarding alignment, since on this freelist we're always reading and writing on the back, and we expect the freelist to never be empty, I don't see an immediate way that differing alignment can help; do you? Happy to learn :)

Regarding perf, I'm always happy for more perf :) but @xrme is in the middle of a refactor to pull multiple packets off the freelist at a time, from assembly, promising some significant gains; with that in mind I think it's OK to keep the focus on simplicity for now :)

@lukego
Copy link
Member

lukego commented Feb 24, 2016

The apparent performance regression is bothering me.

This is bothering me too. I want a well-defined way to reach definitive conclusions on these issues. So that we could say with certainty what is the impact of the FFI freelist and of different possible alignments. I am feeling around for steps we could take in this direction.

The basic1 benchmark should be the most sensitive to freelist performance. It does something like 100M packet freelist operations per second. (The iperf-1500 benchmark is actually dominated by totally different factors e.g. SIMD checksum routine performance which is more reason to doubt the significance of that result.)

I compared 100 runs on this branch with 100 runs of the master branch on lugano-1 and collected the results in a spreadsheet. I see 0.9% difference. I have not done an analysis of variance to determine whether this is more likely due to background variation (significant) vs the code change, I am not quite at this level of sophisticated yet, but I would suggest that the difference seems very small indeed and not likely something we should worry about here and now.

I want to start doing more thorough A/B performance evaluations of release candidates on the next branch so that we can look at the sum total impact of the changes we have merged and then go digging in more detail if we find significant regressions. It's probably overkill to do this much analysis on individual PRs.

@xrme
Copy link
Contributor Author

xrme commented Feb 24, 2016

I wonder about "struct freelist" -- I can see this name as easily colliding with a name from some other user's module, struct tags being global and all. Unless you need this struct to have a tag...

DynASM has a feature where you can say

|.type link, struct link

-- Emit code to receive packet pointer into register p
-- from the ring buffer pointer in register link.
-- i is a temp register
local function emit_receive(p, link, i)
   local mask = (C.LINK_RING_SIZE - 1)
   | mov Rd(i), link:Rq(link)->read
   -- assumes packets[] is first in struct link
   -- could use displacement of offsetof(link,packets)
   | mov Rq(p), [Rq(link)+Rq(i)*8]
   | add Rd(i), 1
   | and Rd(i), mask
   | mov dword link:Rq(link)->read, Rd(i)
end

See http://corsix.github.io/dynasm-doc/reference.html#_type

I'd like to be able to use notation like mov Rd(i), link:Rq(link)->read, and I think I need struct packet to exist to do that.

@wingo
Copy link
Contributor

wingo commented Feb 24, 2016

Neat feature, thanks for explaining it! I had no idea :)

wingo added a commit to wingo/snabb that referenced this pull request Feb 24, 2016
@wingo wingo added the merged label Feb 24, 2016
@wingo
Copy link
Contributor

wingo commented Feb 24, 2016

Merged. If you have anything you want added, let's do it in another PR. Cheers :)

@lukego
Copy link
Member

lukego commented Feb 27, 2016

For an exercise I made a CSV file from Nikolay's data:

$ cat nn.csv
base,patched
27.6,28.3
29.5,30.0
29.4,29.9
29.4,29.8
29.0,30.0
29.2,29.8
29.4,29.3
29.4,30.0
29.3,30.0
28.5,29.9

and wrote an R one-liner to summarize that:

$ R
> summary(read.csv(file='nn.csv'))
      base          patched
 Min.   :27.60   Min.   :28.3
 1st Qu.:29.05   1st Qu.:29.8
 Median :29.35   Median :29.9
 Mean   :29.07   Mean   :29.7
 3rd Qu.:29.40   3rd Qu.:30.0
 Max.   :29.50   Max.   :30.0

and then I reformatted the data into another format, which I am sure could have been done within the R script instead:

$ head -3 nikolay.csv 
code,Mpps
base,27.6
base,29.5

so that I can easily ask R for a post-hoc analysis using Tukey's test:

$ R
> nn <- read.csv(file='nikolay.csv')
> df <- data.frame(code=nn$code, Mpps=nn$Mpps)
> TukeyHSD(aov(Mpps ~ code, data=df))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Mpps ~ code, data = df)

$code
             diff        lwr      upr     p adj
patched-base 0.63 0.09845798 1.161542 0.0227726

which tells us that the measured average difference from patched to base is +0.63 Mpps and we are 95% confident that the real difference is between +0.1 Mpps and +1.16 Mpps. (lwr and upr are the lower and upper bounds that we are 95% confident of.)

This says to me that the change looks promising but it would be interesting to run with more measurements so that we can say with 95% confidence that the difference is more than 0.1%.

Nikolay if you want to submit that change on its own PR I would be happy to poke around at it as an exercise in data analysis.

This was referenced Apr 26, 2016
@eugeneia eugeneia merged commit 28593e6 into snabbco:master May 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants