-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement freelist with FFI cdata object instead of Lua table #771
Conversation
This enables assembly language code to access freelists in a fairly straightforward manner.
By analyzing the blame information on this pull request, we identified @lukego and @nnikolaev-virtualopensystems to be potential reviewers |
-- XXX Better LuaJIT idiom for specifying the array type? | ||
list = ffi.new(type.."[?]", size) } | ||
local element_ct = ffi.typeof(type) | ||
local fl_type = ffi.typeof("struct { int nfree, max; $ list[?]; }", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ints for free
and max
? Should it be a 64bit type?
What about cache alignment? I'd personally prefer list
to start on an aligned address.
Did you measure the performance impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance impact should be measured automatically by the CI. If this testing is not sufficient then we should improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: You can see the performance regression tests at the top of the results page that is linked from the little green tick / red cross next to the commit.
Here is the snippet from this PR:
Checking for performance regressions:
BENCH basic1-100e6 -> 0.967078 of 14.58 (SD: 0.0979796 )
BENCH packetblaster-64 -> 0.997168 of 10.592 (SD: 0.0292575 )
BENCH packetblaster-synth-64 -> 0.991587 of 10.698 (SD: 0.00979796 )
BENCH snabbnfv-iperf-1500 -> 0.899233 of 5.478 (SD: 0.326399 )
BENCH snabbnfv-iperf-jumbo -> 0.997787 of 6.326 (SD: 0.191896 )
BENCH snabbnfv-loadgen-dpdk -> 0.991281 of 2.6838 (SD: 0.0322639 )
@eugeneia this actually looks suspicious to me. Is SnabbBot being quite conservative about flagging problems with there is high variance in results? If I am reading this right then the iperf-1500 result had around -10% for the mean value but this is accepted due to high variance / error bars i.e. likely false negative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to above comments, but related to the line in question: If there is ever more than one freelist used by a function, then this switch could cause branchy traces as each ffi.typeof('struct {}') will create a fresh type. Makes me nervous!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lukego Standard deviation is only printed and not used in any equation. Any result below 85% of master results is considered a failure. Might be a ridiculous margin, I had no idea of what would be a good margin at the time of picking this number. If I remember correctly 0.85 was chosen because it yielded an acceptable false positive rate. If I were to revisit that code, I would not calculate average at all and use minimum instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eugeneia Thanks for explaining. One day we can revisit this once we also have a better handle on the whole "scientific testing" concept. For now it is probably sufficient that we check more closely before making a release. I would not like to ship even a 1% performance regression if we can avoid it but I understand that it is non-trivial to verify this (particularly on a reasonable processing budget).
This sounds like a good idea in light of @wingo's observation. LuaJIT generates code that is very type-specialized and every non-trivial call to This can be avoided by hoisting the I don't think this would actually happen today with the code as written - we're only working on one freelist - but if you were already thinking of simplifying the type then this would be another reason to do so. This whole "FFI ctype diversity" JIT issue is something that @alexandergall discovered relatively recently in #612. |
@wingo will you be upstream for this one? |
avec plaisir :) |
There are a number of modules that require the freelist module:
But AFAICT only the packet module uses it. @xrme, if you have no plans to use the freelist module more widely, would you mind inlining the freelist module into packet.lua and removing freelist.lua? Then we can get rid of the ffi.typeof of the tagless struct to avoid this hazard in the future. Probably you will have to update the documentation as well, but it will just be deletion. Thanks! |
The apparent performance regression is bothering me. I don't think I understand the comment by @nnikolaev-virtualopensystems regarding the alignment of the |
When I access element number 1 from the list, I prefer that the next couple of elements are also fetched in the cacheline (which is 64 bytes on Intel's Xeon). So 64 byte alignment of performance sensitive structures might be crucuial in some cases. |
Regarding the perf regression, I only see that on the snabbnfv-iperf-1500 case, and then only sometimes. That case seems to have a large standard deviation so perhaps that's normal. I don't suspect this change as having significant perf impact. |
But maybe we can gain some performance here? |
int nfree, max; | ||
struct packet *list[?]; | ||
}; | ||
]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the refactor, looks great :) I have two more nits before landing:
- Would you mind changing to unsigned integers please, unless you have a reason to use signed integers
- I wonder about "struct freelist" -- I can see this name as easily colliding with a name from some other user's module, struct tags being global and all. Unless you need this struct to have a tag, in this case
local freelist_t = ffi.typeof[[struct { ... }]]
is more appropriate. In that case I would change the ffi.new call below to belocal packets_fl = freelist_t(max_packets, 0, max_packets)
.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some quick basic benchmark testing and got slightly better results with:
ffi.cdef[[
struct freelist {
uint64_t nfree, max;
__attribute__((aligned(64)))
struct packet *list[?];
};
]]
I understand that it is probably annoying to insist here - but this and struct packet
are the most used (at runtime) data structures in the project, so I believe every detail here is important. So if this is going upstream, IMHO we should look at it very carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Would you mind quantifying what you did to benchmark this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for i in {1..10}; do sudo ./bench/basic1-100e6; done
Without the patch:
27.6
29.5
29.4
29.4
29.0
29.2
29.4
29.4
29.3
28.5
With the patch:
28.3
30.0
29.9
29.8
30.0
29.8
29.3
30.0
30.0
29.9
I know, it's not much. But the basic1 benchmark has only 3 links to process. I believe the effect can (and will) be multiplied once you scale your configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, only thing is I'm not sure that align((64)) works with ffi.new. It certainly wouldn't work if it were just malloc. Incidentally, why do you consider alignment to be important if we are always working on the back of the list, a dynamic location which will likely be unaligned? I also note that it's not the number of links in the network but the number of places the packet ingresses and egresses from Snabb. For me I leave it up to @xrme to determine whether he wants to incorporate a change like this, I won't block the commit because of it, though I'm happy to accept follow-ups :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 10-minute glance at the LuaJIT source tree suggests to me that it probably does, but I haven't absolutely verified it.
http://luajit.org/ext_ffi_semantics.html mentions the alignment attributes LuaJIT accepts. The mechanism looks to be CTA_ALIGN (defined in lj_ctype.h, set in lj_cparse.c). Then, in lib_ffi.c, LJLIB_CF(ffi_new) calls lj_cdata_newx (which is in lj_cdata.c), which in turn calls lj_cdata_newv, which in turn says /* Allocate variable-sized or specially aligned C data object. */
TL;DR: I haven't tested it, but the LuaJIT code certainly makes it look plausible.
@nnikolaev-virtualopensystems regarding alignment, since on this freelist we're always reading and writing on the back, and we expect the freelist to never be empty, I don't see an immediate way that differing alignment can help; do you? Happy to learn :) Regarding perf, I'm always happy for more perf :) but @xrme is in the middle of a refactor to pull multiple packets off the freelist at a time, from assembly, promising some significant gains; with that in mind I think it's OK to keep the focus on simplicity for now :) |
This is bothering me too. I want a well-defined way to reach definitive conclusions on these issues. So that we could say with certainty what is the impact of the FFI freelist and of different possible alignments. I am feeling around for steps we could take in this direction. The I compared 100 runs on this branch with 100 runs of the master branch on I want to start doing more thorough A/B performance evaluations of release candidates on the |
DynASM has a feature where you can say
See http://corsix.github.io/dynasm-doc/reference.html#_type I'd like to be able to use notation like |
Neat feature, thanks for explaining it! I had no idea :) |
Merged. If you have anything you want added, let's do it in another PR. Cheers :) |
For an exercise I made a CSV file from Nikolay's data:
and wrote an R one-liner to summarize that:
and then I reformatted the data into another format, which I am sure could have been done within the R script instead:
so that I can easily ask R for a post-hoc analysis using Tukey's test:
which tells us that the measured average difference from This says to me that the change looks promising but it would be interesting to run with more measurements so that we can say with 95% confidence that the difference is more than 0.1%. Nikolay if you want to submit that change on its own PR I would be happy to poke around at it as an exercise in data analysis. |
This (tiny) commit changes the representation of freelists from a Lua table to an ffi cdata object.
This is useful because it lets assembly language code access freelists in a straightforward way.
I believe I figured out the correct LuaJIT FFI notation for parameterizing the type of object that the freelist will hold, but I don't know if it is worth keeping that ability around. The only things that are ever freelisted are packets. It therefore might make sense to remove the type parameterization as an unnecessary abstraction.