-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An API for Cache Size(s) #108
Comments
I'd rather we just use cache-oblivious algorithms where possible. These generally lead to cleaner code than blocked algorithms, exploit multiple levels of cache, and often have comparable performance to straightforward blocking implementations (assuming they are implemented properly, i.e. you have a large enough base case). |
Yes, I second that. Tuning code to cache sizes tends to be pretty brittle and only beats cache-oblivious algorithms if truly go to town and optimize for a very particular architecture. It may be worth doing this in a BLAS, for example, but I think that if we can get close without the brittleness, it's a win. |
I actually first tried to implement the algorithms in TensorOperations.jl using cache-oblivious algorithms. While I was already able to create some speed-up for my version of e.g. permutedims, there certainly was some overhead. I used recursive algorithms because I did not see an easy approach of implementing this within a normal iterative approach. I also had to tweak a little bit to get the recursive approach to work in combination with the cartesian macros. So the term "properly implemented" did probably not apply to my code. I can give this a new try. How would you choose the size of the base case? Using the size of the level 1 cache ? 😄 I also had some further questions regarding how the cache-friendly algorithms are currently approached in the Julia base Library. In e.g. Base.tranpose! I see the following: First the value of sqrthalfcache=1<<7 is defined, which assumes that level 1 cache size is 32k (1<<15). But then, the blocksize is defined as ifloor(sqrthalfcache/elsz/1.4), meaning that the actual memory size of one block will be elsz * blocksize^2 = sqrthalfcache^2 / elsz / 2 = cachesize / elsz / 4 (assuming 1.4 is sqrt(2) ). The total cache taken by this algorithm (if I need to count both the source block and the destination block) is 2 * elsz * blocksize^2 = cachesize / elsz / 2. So there is definitely one division by elsz too much is this approach, and there seems to be an overall safety factor of 2, which seems like a lot (definitely much more than in generic_matmatmul). I guess my main question is: In something like transpose or permutedims, do I only need to count the data that is read (the source array) as occupying the cache, or also the data that simply is written (the dest array) |
i think there's a big difference between providing this information (which would be for external use as well as for julia) and mandating that all julia algorithms use it! provide the best solution you can. if that doesn't use cache size, cool. but saying that this information "should not" be provided so that people are forced to use cache-oblivious algorithms is needlessly authoritarian. if cache-oblivious is best, it should be used whether this is here or not. this api can remain for other cases. in other words - whether or not you use cache oblivious algorithms is completely orthogonal to this api. |
I guess that's a fair point. Sometimes cache-oblivious algorithms are not known or one can't be bothered to come up with one and just wants to know the cache size. If people are going to want to know that, we really ought to provide one good standard way of finding out, rather than forcing everyone to invent their own half-assed way of figuring out cache sizes. |
Yes, these are two different issues. We should certainly use cache oblivious algorithms where possible, but also provide a standard way to figure out cache sizes. |
@Jutho, the size of the base case is chosen to be roughly the smallest size that is big enough to amortize the recursion overhead. This is usually much smaller than any realistic L1 cache. (See, for example, how cache-oblivious transpositions are implemented in FFTW.) |
The hwloc library http://www.open-mpi.org/projects/hwloc/ provides an API to identify cores, caches, and memory sizes, and is very portable. (OpenMPI uses it.) hwloc generates a tree that describes the hardware. There is a C API to walk/query the tree. In Julia, this tree structure could be directly translated to a tree structure, which may be easier to examine in Julia code. As added bonus, hwloc also identifies PCI devices, accelerator, etc., depending on how it is configuration. |
That would be really handy. It might make sense to just include it in Base Julia if we use it – the dynamic library it builds is only 172K on my system, which is pretty small compared the rest of Julia. |
+1, never heard of hwloc before but looking at it for a few minutes, appears very useful, well-behaved, straightforward to build. They are even nice enough to provide Win32 and Win64 MinGW-compiled binaries (they have a nice nickname for them - winballs) which would make packaging via BinDeps really easy. Wrapping their C API in Julia code would be a great project. |
I really like the term |
As mentioned above, hwloc is a library that finds out the local cache sizes. See https://github.com/eschnett/hwloc.jl, although the API for accessing cache sizes is still very low-level. |
This would be how to obtain the L1 data cache size via hwloc:
|
From JuliaLang/julia#6517 - Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.generic_matmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k. - @Jutho
There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types. - @ViralBShah
The text was updated successfully, but these errors were encountered: