Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An API for Cache Size(s) #108

Open
andrewcooke opened this issue Apr 26, 2014 · 13 comments
Open

An API for Cache Size(s) #108

andrewcooke opened this issue Apr 26, 2014 · 13 comments
Labels
help wanted Extra attention is needed

Comments

@andrewcooke
Copy link

From JuliaLang/julia#6517 - Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.generic_matmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k. - @Jutho

There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types. - @ViralBShah

@stevengj
Copy link
Member

I'd rather we just use cache-oblivious algorithms where possible. These generally lead to cleaner code than blocked algorithms, exploit multiple levels of cache, and often have comparable performance to straightforward blocking implementations (assuming they are implemented properly, i.e. you have a large enough base case).

@StefanKarpinski
Copy link
Member

Yes, I second that. Tuning code to cache sizes tends to be pretty brittle and only beats cache-oblivious algorithms if truly go to town and optimize for a very particular architecture. It may be worth doing this in a BLAS, for example, but I think that if we can get close without the brittleness, it's a win.

@Jutho
Copy link
Contributor

Jutho commented Apr 26, 2014

I actually first tried to implement the algorithms in TensorOperations.jl using cache-oblivious algorithms. While I was already able to create some speed-up for my version of e.g. permutedims, there certainly was some overhead.

I used recursive algorithms because I did not see an easy approach of implementing this within a normal iterative approach. I also had to tweak a little bit to get the recursive approach to work in combination with the cartesian macros. So the term "properly implemented" did probably not apply to my code. I can give this a new try.

How would you choose the size of the base case? Using the size of the level 1 cache ? 😄

I also had some further questions regarding how the cache-friendly algorithms are currently approached in the Julia base Library. In e.g. Base.tranpose! I see the following:

First the value of sqrthalfcache=1<<7 is defined, which assumes that level 1 cache size is 32k (1<<15). But then, the blocksize is defined as ifloor(sqrthalfcache/elsz/1.4), meaning that the actual memory size of one block will be elsz * blocksize^2 = sqrthalfcache^2 / elsz / 2 = cachesize / elsz / 4 (assuming 1.4 is sqrt(2) ). The total cache taken by this algorithm (if I need to count both the source block and the destination block) is 2 * elsz * blocksize^2 = cachesize / elsz / 2. So there is definitely one division by elsz too much is this approach, and there seems to be an overall safety factor of 2, which seems like a lot (definitely much more than in generic_matmatmul).
In the end, a non-blocking strategy for transpose! is used if the matrices are smaller then 4_blocksize^2, in which case the total data size handled by the algorithm (counting source and destination) is 2_cachesize/elsz , which would be double the size of the cache, if it were not for the erroneous extra division by elsz.

I guess my main question is: In something like transpose or permutedims, do I only need to count the data that is read (the source array) as occupying the cache, or also the data that simply is written (the dest array)

@andrewcooke
Copy link
Author

i think there's a big difference between providing this information (which would be for external use as well as for julia) and mandating that all julia algorithms use it!

provide the best solution you can. if that doesn't use cache size, cool. but saying that this information "should not" be provided so that people are forced to use cache-oblivious algorithms is needlessly authoritarian. if cache-oblivious is best, it should be used whether this is here or not. this api can remain for other cases.

in other words - whether or not you use cache oblivious algorithms is completely orthogonal to this api.

@StefanKarpinski
Copy link
Member

I guess that's a fair point. Sometimes cache-oblivious algorithms are not known or one can't be bothered to come up with one and just wants to know the cache size. If people are going to want to know that, we really ought to provide one good standard way of finding out, rather than forcing everyone to invent their own half-assed way of figuring out cache sizes.

@ViralBShah
Copy link
Member

Yes, these are two different issues. We should certainly use cache oblivious algorithms where possible, but also provide a standard way to figure out cache sizes.

@stevengj
Copy link
Member

@Jutho, the size of the base case is chosen to be roughly the smallest size that is big enough to amortize the recursion overhead. This is usually much smaller than any realistic L1 cache.

(See, for example, how cache-oblivious transpositions are implemented in FFTW.)

@eschnett
Copy link
Contributor

The hwloc library http://www.open-mpi.org/projects/hwloc/ provides an API to identify cores, caches, and memory sizes, and is very portable. (OpenMPI uses it.)

hwloc generates a tree that describes the hardware. There is a C API to walk/query the tree. In Julia, this tree structure could be directly translated to a tree structure, which may be easier to examine in Julia code.

As added bonus, hwloc also identifies PCI devices, accelerator, etc., depending on how it is configuration.

@StefanKarpinski
Copy link
Member

That would be really handy. It might make sense to just include it in Base Julia if we use it – the dynamic library it builds is only 172K on my system, which is pretty small compared the rest of Julia.

@tkelman
Copy link

tkelman commented Aug 28, 2014

+1, never heard of hwloc before but looking at it for a few minutes, appears very useful, well-behaved, straightforward to build. They are even nice enough to provide Win32 and Win64 MinGW-compiled binaries (they have a nice nickname for them - winballs) which would make packaging via BinDeps really easy. Wrapping their C API in Julia code would be a great project.

@StefanKarpinski
Copy link
Member

I really like the term winballs :-)

@eschnett
Copy link
Contributor

As mentioned above, hwloc is a library that finds out the local cache sizes. See https://github.com/eschnett/hwloc.jl, although the API for accessing cache sizes is still very low-level.

@eschnett
Copy link
Contributor

This would be how to obtain the L1 data cache size via hwloc:

import hwloc
topology = hwloc.topology_load()
l1cache = first(filter(t->t.type_==:Cache && t.attr.depth==1, topology)).attr
println("L1 cache information: $l1cache")
L1 cache information: Cache{size=32768,depth=1,linesize=64,associativity=0,type=Data}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants