An API for Cache Size(s) #108

andrewcooke · 2014-04-26T12:21:23Z

From JuliaLang/julia#6517 - Currently the cache size information is independently hardwired in several methods throughout Julia Base, e.g. in Base.transpose! and in Base.LinAlg.generic_matmatmul, and now also in my implementation of permutedims! (actually in the auxiliary function blockdims). It would probably make sense to centralise this information and maybe even to try to get it from the system (or to set it at build time) instead of fixing it at 32k. - @Jutho

There isn't a new issue to discuss this, but there probably should be. L1 and L2 cache sizes would be nice to have available, but it may require some effort to get this right across OSes and processor types. - @ViralBShah

stevengj · 2014-04-26T19:11:19Z

I'd rather we just use cache-oblivious algorithms where possible. These generally lead to cleaner code than blocked algorithms, exploit multiple levels of cache, and often have comparable performance to straightforward blocking implementations (assuming they are implemented properly, i.e. you have a large enough base case).

StefanKarpinski · 2014-04-26T19:18:48Z

Yes, I second that. Tuning code to cache sizes tends to be pretty brittle and only beats cache-oblivious algorithms if truly go to town and optimize for a very particular architecture. It may be worth doing this in a BLAS, for example, but I think that if we can get close without the brittleness, it's a win.

Jutho · 2014-04-26T19:34:26Z

I actually first tried to implement the algorithms in TensorOperations.jl using cache-oblivious algorithms. While I was already able to create some speed-up for my version of e.g. permutedims, there certainly was some overhead.

I used recursive algorithms because I did not see an easy approach of implementing this within a normal iterative approach. I also had to tweak a little bit to get the recursive approach to work in combination with the cartesian macros. So the term "properly implemented" did probably not apply to my code. I can give this a new try.

How would you choose the size of the base case? Using the size of the level 1 cache ? 😄

I also had some further questions regarding how the cache-friendly algorithms are currently approached in the Julia base Library. In e.g. Base.tranpose! I see the following:

First the value of sqrthalfcache=1<<7 is defined, which assumes that level 1 cache size is 32k (1<<15). But then, the blocksize is defined as ifloor(sqrthalfcache/elsz/1.4), meaning that the actual memory size of one block will be elsz * blocksize^2 = sqrthalfcache^2 / elsz / 2 = cachesize / elsz / 4 (assuming 1.4 is sqrt(2) ). The total cache taken by this algorithm (if I need to count both the source block and the destination block) is 2 * elsz * blocksize^2 = cachesize / elsz / 2. So there is definitely one division by elsz too much is this approach, and there seems to be an overall safety factor of 2, which seems like a lot (definitely much more than in generic_matmatmul).
In the end, a non-blocking strategy for transpose! is used if the matrices are smaller then 4_blocksize^2, in which case the total data size handled by the algorithm (counting source and destination) is 2_cachesize/elsz , which would be double the size of the cache, if it were not for the erroneous extra division by elsz.

I guess my main question is: In something like transpose or permutedims, do I only need to count the data that is read (the source array) as occupying the cache, or also the data that simply is written (the dest array)

andrewcooke · 2014-04-26T20:38:11Z

i think there's a big difference between providing this information (which would be for external use as well as for julia) and mandating that all julia algorithms use it!

provide the best solution you can. if that doesn't use cache size, cool. but saying that this information "should not" be provided so that people are forced to use cache-oblivious algorithms is needlessly authoritarian. if cache-oblivious is best, it should be used whether this is here or not. this api can remain for other cases.

in other words - whether or not you use cache oblivious algorithms is completely orthogonal to this api.

StefanKarpinski · 2014-04-26T21:02:13Z

I guess that's a fair point. Sometimes cache-oblivious algorithms are not known or one can't be bothered to come up with one and just wants to know the cache size. If people are going to want to know that, we really ought to provide one good standard way of finding out, rather than forcing everyone to invent their own half-assed way of figuring out cache sizes.

ViralBShah · 2014-04-27T06:27:45Z

Yes, these are two different issues. We should certainly use cache oblivious algorithms where possible, but also provide a standard way to figure out cache sizes.

stevengj · 2014-04-27T12:51:28Z

@Jutho, the size of the base case is chosen to be roughly the smallest size that is big enough to amortize the recursion overhead. This is usually much smaller than any realistic L1 cache.

(See, for example, how cache-oblivious transpositions are implemented in FFTW.)

eschnett · 2014-08-27T23:09:43Z

The hwloc library http://www.open-mpi.org/projects/hwloc/ provides an API to identify cores, caches, and memory sizes, and is very portable. (OpenMPI uses it.)

hwloc generates a tree that describes the hardware. There is a C API to walk/query the tree. In Julia, this tree structure could be directly translated to a tree structure, which may be easier to examine in Julia code.

As added bonus, hwloc also identifies PCI devices, accelerator, etc., depending on how it is configuration.

StefanKarpinski · 2014-08-27T23:58:41Z

That would be really handy. It might make sense to just include it in Base Julia if we use it – the dynamic library it builds is only 172K on my system, which is pretty small compared the rest of Julia.

tkelman · 2014-08-28T05:12:43Z

+1, never heard of hwloc before but looking at it for a few minutes, appears very useful, well-behaved, straightforward to build. They are even nice enough to provide Win32 and Win64 MinGW-compiled binaries (they have a nice nickname for them - winballs) which would make packaging via BinDeps really easy. Wrapping their C API in Julia code would be a great project.

StefanKarpinski · 2014-08-28T05:34:11Z

I really like the term winballs :-)

eschnett · 2014-11-16T18:02:55Z

As mentioned above, hwloc is a library that finds out the local cache sizes. See https://github.com/eschnett/hwloc.jl, although the API for accessing cache sizes is still very low-level.

eschnett · 2014-12-15T20:51:41Z

This would be how to obtain the L1 data cache size via hwloc:

import hwloc
topology = hwloc.topology_load()
l1cache = first(filter(t->t.type_==:Cache && t.attr.depth==1, topology)).attr
println("L1 cache information: $l1cache")

L1 cache information: Cache{size=32768,depth=1,linesize=64,associativity=0,type=Data}

This was referenced Apr 28, 2014

More efficient permutedims JuliaLang/julia#6517

Closed

WIP: cache oblivious linear algebra algorithms JuliaLang/julia#6690

Merged

ihnorton mentioned this issue Nov 6, 2015

Wish list: a way to find out number of physical cores JuliaLang/julia#13901

Open

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

KristofferC removed the linear algebra label Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An API for Cache Size(s) #108

An API for Cache Size(s) #108

andrewcooke commented Apr 26, 2014

stevengj commented Apr 26, 2014

StefanKarpinski commented Apr 26, 2014

Jutho commented Apr 26, 2014

andrewcooke commented Apr 26, 2014

StefanKarpinski commented Apr 26, 2014

ViralBShah commented Apr 27, 2014

stevengj commented Apr 27, 2014

eschnett commented Aug 27, 2014

StefanKarpinski commented Aug 27, 2014

tkelman commented Aug 28, 2014

StefanKarpinski commented Aug 28, 2014

eschnett commented Nov 16, 2014

eschnett commented Dec 15, 2014

An API for Cache Size(s) #108

An API for Cache Size(s) #108

Comments

andrewcooke commented Apr 26, 2014

stevengj commented Apr 26, 2014

StefanKarpinski commented Apr 26, 2014

Jutho commented Apr 26, 2014

andrewcooke commented Apr 26, 2014

StefanKarpinski commented Apr 26, 2014

ViralBShah commented Apr 27, 2014

stevengj commented Apr 27, 2014

eschnett commented Aug 27, 2014

StefanKarpinski commented Aug 27, 2014

tkelman commented Aug 28, 2014

StefanKarpinski commented Aug 28, 2014

eschnett commented Nov 16, 2014

eschnett commented Dec 15, 2014