-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fortran underscores #264
Comments
You could use Fortran 2003's C interoperability features but I suspect that this is just a trivial bug that someone will fix quickly. |
Not really a bug since we haven't made any effort to make anything other than regular BLAS functions available in Fortran. F2003 is one option, or you could:
|
Actually Elk controls how many threads each nesting level gets. When it reaches the deepest nesting levels and there are still idle threads, then it assigns them to BLAS or LAPACK routines. Hence it has to have access to the MKL, OpenBLAS or BLIS threading routines during runtime. I would like to request that these and other BLIS routines be made Fortran-compatible. The other libraries appear to have both conventions included (eg. zgemm and zgemm_). I understand it's a bit annoying to have to do this, but it would certainly be useful for Elk and other Fortran codes to be able to access the BLIS API. |
Interesting. We have always viewed the BLIS native APIs as being for C and C++ users. It is good to see that Fortran users are open to more flexible interfaces as well. I'm not sure where on the priority list this should fall... We'll huddle. |
Thanks for your comments, everyone. @jkd2016 As much as I would like, in principle, to provide you with a ready-to-use solution that suits your needs, we have taken many steps to keep BLIS completely divorced of Fortran dependencies. That is why we shy away from anything that causes BLIS to rely on a Fortran compiler. And while it would be possible to create Fortran-compatible API wrappers in C99 for all of BLIS's native APIs, doing so would be a huge time sink and probably not used by very many people. (I don't say this to imply that your needs are not important, but rather because we have to strategically invest our limited resources for maximum impact.) I think it would be best to wait until the right person volunteered to contribute this kind of code. If someone else wrote it, I might be willing to include it in the BLIS source distribution as long as it was completely optional and as long as most of the ongoing technical support was provided by others.
If you would like us to provide BLAS APIs without an underscore (in addition to with an underscore) I would be happy to add this to my to-do list. However, I don't think that is what you are interested in. (Please correct me if I'm mistaken.) @jeffhammond Thanks for opening this issue. |
Adding the wrapper routine "bli_thread_set_num_threads_" wouldn't suddenly make BLIS depend on Fortran. All the BLAS routines in BLIS already have the underscore. If BLIS and libFLAME are AMD's counterpart to MKL, then there is certainly incentive to be able to link to Fortran codes. We have found that zgemm applied to a 15000x15000 matrix with multi-threaded BLIS on a 32-core Ryzen 2990WX processor is about twice as fast as MKL (43 seconds to 80 seconds). Hence the interest in getting it to work with Elk and our motivation for buying AMD hardware. Where MKL wins (even on AMD CPUs) is multi-threaded LAPACK. The eigenvalue solver zheev scales almost linearly with the number of threads whereas BLIS+libFLAME-LAPACK shows no improvement. For a 5000x5000 matrix zheev with MKL takes 45 seconds on 32 cores and BLIS+libFLAME takes 411 seconds. In short, there are good reasons to make BLIS Fortran-friendly without making it Fortran-dependent. |
Sorry, I was sloppily arguing multiple points simultaneously. I agree that adding a
That is impressive. I'm actually surprised by those results, and now I understand a little better while you are playing the part of squeaky wheel.
libflame has advanced Hermitian eigenvalue decomposition (and general SVD) based on the QR method similar in spirit to
I don't disagree. How much mileage would you get if I provided Fortran wrappers to only the runtime thread-related API functions? (I'm primarily thinking of |
I will discuss the Fortran API for BLIS on #265... |
This would be great: I just need bli_thread_set_num_threads_ for Elk at the moment, but I'll leave it to your judgment to add other routines (Elk uses mkl_set_dynamic(.false.) to switch off dynamic thread allocation). In future it would be interesting to try some of the non-BLAS vector operations in BLIS. Elk relies heavily on complex vector operations some of which are not available in BLAS. One pressing example is to calculate many successive complex dot-products: right now I have to call zdotc multiple times and it would be good to dispense with the associated overhead. In the longer term, multi-threaded LAPACK optimised for AMD hardware would be most welcome, particularly for complex matrices of about 10000x10000 elements. I performed a few benchmarks on our machines. We just bought a 32 core Ryzen 2990WX with 64GB RAM for testing but we also have a cluster with Intel Xeon E5-2680 nodes. Here is how BLIS, OpenBLAS and MKL compare for zgemm with 15000x15000 matrices:
As you can see BLIS and OpenBLAS are virtually identical and are faster than MKL on AMD hardware. MKL is somewhat faster than BLIS on the Xeon processors. The Intel processors are faster overall, probably because the amount of memory being thrown around is quite large (~10GB). |
Could you fill me in on the implications of this? I am not a regular user of non-BLAS MKL functions.
Which operations are you referring to here? Things like
Are all of these dot products the same length? Are any of the vectors reused across dot products? If yes to both, you can cast the collection of operations as one or more matrix multiplications.
Thanks for sharing these results. I'm not surprised by MKL outperforming BLIS on Intel hardware, and frankly, nor should anyone else be surprised. It's not a fair fight--not even close. Intel is a company with essentially limitless resources, in money and talent, and they have intimate knowledge of how their hardware works since, well, they designed it. This puts them in a uniquely advantageous position to produce high-performance software. And us? We're basically the equivalent of a scrappy startup company that sneezed in the direction of a university. Most importantly, BLIS can do plenty of things that MKL or any other BLAS library cannot, so we aren't particularly upset about being 10% slower for many/most operations. One last comment: Measuring in raw time is somewhat foreign to us. We usually measure performance in units of gigaflops per second (gflops), which allows us to characterize performance as a fraction of the theoretical peak rate of computation. (Time in seconds only works as a vehicle for comparison when you have a second implementation against which to compare to.) We also look at a range of problem sizes (e.g. 100 to 2000 in increments of 100). This allows the reader to see how quickly performance asymptotes towards the attained peak. We have standalone test drivers that generate these gflops numbers, so it's relatively easy to get up and running with them. Granted, our method detail-oriented and geared towards those who live and breathe this stuff, so it is not for everyone. |
I haven't kept up with the situation in current compilers, but after many years' experience of that sort of thing, I'd definitely recommend the f2003-ish features. They may not apply here, but there are potential gotchas to assuming you can just use names with different case/underscoring in general. [Glad to see the packaging I'm getting into Fedora will have a consumer for the BLIS interface in the ELK package.] |
I don't know why memory would have anything to do with DGEMM performance. DGEMM is not bandwidth-limited for 15000x15000 matrices. The likely explanation here is that Xeon v3 processors aka Haswell uarch support dual-ported 256-bit FMAs whereas Zen appears to have dual-ported 128-bit FMAs. References
Indeed. I made similar comments on https://scicomp.stackexchange.com/a/20727/150. Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL.
And this is why Intel funds BLIS development. ReferencesDisclaimerI work for Intel. |
100% agree with this. |
This tells MKL not to dynamically adjust the number of threads. Elk chooses the number of threads based on how many are idle at the time of the call to MKL.
For example, there is a heavily used routine in which a complex constant times a real vector is accumulated in a complex vector. Currently this is done by two calls to daxpy but it would be more efficient to have a BLAS-like routine to do this. Another example is a routine which requires an operation like
The BLAS routine zher2 performs only the symmetrised version:
so we had to hand-code (1) instead.
The vectors are all different but of the same length. You could use a matrix multiplication A* B but only the diagonal part would be needed. This is one of the most heavily used parts of Elk (called millions of times in a typical calculation)
Actually I was very impressed with the performance of BLIS, particularly the scaling with the number of threads. If you can produce the same scaling with libFLAME LAPACK + BLIS BLAS we would certainly use it. Right now the optimal library choice on the Ryzen is MKL LAPACK + BLIS BLAS. This gives the highest performance for Elk.
Understood. We typically run Elk on up to 1000 cores with MPI + OpenMP + MKL (or BLIS) hbyrid, nested parallelism on a distributed file system. It's a complex parallel environment to have to write code for and I'm amazed it works as well as it does. We don't really count gflops but just like the code to run as fast as possible so we don't have to wait for results. I'll add a BLIS interface to Elk for the next official release. Thanks for all your help. |
I think you're right (the calculation used zgemm not dgemm). I tried several smaller sizes and the results were the same: the Xeon cores are about 30% faster even with the lower clock speed.
Our Ryzen machine also has a nVidia Titan V graphics card. Using CUDA NVBLAS the same 15000x15000 zgemm matrix multiplication takes 6.6 seconds, which includes the copying between CPU and GPU memory; this is about 5 times faster than 24 Xeon cores with MKL. The Ryzen 2990WX + Titan V combination gives a lot of double-precision compute power per dollar, the tricky bit will be getting them to do useful work simultaneously on a complicated code like Elk. |
Got it, thanks.
That looks like BLIS's Related to this: We just presented mixed-domain/mixed-precision (mixed-datatype) functionality, albeit only for
Sure thing. I'll let you know when I have the Fortran-compatible thread APIs committed. |
We specifically needed
That could be useful. There is a little bit of mixed precision in Elk where memory is an issue, but we're usually quite wary about lowering precision in the code. Probably just paranoia... |
I'm unfamiliar with I'm also still unclear what operation you are trying to perform. You first gave the examples:
And it sounded like what you wanted (1) was close to (2) but not quite, and (2) is zher2. So it led me to believe that you wanted some level-2 operation involving a matrix and two vectors. Maybe it would be clearer to define the dimensions of all the objects in your desired operation. For example, an outer product |
I believe what he means is |
Correct. I could have been a bit clearer. By ^* I mean conjugation and ^t means transpose, this is following LaTeX. |
A := x^* y^T + A means A^T := y x^H + A^T if I am right. This means this can be handled by switching the strides in the BLIS API. |
Maybe so, but absolutely no one should have to do this at the user level. I didn't create BLIS's APIs so that people would continue to think about dorking with strides in order to achieve their goals. I created them to free people of thinking in such terms. @jkd2016 The functionality you want is implemented natively within BLIS's ger operation. BLIS's ger allows you to conjugate x, y, both, or neither independent of the transpose on y. BLIS's ger is more general-purpose than BLAS. (In other words: BLIS is not BLAS. BLIS is much more than BLAS with a BLAS compatibility layer thrown in as an afterthought.) Caveat: |
Hint to @field: optimizing level-1 and 2 have long been on my wish list... You just don't read my poker face very well. |
@rvdg Nice try, but you're not going to convince Mr. Complex that it's his fault that we haven't spent enough time on the complex domain. (BTW, you should use my actual handle, @fgvanzee. You probably just called attention from some poor unsuspecting github user who has no idea why he needs to pay attention to this thread.) |
That would certainly do the job. Here is what is in Elk at the moment (sorry about the Fortran):
The vector x is conjugated external to this routine. This saves two complex conjugations per element over using zgerc. The routine also checks whether the prefactor is complex, real or imaginary. It also invokes parallelism but only if there are any idle threads. This routine is nested four OpenMP levels down with MPI above that. However, there is no possibility of thread over-subscription, which is what often happens when things are left to the compiler. We find it is good practice to keep checking whether optimisations like these, which were an advantage a few years ago, remain so today with better compilers and libraries. |
It is things like this (no conjugation allowed on vector x) that motivated us to formulate the BLIS APIs. The original BLAS treats the complex domain as somewhat of a second-class citizen. Why was "conjugation without transposition" left out of the set of valid values of
I'm glad you continued to keep your ear to the ground on this, and I'm delighted that we might make your code a bit cleaner. |
@jkd2016 Would you care to share your full name so I can acknowledge you in the |
Thanks. It's Kay Dewhurst (Max Planck Institute, Halle, Germany). |
So the above Fortran code would be replaced with
...which is neater and faster to boot. Also, Elk needs a similar operation when x and y are the same vectors. I see that I could use bli_zher for this (there is no BLAS analogue). I'm starting to like this a lot. |
Yes, BLIS
Glad to hear. I always thought that people would like BLIS the more they used it, especially if/when they strayed away from the BLAS interfaces.
The code is ready to try out in the
Here,
Please let us know how it goes. We're counting on you to test these functions out! :) (Nobody here at UT uses Fortran test drivers.) |
I'm closing this in favor of #265. For |
I'm currently adding BLIS to our electronic structure code Elk (elk.sourceforge.net).
There is an issue with calling the BLIS API subroutines, namely they don't have leading underscores as required by Fortran.
For example, to set the number of threads at run-time, I can use
with MKL and OpenBLAS, but
doesn't work because bli_thread_set_num_threads_ is not the name in the library.
Any ideas on how to resolve this in a portable way?
The text was updated successfully, but these errors were encountered: