-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging with JCuda and JOpenCL projects for better quality cuda interfaces #475
Comments
That would be nice, but the problem is that people expect Oracle to come up with a better solution than JavaCPP, even though they are not working on anything at the moment. As far as I can tell, the developers of Project Panama have given up on any generic solution to C++, no one knows how to make something better than JavaCPP. Still, they hope and believe, and wait, mostly. If you could help convincing them that nothing better is going to happen, explaining and reexplaining over and over again how JavaCPP could get better, that would be the first thing that needs to be done. |
Just for reference: jcuda/jcuda#12 (comment) |
@saudet Anyway I registered under the project mail list, but to be hones I went trough some links on project site, some repository links are broken, and also blog sites of the main creators/devs are not updated for long time... I will check more and read about this. |
There is a lot happening in Panama (the project...) right now. Admittedly, although I'm registered to the mailing list, too much to follow it all in detail. However, if they manage to achieve the goals that they stated at the project site, http://openjdk.java.net/projects/panama/ , this would certainly compete with JavaCPP. Of course, development there happens at a different pace. We all know that a "single-developer project" often can be far more agile than a company-driven project, where specifications and sustainability play a completely different role. Panama also approaches topics that go far beyond what can be accomplished by JavaCPP or JNI in general. They are really going down to the guts, and the work there is interwoven with the topics of Value Types, Vectorization and other HotSpot internals. So I agree to saudet that it does not make sense to (inactively) "wait for a better solution". JavaCPP is an existing solution for (many of, but by no means all of) the goals that are addressed in Panama. More generally speaking, the problem of fragmentation (in terms of different JNI bindings for the same library) occurred quite frequently. One of the first "large" ones had been OpenGL, with JOGL which basically competed with LWJGL. For CUDA, there had been some very basic approaches, but none of them (except for JCuda) have really been maintained. When OpenCL popped up, there quickly have been a handful of Java bindings (some of them being listed at jocl.org and in this stackoverflow answer ), but I'm not sure about how actively each of them is still used and maintained. (OT: It has been a bit quiet around OpenCL in general recently. Maybe due to Vulkan, which also supports GPU computations? When Vulkan was published, I registered jvulkan.org, but the statement "Coming soon" is not true any more: There already is a vulkan binding in LWJGL, and the API is too complex to create manual bindings. There doesn't seem to be a Vulkan preset for JavaCPP, or did I overlook it?) For me, as the maintainer of jcuda.org and jocl.org, one of the main questions about "merging" projects would be how this can be done "smoothly", without just abandoning one project in favor of the other. I always tried to be backward compatible and "reliable", in that sense. Quite a while ago, I talked to one of the maintainers of Jogamp-JOCL, about merging the Jogamp-JOCL and the jocl.org-JOCL. One basic idea there had been to reshape one of the libraries so that it could be some sort of "layer" that is placed over the other, but this idea has not been persued any further. I'm curious to hear other thoughts and ideas about how such a "merge" might actually be accomplished, considering that the projects are built on very different infrastructures. |
I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks
|
Yes, JCuda, etc could be rebased on JavaCPP, that's the idea IMO. There are no bindings for OpenCL or Vulkan just because I don't have the time to do everything, that's all. |
@jcuda @saudet |
I know about these links for JNR:
http://www.oracle.com/technetwork/java/jvmls2013nutter-2013526.pdf
bytedeco/javacpp#70
|
@saudet thanks buddy, I also suggest to move the discussion about jcuda vs javacpp to marco's thread at, as he requested: NOTE: I think out of theoretical discussion, as performance is the top priority I suggest if you @saudet create under JavaCPP new github project where we can develop real benchmark for Jcuda and Javacpp based CUDA (as Vulkan and OpenCL are not available in the moment), so we can analyze code syntax diff/similarities and performance as well in some unified way. I also suggest to decide which benchmark framework should be used to build this stuff: |
Sure, but who will take time to do? I keep telling everyone I don't have the time to do everything by myself...
|
I will create initial project and adopt few basic CUDA algorithms to be implemented in Jcuda and javacpp, I hope we could find more users from the other side (jcuda) to participate as well. |
Ok, cool, thanks! Can we name the repo "benchmarks"? or would there be a better name?
|
I think make it this generic best, so benchmarks sounds good. As out of of this I would like to also later (if having time) to test JavaCPP vs JNR in some simple dummy getuuid functions call tests from libc as kind of template:
|
Again, I'm not so deeply involved there, but their primary goal is (to my understanding) not something that is based on accessing libraries via their definitions in header files. My comment mainly referred to the high-level project goals (i.e. accessing native libraries, basically regardless of which language they have been written in), together with the low-level efforts in the JVM. At least, there are some interesting threads in the mailing list, and the repo at http://hg.openjdk.java.net/panama/panama/jdk/shortlog/d83170db025b seems rather active. Regarding the benchmarks: As I also mentioned in the forum, creating a sensible benchmark may be difficult. Even more so if it is supposed to cover the point that is becoming increasingly important, namely multithreading. But setting up a basic skeleton with basic sample code could certainly help to figure out what can be measured, and how it can be measured sensibly. (As for the topic of merging libraries, the API differences might actually be more important, but this repo would automatically serve this purpose, to some extent - namely, by showing how the same task is accomplished with the different libraries) |
Thanks for your comments. Actually based on the presentation it even looks they have added even more processing layers than JNI has :-))), but I will need to investigate the whole story more. Thanks for link. Regarding benchmark: That is exactly the point, because I also do not now how the differences are big in the moment, how big breakthrough we talk about. |
@archenroot I created the repository and gave you admin access: |
@saudet good starting point, I will try to do as discussed: I am thinking to in some cases provide as well existing C/C++ implementation if available to compare native performance, but will focus on Jcuda vs javacpp at first. Thanks again. |
Yes. CUDA offers streams and some synchronization methods that are basically orchestrated from client side. (This may involve stream callbacks, which only have been introduced in JCuda recently, an example is at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/driver/samples/JCudaDriverStreamCallbacks.java ) As for the other "benchmarks": Some simple matrix multiplication could be one that creates a real workload. Others might be more artificial, in order to more easily tune the possible parameters. Just a rough example: One could create a kernel that just operates on a set of vector elements. Then one could create a vector with 1 million entries, and try different configurations - namely, copying
(the kernel itself could then also be "trivial", or create a real workload by throwing in some useless Again, this is just a vague idea. |
FWIW, being able to compile CUDA kernels in Java is something we can do easily with JavaCPP as well. To get a prettier interface, we only need to finish what @cypof has started in bytedeco/javacpp#138. |
@archenroot @jcuda May I add that the actual computation time of the GPU kernels is not that important for the benchmarks. What we need to measure here is an overhead over plain C/C++ cuda driver calls. So, let's say that enqueing the "dummy" kernel costs X time. Java wrapper needs k * X time. We are interested in knowing k1 (JCuda) and k2 (JavaCPP cuda), In my opinion, |
Compiling CUDA kernels at runtime already is possible with the NVRTC (a runtime compiler). An example is in https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcVectorAdd.java . (Of course one could add some convenience layer around this. But regarding the performance, the compilation of kernels is not relevant in most use cases). I'll have a look at the linked PR, though. |
@jcuda Oh, interesting. It's nice to be able to do this with C++ in general and not only CUDA though. |
In fact, the other sample at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcLoweredNames.java shows that this also supports "true" C++, with namespace, templates etc. (The sample does not really "do" anything, it only shows how the mangled names may be accessed afterwards). The NVRTC was introduced only recently, and before it was introduced, one problem indeed was the lack of proper C++ support for kernels in JCuda: It was possible to compile kernels that contained templates by using the offline CUDA compiler (which is backed by a C++ compiler like that of Visual Studio). The result was a PTX file with one function for each template instance. But of course, with oddly mangled names that had to be accessed directly via strings from Java. With the NVRTC, this problem is at least alleviated. |
But it doesn't help for C++ code running on the host, right? So, if I
understand correctly, NVRTC doesn't help for something like Thrust:
https://github.com/bytedeco/javacpp/wiki/Interface-Thrust-and-CUDA
|
That's right. And the question was already asked occasionally, aiming at something like "JThrust". But I think that the API of thrust (which on some level is rather template-heavy) does not map sooo well to Java. I think that a library with a functionality that is similar to that of thrust, but in a more Java-idiomatic way. (A while ago I considered to at least create some bindings for https://nvlabs.github.io/cub/ , as asked for in jcuda/jcuda-main#11 , but I'm hesitating to commit to another project - I'm running out of spare time....) |
@jcuda @archenroot @blueberry FYI, wrapper overhead might become more important since kernel launch overhead has apparently been dramatically reduced with CUDA 9.1:
|
They don't give any details/baseline of what they compared. A dedicated benchmark or comparison with CUDA 9.0 and 9.1 might be worthwhile. (I haven't updated to 9.1 yet - currently, the Maven release of 9.0 is on its way...) @archenroot Any updates on the benchmark repo? |
In the meantime, I've released presets for CUDA 9.1 :) |
@jcuda The central statistics for the CUDA presets look like this (numbers for December aren't in yet it seems): In any case, my goal with JavaCPP was never to provide clean APIs for end users, but to provide developers like you with the tools necessary to work on high-level idiomatic APIs. The kind of tools that nearly all Python developers take for granted, but for some reason most Java developers, even those at Oracle, prefer to write JNI manually, such as with the work that @Craigacp has recently been doing for ONNX Runtime. Another case in point, Panama has officially dropped any intentions of offering something like JavaCPP as part of OpenJDK, see http://cr.openjdk.java.net/~mcimadamore/panama/jextract_distilled.html. What they are saying essentially is that since they haven't been able to come up with something that's perfect, that they can confidently support for the next century or so (I'm exaggerating here, but that's not far from the truth), they will leave this dirty work to others like myself and yourself! :) So, please do consider rebasing JCuda and JOCL on JavaCPP. People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users. TensorFlow has done it and they even got a speed boost over manually written JNI, see tensorflow/java#18 (comment). MXNet has also dropped their manually written JNI too and may choose to continue either with (slow) JNA or (faster) JavaCPP, see apache/mxnet#17783. In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why! The engineers at NVIDIA certainly haven't been very clear about why they consider tools like Cython, pybind11, setuptools, and pip to be adequate for Python, but not for Java where for some reason everything has to be redone manually with JNI for each new project, see rapidsai/cudf#1995 (comment). /cc @razajafri |
So, what's happening since August...? Maybe people (or at least, one or few "large" users) are moving from JCuda to JavaCPP...
Originally, my goal of JCuda was also to address two layers:
I didn't really tackle the latter. It would be easy to offer some abstraction layer that covers 99% of all use cases (copy memory, run kernel - that's it). But designing, maintaining and extending this properly could be a full-time job. The direct JNI bindings had been manageable... until recently. I have some parsing- and code generation infrastructure (which, in turn, is far away from being publishable). But the general approach of memory/ I talked with some of the Panama guys a while ago. Part of this discussion was also about ~"the right level of abstraction". I'm generally advocating for defining clear, narrow tasks. Creating a tool that does one thing, and does it right. Or as indicated by the two steps mentioned above: Defining a powerful (versatile), stable (!) API, and build the convenience layer based on that. I didn't manage to follow the discussion on the Panama mailing list in all detail. But I can roughly imagine the difficulties that come with designing something that is supposed to be used for literally everything (i.e. each and every C++ library that somebody might write), and doing this in a form that is stable and reliable. (And by the way: I highly appreciate the fact that Oracle puts much emphasis on long-term stability and support. Today, I can take a Java file that was written for a 32bit Linux with Java 1.2 in 1999, and drag-and-drop it into my IDE on Win10 with Java 8, and it just works. Period. No updates. No incompatibility. No hassle. No problems whatsoever. Maybe one only learns to appreciate that after being confronted with the daunting task of updating some crappy JS "web-application" from Angular 4.0.1.23 to 4.0.1.23b and noticing that this may imply a re-write. Stability and reliability are important) I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... *ehrm*... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addresing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other... An aside:
I'd really like to do that, for some parts of the CUDA API. It lends itself to an Object-Oriented layer quite naturally.
And even more so for the new "Graph Execution" part of the API that my rant was about (I'm a fan of flow-based programming - that's why I created https://github.com/javagl/Flow , and having "CUDA modules" there would be neat...). But the point is: Nobody wants to use these parts of the CUDA API. People think that they have to use it, for profit, and will use it. They will hate it, but they will use it. And NVIDIA knows that, so they obviously don't give the slightest ... ... care... about many principles of API design.
I don't feel strong against using a tool like JavaCPP, and already mentioned elsewhere: If JavaCPP had been available 10 years ago, I probably wouldn't have spent countless hours for JCuda (including the parsing and code generation infrastructure). I have to admit that I haven't set up the actual JavaCPP toolchain, for the actual creation of code, because I'd have to allocate some time for https://github.com/bytedeco/javacpp-presets/wiki/Building-on-Windows , but it would certainly be (or have been) less effort in the long run... Regarding rebasing JCuda on JavaCPP: I think we already talked about that, quickly, in the forum. It might be possible to do that to some extent. But I have some doubts. Very roghly speaking:
The last one refers to one point that I'm not sure about in JavaCPP. To my understanding, when creating an
In JCuda, I deliberately tried to allow a " |
Hi, I'm Maurizio and I work on Panama - I think what you suggest is not at all stupid/naive. The new Panama APIs (memory access + foreign linker) provide foundational layers to allow low-level memory access and foreign calls. This is typically enough to bypass what currently needs to be done in JNI/Unsafe - meaning that, at least for interfacing with plain C libraries, no JNI glue code/shared libraries should be required. It is totally feasible, at least on paper, to tweak JavaCPP to emit Panama-oriented bindings instead of JNI-oriented ones (even as an optional mode). While this hasn't happened yet, I don't think there's a fundamental reason as to why it cannot happen. I know of some frameworks (Netty and Lucene to name a few) who have started experimenting a bit with the Panama API, to replace their current usages of JNI/Unsafe, so it is possible. Of course, since we're still at an incubating stage, there might be some hiccups (e.g. some API points might need tweaking, and/or performance numbers might not be there in all cases) - but we're generally trying to improve things and managed to do so over the last year. |
@mcimadamore We talked a bit via mail, and I gave jextract a try in https://mail.openjdk.java.net/pipermail/panama-dev/2019-February/004443.html , but it has been quite a while ago, a lot has happened in the meantime, and I'm not really up to date. (There's something paradox about the situation that I spend spare time for JCuda, instead of Panama, while the latter could help to spend less time for JCuda ... :-/ )
From a birds-eye perspective (and not being deeply familiar with JavaCPP, I don't have another perspective... yet), my thought was that it might eventually be possible to replace the
In fact, there are some similarities to my code generation project. I tried to establish "sensible defaults", but still make it possible to plug in
This may be over-engineering, but conceptually, it's the attempt to abstract what's currently done in the But again, that's just brainstorming. I know that it's never as easy as it looks on this level... |
Well, not necessarily. None of the current contributors of TensorFlow for Java are being paid to work on it full time, and it seems to be working out alright. I think what's important is figuring out ways to engage multiple people in a project, and then have it grow that way. I was under the impression that you were already spending most of your time on 2, but if not, indeed, maybe JavaCPP could pick up 1 and then you can move on to 2 for most of the time you can spend on this.
Oh sure, I understand very well the benefits. I guess my beef is more with the Java community that hasn't been creating and experimenting with tools for native libraries, which leaves us with very little experimental results for projects like Panama to pick and choose from.
Yes, as @mcimadamore points out, that's pretty much how it's shaping up to be. These days, I consider Panama to be the "new JNI", which approaches things in a different way, but I'm not entirely convinced it's going to be substantially more usable than JNI and sun.misc.Unsafe. In theory, it should have less overhead than JNI, which would make it worth using just for that reason, but it's not currently the case. Also, it's not going to be available in Android for the foreseeable future, so we'll see.
Yeah, whatever. What I'm trying to do with JavaCPP is to be able to at least expose even these APIs to Java so that they are at least as (un)usable as from C/C++, and it's been working out better than I thought it could at first.
It's not that hard! :) JavaCPP itself just needs a C++ compiler, like this: https://github.com/bytedeco/javacpp#getting-started
JavaCPP also supports arrays. We can have overloads like this: native void someFunction(IntPointer array, int size);
native void someFunction(int[] array, int size); It doesn't try to interpret that "size" because it leads to issues like you've noticed where it's not always possible to map mechanically. However, it's possible to layer on top of that additional overloads like this: void someFunction(int[] array) { someFunction(array, array.length); }
native int cuMemcpyHtoD(long dstDevice, byte[] srcHost, long ByteCount);
native int cuMemcpyHtoD(long dstDevice, short[] srcHost, long ByteCount);
native int cuMemcpyHtoD(long dstDevice, int[] srcHost, long ByteCount);
...
int cuMemcpyHtoD(long dstDevice, byte[] srcHost) { return cuMemcpyHtoD(dstDevice, srcHost, srcHost.length); }
... I suppose that's the kind of thing we could do to make it more like JCuda. Anything else? FWIW, Java arrays are limited to 2^31 elements, so that's why I don't feel it's worth spending too much time supporting all the corner cases. For "big data" applications, the data is in native memory anyway. It's never going to be in Java arrays.
Yup, that's all things that could be worked on for JavaCPP 2.0 along things like using Clang to parse header files, see bytedeco/javacpp#51. (Clang is pretty big though. I'm not sure how Panama plans to justify the cost of adding that to the JDK. It would make sense if they also planned on using LLVM instead of C2 as with https://www.azul.com/products/zing/falcon-jit-compiler/, but they're not planning on doing that, so, I don't know. Panama's roadmap is still way too unclear for me. Like I said, I'm currently considering Panama to be the "new JNI" that might not bring performance improvements and may never be ported to Android...) |
That's true. But JCuda always has been a one-man show. There is nothing that could "grow". When there's a new function in CUDA, the JNI stuff is added, and that's it. I probably should have polished+published the code generation part. As an analogy to JavaCPP: The result is just a pile of repetitive JNI code. The process that generates this result is far more relevant.
Not being entirely up to date, I cannot say anything further about Panama. But ... let's be honest: It was good to have something like JNI, because interoperation with C libraries is crucial for all programming languages and ecosystems. And it was... "sufficient" for generating a wrapper for a function like On a very low technical level: I wonder why this page (which is now only available via the wayback machine) was at some point removed from the JNI docs: https://web.archive.org/web/20070112113059/http://java.sun.com/docs/books/jni/html/stubs.html As far as I understand, this looks like a very generic way to handle native calls....
It requires certain tools like msys, mingw, and an installation procedure consisting of 20 steps,. When I see something like this, I usually assume that at least 5 of these steps will ~"not work as described". (No offense, this is not specific for JavaCPP, just from my experience...) - so I'd allocate at least a weekend for something like this.
I'd have to take a closer look at the JNI code there. I'll just ask this now, and if you think that the answer is beyond the scope of this issue thread, just say "RTF When passing an array to something like a CUDA function, then these functions may be inherently asynchronous. There is a plethora of technical caveats, particularly for things like
The point being: One passes in a structure (array) of objects that are processed asynchronously, but there is no mechanism that prevents each However, the respective
The goal would not necessarily be to "make it more like JCuda", but to improve things in terms of performance and usability (and we can argue about in how far these goals overlap ;-)). There could be different ways of achieving this for this highly specific case. The
I had seen issue 51 before, but it appears to be quiet there (2018). And admittedly, Clang and LLVM are things that I'd like to have a closer look at, but are also nothing that one could just casually get started with. For my stuff, I just used the parsing functionality from https://www.eclipse.org/cdt/ . It parses the whole C++ code, and generates an AST - that's all I needed until now. Re-implementing a full-fledged C++-parser is out of scope for any project. |
JCuda is used by others, so it's not just you alone. What's mainly missing is money. You could get money to work on JCuda, for example, via the NVIDIA-funded projects at https://github.com/rapidsai. Those do not currently interoperate with JCuda, JavaCPP, or anything that gives Java developers access to CUDA functions. I think that's a big oversight on their part, but at the moment their lead engineers do not understand this seemingly simple fact! Someone like you may be able to convince them that they should make their libraries compatible with JCuda, JavaCPP, etc, at which point you may get NVIDIA engineers working for you on your projects and even get money to get things working. I know of at least @razajafri and https://www.linkedin.com/in/stevemasson/ that have tried to use JavaCPP at NVIDIA and they may be able to offer you some help, but it's going to be a hard battle to get their lead engineers to hear you. It's not an impossible task though.
Probably just copyright issues with the publisher or something? After all, it's an old book...
The only reason it needs MSYS2 is to run the Bash scripts. Now that Windows has WSL, we could port all that to WSL. That's something else you could work on. :) Please put the blame where it belongs. Until recently Microsoft was very unfriendly to anything that wasn't 100% Microsoft, including Linux and Java. The community had no choice but to come up with hacks like MSYS2. These days, the smoother developer experience is on Linux.
Like I said, the data is likely in native memory anyway. It's not worth spending all that time trying to support Java arrays. Panama does not support Java arrays, at all, full stop, period. It's a dead end, let it go: Use native (aka off-heap) memory and forget about Java arrays.
JavaCPP isn't parsing the whole of C++, only the bits needed to parse most header files, but it's quite ad hoc. We talked about that before. Anyway, the only free usable C++ parser that is being actively maintained these days is Clang, so... It could make sense as part of an external library like JavaCPP, or even GraalVM, which is using LLVM as a compiler backend called via JavaCPP: |
I've been messing around with the generated clang bindings, experimenting with a code generator and so on. If we decide to go with clang or similar for parsing I think we're going to have to do a bit of manual text replacement in the header files if we want to keep full compatability with the current setups we have (not sure how we do line patterns for example). Perhaps we could have the clang backend as an optional generator while keeping the current one? I'm not very experienced with the libclang C API as I've only done minor experiments with it but their documentation mentions that the C API doesn't really provide that much information and that their C++ API has a lot more data available with more in-depth AST traversal. If we end up struggling with the C API, perhaps we could write a tiny C++ wrapper and generate bindings for that with JavaCPP? I would be happy to help with work regarding a Clang based parser and/or generator. Our initial goal doesn't have to be to replace the existing parser or generator. Let me know if this is something of interest. |
Well, it'd probably be an incompatible move, towards JavaCPP 2.0. @wmeddie made me realize the
Yes, I think that would be the best approach. We can easily "extend" the C API with new functions like you and @yukoba have already done here:
It's something of interest for sure, but it is something that would probably take even a good engineer like you maybe half a year to complete! So, for the moment, it's probably going to stay on the back burner for a while still... You could still look into all that and get bits and pieces done here and there, that'd be great, I just don't want to set unrealistic goals here. I also would like to see if Panama ends up being actually useful for this purpose :) You may want to try to work with them on that, but OpenJDK has their own ideas about how to do things, and they generally do things their way regardless of the feedback they receive (unless it comes with a lot of money attached obviously)... |
BTW, here's a good step in the right direction for the generalization of a framework like TVM: |
@jcuda BTW, if your reluctance to use off-heap memory comes from a belief that it is slower to access from Java, you do not need to worry about that. It is just as fast to access off-heap memory than it is to access the memory of Java arrays, and the API can still see be pretty enough. See, for example, here what it looks like with "indexers": http://bytedeco.org/news/2014/12/23/third-release/ |
There's not really a "reluctance" to use off-heap memory. In some way, quite the contrary: If I had to ~"design a raw-data API from scratch", I'd almost certainly design it exclusively around things like
there is hardly a reason to actually use a
is far more flexible in every concievable way: It also accepts direct buffers, the
The API is just more powerful that way. That being said: My primary motivation for supporting plain
and wanted to offload the "complex processing" to the GPU, this should be possible without the additional overhead of having to create a direct buffer and copying the data first. The memory transfer already is the bottleneck in most cases anyhow.... Things have changed since then. If I had to start over, I'd probably be "reluctant" to do all the contortions that are necessary to support non-direct data... Edit: That "indexers" link looks interesting, I'll try to allocate some time to look into that. It might be related to (or a "better engineered, goal-oriented" version of) what I did with https://github.com/javagl/ND a few years ago... |
I see, makes sense. BTW, Smile is also pretty "old" w.r.t to that aspect of using float[], double[], and friends as part of the API, but the main author did switch to using JavaCPP over the previously available alternative to access the necessary native libraries (mainly netlib-java). It might be worth investigating what he ended up doing there. It looks like for the kind of BLAS-like interface of these libraries, JavaCPP is able to generate everything with float[]/FloatBuffer and double[]/DoubleBuffer. In any case, I'm not seeing any usage of FloatPointer or DoublePointer...
I haven't been trying to create something like NumPy, but merely just what is needed to offer what we already have in C/C++ to access multidimensional arrays, that we don't have in Java. It creates issues not to have some language feature for that when trying to use C/C++ APIs in Java. As for something like NumPy in Java, the C++ API of PyTorch maps reasonably well to Java, it's pretty cool and it works with GPUs too, check it out: https://github.com/bytedeco/javacpp-presets/tree/master/pytorch I hope more than a handful of people find this interesting enough so that we can develop a high-level API on top of that. |
I'm doing something for DL inference deployment. Since most of the data processing and website framework are still based on JAVA, these system/pipelines nowadays need to have a phase of DL inference (such as CV/NLP based classification). there's two ways to do this: 1) cloud native (K8S clusters, lots of PODs are Java based services with some PODs based on Python/C++ to do inference, then use RPC to take with each other. Some tools like Flask looks like quite popular here as a Python based inference service. This is the typical way when this company has lots of DL models need to be served, so use some PODs for inference service. 2) just use Java APIs to do such as Tensorflow Java API.
|
@jackyh what you're doing around triton is great but very vendor specific. Not all DL workloads need GPUs as they are very expensive. Many of the companies betting on java would prefer generally to start with stock cpus first. Generally though they do help in quite a few cases, especially for training. As for javacpp based tooling a lot of tools are built on top of it including tf java and our very own dl4j as well as our tensor library nd4j. I view javacpp as a tool to build tools very similar to kubernetes. It's too low level for most developers to be touching directly. It's still native code and exposes very raw interfaces. I think of directly using javacpp as like programming with c++ directly. The flexibility is amazing and what it does for tools developer is also really nice. |
We've done with lots of on-site survey to these "mid-small" e-commercial companies. 90% of these companies still use CPU for inference, some of them are using TF java api, some of them using other tools (some pipe of Python to Java, ONNX runtime java binding, etc..), some of them using micro service of Flask. The reasons are (listed by importance):
Here, I will not agree with the "training" topic, here's the reasons:
Agree, this is something like ecosystem topic. We need more java developers to contribute for Java-like high-level app demos. JavaCPP is just wrapper, it's essential, but not enough. |
@jackyh The ONNX Runtime Java API works just fine on GPUs as well as CPUs. The latest release supports TensorRT as well as CUDA on GPU, and OpenVINO & DNNL for CPU alongside the standard CPU backend. If there are CPU or GPU throughput problems with the Java API open an issue on github.com/microsoft/onnxruntime and we'll fix them. Similarly TF SIG-JVM does care about inference throughput, though it's not had as much attention as ORT has, especially on GPUs, due to it being community led and the community being rather fractured. One of my current interests for the Java ML ecosystem is building ONNX export support as that allows Java trained models to be served on the major cloud providers using the services they provide for autoscaling. I've talked to the ONNX steering committee about improving ONNX's support for other languages, and we've built ONNX export support into my group's Java ML library. |
ehh...personally, I rarely know that people use Java for Training. I will try to reach out to more customers here. |
@Craigacp how is the op coverage? We're working on import for uptraining models supporting the keras h5 format, onnx as well as TF and have made some fairly good strides there. @jackyh our audience tends to have folks who want uptraining for models. I can confirm what @Craigacp is saying here. There's an audience here, especially for pretrained models. It's not as big as python though for sure. What we've found in practice is that generally vendor specific efforts are better than neutral ones. In our version of triton (https://github.com/KonduitAI/konduit-serving) we've (using javacpp as well as our dl4j stack) tend to implement pipeline steps as a high level abstraction to various javacpp presets allowing for one abstraction for allowing higher performance like what you're aiming for. A big focus is on direct in memory interop by passing pointers around via javacpp. This + a big focus on graalvm integration allowing easily deployed binaries is allowing us to essentially compile models to whole pipelines while selectively adding support for different vendors (whatever might be superior in performance for a particular use case) |
@Craigacp @agibsonccc |
Our ONNX export op coverage is pretty low but we're trying to export ML models rather than DL ones, so we're only implementing the ops we need for the models we can train (excluding TF-Java models as training is broken in TF-Java at the moment anyway). It's all public, you can see where we are up to. Longer term it might be interesting to have tf2onnx in Java, but that's a much larger effort. I agree there is an audience for fine-tuning models in Java, and so it would be pretty useful to have that support. Though personally I find maintaining Python libraries that can pre-train large transformer models to be deeply frustrating and so I'd like to do that on the JVM too. But it's much much harder, and the market is even smaller because it's only really possible at large companies.
I agree the DL ecosystem isn't as big, but DL is a subset of ML, and there is lower hanging fruit in ML. It would be nice to prevent people from having to use Python & scikit-learn to train logistic regressions, or tree ensembles, as those are in my experience more prevalent in terms of solving business problems. |
@jackyh I don't think either of us thinks that python isn't the incumbent. It's just nice to have alternatives out there in the ecosystem. Java is still table stakes for many deployment use cases. Strong interop with the python ecosystem allows for easier deployment of models and also allows different use cases like: desktop tools written in java (of which there are quite a few) as well as things graalvm. With javacpp, you actually have a nice packaging mechanism as an alternative for running native deps while still having access to an easier to use programming language that's more performant than python. I can also confirm @Craigacp is on to something with transformers. A significant number of our users deploy big NLP models and would like to see some huggingface like tooling for their enterprise deployments. I don't see the harm in bootstrapping of of the ecosystem while exposing an easier interface to these tools. |
@Craigacp BTW, that kind of thing is now possible with the JavaCPP Presets for PyTorch, see #1075. However, the underlying C++ API barely supports that use case as it is, so it's not only about Java, it's just not the kind of thing that's being invested in for any other language than Python. However, as @agibsonccc points out, a tool like JavaCPP makes it very easy to manage Python packages all from Java. The only thing engineers need to worry about is the language itself, so it's not that bad of a situation IMO. I often think of Python as the "Bash for AI": It's never going to be as fast as Java or C++, but it gets the job done. :) |
I like this one: Python = Bash for AI :-) matching statement |
For the ML part, lots of things are already covered by project Rapids/Spark |
Hello, First of all, wanted to say thank you for your project. As of this writing, Java is now version 22, and with Java 22, there is Project Panama. With the new way to invoke JNI, it would be great to upgrade your project incorporating the latest JNI from Panama to run CUDA code. Thank you |
Project Panama from what I understand is still fairly useless.
The point of this project is to directly support c++ (read: not just c).
Just because it's newer doesn't mean it's better.
Each tool has their use cases.
Javacpp's flexibility (and the ability to directly support c++)
is what makes it a much better solution for cuda
and adjacent tooling.
…On Mon, Apr 22, 2024 at 9:11 PM KafkaProServerless ***@***.***> wrote:
Hello,
First of all, wanted to say thank you for your project.
As of this writing, Java is now version 22, and with Java 22, there is
Project Panama.
With the new way to invoke JNI, it would be great to upgrade your project
incorporating the latest JNI from Panama to run CUDA code.
Thank you
—
Reply to this email directly, view it on GitHub
<#475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIXL4QO7LDL7FBIYLXUTHLY6T453AVCNFSM4D6G6V3KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWHEZDGNJWGAZQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@saudet Hi buddy,
it just came to my head in last few weeks what about merging cuda and opencl stuff here with work of guys from Jcuda and Jopencl projects. I understand there are some fundamental differences, but having rather more quality devs on single project could provide project quality as well.
The guys from JCuda opened discussion on my request here:
https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538
So, if you think it could bring more value as well, you are free to join the discussion.
The text was updated successfully, but these errors were encountered: