Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be aware of the competing fast GPU neural network library CXXNET #382

Closed
kloudkl opened this issue May 3, 2014 · 6 comments
Closed

Be aware of the competing fast GPU neural network library CXXNET #382

kloudkl opened this issue May 3, 2014 · 6 comments

Comments

@kloudkl
Copy link
Contributor

kloudkl commented May 3, 2014

Since this February, there have been a "(convolutional) neural network toolkit" CXXNET based on the "Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA" mshadow. The toolkit is able to classify 400 images per second, i.e. about 35 million per day, on a GTX 780 GPU. It seems to be faster than Caffe which can process 20 million per day on a K20 and 40 million per day on a K40.
Since CXXNET is using the tensor library, its code is also much more concise than Caffe's.

@shelhamer
Copy link
Member

Actually, Caffe achieves comparable speed by classifying 395 images per second or ~35 million per day on a GTX 780.

If I'm not mistaken the 780 actually has a higher clock speed (875mhz) and memory transfer rate (336gb/s) than the K20 (705mhz and 208gb/s), and even the K40 with default settings (745mhz and 288gb/s). With the highest boost clock setting the K40 speed is 875mhz, and this is the setting we choose for our benchmarks, although it was never clear to me if that is a peak or sustained speed.

@sguada
Copy link
Contributor

sguada commented May 5, 2014

In fact, CXXNET relies in cublas in the same way Caffe does, even the convolutions are implemented the same way, using a im2col and then matrix multiplications. It seems to me that the authors were inspired by Caffe.

PD: Caffe can classify 500 images per second on a Titan or on a K40 at full speed.

@kloudkl
Copy link
Contributor Author

kloudkl commented May 5, 2014

Considering the hardware specification of GTX 780 and [Tesla K40](http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf], there is no big difference in speed.

There is no doubt that the authors borrowed from Caffe. But some parts of CXXNET are indeed good enough to learn from. Just to name a few examples, the element wise operations, the all encompassing main function which corresponds to the isolated Caffe tools, the unified model config file which is a still todo task here, the data class which is perhaps what our DataSource should be, and the layers using the concise tensor api. Even if its implementation is not going to be borrowed back, it reminds us others are quickly catching up.

@sergeyk
Copy link
Contributor

sergeyk commented May 6, 2014

I like how minimal everything is in cxxnet. We should consider using mshadow or an mshadow-like approach and not have separate cpu/gpu code for all layers.

@Yangqing
Copy link
Member

Yangqing commented May 7, 2014

I like the idea of cxxnet too. In fact, I sort of wanted to write a tensor
interface but then the quick rewriting back in November led to a (crappy)
matrix library: you can see that there are switches everywhere that just
calls either caffe_gpu_* or caffe_cpu_*. If someone wants to give it a
cleaning try that would be great, but it will mostly be simply refactor
codes. (note that cxxnet just hides "ugly" cpu and gpu separations deeper
in mshadow :)). Speedwise things won't be much different if one uses the
same blas library.

Yangqing

On Tue, May 6, 2014 at 10:58 AM, Sergey Karayev notifications@github.comwrote:

I like how minimal everything is in cxxnet. We should consider using
mshadow or an mshadow-like approach and not have separate cpu/gpu code for
all layers.

Reply to this email directly or view it on GitHubhttps://github.com//issues/382#issuecomment-42336550
.

@tqchen
Copy link

tqchen commented May 12, 2014

I was bought to this thread by @kloudkl . Bing and I are glad that mshadow and cxxnet is being noticed. Indeed we learned from caffe when implementing cxxnet, specifically, the im2col way to do convolution, which was new to us before we learned from caffe.

There should not be significant speed difference between the two implementations, though cxxnet use de-packing and packing multiple images at a time to do conv, which I don’t know if is already supported in most recent version of caffe.

I would like to advertise mshadow a bit:) MShadow itself is also concise, with 3k lines of code and only 4 CUDA kernels so far, due to use of expression template. It would be great if some part of caffe could also use mshadow. Because mshadow accepts plugin pointer and run, this could easily be done without replacing the blob structure, while allowing writing expressions in update rule, layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants