Skip to content

jeremynewlin/StreamCompaction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 

Repository files navigation

A Study in Parallel Algorithms : Stream Compaction

There are two main components of stream compaction: scan and scatter.

Here is a comparison of the various mehtods I used to scan:

As you can see, the serial version is faster for small arrays, but is quickly out matched as the array length grows. The global memory version is always just a bit slower than the shared memory version, which makes sense, as the only difference is the slowdown that comes from fetching from global memory often. The work efficient algorithm that I've implemented must have a bug in it, because it only becomes comparable to the naive shared memory version after the array is over 10 million elements long. Further investigation is needed.

And here is a comparison of my scatter implementation and thrust's. I think I'm using a slow thrust version of this, becuase I don't think my basic version in CUDA should be as fast as thrust. But, to be honest, I'm not sure how else to optimize my implementation of scatter any further. It has 3 global memory reads that are absolutely necessary, and a branch.

REFERENCES

"Parallel Prefix Sum (Scan) with CUDA." GPU Gems 3.

About

A Study in Parallel Algorithms : Stream Compaction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 69.6%
  • C++ 23.5%
  • C 6.9%