fasttensor

C++ library for tensor arithmetic.

Uses SIMD for CPU acceleration and CUDA for GPU acceleration. Supports multi-GPU if more than 1 is available. Kernel fusion with expression templates allows efficient computation of long arithmetic expressions.

Usage

fasttensor is header-only, simply add the location of the header files to your include path while compiling.

Example code:

using namespace fasttensor;

int main() {
	int num_rows = 4;
	int num_cols = 2;
	// Create integer tensor of rank 2
	// Dimensions: 4 rows, 2 columns (4x2)
	Tensor<int, 2> a(array<ptrdiff_t, 2>{num_rows, num_cols});
	Tensor<int, 2> b(array<ptrdiff_t, 2>{num_rows, num_cols});

	for (int i = 0; i < num_rows; ++i) {
		for (int j = 0; j < num_cols; ++j) {
			// This is how you set/get elements
			a(i, j) = j + num_cols * i;
			b(i, j) = j + num_cols * i;
		}
	}

	Tensor<int, 2> results(array<ptrdiff_t, 2>{num_rows, num_cols});

	// Element-wise addition of the two tensors
	// This will auto-magically use GPU/SIMD instructions
	// Need to compile with appropriate compiler flags and hardware
	results = a + b;

	for (int i = 0; i < num_rows; ++i) {
		for (int j = 0; j < num_cols; ++j) {
			// Just checking if we got the right answer
			assert(results(i, j) == 2 * (j + num_cols * i));
		}
	}

	return 0;
}

Benchmarks

Eager mode is equivalent to a naive implementation of arithmetic expressions, creating a temporary variable after each operation. This behaviour was simulated with a helper function that forces eager evaluation of a given arithmetic expression.

Lazy mode constructs an expression at compile time using expression templates and only evaluates the expression when assigned to a matrix.

Config:

CPU: Intel Xeon E5-2690 v3 @ 2.60 GHz
GPU: NVidia Tesla P4
Compiler: Clang 9.0.1
CUDA Toolkit Version: 10.0

Results:

The variables are 3-dimensional float tensors of size 10⁴ × 10² × 10² filled with random values.
The results were obtained by running 10 trials.
Each trial consisted of evaluating the expression 100 times.

X = A + B + C + D

Devices	Eager		Lazy
Devices	Time	GFlops	Time	GFlops
AVX2 on CPU	28.26 ± 0.21s	0.99	17.73 ± 0.05s	1.58
1 Tesla P4 GPU	2.65 ± 0.00s	10.56	1.51 ± 0.12s	18.52
2 Tesla P4 GPUs	1.56 ± 0.20s	17.92	0.89 ± 0.08s	31.25

Development

To run the tests and benchmarks on Linux:

(Dependencies: CMake >= 3.14.6, clang++ >= 8, CUDA >= 9)

Clone this repo
mkdir build && cd build
Run CMake to generate build files (detailed instructions below). Add -DBUILD_TESTS=OFF to not build tests, -DBUILD_BENCHMARKS=OFF to not build benchmarks.
cmake --build .
./tests to run the tests and ./bench/bench to run the benchmarks

Running CMake

The build can be configured with various build options. The full command to run is:

CXX=<clang++ location> CC=<clang location> cmake.. \
-DDEVICE_TYPE=<NORMAL|SIMD|GPU> -DCMAKE_BUILD_TYPE=<Release|Debug> \
-DCUDA_PATH=<CUDA toolkit path> -DGPU_ARCH=<GPU arch>

Use CXX and CC to set the C and C++ compiler to clang.
Set DEVICE_TYPE to NORMAL for normal CPU mode, SIMD to use SIMD vectorized instructions, and GPU to use the GPU.
Set CMAKE_BUILD_TYPE to Release or Debug depending on your need.
Set CUDA_PATH to the location of the CUDA toolkit, and GPU_ARCH to the GPU's CUDA compute capability (3.7 means you should set it to 37, that is, simply remove the decimal). These options are only required if DEVICE_TYPE is GPU.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
bench		bench
ext		ext
fasttensor		fasttensor
tests		tests
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md
clang-format.cmake		clang-format.cmake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fasttensor

Usage

Benchmarks

Config:

Results:

X = A + B + C + D

Development

Running CMake

About

Releases

Packages

Languages

JHurricane96/fasttensor

Folders and files

Latest commit

History

Repository files navigation

fasttensor

Usage

Benchmarks

Config:

Results:

X = A + B + C + D

Development

Running CMake

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages