Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reviewed ILGPU documentation. #776

Merged
merged 2 commits into from
Apr 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions Docs/Debugging-and-Profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,15 @@ Debugging with the software emulation layer is very convenient due to the very g
Currently, detailed kernel debugging is only possible with the CPU accelerator.
However, we are currently extending the debugging capabilities to also emulate different GPUs in order to test your algorithms with "virtual GPU devices" without needing to have direct access to the actual GPU devices (more information about this feature can be found [here](https://github.com/m4rs-mt/ILGPU/pull/402).

Assertions on GPU hardware devices can be enabled using the `ContextFlags.EnableAssertions` flag (disabled by default when a `Debugger` is not attached to the application).
Assertions on GPU hardware devices can be enabled using the `Assertions()` method of `Context.Builder` (disabled by default when a `Debugger` is not attached to the application).
Note that enabling assertions using this flag will cause them to be enabled in `Release` builds as well.
Be sure to disable this flag if you want to get the best runtime performance.

Source-line based debugging information can be turned on via the flag `ContextFlags.EnableDebugInformation` (disabled by default).
Source-line based debugging information can be turned on via the `DebugSymbols()` method of `Context.Builder` (disabled by default).
Note that only the new portable PBD format is supported.
Enabling debug information is essential to identify problems and catch break points on GPU hardware.
It is also very useful for kernel profiling as you can link the profiling insights to your source lines.
You may want to disable inlining via `ContextFlags.NoInlining` to significantly increase the accuracy of your debugging information at the expense cost of runtime performance.
You may want to disable inlining via `Inlining()` to significantly increase the accuracy of your debugging information at the expense cost of runtime performance.

*Note that the inspection of variables, registers, and global memory on GPU hardware is currently not supported.*

Expand Down
4 changes: 2 additions & 2 deletions Docs/Dynamically-Specialized-Kernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ class ...

static void ...(...)
{
using var context = new Context();
using var accl = new CudaAccelerator(context);
using var context = Context.CreateDefault();
using var accl = context.CreateCudaAccelerator(0);

var genericKernel = accl.LoadStreamKernel<ArrayView<int>, int>(GenericKernel);
...
Expand Down
103 changes: 4 additions & 99 deletions Docs/Inside-ILGPU.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,8 @@
ILGPU features a modern parallel processing, transformation and compilation model.
It allows parallel code generation and transformation phases to reduce compile time and improve overall performance.

However, parallel code generation in the frontend module is disabled by default.
It can be enabled via the enumeration flag `ContextFlags.EnableParallelCodeGenerationInFrontend`.

The global optimization process can be controlled with the enumeration `OptimizationLevel`.
This level can be specified by passing the desired level to the `ILGPU.Context` constructor.
This level can be specified by passing the desired level to the `Optimize` method of `Context.Builder`.
If the optimization level is not explicitly specified, the level is automatically set to `OptimizationLevel.O1`.

The `OptimizationLevel.O2` level uses additional transformations that increase compile time but yield potentially better GPU code.
Expand Down Expand Up @@ -35,32 +32,6 @@ It can be used to manually compile kernels for a specific platform.
Note that **you do not have to create custom backend instances** on your own when using the ILGPU runtime.
Accelerators already carry associated and configured backends that are used for high-level kernel loading.

```c#
class ...
{
static void Main(string[] args)
{
using (var context = new Context())
{
// Creats a user-defined MSIL backend for .Net code generation
using (var cpuBackend = new DefaultILBackend(context))
{
// Use custom backend
}

// Creates a user-defined backend for NVIDIA GPUs using compute capability 5.0
using (var ptxBackend = new PTXBackend(
context,
PTXArchitecture.SM_50,
TargetPlatform.X64))
{
// Use custom backend
}
}
}
}
```

## IRContext

An `IRContext` manages and caches intermediate-representation (IR) code, which can be reused during the compilation process.
Expand All @@ -70,19 +41,6 @@ An `IRContext` is not tied to a specific `Backend` instance and can be reused ac
Note that the main ILGPU `Context` already has an associated `IRContext` that is used for all high-level kernel-loading functions.
Consequently, users are not required to manage their own contexts in general.

```c#
class ...
{
static void Main(string[] args)
{
var context = new Context();

var irContext = new IRContext(context);
// ...
}
}
```

## Compiling Kernels

Kernels can be compiled manually by requesting a code-generation operation from the backend yielding a `CompiledKernel` object.
Expand All @@ -93,30 +51,6 @@ Alternatively, you can cast a `CompiledKernel` object to its appropriate backend

We recommend that you use the [high-level kernel-loading concepts of ILGPU](ILGPU-Kernels) instead of the low-level interface.

```c#
class ...
{
public static void MyKernel(Index index, ...)
{
// ...
}

static void Main(string[] args)
{
using var context = new Context();
using var b = new PTXBackend(context, ...);
// Compile kernel using no specific KernelSpecialization settings
var compiledKernel = b.Compile(
typeof(...).GetMethod(nameof(MyKernel), BindingFlags.Public | BindingFlags.Static),
default);

// Cast kernel to backend-specific PTXCompiledKernel to access the PTX assembly
var ptxKernel = compiledKernel as PTXCompiledKernel;
System.IO.File.WriteAllBytes("MyKernel.ptx", ptxKernel.PTXAssembly);
}
}
```

## Loading Compiled Kernels

Compiled kernels have to be loaded by an accelerator first before they can be executed.
Expand All @@ -131,35 +65,6 @@ An accelerator object offers different functions to load and configure kernels:
* `LoadKernel`
Loads explicitly and implicitly grouped kernels. However, implicitly grouped kernels will be launched with a group size that is equal to the warp size

```c#
class ...
{
static void Main(string[] args)
{
...
var compiledKernel = backend.Compile(...);

// Load implicitly grouped kernel with an automatically determined group size
var k1 = accelerator.LoadAutoGroupedKernel(compiledKernel);

// Load implicitly grouped kernel with custom group size
var k2 = accelerator.LoadImplicitlyGroupedKernel(compiledKernel);

// Load any kernel (explicitly and implicitly grouped kernels).
// However, implicitly grouped kernels will be dispatched with a group size
// that is equal to the warp size of its associated accelerator
var k3 = accelerator.LoadKernel(compiledKernel);

...

k1.Dispose();
k2.Dispose();
// Leave K3 to the GC
// ...
}
}
```

## Direct Kernel Launching

A loaded kernel can be dispatched using the `Launch` method.
Expand All @@ -169,7 +74,7 @@ For performance reasons, we strongly recommend the use of typed kernel launchers
```c#
class ...
{
static void MyKernel(Index index, ArrayView<int> data, int c)
static void MyKernel(Index1D index, ArrayView<int> data, int c)
{
data[index] = index + c;
}
Expand Down Expand Up @@ -210,7 +115,7 @@ These loading methods work similarly to the these versions, e.g. `LoadAutoGroupe
```c#
class ...
{
static void MyKernel(Index index, ArrayView<int> data, int c)
static void MyKernel(Index1D index, ArrayView<int> data, int c)
{
data[index] = index + c;
}
Expand All @@ -225,7 +130,7 @@ class ...
using (var k = accelerator.LoadAutoGroupedKernel(compiledKernel))
{
var launcherWithCustomAcceleratorStream =
k.CreateLauncherDelegate<AcceleratorStream, Index, ArrayView<int>>();
k.CreateLauncherDelegate<AcceleratorStream, Index1D, ArrayView<int>>();
launcherWithCustomAcceleratorStream(someStream, buffer.Extent, buffer.View, 1);

...
Expand Down
22 changes: 11 additions & 11 deletions Docs/Kernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Use explicitly grouped kernels for full control over GPU-kernel dispatching.
class ...
{
static void ImplicitlyGrouped_Kernel(
[Index|Index2|Index3] index,
[Index1D|Index2D|Index3D] index,
[Kernel Parameters]...)
{
// Kernel code
Expand Down Expand Up @@ -93,24 +93,24 @@ In contrast to older versions of ILGPU, all kernels loaded with these functions
```c#
class ...
{
static void MyKernel(Index index, ArrayView<int> data, int c)
static void MyKernel(Index1D index, ArrayView<int> data, int c)
{
data[index] = index + c;
}

static void Main(string[] args)
{
...
var buffer = accelerator.Allocate<int>(1024);
var buffer = accelerator.Allocate1D<int>(1024);

// Load a sample kernel MyKernel using one of the available overloads
var kernelWithDefaultStream = accelerator.LoadAutoGroupedStreamKernel<
Index, ArrayView<int>, int>(MyKernel);
Index1D, ArrayView<int>, int>(MyKernel);
kernelWithDefaultStream(buffer.Extent, buffer.View, 1);

// Load a sample kernel MyKernel using one of the available overloads
var kernelWithStream = accelerator.LoadAutoGroupedKernel<
Index, ArrayView<int>, int>(MyKernel);
Index1D, ArrayView<int>, int>(MyKernel);
kernelWithStream(someStream, buffer.Extent, buffer.View, 1);

...
Expand All @@ -126,7 +126,7 @@ However, if you require custom control over the low-level kernel-compilation pro

Starting with version [v0.10.0](https://github.com/m4rs-mt/ILGPU/releases/tag/v0.10.0), ILGPU offers the ability to immediately compile and launch kernels via the accelerator methods (similar to those provided by other frameworks).
ILGPU exposes direct `Launch` and `LaunchAutoGrouped` methods via the `Accelerator` class using a new strong-reference based kernel cache.
This cache is used for the new launch methods only and can be disabled via the flag `ContextFlags.DisableKernelLaunchCaching`.
This cache is used for the new launch methods only and can be disabled via the `Caching(CachingMode.NoKernelCaching)` method of `ContextBuilder`.

```c#
class ...
Expand All @@ -136,7 +136,7 @@ class ...

}

static void MyImplicitKernel(Index1 index, ...)
static void MyImplicitKernel(Index1D index, ...)
{

}
Expand All @@ -152,10 +152,10 @@ class ...
accl.Launch(stream, MyKernel, < MyKernelConfig >, ...);

// Launch implicitly grouped MyKernel using the default stream
accl.LaunchAutoGrouped(MyImplicitKernel, new Index1(...), ...);
accl.LaunchAutoGrouped(MyImplicitKernel, new Index1D(...), ...);

// Launch implicitly grouped MyKernel using the given stream
accl.LaunchAutoGrouped(stream, MyImplicitKernel, new Index1(...), ...);
accl.LaunchAutoGrouped(stream, MyImplicitKernel, new Index1D(...), ...);
}
}
```
Expand All @@ -173,9 +173,9 @@ var ptxKernel = launcher.GetCompiledKernel() as PTXCompiledKernel;
System.IO.File.WriteAllText("Kernel.ptx", ptxKernel.PTXAssembly);
```

You can specify the context flag `ContextFlags.EnableKernelStatistics` to query additional information about compiled kernels.
You can use the `DebugSymbols()` method of `Context.Builder` to enable additional information about compiled kernels.
This includes local functions and consumed local and shared memory.
After enabling the flag, you can get the information from a compiled kernel launcher delegate instance via:
After enabling, you can get the information from a compiled kernel launcher delegate instance via:
```c#
// Get kernel information from a kernel launcher instance
var information = launcher.GetKernelInfo();
Expand Down
4 changes: 2 additions & 2 deletions Docs/Math-Functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ The algorithms library offers the `XMath` class that has support for all common
Using the 32-bit overloads ensure that the operations are performed on 32-bit floats on the GPU hardware.

### Fast Math
Fast-math can be enabled using the `ContextFlags.FastMath` flag and enables the use of fast (and unprecise) math functions.
Fast-math can be enabled using the `Math(MathMode.Fast)` method of `Context.Builder` and enables the use of fast (and unprecise) math functions.
Unlike previous versions, the fast-math mode applies to all math instructions. Even to default math operations like `x / y`.

### Forced 32-bit Math
Your kernels might rely on third-party functions that are not under your control.
These functions typically depend on the default .Net `Math` class, and thus, work on 64-bit floating-point operations.
You can force the use of 32-bit floating-point operations in all cases using the `ContextFlags.Force32BitMath` flag.
You can force the use of 32-bit floating-point operations in all cases using the `Math(MathMode.Fast32BitOnly)` method of `Context.Builder`.
Caution: all doubles will be considered as floats to circumvent issues with third-party code.
However, this also affects the address computations of array-view elements.
Avoid the use of this flag unless you know exactly what you are doing.
20 changes: 10 additions & 10 deletions Docs/Memory-Buffers-and-Views.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ Should be refers to the fact that all memory buffers will be automatically relea
```c#
class ...
{
public static void MyKernel(Index index, ...)
public static void MyKernel(Index1D index, ...)
{
// ...
}

static void Main(string[] args)
{
using var context = new Context();
using var context = Context.CreateDefault();
using var accelerator = ...;

// Allocate a memory buffer on the current accelerator device.
using (var buffer = accelerator.Allocat<int>(1024))
using (var buffer = accelerator.Allocate1D<int>(1024))
{
...
} // Dispose the buffer after performing all operations
Expand All @@ -45,7 +45,7 @@ You can even enable bounds checks in `Release` builds by specifying the context
```c#
class ...
{
static void MyKernel(Index index, ArrayView<int> view1, ArrayView<float> view2)
static void MyKernel(Index1D index, ArrayView<int> view1, ArrayView<float> view2)
{
ConvertToFloatSample(
view1.GetSubView(0, view1.Length / 2),
Expand All @@ -61,10 +61,10 @@ class ...
static void Main(string[] args)
{
...
using (var buffer = accelerator.Allocat&lt...&gt(...))
using (var buffer = accelerator.Allocate1D&lt...&gt(...))
{
var mainView = buffer.View;
var subView = mainView.GetSubView(0, 1024);
var subView = mainView.SubView(0, 1024);
}
}
}
Expand All @@ -86,7 +86,7 @@ mad.lo.u64 %rd4, %rd3, 4, %rd1;
```

When accessing views using 32-bit indices, the resulting index operation will be performed on 32-bit offsets for performance reasons.
As a result, this operation can overflow when using a 2D 32-bit based `Index2`, for instance.
As a result, this operation can overflow when using a 2D 32-bit based `Index2D`, for instance.
If you already know, that your offsets will not fit into a 32-bit integer, you have to use 64-bit offsets in your kernel.

If you rely on 64-bit offsets, the emitted indexing operating will be slightly more expensive in terms of register usage and computational overhead (at least conceptually). The actual runtime difference depends on your kernel program.
Expand All @@ -104,18 +104,18 @@ class ...
public VariableView<int> Variable;
}

static void MyKernel(Index index, DataView view)
static void MyKernel(Index1D index, DataView view)
{
// ...
}

static void Main(string[] args)
{
// ...
using (var buffer = accelerator.Allocat<...>(...))
using (var buffer = accelerator.Allocate1D<...>(...))
{
var mainView = buffer.View;
var firstElementView = mainView.GetVariableView(0);
var firstElementView = mainView.VariableView(0);
}
}
}
Expand Down
4 changes: 2 additions & 2 deletions Docs/Primer_00.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ CUDA / OpenCL with the ease of use of C#.

This tutorial is a little different now because we are going to be looking at the ILGPU 1.0.0.

ILGPU should work on any 64bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano with pretty decent cuda performance.
ILGPU should work on any 64-bit platform that .Net supports. I have even used it on the inexpensive nvidia jetson nano with pretty decent cuda performance.

Technically ILGPU supports F# but I don't use F# enough to really tutorialize it. I will be sticking to C# in these tutorials.

Expand All @@ -21,7 +21,7 @@ If enough people care I can record a short video of this process, but I expect t
2. Create a new C# project.
![dotnet new console](Images/newProject.png?raw=true)
3. Add the ILGPU package
![dotnet add ILGPU](Images/beta.png?raw=true)
![dotnet add package ILGPU](Images/beta.png?raw=true)
4. ??????
5. Profit

Expand Down
Loading