In this short guide, we will do the following:
- clone ZML to work directly within the prepared example folder
- add Zig code to implement our model
- add some Bazel to integrate our code with ZML
- no weights files or anything external is required for this example
The reason we're doing our exercise in the examples
folder is because it's
especially prepared for new ZML projects. It contains everything needed for ZML
development. From bazel
configs to vscode
settings, and neovim
LSP
support. The examples
folder serves as a cookiecutter ZML project example,
with just a few example models added already.
Note: The examples
folder is self-contained. You can make a copy of
it to a location outside of the ZML repository. Simply remove all examples you
don't need and use it as a template for your own projects.
So, let's get started, shall we?
If you haven't done so already, please install bazel.
Check out the ZML repository. In the examples
directory, create a new folder
for your project. Let's call it simple_layer
.
git clone https://github.com/zml/zml.git
cd zml/examples
mkdir -p simple_layer
... and add a file main.zig
to it, along with a bazel build file:
touch simple_layer/main.zig
touch simple_layer/BUILD.bazel
By the way, you can access the complete source code of this walkthrough here:
Before firing up our editor, let's quickly talk about a few basic ZML fundamentals.
In ZML, we describe a Module, which represents our AI model, as a Zig
struct
. That struct can contain Tensor fields that are used for computation,
e.g. weights and biases. In the forward function of a Module, we describe the
computation by calling tensor operations like mul, add, dotGeneral,
conv2D, etc., or even nested Modules.
ZML creates an MLIR representation of the computation when we compile the Module. For compilation, only the Shapes of all tensors must be known. No actual tensor data is needed at this step. This is important for large models: we can compile them while the actual weight data is being fetched from disk.
To accomplish this, ZML uses a BufferStore. The BufferStore knows how to only load shapes and when to load actual tensor data. In our example, we will fake the BufferStore a bit: we won't load from disk; we'll use float arrays instead.
After compilation is done (and the BufferStore has finished loading weights), we can send the weights from the BufferStore to our computation device. That produces an executable module which we can call with different inputs.
In our example, we then copy the result from the computation device to CPU memory and print it.
So the steps for us are:
- describe the computation as ZML Module, using tensor operations
- create a BufferStore that provides Shapes and data of weights and bias (ca. 5 lines of code).
- compile the Module asynchronously
- make the compiled Module send the weights (and bias) to the computation device utilizing the BufferStore, producing an executable module
- prepare input tensor and call the executable module.
- get the result back to CPU memory and print it
If you like to read more about the underlying concepts of the above, please see ZML Concepts.
Let's start by writing some Zig code, importing ZML and often-used modules:
const std = @import("std");
const zml = @import("zml");
const asynk = @import("async");
// shortcut to the asyncc function in the asynk module
const asyncc = asynk.asyncc;
You will use above lines probably in all ZML projects. Also, note that ZML is async and comes with its own async runtime, thanks to zigcoro.
We will start with a very simple "Model". One that resembles a "multiply and add" operation.
/// Model definition
const Layer = struct {
bias: ?zml.Tensor = null,
weight: zml.Tensor,
pub fn forward(self: Layer, x: zml.Tensor) zml.Tensor {
var y = self.weight.mul(x);
if (self.bias) |bias| {
y = y.add(bias);
}
return y;
}
};
You see, in ZML AI models are just structs with a forward function!
There are more things to observe:
- forward functions typically take Tensors as inputs, and return Tensors.
- more advanced use-cases are passing in / returning structs or tuples, like
struct { Tensor, Tensor }
as an example for a tuple of two tensors. You can see such use-cases, for example in the Llama Model
- more advanced use-cases are passing in / returning structs or tuples, like
- in the model, tensors may be optional. As is the case with
bias
.
ZML code is async. Hence, We need to provide an async main function. It works like this:
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
try asynk.AsyncThread.main(gpa.allocator(), asyncMain);
}
pub fn asyncMain() !void {
// ...
The above main()
function only creates an allocator and an async main thread
that executes our asyncMain()
function by calling it with no (.{}
)
arguments.
So, let's start with the async main function:
pub fn asyncMain() !void {
// Short lived allocations
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Arena allocator for BufferStore etc.
var arena_state = std.heap.ArenaAllocator.init(allocator);
defer arena_state.deinit();
const arena = arena_state.allocator();
// Create ZML context
var context = try zml.Context.init();
defer context.deinit();
const platform = context.autoPlatform(.{});
...
}
This is boilerplate code that provides a general-purpose allocator and, for
convenience, an arena allocator that we will use later. The advantage of arena
allocators is that you don't need to deallocate individual allocations; you
simply call .deinit()
to deinitialize the entire arena instead!
We also initialize the ZML context context
and get our CPU platform
automatically.
Next, we need to set up the concrete weight and bias tensors for our model.
Typically, we would load them from disk. But since our example works without
stored weights, we are going to create a BufferStore manually, containing
HostBuffers (buffers on the CPU) for both the weight
and the bias
tensor.
A BufferStore basically contains a dictionary with string keys that match the
name of the struct fields of our Layer
struct. So, let's create this
dictionary:
// Our weights and bias to use
var weights = [3]f16{ 2.0, 2.0, 2.0 };
var bias = [3]f16{ 1.0, 2.0, 3.0 };
const input_shape = zml.Shape.init(.{3}, .f16);
// We manually produce a BufferStore. You would not normally do that.
// A BufferStore is usually created by loading model data from a file.
var buffers: zml.aio.BufferStore.Buffers = .{};
try buffers.put(arena, "weight", zml.HostBuffer.fromArray(&weights));
try buffers.put(arena, "bias", zml.HostBuffer.fromArray(&bias));
// the actual BufferStore
const bs: zml.aio.BufferStore = .{
.arena = arena_state,
.buffers = buffers,
};
Our weights are {2.0, 2.0, 2.0}
, and our bias is just {1.0, 2.0, 3.0}
. The
shape of the weight and bias tensors is {3}
, and because of that, the shape
of the input tensor is also going to be {3}
!
Note that zml.Shape
always takes the data type associated with the tensor. In
our example, that is f16
, expressed as the enum value .f16
.
We're only going to use the CPU for our simple model, but we need to compile the
forward()
function nonetheless. This compilation is usually done
asynchronously. That means, we can continue doing other things while the module
is compiling:
// A clone of our model, consisting of shapes. We only need shapes for compiling.
// We use the BufferStore to infer the shapes.
const model_shapes = try zml.aio.populateModel(Layer, allocator, bs);
// Start compiling. This uses the inferred shapes from the BufferStore.
// The shape of the input tensor, we have to pass in manually.
var compilation = try asyncc(
zml.compileModel,
.{ allocator, Layer.forward, model_shapes, .{input_shape}, platform },
);
// Produce a bufferized weights struct from the fake BufferStore.
// This is like the inferred shapes, but with actual values.
// We will need to send those to the computation device later.
var model_weights = try zml.aio.loadBuffers(Layer, .{}, bs, arena, platform);
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
const compiled = try compilation.awaitt();
Compiling is happening in the background via the asyncc
function. We call
asyncc
with the zml.compileModel
function and its arguments
separately. The arguments themselves are basically the shapes of the weights in
the BufferStore, the .forward
function name in order to compile
Layer.forward
, the shape of the input tensor(s), and the platform for which to
compile (we used auto platform).
Now that we have compiled the module utilizing the shapes, we turn it into an executable.
// pass the model weights to the compiled module to create an executable module
// all required memory has been allocated in `compile`.
var executable = compiled.prepare(model_weights);
defer executable.deinit();
The executable can now be invoked with an input of our choice.
To create the input
, we directly use zml.Buffer
by calling
zml.Buffer.fromArray()
. It's important to note that Buffer
s reside in
accelerator (or device) memory, which is precisely where the input needs to
be for the executable to process it on the device.
For clarity, let's recap the distinction: HostBuffer
s are located in standard
host memory, which is accessible by the CPU. When we initialized the weights,
we used HostBuffers
to set up the BufferStore
. This is because the
BufferStore
typically loads weights from disk into HostBuffer
s, and then
converts them into Buffer
s when we call loadBuffers()
.
However, for inputs, we bypass the BufferStore
and create Buffer
s directly
in device memory.
// prepare an input buffer
// Here, we use zml.HostBuffer.fromSlice to show how you would create a
// HostBuffer with a specific shape from an array.
// For situations where e.g. you have an [4]f16 array but need a .{2, 2} input
// shape.
var input = [3]f16{ 5.0, 5.0, 5.0 };
var input_buffer = try zml.Buffer.from(
platform,
zml.HostBuffer.fromSlice(input_shape, &input),
);
defer input_buffer.deinit();
// call our executable module
var result: zml.Buffer = executable.call(.{input_buffer});
defer result.deinit();
// fetch the result buffer to CPU memory
const cpu_result = try result.toHostAlloc(arena);
std.debug.print(
"\n\nThe result of {d} * {d} + {d} = {d}\n",
.{ &weights, &input, &bias, cpu_result.items(f16) },
);
Note that the result of a computation is usually residing in the memory of the
computation device, so with .toHostAlloc()
we bring it back to CPU memory in
the form of a HostBuffer
. After that, we can print it.
In order to print it, we need to tell the host buffer how to interpret the
memory. We do that by calling .items(f16)
, making it cast the memory to f16
items.
And that's it! Now, let's have a look at building and actually running this example!
As mentioned already, ZML uses Bazel; so to build our model, we just need to
create a simple BUILD.bazel
file, next to the main.zig
file, like this:
load("@zml//bazel:zig.bzl", "zig_cc_binary")
zig_cc_binary(
name = "simple_layer",
main = "main.zig",
deps = [
"@zml//async",
"@zml//zml",
],
)
To produce an executable, we import zig_cc_binary
from the zig rules, and
pass it a name and the zig file we just wrote. The dependencies in deps
are
what's needed for a basic ZML executable and correlate with our imports at the
top of the Zig file:
const zml = @import("zml");
const asynk = @import("async");
With everything in place now, running the model is easy:
# run release (-c opt)
cd examples
bazel run -c opt //simple_layer
# compile and run debug version
bazel run //simple_layer
And voila! Here's the output:
bazel run -c opt //simple_layer
INFO: Analyzed target //simple_layer:simple_layer (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //simple_layer:simple_layer up-to-date:
bazel-bin/simple_layer/simple_layer
INFO: Elapsed time: 0.120s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/simple_layer/simple_layer
info(pjrt): Loaded library: libpjrt_cpu.dylib
info(zml_module): Compiling main.Layer.forward with { Shape({3}, dtype=.f16) }
The result of { 2, 2, 2 } * { 5, 5, 5 } + { 1, 2, 3 } = { 11, 12, 13 }
You can access the complete source code of this walkthrough here:
const std = @import("std");
const zml = @import("zml");
const asynk = @import("async");
const asyncc = asynk.asyncc;
/// Model definition
const Layer = struct {
bias: ?zml.Tensor = null,
weight: zml.Tensor,
pub fn forward(self: Layer, x: zml.Tensor) zml.Tensor {
var y = self.weight.mul(x);
if (self.bias) |bias| {
y = y.add(bias);
}
return y;
}
};
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
try asynk.AsyncThread.main(gpa.allocator(), asyncMain);
}
pub fn asyncMain() !void {
// Short lived allocations
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Arena allocator for BufferStore etc.
var arena_state = std.heap.ArenaAllocator.init(allocator);
defer arena_state.deinit();
const arena = arena_state.allocator();
// Create ZML context
var context = try zml.Context.init();
defer context.deinit();
const platform = context.autoPlatform(.{});
// Our weights and bias to use
var weights = [3]f16{ 2.0, 2.0, 2.0 };
var bias = [3]f16{ 1.0, 2.0, 3.0 };
const input_shape = zml.Shape.init(.{3}, .f16);
// We manually produce a BufferStore. You would not normally do that.
// A BufferStore is usually created by loading model data from a file.
var buffers: zml.aio.BufferStore.Buffers = .{};
try buffers.put(arena, "weight", zml.HostBuffer.fromArray(&weights));
try buffers.put(arena, "bias", zml.HostBuffer.fromArray(&bias));
// the actual BufferStore
const bs: zml.aio.BufferStore = .{
.arena = arena_state,
.buffers = buffers,
};
// A clone of our model, consisting of shapes. We only need shapes for
// compiling. We use the BufferStore to infer the shapes.
const model_shapes = try zml.aio.populateModel(Layer, allocator, bs);
// Start compiling. This uses the inferred shapes from the BufferStore.
// The shape of the input tensor, we have to pass in manually.
var compilation = try asyncc(zml.compileModel, .{ allocator, Layer.forward, model_shapes, .{input_shape}, platform });
// Produce a bufferized weights struct from the fake BufferStore.
// This is like the inferred shapes, but with actual values.
// We will need to send those to the computation device later.
var model_weights = try zml.aio.loadBuffers(Layer, .{}, bs, arena, platform);
defer zml.aio.unloadBuffers(&model_weights); // for good practice
// Wait for compilation to finish
const compiled = try compilation.awaitt();
// pass the model weights to the compiled module to create an executable
// module
var executable = compiled.prepare(model_weights);
defer executable.deinit();
// prepare an input buffer
// Here, we use zml.HostBuffer.fromSlice to show how you would create a
// HostBuffer with a specific shape from an array.
// For situations where e.g. you have an [4]f16 array but need a .{2, 2}
// input shape.
var input = [3]f16{ 5.0, 5.0, 5.0 };
var input_buffer = try zml.Buffer.from(
platform,
zml.HostBuffer.fromSlice(input_shape, &input),
);
defer input_buffer.deinit();
// call our executable module
var result: zml.Buffer = executable.call(.{input_buffer});
defer result.deinit();
// fetch the result to CPU memory
const cpu_result = try result.toHostAlloc(arena);
std.debug.print(
"\n\nThe result of {d} * {d} + {d} = {d}\n",
.{ &weights, &input, &bias, cpu_result.items(f16) },
);
}