algorithm_benchmark

Algorithm Benchmark is a set of benchmarks aimed at measuring the performance of realistic worloads.

Test name	Description	Params	L0	OCL
Heat3D	A 3D heat-equation solving benchmark that overlaps IPC data transfers (nearest-neighbor halo exchange) and GPU compute kernels.Measures multi-process concurrent kernel execution and IPC memory transfer performance on a single device. Linux-only.	--meshLength Number of mesh points along each of the X-Y-Z directions --subDomainX Number of sub-domains in the X-direction --subDomainY Number of sub-domains in the Y-direction --subDomainZ Number of sub-domains in the Z-direction --timesteps Number of simulation timesteps	✔️	❌

api_overhead_benchmark

Api Overhead Benchmark is a set of tests aimed at measuring CPU-side execution duration of compute API calls.

Test name	Description	Params	L0	OCL
AppendLaunchKernel	measures time spent in zeCommandListAppendLaunchKernel on CPU.	--appendCount Number of appends to run --event Pass output event to the enqueue call (0 or 1) --wgc Workgroup count --wgs Workgroup size, pass 0 to make the driver calculate it during enqueue	✔️	❌
AppendWaitOnEventsImmediate	Measures time spent to zeCommandListAppendWaitOnEvents using immediate command list.	--eventSignaled Event is already signaled before zeCommandListAppendWaitOnEvents call (0 or 1) --ioq Use In order queue (0 or 1)	✔️	❌
CommandListHostSynchronize	measures CPU time spent in zeCommandListHostSynchronize. Optionally, adds an event-signalling barrierand waits for the event, before calling zeCommandListHostSynchronize	--UseBarrierBeforeSync Append an event-signalling-barrier before synchronization (0 or 1)	✔️	❌
CreateBuffer	measures time spent in clCreateBuffer on CPU.	--allocateAll Free buffers at the end of test, as opposed to freeing between iterations. Should disallow resource reuse (0 or 1) --bufferSize Buffer size --copyHostPtr CL_MEM_COPY_HOST_PTR flag (0 or 1) --forceHostMemoryIntel CL_MEM_FORCE_HOST_MEMORY_INTEL flag (0 or 1) --readOnly Read only buffer (0 or 1)	❌	✔️
CreateCommandList	measures time spent in zeCommandListCreate on CPU.	--CmdListCount Number of cmdlists to create --CopyOnly Create copy only cmdlist (0 or 1)	✔️	❌
CreateCommandListImmediate	measures time spent in zeCommandListCreateImmediate on CPU.	--CmdListCount Number of cmdlists to create --ioq Use In order queue (0 or 1)	✔️	❌
DestroyCommandList	measures time spent in zeCommandListDestroy on CPU.	--CmdListCount Number of cmdlists to destroy	✔️	❌
DestroyCommandListImmediate	measures time spent in zeCommandListDestroy on CPU, for immediate cmdlist.	--CmdListCount Number of immediate cmdlists to create --ioq Use In order queue (0 or 1)	✔️	❌
DriverGet	measures time spent in driver get call on CPU.	--getDriverCount Whether to measure driver count or driver get (0 or 1)	✔️	❌
DriverGetApiVersion	measures time spent in zeDriverGetApiVersion call on CPU.		✔️	❌
DriverGetProperties	measures time spent in zeDriverGetProperties call on CPU.		✔️	❌
EnqueueNdrNullLws	measures time spent in clEnqueueNDRangeKernel on CPU. Null LWS is provided, which causes driver to calculate it	--event Pass output event to the enqueue call (0 or 1) --gws Global work size --ooq Use out of order queue (0 or 1) --profiling Creating a profiling queue (0 or 1)	❌	✔️
EnqueueNdrTime	measures time spent in clEnqueueNDRangeKernel on CPU.	--event Pass output event to the enqueue call (0 or 1) --ooq Use out of order queue (0 or 1) --profiling Creating a profiling queue (0 or 1) --wgc Workgroup count --wgs Workgroup size	❌	✔️
EventCreation	measures time spent to create event	--eventCount Number of events to create --hostVisible Event will set host visible flag (0 or 1) --signal Type of signal scope (subdevice or device or host or none) --useProfiling Event will use profiling (0 or 1) --wait Type of wait scope (subdevice or device or host or none)	✔️	❌
EventQueryStatus	Measures time spent to query event status	--eventSignaled Event will be set as signaled (0 or 1)	✔️	❌
ExecImmediate	measures time spent in appending launch kernel for immediate command list on CPU.	--BarrierSynchro Uses barrier synchronization instead of waiting for event from last kernel (0 or 1) --CallsCount amount of calls that is being meassured --EventSync If true, use events to synchronize with host. If false, use zeCommandListHostSynchronize (0 or 1) --KernelExecTime How long a single kernel executes, in us --MeasureCompletion Measures time taken to complete the submission (default is to measure only Immediate call) (0 or 1) --Profiling Pass a profiling ze_event_t to the API call (0 or 1) --ioq Use In order queue (0 or 1)	✔️	❌
ExecImmediateCopyQueue	measures time spent in appending memory copy for immediate command list on CPU with Copy Queue.	--IsCopyOnly If true, Copy Engine is selected. If false, Compute Engine is selected (0 or 1) --MeasureCompletionTime Measures time taken to complete the submission (default is to measure only Immediate call) (0 or 1) --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --ioq Use In order queue (0 or 1) --size Size of the buffer --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
ExecImmediateMultiKernel	measures time spent in executing multiple instances of two different kernels with immediate command list on CPU	--AfterBarrierCnt Adds certain number of kernels after Barrier, Default is 2 --Barrier Add a Barrier after certain number of Kernel launches, number of kernels before barrier is controlled by numKernelsBeforeBarrier (0 or 1) --BeforeBarrierCnt Adds certain number of kernels prior to Barrier, Default is 2 --CallsCount amount of calls that is being measured --ExecTime Approximately how long a single kernel executes, in us --ioq Use In order queue (0 or 1)	✔️	❌
ExecuteCommandList	measures time spent in zeCommandQueueExecuteCommandLists on CPU.	--UseFence Pass a non-null ze_fence_handle_t to the API call (0 or 1) --measureCompletionTime Measures time taken to complete the submission (default is to measure only Execute call) (0 or 1)	✔️	❌
ExecuteCommandListForCopyEngine	measures CPU time spent in zeCommandQueueExecuteCommandLists for copy-only path	--UseFence Pass a non-null ze_fence_handle_t to the API call (0 or 1) --measureCompletionTime Measures time taken to complete the submission (default is to measure only Execute call) (0 or 1)	✔️	❌
ExecuteCommandListWithFenceCreate	measures time spent in zeFenceCreate on CPU when fences are used.		✔️	❌
ExecuteCommandListWithFenceDestroy	measures time spent in zeFenceDestroy on CPU when fences are used.		✔️	❌
ExecuteCommandListWithFenceUsage	measures time spent in zeCommandQueueExecuteCommandLists and zeFenceSynchronize on CPU when fences are used.		✔️	❌
ExecuteCommandListWithIndirectAccess	measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are accessed.	--AllocateMemory If set then prior to measurement new allocation is done and made resident. (0 or 1) --AmountOfIndirectAllocations Amount of indirect allocations that are present in system	✔️	❌
ExecuteCommandListWithIndirectArguments	measures time spent in zeCommandQueueExecuteCommandLists on CPU when indirect allocations are used.	--AmountOfIndirectAllocations Amount of indirect allocations that are present in system --placement Placement of the indirect allocations (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
FlushTime	measures time spent in clEnqueueNDRangeKernel on CPU.	--event Pass output event to the enqueue call (0 or 1) --flushCount Count of flushes to measure --ooq Use out of order queue (0 or 1) --wgc Workgroup count --wgs Workgroup size, pass 0 to make the driver calculate it during enqueue	❌	✔️
GetMemoryProperties	measures time spent in zeMemGetAllocProperties on CPU when driver is queried for memory properties.	--AmountOfUsmAllocations Amount of USM allocations that are present in system	✔️	❌
GetMemoryPropertiesWithModifiedAllocations	measures time spent in zeMemGetAllocProperties on CPU, when allocations are modified between each iteration.	--AmountOfUsmAllocations Amount of USM allocations that are present in system	✔️	❌
GetMemoryPropertiesWithOffsetedPointer	measures time spent in zeMemGetAllocProperties on CPU when the pointer passed is an offset from the base address.	--AmountOfUsmAllocations Amount of USM allocations that are present in system	✔️	❌
KernelSetArgumentValueImmediate	measures time spent in zeKernelSetArgumentValue for immediate arguments on CPU.	--argSize Kernel argument size in bytes --differentValues Use different values for arguments each iteration (0 or 1)	✔️	❌
LifecycleCommandList	measures time spent in zeCommandListCreate + Close + Execute on CPU.	--CmdListCount Number of cmdlists to create --CopyOnly Create copy only cmdlist (0 or 1)	✔️	❌
MemGetIpcHandle	measures time spent in zeMemGetIpcHandle on CPU.	--AmountOfUsmAllocations Amount of USM allocations that are present in system --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
MemOpenIpcHandle	measures time spent in zeMemOpenIpcHandle on CPU.	--AmountOfUsmAllocations Amount of USM allocations that are present in system --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
MemPutIpcHandle	measures time spent in zeMemPutIpcHandle on CPU.	--AmountOfUsmAllocations Amount of USM allocations that are present in system --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
ModuleCreateSpv	measures time spent in zeModuleCreate for .spv kernel on CPU.	--kernelName Path to Kernel .spv file	✔️	❌
MultiArgumentKernelTime	measures time spent in clEnqueueNDRangeKernel on CPU for kernels with mulitple arguments.	--argumentCount argument count in a kernel, supported values 1,4,8,16,32,64 --measureSetArg control whether setKernelArgSvmPointer is measured or not (0 or 1)	❌	✔️
PhysicalMemCreate	measures time spent in zePhysicalMemCreate on CPU.	--reserveSize Size in bytes to be reserved	✔️	❌
PhysicalMemDestroy	measures time spent in zePhysicalMemDestroy on CPU.		✔️	❌
ResetCommandList	measures time spent in zeCommandListReset on CPU.	--CopyOnly Create copy only cmdlist (0 or 1) --size Size of the buffer --sourcePlacement Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
SetKernelArgSvmPointer	measures time spent in clSetKernelArgSVMPointer on CPU.	--allocationSize Size of svm allocations, in bytes --allocationsCount Number of allocations --reallocate Allocations will be freed and allocated again between setKernelArgs (0 or 1)	✔️	✔️
SetKernelGroupSize	measures time spent in zeKernelSetGroupSize on CPU.	--asymmetricLocalWorkSize Use asymmetric local workSize (0 or 1)	✔️	❌
SubmitKernel	measures time spent in submitting a kernel to a SYCL (or SYCL-like) queue on CPU.	--DiscardEvents Create the queue with the discard_events property (0 or 1) --Ioq Create the queue with the in_order property (0 or 1) --KernelExecTime Approximately how long a single kernel executes, in us --MeasureCompletion Measures time taken to complete the submission (default is to measure only submit calls) (0 or 1) --NumKernels Number of kernels to submit to the queue --Profiling Create the queue with the enable_profiling property (0 or 1)	✔️	✔️
UsmMemoryAllocation	measures time spent in USM memory allocation APIs.	--measureMode Specifies which APIs to measure (Allocate or Free or Both) --size Size to allocate --type Type of memory being allocated (Device or Host or Shared)	✔️	❌
VirtualMemFree	measures time spent in zeVirtualMemFree on CPU.	--freeSize Size in bytes to be freed	✔️	❌
VirtualMemGetAccessAttrib	measures time spent in zeVirtualMemGetAccessAttribute on CPU.	--size Size in bytes to get the access attribute	✔️	❌
VirtualMemMap	measures time spent in zeVirtualMemMap on CPU.	--accessType Access type. Either 'ReadWrite' or 'ReadOnly' --reserveSize Size in bytes to be reserved --useOffset Use offset to map into physical memory (0 or 1)	✔️	❌
VirtualMemQueryPageSize	measures time spent in zeVirtualMemQueryPageSize on CPU.		✔️	❌
VirtualMemReserve	measures time spent in zeVirtualMemReserve on CPU.	--reserveSize Size in bytes to be reserved --useNull Flag to decide whether Null to be used for start of region (0 or 1)	✔️	❌
VirtualMemSetAccessAttrib	measures time spent in zeVirtualMemSetAccessAttribute on CPU.	--accessType Access type to set. Either 'ReadWrite', 'ReadOnly' or 'None' --size Size in bytes to set the access attribute	✔️	❌
VirtualMemUnMap	measures time spent in zeVirtualMemUnMap on CPU.	--reserveSize Size in bytes to be unmapped	✔️	❌

atomic_benchmark

Atomic Benchmark is a set of tests aimed at measuring performance of atomic operations inside kernels.

Test name	Description	Params	L0	OCL
OneAtomic	enqueues kernel performing an atomic operation on a single address	--op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️
OneAtomicExplicit	enqueues kernel performing an atomic operation on a single address using OpenCL 2.0 Atomics with explicit memory order and scope	--op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst) --scope Memory scope of an atomic operation (Workgroup or Device) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️
OneLocalAtomic	enqueues kernel performing an atomic operation on a single location placed in SLM	--op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgs Work group size	❌	✔️
OneLocalAtomicExplicit	enqueues kernel performing an atomic operation on a single location placed in SLM using OpenCL 2.0 Atomics with explicit memory order and scope	--op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst) --scope Memory scope of an atomic operation (Workgroup or Device) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgs Work group size	❌	✔️
SeparateAtomics	enqueues kernel performing an atomic operation on different addresses	--atomicsPerCacheline Number of used addresses occupying a single cacheline (this causes operations to be serialized) --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️
SeparateAtomicsExplicit	enqueues kernel performing an atomic operation on different addresses	--atomicsPerCacheline Number of used addresses occupying a single cacheline (this causes operations to be serialized) --op Atomic operation to perform (Add or Sub or Xchg or CmpXchg or Inc or Dec or Min or Max or And or Or or Xor) --order Memory order of an atomic operation (relaxed or acquire or release or acq_rel or seq_cst) --scope Memory scope of an atomic operation (Workgroup or Device) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️

emu_benchmark

Emulation Benchmark is a set of tests aimed at measuring performance of emulated math operations performed in kernels.

Test name	Description	Params	L0	OCL
Int64Div	enqueues kernel performing an int64 division emulation	--useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️

eu_benchmark

EU Benchmark is a set of tests aimed at measuring performance of calculations performed in kernels.

Test name	Description	Params	L0	OCL
DoMathOperation	enqueues kernel performing a math operation	--op Math operation to perform (Add or Sub or Div or Modulo or Inc or Dec or Min or Max or And or Or or Xor) --type Data type of the atomic. Keep in mind not all operations are supported for floating points (Int32 or Int64 or Float) --useEvents Perform GPU-side measurements using events (0 or 1) --wgc Work group count --wgs Work group size	❌	✔️
ReadAfterAtomicWrite	enqueues kernel, which writes to global memory using atomic and then reads non atomically	--atomic If true, write to global memory will be atomic. (0 or 1) --shuffleRead If true, each thread will write and read different memory cell. Otherwise it will be the same one. (0 or 1) --useEvents Perform GPU-side measurements using events (0 or 1) --wgs Workgroup size	❌	✔️

gpu_cmds_benchmark

Gpu Commands Benchmark is a set of tests aimed at measuring GPU-side execution duration of various commands.

Test name	Description	Params	L0	OCL
BarrierBetweenKernels	measures time required to run a barrier command between 2 kernels, including potential cache flush commands	--bytes bytes to flush from L3 --memoryType memory type cached in L3 (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --onlyReads only reads cached in L3 --remoteAccess access cached from remote tile	✔️	❌
CopyWithEvent	measures time required to run a copy kernel with various event configurations.	--devWaitEvent Use ZE_EVENT_SCOPE_FLAG_DEVICE for ze_event_desc_t::wait (0 or 1) --hostSignalEvent Use ZE_EVENT_POOL_HOST_VISIBLE for ze_event_pool_desc_t::flags, and use ZE_EVENT_SCOPE_FLAG_HOST for ze_event_desc_t::signal (0 or 1) --measuredCmds Number of commands being measured. Result is later divided by this number, to achieve time of a single command --timestampEvent Use ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP for ze_event_pool_desc_t::flags (0 or 1)	✔️	❌
EmptyKernel	measures time required to run an empty kernel on GPU.	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	❌
EventCtxtSwitchLatency	measures context switching latency time required to switch between various engine types	--firstEngine first engine to measure context switch latency (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8) --measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command --secondEngine second engine to measure context switch latency (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)	✔️	❌
KernelWithEvent	measures time required to run an empty kernel with various event configurations.	--devWaitEvent Use ZE_EVENT_SCOPE_FLAG_DEVICE for ze_event_desc_t::wait (0 or 1) --hostSignalEvent Use ZE_EVENT_POOL_HOST_VISIBLE for ze_event_pool_desc_t::flags, and use ZE_EVENT_SCOPE_FLAG_HOST for ze_event_desc_t::signal (0 or 1) --measuredCmds Number of commands being measured. Result is later divided by this number, to achieve time of a single command --timestampEvent Use ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP for ze_event_pool_desc_t::flags (0 or 1) --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	❌
KernelWithWork	measures time required to run a GPU kernel which assigns values to elements of a buffer.	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg) --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	❌
WaitOnEventCold	measures time required to service a signalled semaphore, that has never been waited for.	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command	✔️	❌
WaitOnEventFromWalker	measures time required to service a signalled semaphore coming from Walker command	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command	✔️	❌
WaitOnEventHot	measures time required to service a signalled semaphore, that was previously used	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command	✔️	❌
WriteTimestamp	measures time required to write a timestamp on GPU.	--measuredCommands Number of commands being measured. Result is later divided by this number, to achieve time of a single command	✔️	❌

memory_benchmark

Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers.

Test name	Description	Params	L0	OCL
CopyBuffer	allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available.	--compressedDestination Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --compressedSource Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffers (Zeros or Random) --size Size of the buffers --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
CopyBufferRect	allocates two OpenCL buffers and measures rectangle copy bandwidth between them. Buffers will be placed in device memory, if it's available.	--dstCompressed Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --origin Origin of the rectangle --rPitch Row pitch of the rectangle --region Size of the rectangle --sPitch Silice pitch of the rectangle --size Size of the buffer --srcCompressed Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1)	❌	✔️
CopyBufferToImage	allocates buffer and image and measures copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL.	--forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --region Size of the destination image region --size Size of the buffer --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
CopyEntireImage	allocates two image objects and measures copy bandwidth between them. Images will be placed in device memory, if it's available.	--forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --size Size of the image --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
CopyImageRegion	allocates two image objects and measures region copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL.	--forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --size Size of the image --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
CopyImageToBuffer	allocates image and buffer and measures copy bandwidth between them using immediate command list for Level Zero and command queue for OpenCL.	--dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --region Size of the source image region --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
FillBuffer	allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffer (Zeros or Random) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --patternSize Size of the fill pattern --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
FullRemoteAccessMemory	Uses stream memory in a fashion described by 'type' to measure bandwidth of full remote memory access.	--blockAccess Block access (1) or scatter access (0) (0 or 1) --elementSize Size of the single element to read in bytes (1, 2, 4, 8) --size Size of the memory to stream. Must be divisible by element size and a power of 2 --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1) --workItems Number of work items equal to SIMD size * used hwthreads. Must be a power of 2	❌	✔️
FullRemoteAccessMemoryXeCoresDistributed	Uses stream memory in a fashion described by 'type' to measure bandwidth of full remote memory accesswhen hwthreads are distributed between XeCores.	--blockAccess Block access (1) or scatter access (0) (0 or 1) --elementSize Size of the single element to read in bytes (1, 2, 4, 8) --size Size of the memory to stream. Must be a power of 2 --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1) --workItems Number of work items equal to SIMD size * used hwthreads	❌	✔️
MapBuffer	allocates an OpenCL buffer and measures map bandwidth. Mapping operation means memory transfer from GPU to CPU or a no-op, depending on map flags.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffer (Zeros or Random) --mapFlags OpenCL map flags passed during memory mapping (Read or Write or WriteInvalidate) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
QueueInOrderMemcpy	measures time on CPU spent for multiple in order memcpy.	--IsCopyOnly If true, Copy Engine is selected. If false, Compute Engine is selected (0 or 1) --count Number of memcpy operations --destinationPlacement Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --size Size of memory allocation --sourcePlacement Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
RandomAccessMemory	Measures device-memory random access bandwidth for different allocation sizes, alignments and access modes.The benchmark uses 10 million accesses to memory.	--accessMode Access mode to be used('Read', 'Write', 'ReadWrite') --alignment Alignment request for the allocated memory --allocationSize Size of device memory to be allocated.(Maximum supported is 16GB) --randomAccessRange Percentage of allocation size to be used for random access	✔️	❌
ReadBuffer	allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffer (Zeros or Random) --reuse How hostptr allocation can be reused due to previous operations (None or Usm or Map) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
ReadBufferMisaligned	allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU. Destination pointer passed by the application will be misaligned by the specified amount of bytes.	--misalignment Number of bytes by which misaligned the destination pointer will be misaligned --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
ReadBufferRect	allocates an OpenCL buffer and measures rectangle read bandwidth. Rectangle read operation means transfer from GPU to CPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --origin Origin of the rectangle --rPitch Row pitch of the rectangle --region Size of the rectangle --sPitch Silice pitch of the rectangle --size Size of the buffer	❌	✔️
ReadDeviceMemBuffer	allocates two OpenCL buffers and measures source buffer read bandwidth. Source buffer resides in device memory.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --size Size of the buffer	❌	✔️
RemoteAccessMemory	Uses stream memory in a fashion described by 'type' to measure bandwidth with differentpercentages of remote memory access. Triad means two buffers are read and one is written to.In read and write memory is only read or written to.	--remoteFraction Fraction of remote memory access. 1 / n --size Size of the memory to stream. Must be divisible by datatype size. --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1) --workItemSize Number of work items group together for remote check	❌	✔️
RemoteAccessMemoryMaxSaturation	Uses stream memory write to measure max data bus saturation with different percentages of remote memory access	--remoteFraction Fraction of remote memory access. 1 / n --size Size of the memory to stream. Must be divisible by datatype size. --useEvents Perform GPU-side measurements using events (0 or 1) --workItemSize Number of work items group together for remote check --writesPerWorkgroup Number of work items per workgroup that access memory	❌	✔️
SLM_DataAccessLatency	generates SLM local memory transactions inside thread group to measure latency between reads (uses Intel only private intel_get_cycle_counter() )	--direction write or read mode (0 or 1) --occupancyDiv H/W load divider by 8, 4, 2, full occupancy --size SLM Size	❌	✔️
SlmSwitchLatency	Enqueues 2 kernels with different SLM size. Measures switch time between these kernels.	--firstSlmSize Size of the shared local memory per thread group. First kernel. --secondSlmSize Size of the shared local memory per thread group. Second kernel. --wgs Size of the work group.	✔️	❌
StreamAfterTransfer	Goal of this test is to measure how stream kernels perform right after host to device transfer populating the data. Test does clean caches, then emits transfers and then follows with stream kernel and measures GPU execution time of it.	--size Size of the memory to stream. Must be divisible by datatype size. --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
StreamMemory	Streams memory inside of kernel in a fashion described by 'type'. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.	--contents Buffer contents zeros/random (Zeros or Random) --memoryPlacement Memory type used for stream (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --size Size of the memory to stream. Must be divisible by datatype size. --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
StreamMemoryImmediate	Streams memory inside of kernel in a fashion described by 'type' using immediate command list. Copy means one memory location is read from and the second one is written to. Triad means two buffers are read and one is written to. In read and write memory is only read or written to.	--size Size of the memory to stream. Must be divisible by datatype size. --type Memory streaming type (Read or Write or Scale or Triad) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UnmapBuffer	allocates an OpenCL buffer and measures unmap bandwidth. Unmapping operation meansmemory transfer from CPU to GPU or a no-op, depending on map flags.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffer (Zeros or Random) --mapFlags OpenCL map flags passed during memory mapping (Read or Write or WriteInvalidate) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
UsmConcurrentCopy	allocates four unified shared memory buffers, 2 in device memory and 2 in host memory. Measures concurrent copy bandwidth between them.	--d2hEngine Engine used for device to host copy (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8) --h2dEngine Engine used for host to device copy (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8) --size Size of the buffer	✔️	❌
UsmCopy	allocates two unified shared memory buffers and measures copy bandwidth between them.	--contents Contents of the buffers (Zeros or Random) --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --reuseCmdList Command list is reused between iterations (0 or 1) --size Size of the buffer --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmCopyConcurrentMultipleBlits	Measures Copy bandwidth while performing concurrent copies between host and device using different copy engines. Engines for Host to Device copies could be selected using d2hBlitters.Engines for Device to Host copies could be selected using d2hBlitters.	--d2hBlitters A bit mask for selecting copy engines to be used for device to host copy --h2dBlitters A bit mask for selecting copy engines to be used for host to device copy --size Size of the copy to be done for each copy engine	✔️	❌
UsmCopyImmediate	allocates two unified shared memory buffers and measures copy bandwidth between them using immediate command list.	--contents Contents of the buffers (Zeros or Random) --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --size Size of the buffer --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UsmCopyMultipleBlits	allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.	--blitters A bit mask for selecting copy engines --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --size Size of the operation processed by each engine --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	✔️
UsmCopyRegion	allocates two unified shared memory buffers and measures region copy bandwidth between them using immediate command list.	--contents Contents of the buffers (Zeros or Random) --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --origin Origin of the region --region Size of the region --size Size of the buffer --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UsmCopyStagingBuffers	Measures copy time from device/host to host/device. Host memory is non-USM allocation.Copy is done through staging USM buffers. Non-USM host ptr is never passed to L0 API, only through staging buffers.	--chunks How much memory chunks should the buffer be splitted into --dst Memory placement of destination (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --size Size of the buffer	✔️	❌
UsmFill	allocates a unified memory buffer and measures fill bandwidth	--contents Contents of the buffer (Zeros or Random) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --patternContents Select contents of the fill pattern (Zeros or Random) --patternSize Size of the fill pattern --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmFillImmediate	allocates a unified memory buffer and measures fill bandwidth using immediate command list	--contents Contents of the buffer (Zeros or Random) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --patternContents Select contents of the fill pattern (Zeros or Random) --patternSize Size of the fill pattern --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UsmFillMultipleBlits	allocates a unified shared memory buffer, divides it into chunks, copies each chunk using a different copy engine and measures bandwidth. Refer to UsmCopyMultipleBlits for more details.	--blitters A bit mask for selecting copy engines --memory Placement of buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --patternContents Select contents of the fill pattern (Zeros or Random) --patternSize Size of the fill pattern --size Size of the operation processed by each engine	✔️	✔️
UsmFillSpecificPattern	allocates a unified memory buffer and measures fill bandwidth. Allow specifying arbitrary pattern.	--contents Contents of the buffer (Zeros or Random) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --pattern The fill pattern represented hexadecimally, e.g. 0x91ABCD1254 --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmImmediateCopyMultipleBlits	allocates two unified shared memory buffers, divides them into chunks, copies each chunk using a different copy engine with an immediate command list and measures bandwidth. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.	--blitters A bit mask for selecting copy engines --dst Placement of the destination buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --size Size of the operation processed by each engine --src Placement of the source buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported)	✔️	❌
UsmMemset	allocates a unified memory buffer and measures memset bandwidth	--contents Contents of the buffer (Zeros or Random) --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of the buffer (Device or Host or Shared or non-USM or non-USM-imported or non-USM-mapped or non-usm2MBAligned or non-USM2MBAligned-imported) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
UsmSharedMigrateCpu	allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from GPU to CPU	--accessAllBytes Select, whether entire resource or only one byte will be accessed on CPU (0 or 1) --preferredLocation Apply memadvise with preferred device location (system, device, none) (System or Device or None) --size Size of the buffer	✔️	✔️
UsmSharedMigrateGpu	allocates a unified shared memory buffer and measures bandwidth for kernel that must migrate resource from CPU to GPU	--preferredLocation Apply memadvise with preferred device location (system, device, none) (System or Device or None) --prefetch Explicitly migrate shared allocation to device associated with command queue (0 or 1) --size Size of the buffer	✔️	✔️
UsmSharedMigrateGpuForFill	allocates a unified shared memory buffer and measures bandwidth for memory fill operation that must migrate resource from CPU to GPU	--forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --preferredLocation Apply memadvise with preferred device location (system, device, none) (System or Device or None) --prefetch Explicitly migrate shared allocation to device associated with command queue (0 or 1) --size Size of the buffer	✔️	✔️
WriteBuffer	allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --contents Contents of the buffer (Zeros or Random) --reuse How hostptr allocation can be reused due to previous operations (None or Usm or Map) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
WriteBufferRect	allocates an OpenCL buffer and measures rectangle write bandwidth. Rectangle write operation means transfer from CPU to GPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --inOrderQueue If set use IOQ, otherwise OOQ. Applicable only for OCL. (0 or 1) --origin Origin of the rectangle --rPitch Row pitch of the rectangle --region Size of the rectangle --sPitch Silice pitch of the rectangle --size Size of the buffer	❌	✔️

miscellaneous_benchmark

Miscellaneous Benchmark is a set of tests measuring different simple compute scenarios.

Test name	Description	Params	L0	OCL
IoqKernelSwitchLatency	measures time from end of one kernel till start of next kernel for in order queue	--kernelCount Count of kernels --useEvents Use events to synchronize between kernels (0 or 1)	❌	✔️
KernelWithWork	measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Benchmark checks the impact of kernel split.	--split How many times kernel is split --useEvents Perform GPU-side measurements using events (0 or 1) --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg) --wgc Workgroup count --wgs Workgroup size (aka local work size)	❌	✔️
Reduction	Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.	--numberOfElements Number of elements that will be reduced	❌	✔️
Reduction2	Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.	--numberOfElements Number of elements that will be reduced	❌	✔️
Reduction3	Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.	--numberOfElements Number of elements that will be reduced	❌	✔️
Reduction4	Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.	--numberOfElements Number of elements that will be reduced	❌	✔️
Reduction5	Performs a reduction operation on a buffer. Each thread performs atomic_add on one shared memory location.	--numberOfElements Number of elements that will be reduced	❌	✔️
VectorSum	Performs vector addition	--numberOfElementsX Number of elements in X dimension --numberOfElementsY Number of elements in Y dimension --numberOfElementsZ Number of elements in Z dimension	❌	✔️

multiprocess_benchmark

Multiprocess Benchmark is a set of tests aimed at measuring how different commands benefit for simultaneous execution.

Test name	Description	Params	L0	OCL
KernelAndCopy	enqueues kernel and copy operation with the ability to perform both tasks on different command queues.	--runCopy Enqueue buffer to buffer copy during each iteration (0 or 1) --runKernel Enqueue kernel during each iteration (0 or 1) --twoQueues Enables using separate queues for both operations. Must be used with runCopy and runKernel (0 or 1) --useCopyQueue Use a specialized copy queue for the copy operation. Must be used with runCopy (0 or 1)	❌	✔️
MultiProcessCompute	Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution	--opsPerKernel Operations performed in kernel, used to steer its execution time --processesPerTile Number of processes that will be started on each of the tiles specified --synchronize Synchronize all processes before each iteration (0 or 1) --tiles Tiles for execution (Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --workgroupsPerProcess Number of workgroups that each process will start	✔️	❌
MultiProcessComputeSharedBuffer	Creates a number of separate processes for each tile specified performing a compute workload and measures average time to complete all of them. Processes will use affinity mask to select specific sub-devices for the execution. A single buffer for each tile is created by parent process. All processes executing on a given tile will share it via IPC calls.	--processesPerTile Number of processes that will be started on each of the tiles specified --synchronize Synchronize all processes before each iteration (0 or 1) --tiles Tiles for execution (Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --workgroupsPerProcess Number of workgroups that each process will start	✔️	❌
MultiProcessImmediateCmdlistCompletion	measures completion latency of AppendMemoryCopy issued from multiple processes to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.Bits of the 'engineMask' are indexed from right to left. So rightmost bit represents first engine and leftmost, the last engine.'processesPerEngine' number of processes submits commands to each selected engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x selected engine count, then the excess processes are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all processes are assigned to that engine.	--copySize copy size in bytes --engineGroup engine group to be used --engineMask bit mask for selecting engines to be used for submission --numberOfProcesses total number of processes --processesPerEngine number of processes submitting commands to each engine	✔️	❌
MultiProcessImmediateCmdlistSubmission	measures submission latency of walker command issued from multiple processes to Immediate Command Lists.'processesPerEngine' count of processes, submit commands to each engine.If 'numberOfProcesses' is greater than 'processesPerEngine' x engine count, then the excess processes are assigned to engines one each, in a round-robin method.if engineCount == 1, then all processes are assigned to the engine.	--numberOfProcesses total numer of processes --processesPerEngine number of processes submitting commands to each engine	✔️	❌
MultiProcessInit	Measures the initialization overhead in a multi-process application.For Level Zero we only measure the first invocation of zeInit() per process execution.	--initFlag Initialization flag. For Level Zero: 0 - default, 1 - ZE_INIT_FLAG_GPU_ONLY, 2 - ZE_INIT_FLAG_VPU_ONLY --numberOfProcesses Total number of processes	✔️	❌

multithread_benchmark

Multithread Benchmark is a set of tests aimed at measuring how different commands benefit from multithreaded execution.

Test name	Description	Params	L0	OCL
ImmediateCommandListCompletion	measures completion latency of AppendMemoryCopy issued from multiple threads to Immediate Command Lists.Engines to be used for submissions are selected based on the enabled bits of engineMask.'threadsPerEngine' number of threads submits commands to each selected engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x selected engine count, then the excess threads are assigned to selected engines one each, in a round-robin method.if selected engineCount == 1, then all threads are assigned to that engine.	--copySize copy size in bytes --engineGroup engine group to be used --engineMask bit mask for selecting engines to be used for submission --numberOfThreads total number of threads --threadsPerEngine number of threads submitting commands to each engine	✔️	❌
ImmediateCommandListSubmission	measures submission latency of AppendLaunchKernel issued from multiple threads to Immediate Command Lists.'threadsPerEngine' count of threads submit commands to each engine.If 'numberOfThreads' is greater than 'threadsPerEngine' x engine count, then the excess threads are assigned to engines one each, in a round-robin method.if engineCount == 1, then all threads are assigned to the engine.	--numberOfThreads total number of threads --threadsPerEngine number of threads submitting commands to each engine	✔️	❌
SvmCopy	enqueues multiple svm copies on multiple threads concurrently.	--numberOfThreads Number of threads that will run concurrently	✔️	✔️

multitile_memory_benchmark

Multi-tile Memory Benchmark is a set of tests aimed at measuring bandwidth of memory transfers performed on a multi-tile device.

Test name	Description	Params	L0	OCL
CopyBuffer	allocates two OpenCL buffers and measures copy bandwidth between them. Buffers will be placed in device memory, if it's available.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --dst Placement of memory for the destination buffer (Root or Tile0 or Tile1 or Tile2 or Tile3) --dstCompressed Select if the destination buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffers --src Placement of memory for the source buffer (Root or Tile0 or Tile1 or Tile2 or Tile3) --srcCompressed Select if the source buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
FillBuffer	allocates an OpenCL buffer and measures fill bandwidth. Buffer will be placed in device memory, if it's available.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3) --patternSize Size of the fill pattern --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
ReadBuffer	allocates an OpenCL buffer and measures read bandwidth. Read operation means transfer from GPU to CPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3) --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️
UsmBidirectionalCopy	allocates two unified device memory buffers, each on a different tile, and measures copy bandwidth between. Test measures copies on two directions, which can be controlled with the -write parameter: with -write=1, each tile performs a write operation. For instance: queue is placed in tile 0, source is buffer in tile 0, and destination is in tile 1. Similarly for tile 1, queue is placed in tile 1, source in tile 1, and destination in tile 0. With -write=0, the destination and source are flipped: queue is placed in tile 0, source is buffer in tile 1, and destination is in tile 0, while for tile 1, queue is placed in tile 1, source in tile 0, and destination in tile 1.	--forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --size Size of the buffers --write Which operation is used, whether write or read (0 or 1)	✔️	❌
UsmCopy	allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffers --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmCopyImmediate	allocates two unified shared memory buffers and measures copy bandwidth between them using a builtin function appended to an immediate list.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffers --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UsmCopyKernel	allocates two unified shared memory buffers and measures copy bandwidth between them using a custom kernel.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --dst Placement of memory for the destination buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffers --src Placement of memory for the source buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmFill	allocates a unified shared memory buffer and measures fill bandwidth.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --forceBlitter Force blitter engine. Test will be skipped if device does not support blitter. Warning: in OpenCL blitter may still be used even if not forced (0 or 1) --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --patternSize Size of the fill pattern --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	✔️
UsmSharedMigrateCpu	allocates a unified shared memory buffer and measures time to migrate it from GPU to CPU.	--accessAllBytes Select, whether entire resource or only one byte will be accessed on CPU (0 or 1) --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --size Size of the buffer	✔️	✔️
UsmSharedMigrateGpu	allocates a unified shared memory buffer and measures time to migrate it from CPU to GPU.	--context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --memory Placement of memory for the buffer (Host or Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --size Size of the buffer	✔️	✔️
WriteBuffer	allocates an OpenCL buffer and measures write bandwidth. Write operation means transfer from CPU to GPU.	--compressed Select if the buffer is to be compressed. Will be skipped, if device does not support compression (0 or 1) --context How context will be created (Root or Tile0 or Tile1 or Tile2 or Tile3 or a list separated with ':') --memory Placement of memory for the buffer (Root or Tile0 or Tile1 or Tile2 or Tile3) --queue Which device within the context will perform the operation (Root or Tile0 or Tile1 or Tile2 or Tile3) --size Size of the buffer --useEvents Perform GPU-side measurements using events (0 or 1)	❌	✔️

p2p_benchmark

P2P Benchmark is a set of tests aimed at measuring bandwidth and latency of memory transfers between peer devices.

Test name	Description	Params	L0	OCL
UsmCopyMultipleBlits	allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.	--blitters A bit mask for selecting copy engines --dstDeviceId Destination device --size Size of the operation processed by each engine --srcDeviceId Source device	✔️	❌
UsmEUCopy	allocates two unified device memory buffers on separate devices, performs a copy between them using a compute engine, and reports bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy.	--contents Contents of the buffers (Zeros or Random) --dstDeviceId Destination device --reuseCmdList Command list is reused between iterations (0 or 1) --size Size of the buffer --srcDeviceId Source device --useEvents Perform GPU-side measurements using events (0 or 1)	✔️	❌
UsmImmediateCopyMultipleBlits	allocates two unified device memory buffers on separate devices and performs a copy between sections (or chunks) of these using a different copy engine with an immediate command list and measures bandwidth. Test first checks for P2P capabilities in the target platform before submitting the copy. Results for each individual blitter engine is measured using GPU-based timings and reported separately. Total bandwidths are calculated by dividing the total buffer size by the worst result from all engines. Division of work among blitters is not always even - if main copy engine is specified (rightmost bit in --bliters argument), it gets a half of the buffer and the rest is divided between remaining copy engines. Otherwise the division is even.	--blitters A bit mask for selecting copy engines --dstDeviceId Destination device --size Size of the operation processed by each engine --srcDeviceId Source device	✔️	❌

ulls_benchmark

Ulls Benchmark is a set of tests aimed at measuring Ultra Low Latency Submission (ULLS) performance impact.

Test name	Description	Params	L0	OCL
BestSubmission	enqueues a system memory write via PIPE_CONTROL and measures when update becomes visible on the CPU.		✔️	❌
BestWalkerNthCommandListSubmission	enqueues single kernel on n command lists, which updates system memory location and then busy-loops on CPU until the update of the kernel of nth command list becomes visible. This is L0 only test.	--CmdListCount Command list count	✔️	❌
BestWalkerNthSubmission	enqueues n kernels, which updates system memory location and then busy-loops on CPU until the update of nth kernel becomes visible.	--KernelCount Kernel count	✔️	❌
BestWalkerNthSubmissionImmediate	enqueues n kernels, which updates system memory location and then busy-loops on CPU until the update of nth kernel becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific.	--KernelCount Kernel count	✔️	❌
BestWalkerSubmission	enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible.		✔️	✔️
BestWalkerSubmissionImmediate	enqueues kernel, which updates system memory location and then busy-loops on CPU until the update becomes visible. Kernel is enqueued using low-latency immediate command list, so the test is LevelZero-specific.		✔️	❌
BestWalkerSubmissionImmediateMultiCmdlists	Append N kernels on N cmdlists, which updates system memory locations and then waits using busy-loop on CPU until the update becomes visible. Kernels are appended using immediate command lists.Amount of command lists is specified by cmdlistCount.	--cmdlistCount Count of command lists	✔️	❌
CompletionLatency	enqueues system memory write and measures time between the moment, when update is visible on CPU and the moment, when synchronizing call returns.		✔️	❌
CopySubmissionEvents	enqueues 4 byte copy to copy engine and return submission delta which is time between host API call and copy engine start	--engine Engine used for copying (RCS or CCS0 or CCS1 or CCS2 or CCS3 or BCS or BCS1 or BCS2 or BCS3 or BCS4 or BCS5 or BCS6 or BCS7 or BCS8)	✔️	✔️
EmptyKernel	enqueues empty kernel and measures time to launch it and wait for it on CPU, thus measuring walker spawn time.	--wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	✔️
EmptyKernelImmediate	enqueues empty kernel and measures time to launch it using immediate command list and wait for it on CPU, thus measuring walker spawn time.	--UseEventForHostSync If true, use events to synchronize with host.If false, use zeCommandListHostSynchronize (0 or 1) --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	❌
EnqueueBarrierWithEmptyWaitlist	enqueues kernel with barriers with empty waitlists inbetween, waiting on the last barriers event	--enqueueCount Number of enqueues --outOfOrderQueue Use out of order queue (0 or 1)	❌	✔️
KernelSwitchLatency	measures time from end of one kernel till start of next kernel	--barrier synchronization with barrier instead of events (0 or 1) --count Count of kernels --counterBasedEvents use counter based events for in order (0 or 1) --execTime Approximately how long a single kernel executes, in us --flush Flush between kernels (0 or 1) --hostVisible events are with host visible flag (0 or 1) --inOrder use in order queue/command list (0 or 1)	✔️	✔️
KernelSwitchLatencyImmediate	measures time from end of one kernel till start of next kernel using immediate command lists	--barrier synchronization with barrier instead of events (0 or 1) --count Count of kernels --counterBasedEvents use counter based events for in order (0 or 1) --execTime Approximately how long a single kernel executes, in us --hostVisible events are with host visible flag (0 or 1) --inOrder use in order queue/command list (0 or 1) --useProfiling use profiling to obtain switch time (0 or 1)	✔️	❌
KernelWithWork	measures time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value.	--usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg) --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	✔️
KernelWithWorkImmediate	measures time required to run a GPU kernel which assigns constant values to elements of a buffer using immediate command list. Each thread assigns one value.	--UseEventForHostSync If true, use events to synchronize with host.If false, use zeCommandListHostSynchronize (0 or 1) --usedIds Which of the get_global_id() and get_local_id() calls will be used in the kernel (None or Global or Local or AtomicPerWkg) --wgc Workgroup count --wgs Workgroup size (aka local work size)	✔️	❌
KernelWithWorkPeriodic	measures average time required to run a GPU kernel which assigns constant values to elements of a buffer. Each thread assigns one value. Kernel is run multiple times with a set delay between submissions.	--numSubmissions Number of kernel enqueues to run --timeBetweenSubmissions Delay between kernel enqueues in microseconds	✔️	❌
MultiKernelExecution	submits multiple kernel in single command lists, measures total time of their execution	--count Count of kernels within command list --delay how much delay between atomic reads --inOrderOverOOO use out of order queue to implement in order queue (0 or 1) --wkgCount work group count of each kernel --wkgSizes work group size of each kernel	✔️	❌
MultiQueueSubmission	enqueues kernel on multiple command queues	--queueCount Number of command queues created --wgc Workgroup count --wgs Workgroup size	✔️	✔️
MultipleImmediateWithDependencies	Creates N immediate command lists. Submits kernels in order to each of thoseEach kernel has a dependency on previous oneSubmissions are small to allows concurrent executionMeassures time from scheduling start, till all command lists are completed	--UseEventForHostSync If true, use events to synchronize with host. If false, use zeCommandListHostSynchronize (0 or 1) --cmdlistCount Count of command lists	✔️	❌
NewResourcesSubmissionDevice	enqueues kernel that uses a buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.	--size Size of the buffer	✔️	✔️
NewResourcesSubmissionHost	enqueues kernel that uses a buffer placed in host memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.	--size Size of the buffer	✔️	✔️
NewResourcesWithGpuAccess	enqueues kernel that accesses an entire buffer placed in device memory to measure resource preparation time. The resource is destroyed and recreated for each iteration to ensure it is a different memory allocation.	--size Size of the buffer	✔️	✔️
QueueConcurrency	Submits multiple kernels to out of order queue returning events. There is a sequence of long kernel, short kernel, wait for short kernel Then calls synchronization and meassures performance	--kernelCount How many kernels are submitted --kernelTime How long each work item is in kernel --workgroupCount Workgroup Count of each kernel	❌	✔️
QueuePriorities	Uses queues with different priorities to meassure submission and context switch latencies	--highTime How long each work item is in high priority kernel --lowTime How long each work item is in low priority kernel --priorities Low priority command queue property is used (0 or 1) --sleep sleep time in us after low priority kernel flushed --wgc Workgroup count of high priority kernel	❌	✔️
ResourceReassign	Enqueues stress kernel which utilizes majority of GPU's execution units, then enqueues next kernel, measuring its execution time. Shows overhead releated to GPU's resources releasing and assigning.	--queueCount number of different command queues to which submits after stress kernel	❌	✔️
RoundTripSubmission	enqueues kernel which updates system memory location and waits for it with a synchronizing API.		✔️	✔️
UsmSharedFirstCpuAccess	allocates a unified shared memory buffer and measures time to access it on CPU after creation.	--initialPlacement Hint for initial placement of the resource passed to the driver (Any or Host or Device) --size Size of the buffer	✔️	✔️
UsmSharedFirstGpuAccess	allocates a unified shared memory buffer and measures time to access it on GPU after creation.	--initialPlacement Hint for initial placement of the resource passed to the driver (Any or Host or Device) --size Size of the buffer	✔️	✔️
WalkerCompletionLatency	enqueues a kernel writing to system memory and measures time between the moment when update is visible on CPU and the moment when synchronizing call returns	--inOrderQueue If set use IOQ, otherwise OOQ. Applicable only for OCL. (0 or 1) --useFence Use fence during submission and for further completion. (0 or 1)	✔️	✔️
WalkerSubmissionEvents	enqueues an empty kernel with GPU-side profiling and checks delta between queue time and start time.		✔️	✔️
WriteLatency	unblocks event on GPU, then waits for timestamp being written.		✔️	❌

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TESTS.md

TESTS.md

algorithm_benchmark

api_overhead_benchmark

atomic_benchmark

emu_benchmark

eu_benchmark

gpu_cmds_benchmark

memory_benchmark

miscellaneous_benchmark

multiprocess_benchmark

multithread_benchmark

multitile_memory_benchmark

p2p_benchmark

ulls_benchmark

Files

TESTS.md

Latest commit

History

TESTS.md

File metadata and controls

algorithm_benchmark

api_overhead_benchmark

atomic_benchmark

emu_benchmark

eu_benchmark

gpu_cmds_benchmark

memory_benchmark

miscellaneous_benchmark

multiprocess_benchmark

multithread_benchmark

multitile_memory_benchmark

p2p_benchmark

ulls_benchmark