Fio

Notes on fio.

NUMA and CPU pinning

It is advisable to use the same NUMA node for the fio threads as the node that is used for the storage devices used. This can be done with numactl -N <x> -m <x> -C <x> fio .... with x the node(s). However, while monitoring fio it was noticed that this is not always respected and threads spawned outside of the specified nodes. Therefore, it is recommended to set numa nodes from within fio (as well). This can be done with:

numa_cpu_nodes: set the allowed NUMA nodes that fio can use for forks/threads.
numa_mem_policy: set NUMA memory policy (m in numactl). You can try to use the default policy for a node, prefer using memory of one node, interleave nodes or bind nodes. In general bind is easier to reason about, because it limits fio to a few (or one) NUMA nodes. For example, say the NVMe device you use, is on NUMA node 0. Then you can try --numa_cpu_nodes=0 --numa_mem_policy=bind:0.

It is also possible to pin specific CPUs. This can be useful to see single core performance. In fio you can pin a core with:

cpumask: Set the CPU mask to use. For example 00000001 for the first core.

However, setting it through fio might not be totally trustworthy. Also try out numactl -C <x> with x the cores or taskset --cpu-list <x> with x the cores separated by commas. This allows forcing fio to pin a few cores. If even this does not work, you need to alter boot parameters.

io_uring

Io_uring comes with a number of options, which can make a significant difference:

hipri: use polling mode. In general, enable for fast devices.
sqthread_poll: use a separate kernel thread for polling. This is easy to spot in e.g. htop. Each thread will require one extra kernel thread, so be careful with the number of available cores.
sqthread_poll_cpu: define what cpu(s) to use for the kernel polling thread. This gives more fine-grained control over the kernel thread behaviour, but you need to be careful that you set it properly.
fixedbufs: reuse buffers for direct I/O. Think about not needing to free/malloc each time. Not sure why this is not default. *registerfiles: Register files beforehand, removing some overhead.

ZNS

Unwritten contract

ZNS is different from other storage devices. Therefore, there are different things to look out for. We list a few idiosyncrasies to remember when benchmarking:

The lba format (lbaf) of the ZNS namespace can have a big impact on performance. For example, we saw a throughput increase of up to 100% in certain workloads. Therefore, be sure to test with different formats, either by creating a namespace with a different format or formatting the currently used namespace.
ZNS SSDs can contain multiple types of zones. Sequential zones can be treated as ordinary devices and do not need special ZNS-specific treatment in fio. Sequential zones do, but fio might not prevent you from doing incorrect workloads which can result in freezing the device.
There is not a lot of sanity checking for ZNS devices. Always set --direct=1 with ZNS and prefer to use zonemode=zbd. If this is not used, prepare to reboot the machine or wait very long (the request will hang and the process and kernel along with it).
You need to reset all zones before issuing a fio test. Otherwise, fio can use too many active zones and threads will start to fail (silently).
Random write workloads are not recommended. Random writes are not possible on ZNS. They can be tested, but ZNS will still force the requests to go sequential to one zone. This makes random write tests very hard to reason about, and impractical. Instead try to emulate the behaviour by using multiple threads with each thread starting at a different offset.

How to use ZNS within Fio

Fio allows for ZNS by setting --zonemode=zbd. Be careful, if you do not set this mode and use --direct=0 the device will hang and you will have to wait very long for the device to become usable again or restart the machine. So you always have to specify --zonemode=zbd --direct=0 if you want to be safe. Then the device will be treated like ZNS. Other ZNS modes like "strided" do exist, but are probably not what you want. There are a few things to know about:

You can NOT use a high queue depth for writes, unless you use a scheduler. When using QD>1 ensure that you the sheduler is mq-deadline and not none.
You MUST reset zones manually before starting the test. This is not done by fio and fio will not know what to do. Fio does not keep track of zones that are already open. Therefore, it will think that fewer zones are open/active than truly are there. The result is that the max can be reached and threads will fail (sometimes silently).
You can specify the maximum number of open zones manually with max_open_zones. This only is used from WITHIN Fio, so it is an arbitrary constraint, but can generally not be more than is allowed by the device. You can use this to force fio to use less zone. You can further limit this to each job with job_max_open_zones, so that some jobs will have access to less/more zones. You can also try to break the device by doing --ignore_zone_limits=1.
You can try to use an ordinary block device like a ZNS device. Then you must set max_open_zones, but also the zone size and capacity zonesize zonecapacity.
Reads are always a bit special. It is recommend to write before you read, because devices can be smart and ignore the request (if it is empty). This can be partially circumvented by setting read_beyond_wp=1. This allows reading the empty parts of zones as well.
You can create GC like patterns. By default FIO only resets zones sporadically. In order, to be accurate to applications that use GC, this can be faked. You can set zone_reset_threshold to indicate the minimum treshold (lba to dataset size) before GC kicks in and zone_reset_frequency to control the frequency of resets (1/zone_reset_frequency).
There is a new unit. Use it, it makes your life easier. You can use a size unit for zones z, apart from only % and K,M,G.
Fio can set the scheduler for you. You do not need to write to /sys/block/nvmeX/queue/scheduler. You can use --ioscheduler=none or --ioscheduler=mq-deadline instead.

If fio is set to provide results in json output, with --output-format=json enabled, fio by default does not provide the number of zone resets completed, albeit counting them. Therefore, a simple addition to the fio code also adds the number of zone resets to the json data.

user@stosys:~/src/fio$ git diff
diff --git a/stat.c b/stat.c
index 949af5ed..8e1bcba8 100644
--- a/stat.c
+++ b/stat.c
@@ -1705,6 +1705,8 @@ static struct json_object *show_thread_status_json(struct thread_stat *ts,
        if (opt_list)
                json_add_job_opts(root, "job options", opt_list);

+       json_object_add_value_int(root, "zone_resets", ts->nr_zone_resets);
+
        add_ddir_status_json(ts, rs, DDIR_READ, root);
        add_ddir_status_json(ts, rs, DDIR_WRITE, root);
        add_ddir_status_json(ts, rs, DDIR_TRIM, root);

Fio, ZNS and SPDK

When the SPDK plugin is used, ZNS can be used as well. SPDK comes with more options for zone devices, but there are some things to think about:

You can not set a scheduler, so you can never write with QD > 1.
You can use appends. SPDK is the only io_engine at the moment, that does so. With appends, you can set QD > 1.
You do not need to reset the zones of the device beforehand manually. There is an option initial_zone_reset. This operation resets all zones before the test starts and this will not be timed (and therefore not interfere).

How does Fio use ZNS

Fio has something known as "zonemode". One of these modi is zbd, which is confusing. It is namely an in-house wrapper in fio instead. Internally it uses blkzoned.h from the kernel. Many parts of ZNS seem to be sort of emulated. We could only spot a few true ZNS-like actions. Fio itself uses a sort of sub-engine that makes sure that it is valid ZNS. This allows using "ZBD" also from say libaio or io_uring. That is because fio does the actual zone management and makes sure a command is valid/invalid.

On a write or read fio itself makes sure that it is valid for ZNS. If a read passes a border, it is cut (yes really cut, not split!). If a write crosses a border idem. I/O is then aligned to the block size and internally fio maintains state for each zone. That is the wp, its state, the capacity, etc. This allows storage engines to not care about ZNS at all.

Still some zone management is needed. Fio does this on the background and wraps many calls by adding extra operations. For example on file deletion, Fio might decide to reset the zone as well. And on a start, Fio needs to get the state from the device (report zone). How does Fio ZBD do each of the raw zone operations:

Reset zone(s): this is done through blkzoned.h. It simply resets the zone.
Open and close zones: this is very important. It does not actually do this explicitly! Internally Fio maintains a list of zone states. Zones are never truly closed or opened. Instead a zone is implicitly opened by e.g. writing. Therefore, you will not be able to measure close or open performance.
Report zone: this is done

In short, Fio has its own custom ZNS behaviour. That is something to be aware of.

fio with write hints

With file systems, such as F2FS, files can be given hints on the lifetime of the file. Based on these, the file system can make more appropriate decisions on data placement, such as grouping hot on cold data, as is the case in F2FS. Hints set the inode in VFS, from which the file system picks up the information. The following hints are then possible

#ifdef FIO_HAVE_WRITE_HINT
        {
                .name   = "write_hint",
                .lname  = "Write hint",
                .type   = FIO_OPT_STR,
                .off1   = offsetof(struct thread_options, write_hint),
                .help   = "Set expected write life time",
                .category = FIO_OPT_C_ENGINE,
                .group  = FIO_OPT_G_INVALID,
                .posval = {
                          { .ival = "none",
                            .oval = RWH_WRITE_LIFE_NONE,
                          },
                          { .ival = "short",
                            .oval = RWH_WRITE_LIFE_SHORT,
                          },
                          { .ival = "medium",
                            .oval = RWH_WRITE_LIFE_MEDIUM,
                          },
                          { .ival = "long",
                            .oval = RWH_WRITE_LIFE_LONG,
                          },
                          { .ival = "extreme",
                            .oval = RWH_WRITE_LIFE_EXTREME,
                          },
                },
        },
#endif

The setting of write hints for files relies on F_SET_RW_HINT in fcntl() to pass the hint. There is an additional F_SET_FILE_RW_HINT, which sets the write hint of the file descriptor. However, this hint has been removed in Kernel 5.17+ (see here. Fio still uses this flag for direct I/O, if the --direct=1 flag is passed to fio. Only buffered commands (without direct I/O) use the F_SET_RW_HINT. Thankfully, we can easily modify fio code to only use F_SET_RW_HINT, regardless of buffered or direct I/O. This is achieved with the following change (requires only deleting 3 lines of code):

user@stosys:~/src/fio$ git diff ioengines.c
diff --git a/ioengines.c b/ioengines.c
index e2316ee4..525cbcd1 100644
--- a/ioengines.c
+++ b/ioengines.c
@@ -587,9 +587,6 @@ int td_io_open_file(struct thread_data *td, struct fio_file *f)
                 * the file descriptor. For buffered IO, we need to set
                 * it on the inode.
                 */
-               if (td->o.odirect)
-                       cmd = F_SET_FILE_RW_HINT;
-               else
                cmd = F_SET_RW_HINT;

                if (fcntl(f->fd, cmd, &hint) < 0) {

Now we can easily pass file hints with buffered or direct I/O, and give the file system more information on better data placement. This is for instance beneficial if benchmarking the file system performance under heavy workloads that only write hot data, which would better group the file data and cause less GC in file system like F2FS.

How to read fio's performance results

At the end of a run fio reports various metrics. Typically it contains a latency overview. This latency overview is not stable between storage engines and requires careful consideration. The following three latency metrics (can) be reported:

slat: submission latency. The latency of submitting I/O. Not present in synchronous engines.
clat: completion latency. The latency of completion I/O. Can be close to 0 based on the engine and for most engines is the latency of 1 operation, but with io_uring and sqthread_poll this latency represents the time of the entire run?
lat: the total latency of 1 I/O request. In most cases this should be equal to slat+clat. In general, before copying the latency results blindly, investigate the ioengine and options used.

NVMe

Volatile write cache

off sudo nvme set-feature -f 6 0 /dev/... on sudo nvme set-feature -f 6 1 /dev/...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fio