-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcmulib_get_next_command is broken on aarch64 #688
Comments
This was referenced Oct 4, 2022
geordieintheshellcode
pushed a commit
to ondat/tcmu-runner
that referenced
this issue
Feb 20, 2023
Use the appropriate atomic load instruction to ensure we synchronize the read of the command ring head index with any previous stores. Not doing this causes command ring corruption on aarch64: [Tue Oct 4 15:45:50 2022] cmd_id 0 not found, ring is broken [Tue Oct 4 15:45:50 2022] ring broken, not handling completions See issue open-iscsi#688. PR open-iscsi#693. Signed-off-by: Xiubo Li <xiubli@redhat.com>
The macro TCMU_UPDATE_RB_TAIL to update cmd_tail should also use atomic store ? |
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Repro steps
master
with some scripts and a fix toconsumer.c
to make it not when receiving IO.consumer
example.consumer
handles any TCMU backstores with the "foo" handler. For WRITE SCSI commands we'll simply drop them on the floor and return TCMU_STS_OK.foo
handler and hook it up to a SCSI HBA/LUN.sudo FILE=/dev/<dev_name> BS=4k QD=128 fio randwrite.fio
Almost immediately you'll see the "cmd_id X not found, ring is broken" message in dmesg, and a bunch of writes will fail:
The same test happily runs for hours on x86.
Hypothesis
I think the issue is caused by us not reading the command ring
head
index with the correct memory barrier semantics. On the kernel side, intarget_core_user.c
, we can see that every time a new command is enqueued we first fill theentry
and then update the ring buffer head index using theUPDATE_HEAD
macro (https://elixir.bootlin.com/linux/v4.18/source/drivers/target/target_core_user.c#L1008).UPDATE_HEAD
does ansmp_store_release()
to the head index. From my understanding, a store release barrier should be paired with a load acquire barrier. i.e. the read of the cmd head index intcmulib_get_next_command
should use acquire semantics. Running with the following patch causes the issue to go away, but I don't know if it's purely fortuitous or actually fixes the issue. The idea is to read the head index using an atomic load - which should (I think) be equivalent to a load acquire barrier:The patch also appears to work with a less punitive
__ATOMIC_ACQUIRE
load.I'd appreciate someone checking my working here as I'm not massively clued up on this area! BTW I found this document useful in describing the expected memory barrier semantics for ring buffers: https://www.kernel.org/doc/Documentation/core-api/circular-buffers.rst.
The text was updated successfully, but these errors were encountered: