Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement volatile barrier APIs #107843

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Implement volatile barrier APIs #107843

wants to merge 6 commits into from

Conversation

hamarb123
Copy link
Contributor

@hamarb123 hamarb123 commented Sep 15, 2024

Fixes #98837

This implements the proposed Read-ReadWrite and ReadWrite-Write barriers. Note: I haven't implemented any tests yet.

/cc @jkotas @VSadov @kouvel

Now that I'm on the correct branch
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Sep 15, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Sep 15, 2024
@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

my understanding of the x86 memory model is that it allows reads to be reordered after writes (when different addresses) (which is not supposed to be allowed across a Volatile.WriteBarrier with the ReadWrite-Write model if I'm understanding correctly)

reads are not reordered after writes in x86.
reads can happen earlier than preceding writes (i.e. prefetch), and to prevent that you'd indeed need a fill fence, but that is not something that WriteBarrier needs to guarantee.

In short - WriteBarrier needs to wait for reads/writes in progress to complete before allowing more writes.

  • on x86/x64 Volatile.WriteBarrier is just a compiler fence, similar to Volatile.ReadBarrier.
  • on arm it is a full fence (dmb[ish]). Sadly, dmb.st does not wait for reads.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

Hmm, I still don't understand. What's the difference between:

...x
Volatile.ReadBarrier(); //emits nothing
Volatile.WriteBarrier(); //emits nothing
...y

And

...x
Interlocked.MemoryBarrier(); //emits lock ...
...y

@VSadov can you give me an example of some x & y where the behaviour is allowed to be different so I can see an example of what I'm not understanding?

Edit: I think I understand now (leaving this here for my future reference)

read a
write b
Volatile.ReadBarrier(); Volatile.WriteBarrier(); //or swapped
read c
write d

could be re-ordered to: read a, read c, write b, write d, whereas Interlocked.MemoryBarrier(); would stop this re-ordering obviously.

- And fix missed file from jit-format
@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

on arm it is a full fence (dmb[ish]). Sadly, dmb.st does not wait for reads.

Would dmb ishst + dmb ishld be enough? (idk if this is actually faster anyway, but maybe it is?)
It would give Load-Load, Load-Store, and Store-Store guarantees according to this.

Using my example from earlier: read a, write b, barrier/s, read c, write d (where these represent arbitrary quantities of reads & writes in any order):

  • Volatile.WriteBarrier requires: a,b before d
  • dmb ishst gives b before d
  • dmb ishld gives a before c,d

So it would seem to me as though dmb ishst + dmb ishld (in either order) should theoretically be enough. Whether it's faster than just a dmb ish is another question obviously (one that is only really relevant if it's a valid approach anyway). If there's something wrong with my analysis, please let me know :)

@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

Would dmb ishst + dmb ishld be enough? (idk if this is actually faster anyway, but maybe it is?)

At first glance it seems that the combination is as good as a full barrier.
If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 16, 2024

At first glance it seems that the combination is as good as a full barrier. If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

It's actually not as strong as a full barrier, since it doesn't give b before c, which is the same thing that x86 doesn't give by default I think based on what you were saying.

@VSadov
Copy link
Member

VSadov commented Sep 16, 2024

At first glance it seems that the combination is as good as a full barrier. If the cost of a full barrier could be reduced by doing two half barriers instead, I'd think that is how hardware would do it, so likely it is not faster.

It's actually not as strong as a full barrier, since it doesn't give b before c, which is the same thing that x86 doesn't give by default I think based on what you were saying.

Ah, right, it still does not order Store-Load. It could be cheaper then, since it guarantees less.

@hamarb123
Copy link
Contributor Author

hamarb123 commented Sep 19, 2024

Ah, right, it still does not order Store-Load. It could be cheaper then, since it guarantees less.

I did some testing based on code I gave in the use case section of my api proposal issue (converted to C++) on a M-series macbook and got about a 1.4% regression overall (don't interpret that as the pair is exactly 1.4% slower than just dmb ish as there is obviously other code around the dmb instructions, it's just most likely not faster I think) (testing ran for about 25 mins total), so probably no point pursuing this idea further. Not overly surprised, but anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: Volatile barrier APIs
2 participants