Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Inline access to thread statics #79521

Closed
VSadov opened this issue Dec 12, 2022 · 19 comments
Closed

[NativeAOT] Inline access to thread statics #79521

VSadov opened this issue Dec 12, 2022 · 19 comments

Comments

@VSadov
Copy link
Member

VSadov commented Dec 12, 2022

Re: #79519

While implementing thin locks for NativeAOT I noticed that making calls to runtime to figure location of ManagedThreadId is a considerable expense (relatively, as all other parts of the lock implementation are fairly cheap).

Note that in NateviAOT the managed thread ID is dispensed and managed on the managed side. That is because native thread and managed thread object may outlive each other and we need to keep the ID unique as long as either is alive.

The part that the ID is dispensed and set from managed code should not prevent the optimizations. We basically need to somehow intrincify/inline the computation of the managed thread ID location, then the regular implementation that initializes the ID on demand should work just fine.

In fact only reading path is perf critical. We will write to the location not more than once per life time of the thread.

Similar optimization may be applicable to CoreClr as well. CoreClr also makes a method call when accessing managed thread ID.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 12, 2022
@ghost
Copy link

ghost commented Dec 12, 2022

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Re: #79519

While implementing thin locks for NativeAOT I noticed that making calls to runtime to figure location of ManagedThreadId is a considerable expense (relatively as all other parts of the lock implementation are fairly cheap).

Note that in NateviAOT the managed thread ID is dispensed and managed on the managed side. That is because native thread and managed thread object may outlive each other and we need to keep the ID unique as long as either is alive.

That should not prevent the optimizations. We basically need to somehow intrincify/inline the computation of the managed thread ID
location, then the regular implementation that initializes the ID on demand should work just fine.

In fact only reading path is perf critical.
We will write to the location not more than once per life time of the thread.

Similar optimization may be applicable to CoreClr as well. CoreClr also makes a method call when accessing managed thread ID.

Author: VSadov
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@MichalStrehovsky
Copy link
Member

In .NET Native, we intrinsified getting the native thread ID and switched things to use that instead of the managed one (see old change by @AntonLapounov: dotnet/corert@7087448).

Not saying that that's what we should do because I'm far from an expert in this area - I just happen to remember random stuff from 5 years ago so throwing this here.

@VSadov
Copy link
Member Author

VSadov commented Dec 12, 2022

We need specifically the managed thread ID here - the one that is a small int32 number: 1, 2, 3, 4... which is the identity of a thread in the managed runtime. In theory a managed thread does not need to map 1-1 to the native thread, and the same native thread may have different IDs in different runtimes, if process loads more than one runtime.
Historically native thread ID is a Windows thing, but there is typically something that is ID-like for threads on other platforms. It could be the stack base address, pthread_t, etc.. Even on Windows the TEB address could act as a faster substitute for a thread ID.

Anyways, the main issue here is that native ID is a pointer-sized value and there is no space for that in the object header, and there is no reason for it to be a pointer, even int32 is too generous - how many threads can you have at the same time?
CoreClr thin locks support IDs in [1..1024) range, which I think is on the low side nowdays. I made NativeAOT thin locks to support IDs up to 65535 (there are few spare bits if we need more, but not a lot). Anything above that can only own a fat lock. Native thread IDs, which are pointers, can easily be outside of these ranges.

As a known quantity under runtime's control, managed thread ID is more predictable and more portable. I think we should use managed IDs in the managed runtime unless scenario really calls for the native ID (hard to think of a scenario outside of interop with native threading). And we should make accessing the managed ID faster.
It is actually quite fast, but could be faster if we eliminate the cost of a call.

@jkotas
Copy link
Member

jkotas commented Dec 12, 2022

We should make all thread-statics fast by inlining them. I do not think it is worth building some special path for just ManagedThreadId.

@jkotas jkotas changed the title [NativeAOT] Intrincify and inline the access to ManagedThreadId location [NativeAOT] Inline access to thread statics Dec 12, 2022
@VSadov
Copy link
Member Author

VSadov commented Dec 12, 2022

We should make all thread-statics fast by inlining them

If that is not too hard, it would be great.

@jkotas
Copy link
Member

jkotas commented Dec 12, 2022

It should not be significantly harder than inlining ManagedThreadId access only. The JIT needs to be able to emit the platform-specific native sequence to access thread statics in either case. The only difference is in where to get the symbols or offsets to use in the sequence.

@EgorBo
Copy link
Member

EgorBo commented Dec 14, 2022

Related: #63619

@JulieLeeMSFT
Copy link
Member

CC @kunalspathak.

@VSadov
Copy link
Member Author

VSadov commented Feb 22, 2023

RhpGetThreadStaticBaseForType call is the top fourth call in BasicMinimalApi traces (in terms of exclusive cost). There are other calls like Monitor.Enter and Monitor.Exit that may get cheaper if threadstatic access is inlined.

I think optimizing thread statics may result in material improvements

image

@kunalspathak
Copy link
Member

Thanks for the data @VSadov . I am currently working on the prototype for this.

@MichalStrehovsky MichalStrehovsky added this to the 8.0.0 milestone Jul 11, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 11, 2023
@MichalStrehovsky MichalStrehovsky modified the milestones: 8.0.0, 9.0.0 Aug 2, 2023
@MichalStrehovsky
Copy link
Member

I'm moving this issue to 9.0 since we don't expect the RyuJIT TLS work to land in 8.0 per #89472 (comment).

@MichalStrehovsky
Copy link
Member

Any codegen TLS work left or can we close?

@VSadov
Copy link
Member Author

VSadov commented Jan 30, 2024

I think arm64 is still to be supported.

@kunalspathak
Copy link
Member

windows/x64 support: #89472
linux/x64 support: #97413
linux/arm64 support: #97910

@agocke
Copy link
Member

agocke commented Jun 7, 2024

@kunalspathak any plans for Windows Arm64?

@kunalspathak
Copy link
Member

@kunalspathak any plans for Windows Arm64?

Hoping to do in .NET 9

@kunalspathak
Copy link
Member

@kunalspathak any plans for Windows Arm64?

Hoping to do in .NET 9

#104282

@kunalspathak
Copy link
Member

#104282 is done, so this issue can be closed now.

@MichalStrehovsky
Copy link
Member

#104282 is done, so this issue can be closed now.

Sounds good to me!

@github-actions github-actions bot locked and limited conversation to collaborators Aug 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Archived in project
Development

No branches or pull requests

7 participants