Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] ThreadStatics part 2 #87148

Merged
merged 24 commits into from
Jun 23, 2023
Merged

[NativeAOT] ThreadStatics part 2 #87148

merged 24 commits into from
Jun 23, 2023

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Jun 5, 2023

The change allows inlining of the access to thread static variables in the accessing module.
The optimization is no longer affected by whether we have a singlemodule or multimodule scenario.

We do only win-x64 and linux-x64 here. That is enough to discuss and test the design.
The other platform combinations may come as a separate change and will simply follow the same plan.

The change requires dotnet/llvm-project#429 , dotnet/llvm-project#438 in order to pass tests.


  • remove GetSingleTypeManager
  • remove RhGetInlinedThreadStaticStorage
  • remove RhpGetInlinedThreadStaticBase
  • remove tls_InlinedThreadStatics from the native runtime (c++)
  • make tls_InlinedThreadStatics emitted by ILC and inline the access code into the thinks
    • Linux x64
    • Linux arm64
    • Windows x64
    • Windows arm64 - similar to platforms above, but harder to test. Will likely be a separate PR.
    • MacOS - there are some issues with convincing ObjWriter to emit appropriate relocs. Will likely be a separate PR.
  • instruct JIT to inline TLS base access into the actual callsites - will be a separate PR.

@ghost
Copy link

ghost commented Jun 5, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

In progress:

  • remove GetSingleTypeManager
  • remove RhGetInlinedThreadStaticStorage
  • move tls_InlinedThreadStatics from c++ to be emitted by ILC
    • ELF
    • COFF - surprisingly nontrivial. (there is a hack that works for unknown reasons, but need to do this right)
    • Mach-O - any docs?
  • remove RhpGetInlinedThreadStaticBase helpers and inline TLS base access into the thunks
    temporarily - just to prototype/proof/test that it works.
    • ELF
    • COFF
  • instruct JIT to inline TLS base access into the actual callsites
Author: VSadov
Assignees: VSadov
Labels:

area-NativeAOT-coreclr

Milestone: -

@filipnavara
Copy link
Member

  • Mach-O - any docs on how to emit TLS variables?

https://www.jakubkonka.com/2021/01/21/llvm-tls-apple-silicon.html

The tricky part is that there's one thing emitted in the .o file and the pattern is then recognized by ld64 linker and rewritten. If done incorrectly you get a garbage. I may have a look at it eventually if you skip over that part.

@VSadov
Copy link
Member Author

VSadov commented Jun 6, 2023

The tricky part is that there's one thing emitted in the .o file and the pattern is then recognized by ld64 linker and rewritten

Yes. TLS variables must have some degree of rewrite. What is stored in the object/executable file will need to be picked up by the linker/loader and become a "template" that is used at thread start (or first access) to initialize the actual TLS storage.

The patterns are wildly different:

  • The ELF is the simplest. Anything in .tdata and .tbss will end up in TLS.
  • Windows seems to require laying out the actual TLS template somewhere in the binary, location does not matter much, just needs to be contiguous. What makes it TLS is referencing via _tls_used, _tls_start, _tls_end, etc..
    However that is done by the linker. A compilation unit just needs to put the variables into .tls$ section.
  • Mach-O - ???
    If it is just placing the variables into __thread_vars and __thread_bss, it would not be a lot different from ELF.

@filipnavara
Copy link
Member

filipnavara commented Jun 6, 2023

  • If it is just placing the variables into __thread_vars and __thread_bss, it would not be a lot different from ELF.

IIRC yes, that's one part of it. The other part is that linker must recognize the code accessing it as code section (should already be the case for anything emitted in the managed code section) and it must use exact code pattern, not just correct relocation type.

{
ISymbolNode getInlinedThreadStaticBaseSlow = factory.HelperEntrypoint(HelperEntrypoint.GetInlinedThreadStaticBaseSlow);
ISymbolNode tlsRoot = factory.TlsRoot;
bool singleFileExe = factory.CompilationModuleGroup.IsSingleFileCompilation;
Copy link
Member Author

@VSadov VSadov Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can emit more compact code when we know it is going to be a single exe. This is particularly useful on Windows, since there are no linker relaxations, but is mildly advantageous on Linux as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assuming that _tls_index is going to be zero when IsSingleFileCompilation is set? I do not think it is safe assumption to make. IsSingleFileCompilation does not guarantee that the output is .exe.

I am not sure whether this is a safe assumption to make even for .exes - is that documented somewhere or have you seen any Windows C++ compiler doing this optimization?

Copy link
Member Author

@VSadov VSadov Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tls_index would be 0 in the most common case that we support right now where the TLS variable is used from the same module that defines it and it is known at link time. I am not sure if IsSingleFileCompilation guarantees that, but it would be useful if we could know if we have such case as inlining shorter code is attractive.
Overall - Since in such cases _tls_index will be 0 constant at link time, I think this is a minor optimization since it mostly just makes the code shorter. It does not reduce the number of memory accesses or indirection. It is not a big deal if we can't do this.

C++ compiler does this optimization when /GA switch is used (optimize for application), but it is also known that it may cause trouble in rare cases:

It’s worth noting that in the (extremely rare) case that a .exe exports functions and is imported by another .exe, this optimization will cause random corruption if the imported .exe happens to use __declspec(thread).

Source: http://www.nynaeve.net/?p=185

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know about /GA switch. IsSingleFileCompilation does not have the right semantics for this. We would need a new compiler switch that says we are building an application.

Copy link
Member Author

@VSadov VSadov Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same can be said about Linux case. GD form is the safest to emit and linker will relax it to LS form if that is ok.

Emitting an Initial Exec model is mostly size saving, but it could be attractive, considering it is a very common case for us and having smaller code is nice when inlining.
The applicability of the model is defined slightly differently than on Windows, but I think it means roughly the same - not dynamically loaded and not cross-module accessed.

A more restrictive optimization is usable if the variables accessed are known to be in one of the modules available at program start and if the programmer selects to use the static access model.

It is a good option if we can do this, but emitting GD form is not that much worse to be too worried about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this optimization can be enabled on both Windows and Unix, via a new compiler switch.

Copy link
Member Author

@VSadov VSadov Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put things in the perspective - for Linux the Initial Exec sequence is the same 16 bytes as GD, but does not need an additional mov rdi, rax after that as there is more flexibility where the result would be if you do not imply a possible method call.

It saves some, but not a huge saving overall.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsSingleFileCompilation does not guarantee that the output is .exe.

Just confirmed that this is correct. IsSingleFileCompilation is not enough and can break easily on Windows in a case of dynamically loaded .dlls. We really need a switch analogous to /GA in order to do Initial Exec optimizations.

It does not seem to break on Linux, but the docs say it must be "modules available at program start", so it could be just luck or some limited tolerance to abuse. We still need a switch.

Copy link
Member Author

@VSadov VSadov Jun 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have short-circuited the "Initial Exec" optimizations, but left the code in place, assuming we will implement a compiler switch similar to /GA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS Relocations are more complex on linux-arm64 for the dynamic case while LE case is still quite simple. It makes the support for switch like \GA / -fno-pic -fpie a lot more interesting.

@VSadov
Copy link
Member Author

VSadov commented Jun 14, 2023

I think this is ready for a review.

@MichalStrehovsky
Copy link
Member

Do we still need the global pre-pass to give indices to things? The C++ compiler doesn't need a global optimization pass to access thread locals.

@VSadov
Copy link
Member Author

VSadov commented Jun 14, 2023

Do we still need the global pre-pass to give indices to things?

That was needed because compiler is concurrent and we may need to emit field acceses before we know if they are inlined in the TLS block and in what position. I think this reason is still there.

The C++ compiler doesn't need a global optimization pass to access thread locals

Likely because every c++ threadstatic variable can be referenced symbolically.

We could have such model if we had a TLS root per-type, not per-module. There was a concern that such approach would not scale.
#84566 (comment)

For example we would need to walk the list of such roots when reporting to GC. If you have 100 threadstatics, you might end up with a 100 items linked list to walk per each thread per-GC. (twice if there is a background marking pass, three times if need to update pointers due to compaction)
Reporting one block per thread would be cheaper. In particular if most these threadstatics are not objects, then no further tracing is required on the GC side other than marking the containing block as alive.

@@ -101,7 +101,7 @@ public static GCPointerMap FromStaticLayout(MetadataType type)
return builder.ToGCMap();
}

private static void MapThreadStaticsForType(GCPointerMapBuilder builder, MetadataType type, int baseOffset)
private static void MapThreadStaticsForType(ref GCPointerMapBuilder builder, MetadataType type, int baseOffset)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ref part is the follow up change that somehow missed the previous PR.
#84566 (comment)

@MichalStrehovsky
Copy link
Member

Likely because every c++ threadstatic variable can be referenced symbolically.

I don't think there's a reason to allocate threadstatics that don't have GC references on the GC heap other than ".NET Native did it this way and that's what we inherited". If we wanted threadstatics to be as fast as C++, we'd update this

protected override bool ComputeHasGCStaticBase(FieldDesc field)
{
Debug.Assert(field.IsStatic);
if (field.IsThreadStatic)
return true;
TypeDesc fieldType = field.FieldType;
if (fieldType.IsValueType)
return ((DefType)fieldType).ContainsGCPointers;
else
return fieldType.IsGCPointer;
}
to not place them on the GC static base in the type system and emit them like C++ compiler would.

In particular if most these threadstatics are not objects, then no further tracing is required on the GC side other than marking the containing block as alive.

IIRC this is how threadstatics worked before. We could group the individual roots together using linker magic (emitting things into .foo$i section and emitting something else into .foo$a and .foo$z section emits everything into .foo section, but linker will sort things alphabetically - so if one refers to something in .foo$a, one is guaranteed that it's going to be before everything in .foo$i). I think we could make this group all TLS symbols that refer to things on the GC heap into a single block that we could address as a single block.

@VSadov
Copy link
Member Author

VSadov commented Jun 15, 2023

I don't think there's a reason to allocate threadstatics that don't have GC references on the GC heap

I think we have already touched the idea of placing unmanaged threadstatics on a separate plan from managed ones. It is technically possible and the current scheme does not close the door to that, but it is unlikely we will have time for that before 9.0

For managed threadstatics the current scheme is nearly optimal.
Since roots live in native memory, they are not served by GC barriers/writewatch and cannot participate in incremental marking, thus must be reported during STW pauses. The time is more precious when user's code does not run. Thus lots of roots will at some point hurt and the fewer roots we have to report the better. One root per individually compiled module is the minimum that we can have without introducing indirection and indexing by module on every threadstatic access.

Whether linker tricks can be used to predict/guarantee the location of the fields within the managed storage block and replace the need for pre-scanning - I do not know. I think this can also be left as a future improvement.

@MichalStrehovsky
Copy link
Member

current scheme does not close the door to that, but it is unlikely we will have time for that before 9.0

The question is whether this is the thing we want to teach the JIT to inline. I believe the reason for this work was that we need this to be as efficient as possible before JIT starts inlining this because it will be a lot more difficult to change later. This is still not as efficient as C++. I still think the global prepass is a wrong direction.

For managed threadstatics the current scheme is nearly optimal.

I think managed threadstatics could work the same way as normal GC statics - normal GC statics get a pointer to the base in the .data segment. We dereference this pointer to get to the actual data on the GC heap (this data is per type, so no global prepass). But we do not scan each type's static base individually when GC runs. Instead we have a single spine that we report to the GC as an array that is like any other - we could have the same spine for threadstatics.

I think we could have a similar scheme for threadstatics where the pointer to the type's base is in the .tls region (instead of .data) and gets accessed similarly (an indirection to get to the object on the GC heap that represents this types' base).

@VSadov
Copy link
Member Author

VSadov commented Jun 15, 2023

I think managed threadstatics could work the same way as normal GC statics

Normal statics are much more numerous than thread local statics, so the motivation of allocating storage per type is higher.
On the other hand regular statics are very static. There is only one copy of storage per-type with convenient points when to initialize the storage and the spine with trivial failure modes. Since the lifetime of regular statics is so simple, you can reference indirectly via handles, while not worrying about deallocation/leaks.

The only "convenient" part about threadlocal variables that there are typically a lot less of them in a program than regular statics, but there could be easily hundreds of threads with separate instances. There is no nice point to allocate the support upfront - the thread attaching to runtime can't handle allocation failures. Deallocation could also get tricky.

Once you start adjusting the design to these details (lets avoid handles and point to bases directly, but we can merge the bases in one block to have fewer roots, . . .), it may end up not far from the approach in this PR.

I do not think there is a shortage of ideas of how to deal with TLS.
However, once trying to prototype, I think the tradeoffs and implementation details will likely push most ideas towards the design similar to what we have here.

@VSadov
Copy link
Member Author

VSadov commented Jun 22, 2023

ObjectWriter changes are in. We are passing tests now!!!

@VSadov
Copy link
Member Author

VSadov commented Jun 22, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov
Copy link
Member Author

VSadov commented Jun 22, 2023

extra platforms run has bunch of failures like

 Unhandled exception. Internal.TypeSystem.TypeSystemException+FileNotFoundException: Failed to load assembly 'SQLitePCLRaw.core'

It does not look related. (if TLS was broken, we'd not get this far)

Also seems vaguely familiar. I think this happened before.

internal static unsafe object GetInlinedThreadStaticBaseSlow(ref object? threadStorage)
{
Debug.Assert(threadStorage == null);
// Allocate an object that will represent a memory block for all thread static fields
TypeManagerHandle typeManager = RuntimeImports.RhGetSingleTypeManager();
TypeManagerHandle typeManager = (new object()).GetMethodTable()->TypeManager;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TypeManagerHandle typeManager = (new object()).GetMethodTable()->TypeManager;
TypeManagerHandle typeManager = EETypePtr.EETypePtrOf<object>().ToPointer()->TypeManager;

?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, I have noticed that we have unused broken MethodTableOf<T> - you can delete it while you are on it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there might be a way to just fetch the method table of the type, but did not want to involve anything reflection-like by accident. This code will run once per thread, so allocating one object seemed both simple and affordable.
I did not know about EETypePtrOf.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Comment on lines +304 to +305
// mov rax, qword ptr[rdi]
encoder.Builder.EmitBytes(new byte[] { 0x48, 0x8B, 0x07 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can express a lot of these with the existing EmitMOV methods (not going to comment on all).

Copy link
Member Author

@VSadov VSadov Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These patterns are fragile. We want to be sure we emit exactly these bytes.

When we emit instructions with zero offsets we may have more than one way to encode them. That is especially true on arm64 - if you have a zero offset, there could be a different instruction format that just does not have an offset. I had a bit of trouble with this in regular assembly in #85689 - masm tries to be helpful and picks a nicer instruction, but it does not work, since having a specific encoding is important for the relocation to be applicable.
And some instructions are not expressible with EmitXXX - like on Linux we need to insert some junk prefixes that make no semantical difference, but ensure the size of the code is big enough to be patched.

In many cases I ended up looking at what c++ compiler emits and copying the exact instruction bytes, to be sure.
EmitBytes was an easier API to work with than trying to figure if it is possible and how to do the same using EmitMOV and the like and be sure the emit is exactly as needed.
Perhaps in a few places we can use EmitMOV. I do not think it is worth finding and switching them.

Also this codegen will eventually (and hopefully soon) move into JIT and will be using whatever JIT uses to emit code anyways.

// IsSingleFileCompilation is not enough to guarantee that we can use "Initial Executable" optimizations.
// we need a special compiler flag analogous to /GA. Just assume "false" for now.
// bool singleFileExe = factory.CompilationModuleGroup.IsSingleFileCompilation;
bool singleFileExe = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already know isSingleFileExe is not what we want, can we call this what this is? (Same for ARM64)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be changed anyways once the "optimize for application" is implemented. I assume we will be doing that.

But I can change this. Would isInitialExecutable look better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name it after the expected optimization (canAssumeZeroOffset or whatever). singleFileExe is unlikely to be the thing we end up keying on because I don't expect we'd enable it whenever we build a single file exe (I don't see this option enabled by default for EXE projects in VC++ VS projects and I don't even see a UI option to enable it in VS settings in general besides typing /GA manually - both of those feel like red flags we'd need to understand).

Copy link
Member Author

@VSadov VSadov Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about Windows. Perhaps the difference on win-x64 is not big enough and building an ".exe" is not common enough. On win-arm64 there would be more difference, but perhaps they chose to be consistent with x64, or just did not think about the flag.

Both GCC and Clang assume single executable mode by default (-fpie). Dynamic TLS mode is basically a de-optimization that happens when building an .so (i.e. -fpic is used).
The differences are more significant than on Windows. See:

I do not think the flag should describe the difference in the codegen. because what happens is specific to OS and architecture.
The flag should simply specify what user intends to produce - an executable or a dynamic library.

{
public void AppendMangledName(NameMangler nameMangler, Utf8StringBuilder sb)
{
// tls_InlinedThreadStatics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not any more. I will remove it.

@VSadov
Copy link
Member Author

VSadov commented Jun 23, 2023

Thanks!!!

@VSadov VSadov merged commit 0f71880 into dotnet:main Jun 23, 2023
@VSadov VSadov deleted the thrSt branch June 23, 2023 16:59
@VSadov
Copy link
Member Author

VSadov commented Jun 28, 2023

CC @kunalspathak - I think this is the state from which JIT work can happen.

@NCLnclNCL
Copy link

Ok

@ghost ghost locked as resolved and limited conversation to collaborators Aug 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants