Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Optimize TLS access in generated code on Linux #17178

Merged
merged 1 commit into from
Jun 30, 2016
Merged

Conversation

yuyichao
Copy link
Contributor

By emitting the assembly sequence directly when static TLS is detected.

With this, I can't measure any performance difference in JIT code with threading on or off anymore =)

Currently supports and tested on LLVM3.7+ on Linux x86/x64/aarch64. ARM should be possible too but the thread pointer seems more complicated to emit (we might be able to use compiler intrinsics). FreeBSD or other platforms that uses ELF format probably works too assuming it has dl_iterate_phdr but I can't really test. This uses a glibc extension dl_iterate_phdr but AFAIK it is provided by musl too (and is required by libunwind).

This could make runtime JIT code harder to share but

  1. We have many other pointers hard coded in runtime JIT code
  2. The code is actually more reusable than what we do now for runtime JIT code since it can actually be reused when loading from the same julia-ui.

@yuyichao yuyichao added multithreading Base.Threads and related functionality performance Must go faster labels Jun 29, 2016
asm("movl %%gs:0, %0" : "=r"(tp));
#elif defined(_CPU_AARCH64_)
asm("mrs %0, tpidr_el0" : "=r"(tp));
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have an #else #error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I see now.

@JeffBezanson
Copy link
Member

Cool!!

asm_str = "movl %gs:0, $0";
# elif defined(_CPU_AARCH64_)
asm_str = "mrs $0, tpidr_el0";
# endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment about what happens on other arches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the assertion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want is a description of what it does do on other architectures. Something like "for the 3 supported architectures above, load tls pointer from its known offset. on all others, fall back to calling jl_get_ptls_states" (I think that's how it works?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. (Although we resolve jl_get_ptls_states to the real one in codegen.) I didn't think about this because this is documented in

// For threading, we emit a call to the getter function.

@yuyichao yuyichao force-pushed the yyc/threads/elf branch 3 times, most recently from 9bc2714 to 133831b Compare June 29, 2016 05:26
@JeffBezanson
Copy link
Member

Ready to merge?

@yuyichao
Copy link
Contributor Author

Ready to merge?

Should be, the RFC is mainly for how people feel like emitting assembly directly (in the case where LLVM can't do it) so should be ready to go as long as people don't have problem with that.

There's (yet) another optimization I kind of got working last night to optimize the tls getter call in C too but I'll probably need some more experiment with that and it can go in another PR.

* Detect if we are using a static TLS model
* Emit the assembly directly in codegen to access static TLS variables
@yuyichao
Copy link
Contributor Author

So assuming people are fine with inline assembly (and my next one close to being ready), I'll merge this later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants