-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpuinfo enhancements #11816
cpuinfo enhancements #11816
Conversation
Nice to start on this, I also had such plans as I rely a lot on CPU feature detection. I see however 2 glaring issues at the moment. Exposing the feature enumI think the enum is a bad approach for the public API, when new instructions comes out say Worse you can't switch from old nim to new nim because using an else branch will give you "all cases are covered" and the other will give you "uncovered cases. Using procs avoids compatibility issues when switching between Nim versions. Repeated checks / no cachingThe detected features are not cached, this is especially costly because looping on a set requires looping on the whole range which is costly: See how set iterator is defined, it checks inclusion on the whole enum range (which might get as big as 30~50 features) Lines 33 to 40 in 8c93c69
and the code generated to check set inclusion. Lines 1754 to 1791 in 8c93c69
My current feature detection solutionThis is what I'm using at the moment: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim It's probably the most comprehensive hardware detection library (just missing TLB cache size). You can test it by cloning the repo, initiating the submodules and |
lib/pure/concurrency/cpuinfo.nim
Outdated
ArmFeature* {.pure.} = enum | ||
Neon, Vfp | ||
|
||
type CpuFeature* = (when onX86: X86Feature elif onArm: ArmFeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be public, we should have a "CPU" object with proc defined for each feature to check to avoid backward compatibility issues when a new feature to detect is introduced/supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an object really necessary (could we just calculate every possible feature of these at program startup when cpuinfo is imported and cache the result globally)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant a private object, not exposed to the public, but it's not necessary for it to be an object either.
Procs could also alternatively use the following scheme:
proc cpuHasAVX(): bool =
let avxCache {.global.} = determineAVX()
return avxCache
And it will compute that only once. I'm not sure what happens if it's called in a parallel loop though so a simple global is probably better
let avxCache = determineAVX()
proc cpuHasAVX(): bool =
return avxCache
This way it's called only on module import.
Or alternatively
type CpuFeatures = object
hasSSE3: bool
hasSSE4_1: bool
hasAVX: bool
hasAVX2: bool
hasAVX512: bool
# initialized on module import
let cacheCpuFeatures = determineCpuFeatures()
proc hasAVX*(): bool =
return cacheCpuFeatures.hasAVX()
This is how it's done in Facebook's cpuinfo: https://github.com/pytorch/cpuinfo/blob/40c5f3695b053e5c3d642d9bc34113f3baa71ef2/include/cpuinfo.h#L579-L682.
Nim does have sets but the iteration through the native set is costly for such big sets as I said in the review.
lib/pure/concurrency/cpuinfo.nim
Outdated
leaf1 = cpuidX86(leaf = 1) | ||
leaf7 = cpuidX86(leaf = 7) | ||
leaf8 = cpuidX86(leaf = 0x80000001'i32) | ||
for feature in features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be populated and cached on import cpuinfo
so that it's done only once for the lifetime of the program.
Applications and libraries that benefits from CPU feature detection target high-performance and repeatedly looping for something that is invariant at runtime is performance left on the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I was expecting more that the application would hang onto this result (and possibly manipulate it) instead of recalling the proc everywhere, but you're right, I'll clean this up
…ed the CPU side channel feature indicators that no one will ever use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes so far look good to me.
In terms of code organization, I would probably put everything under the when onX86 or defined(nimdoc):
test in a separate cpuinfo_x86.nim file with include ./cpuinfo_x86
in the main file.
This is because it's a long section, and it's harder to track indentation on Github when 800+ lines are indented i.e. am I in a proc and so is this a local let
or at the top level of the when branch and is this a global let
.
I'm not sure the path (should we use an cpuinfo directory? Or put cpuinfo_x86 in the same folder as cpuinfo) so to be confirm by someone else.
Looks good so far. |
I was thinking about how to design the API further and hit an area of confusion, mainly due to heterogeneous CPU configurations (ARM's big.LITTLE comes to mind, I don't know of any x86 heterogenous configurations but it's certainly a possibility that it could happen in the future). In heterogeneous CPU systems, it should be okay to keep the CPU ISA features as variables rather than procedures: if I understand correctly, big.LITTLE enforces that all CPUs speak the same ISA (extensions and all), deferring to the OS to shift resource-intensive threads to heavy duty cores. IBM's Cell for the Playstation 3 is a notable exception: the ISA of the SPU cores was not fully compatible with the "guardian" PPU core. But I've seen code for the Cell (it's not pthreads-like at all), it's a complete mess and while I think heterogeneous systems will become more popular in the future, I don't think history will exactly repeat itself with the messiness of the Cell, so just using global variables saying "All CPUs have XYZ features" and not using thread-local procedures should be okay. |
We can implement the global for now with no argument, but instead of making it public, we have a wrapper proc that just does let cachehasAvx512f = testX86Feature(Avx512f)
proc hasAvx512f*(): bool {.inline.} = cachehasAvx512f Assuming in the future we want to check a specific core ID we can extend withproc hasAvx512f*(core_id: Natural): bool {.inline.} =
discard # Logic to check a specific core Assuming we want a switch between "check if any core supports our instruction set" (current default) / "check if current core supports our instruction set"We can have a getCurrentCoreId() proc that combined with the previous overload gives what we need:
Or use it with a special meaning for certain value (or an Either type) proc hasAvx512f*(core_id = -1): bool {.inline.} =
if core_id = -1:
return cachehasAvx512f
else:
discard # Logic to check a specific core We can modify the original with a default param (at the price of a if/else branch) proc hasAvx512f*(test_only_current_core = false): bool {.inline.}
if test_only_current_core:
# dispatch to a specific function
else:
return cachehasAvx512f Or just create another proc proc currentCoreHasAvx512f*(): bool {.inline.} =
discard My takeI think having different procs for:
I also think that public procs are better than public globals because it's easier to extend them without breakage in the future, and they can be inline for performance. Having different procs (or overloads) for the 3 cases also avoids an extra level of if branches (even if predictable). if hasSSE2():
discard
elif hasSSE41():
discard
elif hasAVX():
discard
elif hasAVX2():
discard
elif hasAVX512f():
discard |
Updated API. I could "add tests" but I don't think that's a really good idea, since this is all about variable hardware. That being said I did test this on my own hardware across two different computers (running an i5-6300U and a i5-4690K) and it does appear to work correctly. Once CI passes, I think this should be okay to merge, I can add ARM support and cache stuff in another PR |
Seems like the kind of thing that could easily live in a library, so that you don't have to lock yourself to a particular nim version to gain access to new cpu features. |
I mean, D found it was necessary to put it in the standard library. Compare that to Rust, which opted to not have a CPUID library and yet has three different cpuid crates for some reason, never mind that its compiler itself makes using these crates mostly unnecessary because it allows for runtime CPU feature branches via a special attribute. Also when I eventually add support for cache detection it will probably be used elsewhere in the stdlib, also possible in the future is if stdlib does end up making direct use of ISA features and thereby would need to guard them. The original reason I wrote this was because I saw an example from @mratsim experimenting with a new multithreading system for Nim and I noticed it was based on an assumed cache-line size of 64 bytes, which I knew was not a completely valid assumption for all hardware. It seems weird that someone would want to use newer extensions on older versions of Nim, given that new instructions are sometimes in themselves unfortunately not quite so airtight - for instance, Intel TSX and SGX - much as using newer versions of languages opens up potential problems. Also, we have backports. Plus if someone desperately needed a new extension, it would not be very difficult to look up the extension CPUID bit and use |
I actually have a library here: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim, it's just that it's a wrapper around a C99 library so I had issues when compiling to C++ ... Anyway I like having those in the standard library:
The CacheLineSize is a bad example though, it's used for padding in data structures so I need to have those at compile-time (but it can be configured with {.intdefine.} and I can add a runtime assert vs what cpuinfo returns) |
I welcome the effort but this should be a Nimble package for now, IMO. There is little reason to tie it to any specific Nim version. |
Ping @awr1 this should be a Nimble package. |
https://github.com/awr1/cpuwhat Will add a PR to the Nimble Repo later. |
Awesome, thanks! @narimiran add this to "important packages" please. |
I went to do it, but there are no tests in that package. Should we just check if |
Possibly. That library is difficult to make tests for (for previously mentioned reasons) |
Work in progress. Putting more meat on the bones of
cpuinfo.nim
(because people will find these sort of things useful)So far:
ARM CPU Feature support (note to self: for Windows on ARM, see this)(reserve for another PR)Cache info(reserve for another PR)