cpuinfo enhancements #11816

awr1 · 2019-07-22T21:19:01Z

Work in progress. Putting more meat on the bones of cpuinfo.nim (because people will find these sort of things useful)

So far:

CPU Feature tester
x86 CPU Feature support
~~ARM CPU Feature support (note to self: for Windows on ARM, see this)~~ (reserve for another PR)
~~Cache info~~ (reserve for another PR)
Processor name

mratsim · 2019-07-23T08:00:36Z

Nice to start on this, I also had such plans as I rely a lot on CPU feature detection. I see however 2 glaring issues at the moment.

Exposing the feature enum

I think the enum is a bad approach for the public API, when new instructions comes out say bfloat16 that was announced yesterday, the enum changes and all code must be updated for the latest nim versions.

Worse you can't switch from old nim to new nim because using an else branch will give you "all cases are covered" and the other will give you "uncovered cases.

Using procs avoids compatibility issues when switching between Nim versions.

Repeated checks / no caching

The detected features are not cached, this is especially costly because looping on a set requires looping on the whole range which is costly:

See how set iterator is defined, it checks inclusion on the whole enum range (which might get as big as 30~50 features)

Nim/lib/system/iterators.nim

Lines 33 to 40 in 8c93c69

    
           iterator items*[T](a: set[T]): T {.inline.} = 
        
             ## Iterates over each element of `a`. `items` iterates only over the 
        
             ## elements that are really in the set (and not over the ones the set is 
        
             ## able to hold). 
        
             var i = low(T).int 
        
             while i <= high(T).int: 
        
               if T(i) in a: yield T(i) 
        
               inc(i)

and the code generated to check set inclusion.

Nim/compiler/ccgexprs.nim

Lines 1754 to 1791 in 8c93c69

    
           proc genInOp(p: BProc, e: PNode, d: var TLoc) = 
        
             var a, b, x, y: TLoc 
        
             if (e.sons[1].kind == nkCurly) and fewCmps(p.config, e.sons[1]): 
        
               # a set constructor but not a constant set: 
        
               # do not emit the set, but generate a bunch of comparisons; and if we do 
        
               # so, we skip the unnecessary range check: This is a semantical extension 
        
               # that code now relies on. :-/ XXX 
        
               let ea = if e.sons[2].kind in {nkChckRange, nkChckRange64}: 
        
                          e.sons[2].sons[0] 
        
                        else: 
        
                          e.sons[2] 
        
               initLocExpr(p, ea, a) 
        
               initLoc(b, locExpr, e, OnUnknown) 
        
               var length = sonsLen(e.sons[1]) 
        
               if length > 0: 
        
                 b.r = rope("(") 
        
                 for i in 0 ..< length: 
        
                   let it = e.sons[1].sons[i] 
        
                   if it.kind == nkRange: 
        
                     initLocExpr(p, it.sons[0], x) 
        
                     initLocExpr(p, it.sons[1], y) 
        
                     addf(b.r, "$1 >= $2 && $1 <= $3", 
        
                          [rdCharLoc(a), rdCharLoc(x), rdCharLoc(y)]) 
        
                   else: 
        
                     initLocExpr(p, it, x) 
        
                     addf(b.r, "$1 == $2", [rdCharLoc(a), rdCharLoc(x)]) 
        
                   if i < length - 1: add(b.r, " || ") 
        
                 add(b.r, ")") 
        
               else: 
        
                 # handle the case of an empty set 
        
                 b.r = rope("0") 
        
               putIntoDest(p, d, e, b.r) 
        
             else: 
        
               assert(e.sons[1].typ != nil) 
        
               assert(e.sons[2].typ != nil) 
        
               initLocExpr(p, e.sons[1], a) 
        
               initLocExpr(p, e.sons[2], b) 
        
               genInExprAux(p, e, a, b, d)

My current feature detection solution

This is what I'm using at the moment: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim
wrapping https://github.com/pytorch/cpuinfo.

It's probably the most comprehensive hardware detection library (just missing TLB cache size).

You can test it by cloning the repo, initiating the submodules and nim c -r examples/ex01_cpuinfo.nim (https://github.com/numforge/laser/blob/master/examples/ex01_cpuinfo.nim)

mratsim · 2019-07-23T08:02:46Z

lib/pure/concurrency/cpuinfo.nim

+    ArmFeature* {.pure.} = enum
+      Neon, Vfp
+
+type CpuFeature* = (when onX86: X86Feature elif onArm: ArmFeature)


This shouldn't be public, we should have a "CPU" object with proc defined for each feature to check to avoid backward compatibility issues when a new feature to detect is introduced/supported

Is an object really necessary (could we just calculate every possible feature of these at program startup when cpuinfo is imported and cache the result globally)?

I meant a private object, not exposed to the public, but it's not necessary for it to be an object either.

Procs could also alternatively use the following scheme:

proc cpuHasAVX(): bool = let avxCache {.global.} = determineAVX() return avxCache

And it will compute that only once. I'm not sure what happens if it's called in a parallel loop though so a simple global is probably better

let avxCache = determineAVX() proc cpuHasAVX(): bool = return avxCache

This way it's called only on module import.

Or alternatively

type CpuFeatures = object hasSSE3: bool hasSSE4_1: bool hasAVX: bool hasAVX2: bool hasAVX512: bool # initialized on module import let cacheCpuFeatures = determineCpuFeatures() proc hasAVX*(): bool = return cacheCpuFeatures.hasAVX()

This is how it's done in Facebook's cpuinfo: https://github.com/pytorch/cpuinfo/blob/40c5f3695b053e5c3d642d9bc34113f3baa71ef2/include/cpuinfo.h#L579-L682.
Nim does have sets but the iteration through the native set is costly for such big sets as I said in the review.

mratsim · 2019-07-23T08:04:49Z

lib/pure/concurrency/cpuinfo.nim

+      leaf1 = cpuidX86(leaf = 1)
+      leaf7 = cpuidX86(leaf = 7)
+      leaf8 = cpuidX86(leaf = 0x80000001'i32)
+    for feature in features:


This should be populated and cached on import cpuinfo so that it's done only once for the lifetime of the program.

Applications and libraries that benefits from CPU feature detection target high-performance and repeatedly looping for something that is invariant at runtime is performance left on the table.

Generally I was expecting more that the application would hang onto this result (and possibly manipulate it) instead of recalling the proc everywhere, but you're right, I'll clean this up

…ed the CPU side channel feature indicators that no one will ever use

mratsim

The changes so far look good to me.

In terms of code organization, I would probably put everything under the when onX86 or defined(nimdoc): test in a separate cpuinfo_x86.nim file with include ./cpuinfo_x86 in the main file.

This is because it's a long section, and it's harder to track indentation on Github when 800+ lines are indented i.e. am I in a proc and so is this a local letor at the top level of the when branch and is this a global let.

I'm not sure the path (should we use an cpuinfo directory? Or put cpuinfo_x86 in the same folder as cpuinfo) so to be confirm by someone else.

lib/pure/concurrency/cpuinfo.nim

…cessary {.global.} pragmas

mratsim · 2019-07-27T09:51:57Z

Looks good so far.

awr1 · 2019-07-27T22:35:20Z

I was thinking about how to design the API further and hit an area of confusion, mainly due to heterogeneous CPU configurations (ARM's big.LITTLE comes to mind, I don't know of any x86 heterogenous configurations but it's certainly a possibility that it could happen in the future). cpuName, etc. should probably be a proc with an index argument, in the cpuinfo C library you showed me, you have cpuinfo_get_package that reflects this.

In heterogeneous CPU systems, it should be okay to keep the CPU ISA features as variables rather than procedures: if I understand correctly, big.LITTLE enforces that all CPUs speak the same ISA (extensions and all), deferring to the OS to shift resource-intensive threads to heavy duty cores.

IBM's Cell for the Playstation 3 is a notable exception: the ISA of the SPU cores was not fully compatible with the "guardian" PPU core. But I've seen code for the Cell (it's not pthreads-like at all), it's a complete mess and while I think heterogeneous systems will become more popular in the future, I don't think history will exactly repeat itself with the messiness of the Cell, so just using global variables saying "All CPUs have XYZ features" and not using thread-local procedures should be okay.

mratsim · 2019-07-29T20:10:19Z

We can implement the global for now with no argument, but instead of making it public, we have a wrapper proc that just does

let cachehasAvx512f = testX86Feature(Avx512f)
proc hasAvx512f*(): bool {.inline.} = cachehasAvx512f

Assuming in the future we want to check a specific core ID we can extend with

proc hasAvx512f*(core_id: Natural): bool {.inline.} =
  discard # Logic to check a specific core

Assuming we want a switch between "check if any core supports our instruction set" (current default) / "check if current core supports our instruction set"

We can have a getCurrentCoreId() proc that combined with the previous overload gives what we need:

let instrAvailable{.threadvar.} = hasAvx512f(getCurrentCoreId())

Or use it with a special meaning for certain value (or an Either type)

proc hasAvx512f*(core_id = -1): bool {.inline.} =
  if core_id = -1:
    return cachehasAvx512f
  else:
    discard # Logic to check a specific core

We can modify the original with a default param (at the price of a if/else branch)

proc hasAvx512f*(test_only_current_core = false): bool {.inline.}
  if test_only_current_core:
    # dispatch to a specific function
  else:
    return cachehasAvx512f

Or just create another proc

proc currentCoreHasAvx512f*(): bool {.inline.} =
  discard

My take

I think having different procs for:

check all cores, return true if at least one core supports target feature
check target core, return true if it supports target feature
check running core, return true if it supports target feature
is fine. So we can focus on the first usage for now.

I also think that public procs are better than public globals because it's easier to extend them without breakage in the future, and they can be inline for performance.

Having different procs (or overloads) for the 3 cases also avoids an extra level of if branches (even if predictable).
I say extra because there is almost always a:

if hasSSE2():
  discard
elif hasSSE41():
  discard
elif hasAVX():
  discard
elif hasAVX2():
  discard
elif hasAVX512f():
  discard

awr1 · 2019-08-27T18:44:11Z

Updated API. I could "add tests" but I don't think that's a really good idea, since this is all about variable hardware. That being said I did test this on my own hardware across two different computers (running an i5-6300U and a i5-4690K) and it does appear to work correctly.

Once CI passes, I think this should be okay to merge, I can add ARM support and cache stuff in another PR

mratsim · 2019-08-29T19:00:49Z

Test program: https://gist.github.com/mratsim/564254157f565bbf8925dd32e001019c

…chnicality

arnetheduck · 2019-09-03T09:04:55Z

Seems like the kind of thing that could easily live in a library, so that you don't have to lock yourself to a particular nim version to gain access to new cpu features.

awr1 · 2019-09-03T12:53:47Z

I mean, D found it was necessary to put it in the standard library. Compare that to Rust, which opted to not have a CPUID library and yet has three different cpuid crates for some reason, never mind that its compiler itself makes using these crates mostly unnecessary because it allows for runtime CPU feature branches via a special attribute. Also when I eventually add support for cache detection it will probably be used elsewhere in the stdlib, also possible in the future is if stdlib does end up making direct use of ISA features and thereby would need to guard them.

The original reason I wrote this was because I saw an example from @mratsim experimenting with a new multithreading system for Nim and I noticed it was based on an assumed cache-line size of 64 bytes, which I knew was not a completely valid assumption for all hardware.

It seems weird that someone would want to use newer extensions on older versions of Nim, given that new instructions are sometimes in themselves unfortunately not quite so airtight - for instance, Intel TSX and SGX - much as using newer versions of languages opens up potential problems.

Also, we have backports. Plus if someone desperately needed a new extension, it would not be very difficult to look up the extension CPUID bit and use patchFiles() or something.

mratsim · 2019-09-03T22:45:22Z

I actually have a library here: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim, it's just that it's a wrapper around a C99 library so I had issues when compiling to C++ ...

Anyway I like having those in the standard library:

I don't think there is bikeshedding potential around the API, it's always a function with no argument that returns booleans
maintenance is easy, hardware instructions are never removed or changed, only added.
several packages actually reimplemented their own SIMD but without the detection part:

The CacheLineSize is a bad example though, it's used for padding in data structures so I need to have those at compile-time (but it can be configured with {.intdefine.} and I can add a runtime assert vs what cpuinfo returns)

Araq · 2019-09-21T22:45:22Z

I welcome the effort but this should be a Nimble package for now, IMO. There is little reason to tie it to any specific Nim version.

Araq · 2020-04-20T08:04:28Z

Ping @awr1 this should be a Nimble package.

awr1 · 2020-04-22T20:32:48Z

https://github.com/awr1/cpuwhat

Will add a PR to the Nimble Repo later.

Araq · 2020-04-24T07:44:09Z

Awesome, thanks! @narimiran add this to "important packages" please.

narimiran · 2020-04-26T10:22:29Z

@narimiran add this to "important packages" please.

I went to do it, but there are no tests in that package. Should we just check if source/cpuwhat.nim compiles? (cc @awr1)

awr1 · 2020-04-26T17:27:13Z

Possibly. That library is difficult to make tests for (for previously mentioned reasons)

awr1 added 3 commits July 22, 2019 13:23

added internal x86 cpuid procedure

a5990c7

added x86 feature detect

8df6c89

minor cpuinfo fixes

3d80ff5

mratsim reviewed Jul 23, 2019

View reviewed changes

awr1 added 7 commits July 25, 2019 16:17

reworking the way x86 features are tested

f5ec15c

made CPU features into seperate variables, added docs for each, remov…

9d72949

…ed the CPU side channel feature indicators that no one will ever use

formatting fix

66a81e2

Fixed wrong spelling of Amdv

5911566

More spelling fixes

eeae3ce

Added Clwb

52bff3b

Forgot NxBit

b6ccc07

mratsim reviewed Jul 26, 2019

View reviewed changes

lib/pure/concurrency/cpuinfo.nim Outdated Show resolved Hide resolved

lib/pure/concurrency/cpuinfo.nim Outdated Show resolved Hide resolved

awr1 added 4 commits July 26, 2019 15:19

reorganized cpuinfo to ISA-specific includeable modules, removed unne…

9f7a931

…cessary {.global.} pragmas

added x86 processor name

3d0362c

Fix pesky Cas* capitalization

f20b213

Fixed bad commit

f54333c

Updated cpuinfo.nim API, added comments

ce1dcba

awr1 added 2 commits August 27, 2019 13:48

minor adjustment: change cpuName to a proc

e61ddeb

Use return in cpuName()

39fb542

awr1 changed the title ~~[WIP] cpuinfo enhancements~~ cpuinfo enhancements Aug 28, 2019

awr1 added 3 commits August 28, 2019 16:59

minor adjustment: make cpuinfo feature procs inline

12a077f

adjust cpuidX86 to be able to intake ecx, made leafs global

4d108a1

minor doc wording fix

b918827

mratsim approved these changes Aug 29, 2019

View reviewed changes

mmx ext is actually not included as part of amd64, clearing up doc te…

2411a37

…chnicality

mratsim added the Standard Library label Sep 2, 2019

mratsim mentioned this pull request Mar 2, 2020

[performance] performance tricks to improve std/json, parsers, lookups nim-lang/RFCs#188

Open

Araq closed this Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuinfo enhancements #11816

cpuinfo enhancements #11816

awr1 commented Jul 22, 2019 •

edited

Loading

mratsim commented Jul 23, 2019 •

edited

Loading

mratsim Jul 23, 2019

awr1 Jul 23, 2019 •

edited

Loading

mratsim Jul 23, 2019 •

edited

Loading

mratsim Jul 23, 2019

awr1 Jul 23, 2019 •

edited

Loading

mratsim left a comment

mratsim commented Jul 27, 2019

awr1 commented Jul 27, 2019 •

edited

Loading

mratsim commented Jul 29, 2019

awr1 commented Aug 27, 2019 •

edited

Loading

mratsim commented Aug 29, 2019

arnetheduck commented Sep 3, 2019

awr1 commented Sep 3, 2019 •

edited

Loading

mratsim commented Sep 3, 2019 •

edited

Loading

Araq commented Sep 21, 2019

Araq commented Apr 20, 2020

awr1 commented Apr 22, 2020

Araq commented Apr 24, 2020

narimiran commented Apr 26, 2020

awr1 commented Apr 26, 2020

cpuinfo enhancements #11816

cpuinfo enhancements #11816

Conversation

awr1 commented Jul 22, 2019 • edited Loading

mratsim commented Jul 23, 2019 • edited Loading

Exposing the feature enum

Repeated checks / no caching

My current feature detection solution

mratsim Jul 23, 2019

Choose a reason for hiding this comment

awr1 Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

mratsim Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

mratsim Jul 23, 2019

Choose a reason for hiding this comment

awr1 Jul 23, 2019 • edited Loading

Choose a reason for hiding this comment

mratsim left a comment

Choose a reason for hiding this comment

mratsim commented Jul 27, 2019

awr1 commented Jul 27, 2019 • edited Loading

mratsim commented Jul 29, 2019

Assuming in the future we want to check a specific core ID we can extend with

Assuming we want a switch between "check if any core supports our instruction set" (current default) / "check if current core supports our instruction set"

My take

awr1 commented Aug 27, 2019 • edited Loading

mratsim commented Aug 29, 2019

arnetheduck commented Sep 3, 2019

awr1 commented Sep 3, 2019 • edited Loading

mratsim commented Sep 3, 2019 • edited Loading

Araq commented Sep 21, 2019

Araq commented Apr 20, 2020

awr1 commented Apr 22, 2020

Araq commented Apr 24, 2020

narimiran commented Apr 26, 2020

awr1 commented Apr 26, 2020

awr1 commented Jul 22, 2019 •

edited

Loading

mratsim commented Jul 23, 2019 •

edited

Loading

awr1 Jul 23, 2019 •

edited

Loading

mratsim Jul 23, 2019 •

edited

Loading

awr1 Jul 23, 2019 •

edited

Loading

awr1 commented Jul 27, 2019 •

edited

Loading

awr1 commented Aug 27, 2019 •

edited

Loading

awr1 commented Sep 3, 2019 •

edited

Loading

mratsim commented Sep 3, 2019 •

edited

Loading