Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuinfo enhancements #11816

Closed
wants to merge 21 commits into from
Closed

cpuinfo enhancements #11816

wants to merge 21 commits into from

Conversation

awr1
Copy link
Contributor

@awr1 awr1 commented Jul 22, 2019

Work in progress. Putting more meat on the bones of cpuinfo.nim (because people will find these sort of things useful)

So far:

  • CPU Feature tester
  • x86 CPU Feature support
  • ARM CPU Feature support (note to self: for Windows on ARM, see this) (reserve for another PR)
  • Cache info (reserve for another PR)
  • Processor name

@mratsim
Copy link
Collaborator

mratsim commented Jul 23, 2019

Nice to start on this, I also had such plans as I rely a lot on CPU feature detection. I see however 2 glaring issues at the moment.

Exposing the feature enum

I think the enum is a bad approach for the public API, when new instructions comes out say bfloat16 that was announced yesterday, the enum changes and all code must be updated for the latest nim versions.

Worse you can't switch from old nim to new nim because using an else branch will give you "all cases are covered" and the other will give you "uncovered cases.

Using procs avoids compatibility issues when switching between Nim versions.

Repeated checks / no caching

The detected features are not cached, this is especially costly because looping on a set requires looping on the whole range which is costly:

See how set iterator is defined, it checks inclusion on the whole enum range (which might get as big as 30~50 features)

iterator items*[T](a: set[T]): T {.inline.} =
## Iterates over each element of `a`. `items` iterates only over the
## elements that are really in the set (and not over the ones the set is
## able to hold).
var i = low(T).int
while i <= high(T).int:
if T(i) in a: yield T(i)
inc(i)

and the code generated to check set inclusion.

Nim/compiler/ccgexprs.nim

Lines 1754 to 1791 in 8c93c69

proc genInOp(p: BProc, e: PNode, d: var TLoc) =
var a, b, x, y: TLoc
if (e.sons[1].kind == nkCurly) and fewCmps(p.config, e.sons[1]):
# a set constructor but not a constant set:
# do not emit the set, but generate a bunch of comparisons; and if we do
# so, we skip the unnecessary range check: This is a semantical extension
# that code now relies on. :-/ XXX
let ea = if e.sons[2].kind in {nkChckRange, nkChckRange64}:
e.sons[2].sons[0]
else:
e.sons[2]
initLocExpr(p, ea, a)
initLoc(b, locExpr, e, OnUnknown)
var length = sonsLen(e.sons[1])
if length > 0:
b.r = rope("(")
for i in 0 ..< length:
let it = e.sons[1].sons[i]
if it.kind == nkRange:
initLocExpr(p, it.sons[0], x)
initLocExpr(p, it.sons[1], y)
addf(b.r, "$1 >= $2 && $1 <= $3",
[rdCharLoc(a), rdCharLoc(x), rdCharLoc(y)])
else:
initLocExpr(p, it, x)
addf(b.r, "$1 == $2", [rdCharLoc(a), rdCharLoc(x)])
if i < length - 1: add(b.r, " || ")
add(b.r, ")")
else:
# handle the case of an empty set
b.r = rope("0")
putIntoDest(p, d, e, b.r)
else:
assert(e.sons[1].typ != nil)
assert(e.sons[2].typ != nil)
initLocExpr(p, e.sons[1], a)
initLocExpr(p, e.sons[2], b)
genInExprAux(p, e, a, b, d)

My current feature detection solution

This is what I'm using at the moment: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim
wrapping https://github.com/pytorch/cpuinfo.

It's probably the most comprehensive hardware detection library (just missing TLB cache size).

You can test it by cloning the repo, initiating the submodules and nim c -r examples/ex01_cpuinfo.nim (https://github.com/numforge/laser/blob/master/examples/ex01_cpuinfo.nim)

ArmFeature* {.pure.} = enum
Neon, Vfp

type CpuFeature* = (when onX86: X86Feature elif onArm: ArmFeature)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be public, we should have a "CPU" object with proc defined for each feature to check to avoid backward compatibility issues when a new feature to detect is introduced/supported

Copy link
Contributor Author

@awr1 awr1 Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an object really necessary (could we just calculate every possible feature of these at program startup when cpuinfo is imported and cache the result globally)?

Copy link
Collaborator

@mratsim mratsim Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant a private object, not exposed to the public, but it's not necessary for it to be an object either.

Procs could also alternatively use the following scheme:

proc cpuHasAVX(): bool =
  let avxCache {.global.} = determineAVX()
  return avxCache

And it will compute that only once. I'm not sure what happens if it's called in a parallel loop though so a simple global is probably better

let avxCache = determineAVX()

proc cpuHasAVX(): bool =
  return avxCache

This way it's called only on module import.

Or alternatively

type CpuFeatures = object
  hasSSE3: bool
  hasSSE4_1: bool
  hasAVX: bool
  hasAVX2: bool
  hasAVX512: bool

# initialized on module import
let cacheCpuFeatures = determineCpuFeatures()

proc hasAVX*(): bool =
  return cacheCpuFeatures.hasAVX()

This is how it's done in Facebook's cpuinfo: https://github.com/pytorch/cpuinfo/blob/40c5f3695b053e5c3d642d9bc34113f3baa71ef2/include/cpuinfo.h#L579-L682.
Nim does have sets but the iteration through the native set is costly for such big sets as I said in the review.

leaf1 = cpuidX86(leaf = 1)
leaf7 = cpuidX86(leaf = 7)
leaf8 = cpuidX86(leaf = 0x80000001'i32)
for feature in features:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be populated and cached on import cpuinfo so that it's done only once for the lifetime of the program.

Applications and libraries that benefits from CPU feature detection target high-performance and repeatedly looping for something that is invariant at runtime is performance left on the table.

Copy link
Contributor Author

@awr1 awr1 Jul 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I was expecting more that the application would hang onto this result (and possibly manipulate it) instead of recalling the proc everywhere, but you're right, I'll clean this up

Copy link
Collaborator

@mratsim mratsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes so far look good to me.

In terms of code organization, I would probably put everything under the when onX86 or defined(nimdoc): test in a separate cpuinfo_x86.nim file with include ./cpuinfo_x86 in the main file.

This is because it's a long section, and it's harder to track indentation on Github when 800+ lines are indented i.e. am I in a proc and so is this a local letor at the top level of the when branch and is this a global let.

I'm not sure the path (should we use an cpuinfo directory? Or put cpuinfo_x86 in the same folder as cpuinfo) so to be confirm by someone else.

lib/pure/concurrency/cpuinfo.nim Outdated Show resolved Hide resolved
lib/pure/concurrency/cpuinfo.nim Outdated Show resolved Hide resolved
@mratsim
Copy link
Collaborator

mratsim commented Jul 27, 2019

Looks good so far.

@awr1
Copy link
Contributor Author

awr1 commented Jul 27, 2019

I was thinking about how to design the API further and hit an area of confusion, mainly due to heterogeneous CPU configurations (ARM's big.LITTLE comes to mind, I don't know of any x86 heterogenous configurations but it's certainly a possibility that it could happen in the future). cpuName, etc. should probably be a proc with an index argument, in the cpuinfo C library you showed me, you have cpuinfo_get_package that reflects this.

In heterogeneous CPU systems, it should be okay to keep the CPU ISA features as variables rather than procedures: if I understand correctly, big.LITTLE enforces that all CPUs speak the same ISA (extensions and all), deferring to the OS to shift resource-intensive threads to heavy duty cores.

IBM's Cell for the Playstation 3 is a notable exception: the ISA of the SPU cores was not fully compatible with the "guardian" PPU core. But I've seen code for the Cell (it's not pthreads-like at all), it's a complete mess and while I think heterogeneous systems will become more popular in the future, I don't think history will exactly repeat itself with the messiness of the Cell, so just using global variables saying "All CPUs have XYZ features" and not using thread-local procedures should be okay.

@mratsim
Copy link
Collaborator

mratsim commented Jul 29, 2019

We can implement the global for now with no argument, but instead of making it public, we have a wrapper proc that just does

let cachehasAvx512f = testX86Feature(Avx512f)
proc hasAvx512f*(): bool {.inline.} = cachehasAvx512f
Assuming in the future we want to check a specific core ID we can extend with
proc hasAvx512f*(core_id: Natural): bool {.inline.} =
  discard # Logic to check a specific core
Assuming we want a switch between "check if any core supports our instruction set" (current default) / "check if current core supports our instruction set"

We can have a getCurrentCoreId() proc that combined with the previous overload gives what we need:

let instrAvailable{.threadvar.} = hasAvx512f(getCurrentCoreId())

Or use it with a special meaning for certain value (or an Either type)

proc hasAvx512f*(core_id = -1): bool {.inline.} =
  if core_id = -1:
    return cachehasAvx512f
  else:
    discard # Logic to check a specific core

We can modify the original with a default param (at the price of a if/else branch)

proc hasAvx512f*(test_only_current_core = false): bool {.inline.}
  if test_only_current_core:
    # dispatch to a specific function
  else:
    return cachehasAvx512f

Or just create another proc

proc currentCoreHasAvx512f*(): bool {.inline.} =
  discard 

My take

I think having different procs for:

  • check all cores, return true if at least one core supports target feature
  • check target core, return true if it supports target feature
  • check running core, return true if it supports target feature
    is fine. So we can focus on the first usage for now.

I also think that public procs are better than public globals because it's easier to extend them without breakage in the future, and they can be inline for performance.

Having different procs (or overloads) for the 3 cases also avoids an extra level of if branches (even if predictable).
I say extra because there is almost always a:

if hasSSE2():
  discard
elif hasSSE41():
  discard
elif hasAVX():
  discard
elif hasAVX2():
  discard
elif hasAVX512f():
  discard

@awr1
Copy link
Contributor Author

awr1 commented Aug 27, 2019

Updated API. I could "add tests" but I don't think that's a really good idea, since this is all about variable hardware. That being said I did test this on my own hardware across two different computers (running an i5-6300U and a i5-4690K) and it does appear to work correctly.

Once CI passes, I think this should be okay to merge, I can add ARM support and cache stuff in another PR

@awr1 awr1 changed the title [WIP] cpuinfo enhancements cpuinfo enhancements Aug 28, 2019
@mratsim
Copy link
Collaborator

mratsim commented Aug 29, 2019

@arnetheduck
Copy link
Contributor

Seems like the kind of thing that could easily live in a library, so that you don't have to lock yourself to a particular nim version to gain access to new cpu features.

@awr1
Copy link
Contributor Author

awr1 commented Sep 3, 2019

I mean, D found it was necessary to put it in the standard library. Compare that to Rust, which opted to not have a CPUID library and yet has three different cpuid crates for some reason, never mind that its compiler itself makes using these crates mostly unnecessary because it allows for runtime CPU feature branches via a special attribute. Also when I eventually add support for cache detection it will probably be used elsewhere in the stdlib, also possible in the future is if stdlib does end up making direct use of ISA features and thereby would need to guard them.

The original reason I wrote this was because I saw an example from @mratsim experimenting with a new multithreading system for Nim and I noticed it was based on an assumed cache-line size of 64 bytes, which I knew was not a completely valid assumption for all hardware.

It seems weird that someone would want to use newer extensions on older versions of Nim, given that new instructions are sometimes in themselves unfortunately not quite so airtight - for instance, Intel TSX and SGX - much as using newer versions of languages opens up potential problems.

Also, we have backports. Plus if someone desperately needed a new extension, it would not be very difficult to look up the extension CPUID bit and use patchFiles() or something.

@mratsim
Copy link
Collaborator

mratsim commented Sep 3, 2019

I actually have a library here: https://github.com/numforge/laser/blob/master/laser/cpuinfo.nim, it's just that it's a wrapper around a C99 library so I had issues when compiling to C++ ...

Anyway I like having those in the standard library:

The CacheLineSize is a bad example though, it's used for padding in data structures so I need to have those at compile-time (but it can be configured with {.intdefine.} and I can add a runtime assert vs what cpuinfo returns)

@Araq
Copy link
Member

Araq commented Sep 21, 2019

I welcome the effort but this should be a Nimble package for now, IMO. There is little reason to tie it to any specific Nim version.

@Araq
Copy link
Member

Araq commented Apr 20, 2020

Ping @awr1 this should be a Nimble package.

@awr1
Copy link
Contributor Author

awr1 commented Apr 22, 2020

https://github.com/awr1/cpuwhat

Will add a PR to the Nimble Repo later.

@Araq
Copy link
Member

Araq commented Apr 24, 2020

Awesome, thanks! @narimiran add this to "important packages" please.

@Araq Araq closed this Apr 24, 2020
@narimiran
Copy link
Member

@narimiran add this to "important packages" please.

I went to do it, but there are no tests in that package. Should we just check if source/cpuwhat.nim compiles? (cc @awr1)

@awr1
Copy link
Contributor Author

awr1 commented Apr 26, 2020

Possibly. That library is difficult to make tests for (for previously mentioned reasons)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants