Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

Closed
wants to merge 2 commits into from

Conversation

timotheecour
Copy link
Member

@timotheecour timotheecour commented Mar 10, 2020

This PR turns system/timers include into a new stdlib module std/timers.

cpuTime can be misleading for benchmarking because it adds a lot of overhead (for various reasons, including FP operations); so you get flawed conclusions unless the workload is significantly more expensive than the cost of cpuTime, which isn't always possible without affecting the thing you're measuring.
getTicks is more adequate, giving the highest timer precision and least overhead; in practice, this is what people use outside of nim (eg QueryPerformanceCounter / mach_absolute_time etc, which is wrapped by timers)

example

on local OSX, this shows getTicks has an overhead of 16ns, vs 450ns for cpuTime, ie 27X less overhead.
getTicks also has the highest available precision on a given machine using platform specific API's (the benchmark below doesn't measure this)

# include system/timers
# import t10333b
import std/timers
include times
proc toSeconds(a: Nanos): float = a.float*1e-9
proc toSeconds(a: float): float = a
template test(fun)=
  block:
    proc main()=
      let n = 10_000_000
      type T = type(fun()-fun())
      var dt: T
      for i in 0..<n:
        let t = fun()
        # code to benchmark here (intentionally empty)
        let t2 = fun()
        dt += t2-t
      echo "\n" & astToStr(fun) & ":"
      echo dt.toSeconds
      echo (dt.toSeconds * 1e9 / n.float).formatEng
      echo (n.float / dt.toSeconds).formatEng
    main()
test(getTicks)
test(cpuTime)

prints:

getTicks:
0.16410692
16.410692
60.9358825332e6

cpuTime:
4.509458000001094
450.9458000001
2.2175614009e6

note

I added a dummy system/timers.nim with a deprecation msg; the only tested package that required it was https://krux02@bitbucket.org/krux02/tensordslnim.git
code the reliied on system/timers can avoid the warning by:

template imp = import std/timers
when compiles(imp()): imp() else: include system/timers

or (after 1.2 release) by checking on nim version

@timotheecour timotheecour changed the title new std/timers module for high performance / low overhead timers (formerly system/timers include) new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Mar 10, 2020
@Araq
Copy link
Member

Araq commented Mar 10, 2020

We have std / monotimes now. Are you aware?

@timotheecour
Copy link
Member Author

timotheecour commented Mar 10, 2020

I do remember now, it was added fairly recently, however monotimes still adds too much overhead; what's provided in this PR offers the least possible overhead on a platform, which matters a lot for profiling/benchmarking etc

here's updated benchmark, adding in monotimes: getMonoTime has 2X the overhead of getTicks, that means any small measurement will be impacted. Depending on what you're trying to measure, that makes all the difference (it's not always possible to "increase the payload" of what you're measuring, this can yield flawed conclusions).

Looking at the implementation of getMonoTime, this is not something that can be fixed, the overhead is inherent in what getMonoTime does.

cpuTime:
("total secs", 3.924649999999624)
("ns/iter", "392.465")
("iters per sec", "2.5479979106e6")

getMonoTime:
("total secs", 0.282378408)
("ns/iter", "28.2378408")
("iters per sec", "35.4134725485e6")

getTicks:
("total secs", 0.137324019)
("ns/iter", "13.7324019")
("iters per sec", "72.8204728701e6")
# include system/timers
# import t10333b
import std/timers
import std/monotimes
import times
import strutils
proc toSeconds(a: Nanos): float = a.float*1e-9
proc toSeconds(a: float): float = a
proc toSeconds(a: Duration): float = a.nanoseconds.float*1e-9
template test(fun)=
  block:
    proc main()=
      let n = 10_000_000
      type T = type(fun()-fun())
      var dt: T
      for i in 0..<n:
        let t = fun()
        # code to benchmark here (intentionally empty)
        let t2 = fun()
        dt += t2-t
      echo "\n" & astToStr(fun) & ":"
      echo ("total secs", dt.toSeconds)
      echo ("ns/iter", (dt.toSeconds * 1e9 / n.float).formatEng)
      echo ("iters per sec", (n.float / dt.toSeconds).formatEng)
    main()
test(cpuTime)
test(getMonoTime)
test(getTicks)

TODO for subsequent PR's

  • I can, however, cross-reference properly monotimes, std/timers and cpuTime in the docs and explain pros/cons of each approach
  • as well as refactor monotimes so it imports std/timers to avoid code duplication it does

@timotheecour
Copy link
Member Author

timotheecour commented Mar 11, 2020

reviewer note: to view the diff (github and git dont' do well when file is renamed and old file is replaced), best to view locally as follows:

git diff parent_hash:./lib/system/timers.nim ./lib/std/timers.nim

@GULPF
Copy link
Member

GULPF commented Mar 11, 2020

On my machine (Ubuntu, latest develop version of Nim):

cpuTime:
("total secs", 4.451975478000229)
("ns/iter", "445.1975478")
("iters per sec", "2.2461938637e6")

getMonoTime:
("total secs", 0.18464677)
("ns/iter", "18.464677")
("iters per sec", "54.1574596729e6")

getTicks:
("total secs", 0.20437)
("ns/iter", "20.437")
("iters per sec", "48.9308606938e6")

I cannot see why monotimes would have a meaningful overhead over timers. The only overhead is the use of Duration in -, but if you really want to avoid that you can do getMonoTime().ticks and use a raw integer instead.

Looking at the implementation of getMonoTime, this is not something that can be fixed, the overhead is inherent in what getMonoTime does.

Please explain.

as well as refactor monotimes so it imports std/timers to avoid code duplication it does

Note that timers falls back to gettimeofday which is not guaranteed to be monotonic and so cannot be used in monotimes.

@mratsim
Copy link
Collaborator

mratsim commented Mar 11, 2020

I would expect getTicks() to be compiled to a memory barrier + RDTSC (or RDTSCP directly) on x86 if we want as accurate as possible clock cycles report. With the caveat of TurboBoost, CPU frequency scaling and overclocking not reporting the correct cycle count.

See what I use in Weave https://github.com/mratsim/weave/blob/v0.3.0/weave/instrumentation/timers.nim

when defined(i386) or defined(amd64):
  # From Linux
  #
  # The RDTSC instruction is not ordered relative to memory
  # access.  The Intel SDM and the AMD APM are both vague on this
  # point, but empirically an RDTSC instruction can be
  # speculatively executed before prior loads.  An RDTSC
  # immediately after an appropriate barrier appears to be
  # ordered as a normal load, that is, it provides the same
  # ordering guarantees as reading from a global memory location
  # that some other imaginary CPU is updating continuously with a
  # time stamp.
  #
  # From Intel SDM
  # https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  when not defined(vcc):
    when defined(amd64):
      proc getticks(): int64 {.inline.} =
        var lo, hi: int64
        # TODO: Provide a compile-time flag for RDTSCP support
        #       and use it instead of lfence + RDTSC
        {.emit: """asm volatile(
          "lfence\n"
          "rdtsc\n"
          : "=a"(`lo`), "=d"(`hi`)
          :
          : "memory"
        );""".}
        return (hi shl 32) or lo
    else:
      proc getticks(): int64 {.inline.} =
        # TODO: Provide a compile-time flag for RDTSCP support
        #       and use it instead of lfence + RDTSC
        {.emit: """asm volatile(
          "lfence\n"
          "rdtsc\n"
          : "=a"(`result`)
          :
          : "memory"
        );""".}
  else:
    proc rdtsc(): int64 {.sideeffect, importc: "__rdtsc", header: "<intrin.h>".}
    proc lfence() {.importc: "__mm_lfence", header: "<intrin.h>".}

    proc getticks(): int64 {.inline.} =
      lfence()
      return rdtsc()
else:
  when defined(WV_profile):
    {.error: "getticks is not supported on this CPU architecture".}

@timotheecour
Copy link
Member Author

I cannot see why monotimes would have a meaningful overhead over timers

it looks like this is platform specific; on OSX, after measuring, the 2X slowdown is caused by overhead in 2 places:

    mach_timebase_info(machAbsoluteTimeFreq)

and

result = MonoTime(ticks: ticks * machAbsoluteTimeFreq.numer div machAbsoluteTimeFreq.denom)

div is expensive.

On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.

  • avoiding 64 bit overflow can be done using 128bit arithmetics as suggested for that mach_timebase_info issue in the accepted answer (https://stackoverflow.com/a/23378064/1426932); however that's even more expensive and would definitely add overhead, especially if that's done inside getMonoTime as opposed to out of line as I'm doing in this PR
  • but, according to some other sources, on intel macs, the output of mach_timebase_info could be (1,1), meaning that the overhead of ticks: ticks * machAbsoluteTimeFreq.numer div machAbsoluteTimeFreq.denom becomes 0 in this case....

Note that timers falls back to gettimeofday which is not guaranteed to be monotonic and so cannot be used in monotimes.

some parts can be reused though (especially ugly platform specific parts that shouldn't be written twice)

I'll dig into @mratsim 's answer later

@GULPF
Copy link
Member

GULPF commented Mar 11, 2020

div is expensive

That's true, but there's no reason why monotimes cannot do like timers and delay calling div until it's necessary. Note that the ticks field in MonoTime is not exported, so it's possible to change the meaning of it to be a platform dependent value (for some platforms) without breaking anything.

The monotimes module is intended to serve the same purpose as the timers module (as it's currently written, @mratsim suggestion would be something different). It's possible that timers does something better than monotimes, but in that case I think it's better to improve the monotimes module. We shouldn't need two modules in the stdlib that are this similar.

I'm trying to figure whether mach_timebase_info is constant for a given machine or can vary during lifetime of a process, however apple docs are as uninformative as possible (see https://developer.apple.com/documentation/driverkit/3433733-mach_timebase_info?language=objc)

monotimes did previously assume that, but it was changed recently (e24443f#diff-a50dac6a52159869336dde0e09e9139b).

@mratsim
Copy link
Collaborator

mratsim commented Mar 11, 2020

  • avoiding 64 bit overflow can be done using 128bit arithmetics as suggested for that mach_timebase_info issue in the accepted answer (https://stackoverflow.com/a/23378064/1426932); however that's even more expensive and would definitely add overhead, especially if that's done inside getMonoTime as opposed to out of line as I'm doing in this PR

For this case 128-bit multiplications and divisions are actually free on x86-64, (64 x 64 / 64 => 128 / 64 => 64)
64-bit division and multiplication are implemented in hardware via div128by64 (the div and idiv instruction) and mul64x64to128 (the mul and imul instruction)

They can harnessed via the following code:

@timotheecour timotheecour changed the title new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Mar 12, 2020
@GULPF GULPF added the Times label Mar 27, 2020
@Araq
Copy link
Member

Araq commented Apr 24, 2020

Please fix/update/patch std/monotimes instead.

@Araq Araq closed this Apr 24, 2020
@timotheecour timotheecour added the stale Staled PR/issues; remove the label after fixing them label Jun 28, 2020
@timotheecour timotheecour changed the title [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) [TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Jul 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Staled PR/issues; remove the label after fixing them Times
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants