[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

timotheecour · 2020-03-10T05:28:59Z

This PR turns system/timers include into a new stdlib module std/timers.

cpuTime can be misleading for benchmarking because it adds a lot of overhead (for various reasons, including FP operations); so you get flawed conclusions unless the workload is significantly more expensive than the cost of cpuTime, which isn't always possible without affecting the thing you're measuring.
getTicks is more adequate, giving the highest timer precision and least overhead; in practice, this is what people use outside of nim (eg QueryPerformanceCounter / mach_absolute_time etc, which is wrapped by timers)

example

on local OSX, this shows getTicks has an overhead of 16ns, vs 450ns for cpuTime, ie 27X less overhead.
getTicks also has the highest available precision on a given machine using platform specific API's (the benchmark below doesn't measure this)

# include system/timers
# import t10333b
import std/timers
include times
proc toSeconds(a: Nanos): float = a.float*1e-9
proc toSeconds(a: float): float = a
template test(fun)=
  block:
    proc main()=
      let n = 10_000_000
      type T = type(fun()-fun())
      var dt: T
      for i in 0..<n:
        let t = fun()
        # code to benchmark here (intentionally empty)
        let t2 = fun()
        dt += t2-t
      echo "\n" & astToStr(fun) & ":"
      echo dt.toSeconds
      echo (dt.toSeconds * 1e9 / n.float).formatEng
      echo (n.float / dt.toSeconds).formatEng
    main()
test(getTicks)
test(cpuTime)

prints:

getTicks:
0.16410692
16.410692
60.9358825332e6

cpuTime:
4.509458000001094
450.9458000001
2.2175614009e6

note

I added a dummy system/timers.nim with a deprecation msg; the only tested package that required it was https://krux02@bitbucket.org/krux02/tensordslnim.git
code the reliied on system/timers can avoid the warning by:

template imp = import std/timers
when compiles(imp()): imp() else: include system/timers

or (after 1.2 release) by checking on nim version

Araq · 2020-03-10T15:10:05Z

We have std / monotimes now. Are you aware?

timotheecour · 2020-03-10T21:10:08Z

I do remember now, it was added fairly recently, however monotimes still adds too much overhead; what's provided in this PR offers the least possible overhead on a platform, which matters a lot for profiling/benchmarking etc

here's updated benchmark, adding in monotimes: getMonoTime has 2X the overhead of getTicks, that means any small measurement will be impacted. Depending on what you're trying to measure, that makes all the difference (it's not always possible to "increase the payload" of what you're measuring, this can yield flawed conclusions).

Looking at the implementation of getMonoTime, this is not something that can be fixed, the overhead is inherent in what getMonoTime does.

cpuTime:
("total secs", 3.924649999999624)
("ns/iter", "392.465")
("iters per sec", "2.5479979106e6")

getMonoTime:
("total secs", 0.282378408)
("ns/iter", "28.2378408")
("iters per sec", "35.4134725485e6")

getTicks:
("total secs", 0.137324019)
("ns/iter", "13.7324019")
("iters per sec", "72.8204728701e6")

# include system/timers
# import t10333b
import std/timers
import std/monotimes
import times
import strutils
proc toSeconds(a: Nanos): float = a.float*1e-9
proc toSeconds(a: float): float = a
proc toSeconds(a: Duration): float = a.nanoseconds.float*1e-9
template test(fun)=
  block:
    proc main()=
      let n = 10_000_000
      type T = type(fun()-fun())
      var dt: T
      for i in 0..<n:
        let t = fun()
        # code to benchmark here (intentionally empty)
        let t2 = fun()
        dt += t2-t
      echo "\n" & astToStr(fun) & ":"
      echo ("total secs", dt.toSeconds)
      echo ("ns/iter", (dt.toSeconds * 1e9 / n.float).formatEng)
      echo ("iters per sec", (n.float / dt.toSeconds).formatEng)
    main()
test(cpuTime)
test(getMonoTime)
test(getTicks)

TODO for subsequent PR's

I can, however, cross-reference properly monotimes, std/timers and cpuTime in the docs and explain pros/cons of each approach
as well as refactor monotimes so it imports std/timers to avoid code duplication it does

timotheecour · 2020-03-11T03:48:02Z

reviewer note: to view the diff (github and git dont' do well when file is renamed and old file is replaced), best to view locally as follows:

git diff parent_hash:./lib/system/timers.nim ./lib/std/timers.nim

GULPF · 2020-03-11T06:23:54Z

On my machine (Ubuntu, latest develop version of Nim):

cpuTime:
("total secs", 4.451975478000229)
("ns/iter", "445.1975478")
("iters per sec", "2.2461938637e6")

getMonoTime:
("total secs", 0.18464677)
("ns/iter", "18.464677")
("iters per sec", "54.1574596729e6")

getTicks:
("total secs", 0.20437)
("ns/iter", "20.437")
("iters per sec", "48.9308606938e6")

I cannot see why monotimes would have a meaningful overhead over timers. The only overhead is the use of Duration in -, but if you really want to avoid that you can do getMonoTime().ticks and use a raw integer instead.

Looking at the implementation of getMonoTime, this is not something that can be fixed, the overhead is inherent in what getMonoTime does.

Please explain.

as well as refactor monotimes so it imports std/timers to avoid code duplication it does

Note that timers falls back to gettimeofday which is not guaranteed to be monotonic and so cannot be used in monotimes.

…merly system/timers include)

mratsim · 2020-03-11T10:42:42Z

I would expect getTicks() to be compiled to a memory barrier + RDTSC (or RDTSCP directly) on x86 if we want as accurate as possible clock cycles report. With the caveat of TurboBoost, CPU frequency scaling and overclocking not reporting the correct cycle count.

See what I use in Weave https://github.com/mratsim/weave/blob/v0.3.0/weave/instrumentation/timers.nim

when defined(i386) or defined(amd64):
  # From Linux
  #
  # The RDTSC instruction is not ordered relative to memory
  # access.  The Intel SDM and the AMD APM are both vague on this
  # point, but empirically an RDTSC instruction can be
  # speculatively executed before prior loads.  An RDTSC
  # immediately after an appropriate barrier appears to be
  # ordered as a normal load, that is, it provides the same
  # ordering guarantees as reading from a global memory location
  # that some other imaginary CPU is updating continuously with a
  # time stamp.
  #
  # From Intel SDM
  # https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
  when not defined(vcc):
    when defined(amd64):
      proc getticks(): int64 {.inline.} =
        var lo, hi: int64
        # TODO: Provide a compile-time flag for RDTSCP support
        #       and use it instead of lfence + RDTSC
        {.emit: """asm volatile(
          "lfence\n"
          "rdtsc\n"
          : "=a"(`lo`), "=d"(`hi`)
          :
          : "memory"
        );""".}
        return (hi shl 32) or lo
    else:
      proc getticks(): int64 {.inline.} =
        # TODO: Provide a compile-time flag for RDTSCP support
        #       and use it instead of lfence + RDTSC
        {.emit: """asm volatile(
          "lfence\n"
          "rdtsc\n"
          : "=a"(`result`)
          :
          : "memory"
        );""".}
  else:
    proc rdtsc(): int64 {.sideeffect, importc: "__rdtsc", header: "<intrin.h>".}
    proc lfence() {.importc: "__mm_lfence", header: "<intrin.h>".}

    proc getticks(): int64 {.inline.} =
      lfence()
      return rdtsc()
else:
  when defined(WV_profile):
    {.error: "getticks is not supported on this CPU architecture".}

timotheecour · 2020-03-11T12:46:50Z

I cannot see why monotimes would have a meaningful overhead over timers

it looks like this is platform specific; on OSX, after measuring, the 2X slowdown is caused by overhead in 2 places:

    mach_timebase_info(machAbsoluteTimeFreq)

and

result = MonoTime(ticks: ticks * machAbsoluteTimeFreq.numer div machAbsoluteTimeFreq.denom)

div is expensive.

I'm trying to figure whether mach_timebase_info is constant for a given machine or can vary during lifetime of a process, however apple docs are as uninformative as possible (see https://developer.apple.com/documentation/driverkit/3433733-mach_timebase_info?language=objc)
according to https://stackoverflow.com/questions/23378063/how-can-i-use-mach-absolute-time-without-overflowing it gets even worse: on some machines, ticks * machAbsoluteTimeFreq.numer div machAbsoluteTimeFreq.denom will overflow on 64bit integer arithmetics, which could give wrong results every few minutes! (which would imply the times are off or maybe way off for those machines, at some random times where it overflow).

On PowerPC Macs, mach_timebase_info returns either 1000000000/33333335 or 1000000000/25000000, so Apple's provided code definitely overflows every few minutes. Oops.

avoiding 64 bit overflow can be done using 128bit arithmetics as suggested for that mach_timebase_info issue in the accepted answer (https://stackoverflow.com/a/23378064/1426932); however that's even more expensive and would definitely add overhead, especially if that's done inside getMonoTime as opposed to out of line as I'm doing in this PR
but, according to some other sources, on intel macs, the output of mach_timebase_info could be (1,1), meaning that the overhead of ticks: ticks * machAbsoluteTimeFreq.numer div machAbsoluteTimeFreq.denom becomes 0 in this case....

Note that timers falls back to gettimeofday which is not guaranteed to be monotonic and so cannot be used in monotimes.

some parts can be reused though (especially ugly platform specific parts that shouldn't be written twice)

I'll dig into @mratsim 's answer later

GULPF · 2020-03-11T13:53:45Z

div is expensive

That's true, but there's no reason why monotimes cannot do like timers and delay calling div until it's necessary. Note that the ticks field in MonoTime is not exported, so it's possible to change the meaning of it to be a platform dependent value (for some platforms) without breaking anything.

The monotimes module is intended to serve the same purpose as the timers module (as it's currently written, @mratsim suggestion would be something different). It's possible that timers does something better than monotimes, but in that case I think it's better to improve the monotimes module. We shouldn't need two modules in the stdlib that are this similar.

I'm trying to figure whether mach_timebase_info is constant for a given machine or can vary during lifetime of a process, however apple docs are as uninformative as possible (see https://developer.apple.com/documentation/driverkit/3433733-mach_timebase_info?language=objc)

monotimes did previously assume that, but it was changed recently (e24443f#diff-a50dac6a52159869336dde0e09e9139b).

mratsim · 2020-03-11T15:47:42Z

avoiding 64 bit overflow can be done using 128bit arithmetics as suggested for that mach_timebase_info issue in the accepted answer (https://stackoverflow.com/a/23378064/1426932); however that's even more expensive and would definitely add overhead, especially if that's done inside getMonoTime as opposed to out of line as I'm doing in this PR

For this case 128-bit multiplications and divisions are actually free on x86-64, (64 x 64 / 64 => 128 / 64 => 64)
64-bit division and multiplication are implemented in hardware via div128by64 (the div and idiv instruction) and mul64x64to128 (the mul and imul instruction)

They can harnessed via the following code:

GCC/Clang inline ASM and 128-bit fallback https://github.com/mratsim/constantine/blob/dc0c1c181c10d3c1059d414f4301337d471d1295/constantine/primitives/extended_precision.nim#L78-L135
VCC udiv128 and umul128 https://github.com/mratsim/constantine/blob/dc0c1c181c10d3c1059d414f4301337d471d1295/constantine/primitives/extended_precision.nim#L187-L212

Araq · 2020-04-24T12:51:40Z

Please fix/update/patch std/monotimes instead.

timotheecour changed the title ~~new std/timers module for high performance / low overhead timers (formerly system/timers include)~~ new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Mar 10, 2020

timotheecour mentioned this pull request Mar 10, 2020

Add strutils in-place #13610

Closed

timotheecour added 2 commits March 11, 2020 02:17

new std/timers module for high performance / low overhead timers (for…

e71f538

…merly system/timers include)

fix for external packages that included system/timers

2954e50

timotheecour force-pushed the pr_getTicks branch from f2bf024 to 2954e50 Compare March 11, 2020 09:18

timotheecour changed the title ~~new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include)~~ [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Mar 12, 2020

GULPF added the Times label Mar 27, 2020

Araq closed this Apr 24, 2020

timotheecour added the stale Staled PR/issues; remove the label after fixing them label Jun 28, 2020

timotheecour changed the title ~~[WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include)~~ [TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) Jul 28, 2020

timotheecour mentioned this pull request Feb 6, 2021

system/timers and std/monotimes are mostly duplicated, should be refactored timotheecour/Nim#572

Open

timotheecour mentioned this pull request Mar 17, 2021

doc times: do not use now (and also epochTime) for benchmarking #17405

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

timotheecour commented Mar 10, 2020 •

edited

Loading

Araq commented Mar 10, 2020

timotheecour commented Mar 10, 2020 •

edited

Loading

timotheecour commented Mar 11, 2020 •

edited

Loading

GULPF commented Mar 11, 2020

mratsim commented Mar 11, 2020

timotheecour commented Mar 11, 2020

GULPF commented Mar 11, 2020

mratsim commented Mar 11, 2020

Araq commented Apr 24, 2020

[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617

Conversation

timotheecour commented Mar 10, 2020 • edited Loading

example

note

Araq commented Mar 10, 2020

timotheecour commented Mar 10, 2020 • edited Loading

TODO for subsequent PR's

timotheecour commented Mar 11, 2020 • edited Loading

GULPF commented Mar 11, 2020

mratsim commented Mar 11, 2020

timotheecour commented Mar 11, 2020

GULPF commented Mar 11, 2020

mratsim commented Mar 11, 2020

Araq commented Apr 24, 2020

timotheecour commented Mar 10, 2020 •

edited

Loading

timotheecour commented Mar 10, 2020 •

edited

Loading

timotheecour commented Mar 11, 2020 •

edited

Loading