-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TODO] [WIP] new std/timers module for high performance / low overhead timers and benchmarking (formerly system/timers include) #13617
Conversation
We have |
I do remember now, it was added fairly recently, however monotimes still adds too much overhead; what's provided in this PR offers the least possible overhead on a platform, which matters a lot for profiling/benchmarking etc here's updated benchmark, adding in monotimes: Looking at the implementation of
# include system/timers
# import t10333b
import std/timers
import std/monotimes
import times
import strutils
proc toSeconds(a: Nanos): float = a.float*1e-9
proc toSeconds(a: float): float = a
proc toSeconds(a: Duration): float = a.nanoseconds.float*1e-9
template test(fun)=
block:
proc main()=
let n = 10_000_000
type T = type(fun()-fun())
var dt: T
for i in 0..<n:
let t = fun()
# code to benchmark here (intentionally empty)
let t2 = fun()
dt += t2-t
echo "\n" & astToStr(fun) & ":"
echo ("total secs", dt.toSeconds)
echo ("ns/iter", (dt.toSeconds * 1e9 / n.float).formatEng)
echo ("iters per sec", (n.float / dt.toSeconds).formatEng)
main()
test(cpuTime)
test(getMonoTime)
test(getTicks) TODO for subsequent PR's
|
reviewer note: to view the diff (github and git dont' do well when file is renamed and old file is replaced), best to view locally as follows:
|
On my machine (Ubuntu, latest develop version of Nim): cpuTime:
("total secs", 4.451975478000229)
("ns/iter", "445.1975478")
("iters per sec", "2.2461938637e6")
getMonoTime:
("total secs", 0.18464677)
("ns/iter", "18.464677")
("iters per sec", "54.1574596729e6")
getTicks:
("total secs", 0.20437)
("ns/iter", "20.437")
("iters per sec", "48.9308606938e6") I cannot see why
Please explain.
Note that |
f2bf024
to
2954e50
Compare
I would expect See what I use in Weave https://github.com/mratsim/weave/blob/v0.3.0/weave/instrumentation/timers.nim when defined(i386) or defined(amd64):
# From Linux
#
# The RDTSC instruction is not ordered relative to memory
# access. The Intel SDM and the AMD APM are both vague on this
# point, but empirically an RDTSC instruction can be
# speculatively executed before prior loads. An RDTSC
# immediately after an appropriate barrier appears to be
# ordered as a normal load, that is, it provides the same
# ordering guarantees as reading from a global memory location
# that some other imaginary CPU is updating continuously with a
# time stamp.
#
# From Intel SDM
# https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
when not defined(vcc):
when defined(amd64):
proc getticks(): int64 {.inline.} =
var lo, hi: int64
# TODO: Provide a compile-time flag for RDTSCP support
# and use it instead of lfence + RDTSC
{.emit: """asm volatile(
"lfence\n"
"rdtsc\n"
: "=a"(`lo`), "=d"(`hi`)
:
: "memory"
);""".}
return (hi shl 32) or lo
else:
proc getticks(): int64 {.inline.} =
# TODO: Provide a compile-time flag for RDTSCP support
# and use it instead of lfence + RDTSC
{.emit: """asm volatile(
"lfence\n"
"rdtsc\n"
: "=a"(`result`)
:
: "memory"
);""".}
else:
proc rdtsc(): int64 {.sideeffect, importc: "__rdtsc", header: "<intrin.h>".}
proc lfence() {.importc: "__mm_lfence", header: "<intrin.h>".}
proc getticks(): int64 {.inline.} =
lfence()
return rdtsc()
else:
when defined(WV_profile):
{.error: "getticks is not supported on this CPU architecture".} |
it looks like this is platform specific; on OSX, after measuring, the 2X slowdown is caused by overhead in 2 places:
and
some parts can be reused though (especially ugly platform specific parts that shouldn't be written twice) I'll dig into @mratsim 's answer later |
That's true, but there's no reason why The
|
For this case 128-bit multiplications and divisions are actually free on x86-64, (64 x 64 / 64 => 128 / 64 => 64) They can harnessed via the following code:
|
Please fix/update/patch std/monotimes instead. |
This PR turns system/timers include into a new stdlib module std/timers.
cpuTime
can be misleading for benchmarking because it adds a lot of overhead (for various reasons, including FP operations); so you get flawed conclusions unless the workload is significantly more expensive than the cost ofcpuTime
, which isn't always possible without affecting the thing you're measuring.getTicks
is more adequate, giving the highest timer precision and least overhead; in practice, this is what people use outside of nim (eg QueryPerformanceCounter / mach_absolute_time etc, which is wrapped by timers)example
on local OSX, this shows getTicks has an overhead of 16ns, vs 450ns for cpuTime, ie 27X less overhead.
getTicks
also has the highest available precision on a given machine using platform specific API's (the benchmark below doesn't measure this)prints:
note
I added a dummy
system/timers.nim
with a deprecation msg; the only tested package that required it was https://krux02@bitbucket.org/krux02/tensordslnim.gitcode the reliied on system/timers can avoid the warning by:
or (after 1.2 release) by checking on nim version