some metrics for monitoring futures #85

stefantalpalaru · 2020-03-13T02:06:21Z

You can see these metrics in action inside a nim-beacon-chain repo, by running make NIMFLAGS="-d:insecure" eth2_network_simulation and, a minute later in another terminal, cd tests/simulation/prometheus; prometheus.

Now you can go to this address in your browser and see something like this:

arnetheduck · 2020-03-17T12:06:07Z

chronos/asyncfutures2.nim

@@ -284,6 +287,9 @@ proc addCallback*(future: FutureBase, cb: CallbackFunc, udata: pointer = nil) =
  ## Adds the callbacks proc to be called when the future completes.
  ##
  ## If future has already completed then ``cb`` will be called immediately.
+  {.gcsafe.}:


not sure if chronos is thread safe, but this is certainly not

also, it seems wrong to be creating strings here and counting the callbacks like this - normally what's done is that if a particular counter is expensive, you count for example 1% of the instances and scale the result accordingly, getting a statistical approximation - something that the metrics library should take care of - if metrics require you to pre-aggregate stuff, something is wrong with the library.

finally, when metrics since it's not necessarily part of every codebase to be running this stuff

I don't want all future types ending up in the metrics. Just the most frequently used ones, hence the aggregation, sorting and truncating (further down, in the other file). I know it's ugly, but dragging around hundreds of labels and trying to sort them with Prometheus queries is uglier.

If I add when defined(metrics): all over the code, it will get less readable. Is the performance gain worth it?

No, it's not thread-safe. When we start using multiple event loops in parallel and we want combined metrics, we'll need locks.

It's not just about runtime performance but also compile time and dependency hygiene..

also, like this, there's a significant chunk of metrics code intertwined with chronos' callback / future logic - code-wise, the metrics stuff should be factored out into its own functions at least so that it takes up less code-space where the important things are going on - it's already a readability issue that needs to be fixed

arnetheduck · 2020-03-17T12:07:33Z

pretty graphs, looks useful as a feature

zah · 2020-03-17T15:29:00Z

chronos/asyncloop.nim

@@ -729,6 +747,28 @@ elif unixPlatform:
    # poll() call.
    loop.processCallbacks()

+    # Wait until we have a decent amount of data and pick the most frequently
+    # seen futures.
+    const


I must agree this feature needs more compile-time configuration and perhaps it should be disabled by default for a library such as Chronos.

cheatfate · 2020-03-17T17:53:38Z

First of all chronos already has trackers which can be considered as metrics replacement. And this feature is cross-platform and used already.

Currently only to track transport or server leaks. But it can be easily extended to track any values inside of chronos.

Also i dont think metrics are valuable source of information, at least in proposed manner, because even if you have number of callbacks processed and number of timers processed, this do not solve any problems. And actually this problems are already known. Maybe in future we can introduce some glue layer to nim-metrics or even migrate nim-metrics to chronos (not asyncdispatch). But right now i dont think its a valuable addition.

Please consider that metrics will work only on *nix platforms.

stefantalpalaru · 2020-03-17T21:15:45Z

added "--skipUserCfg --skipParentCfg" to a test case, which is useful when testing from inside nim-beacon-chain
made the "metrics" module an optional dependency
protected the global count table with a lock, when threading is enabled, making the code thread-safe
moved the "callbacks by future" processing code in its own proc and used that also on Windows
added another metric to the Windows poll()

cheatfate

I understand that you do not care, but all this metrics can be and should be implemented more safely using internal chronos mechanisms.

cheatfate · 2020-03-17T21:28:36Z

chronos/asyncloop.nim

+when defined(metrics):
+  var callbacksByFuture* = initCountTable[string]()
+  when defined(threads):
+    var callbacksByFutureLock*: Lock


callbacksByFuture must not be global, because async dispatcher will be single-threaded even in multi-threaded environment, so for N threads there will be present N dispatchers.

Also because dispatcher will be single-threaded locks are not needed.

Also because dispatcher will be single-threaded locks are not needed.

You can have multiple event loops running in different threads and collect metrics from all of them.

cheatfate · 2020-03-17T21:29:06Z

chronos/asyncloop.nim

-declareCounter chronos_loop_timers, "loop timers"
-declareGauge chronos_loop_timers_queue, "loop timers queue"
+when defined(metrics):
+  declareCounter chronos_loop_timers, "loop timers"


all this metrics variables must not be globals.

cheatfate · 2020-03-17T21:32:36Z

chronos/asyncloop.nim

+  declareCounter chronos_processed_callbacks, "total number of processed callbacks"
+
+
+  proc processCallbacksByFuture() =


What the reason to count callbacks per Future?
Almost every possible Future[T] will have only 1 or 2 callbacks, in rare cases it will have more then 2 (allFutures, one), but i really do not understand why this code ever needed?

What the reason to count callbacks per Future?

Per Future type, not instance.

but i really do not understand why this code ever needed

So I can see which Future types are used more often. In NBC's case, it's those produced by readExactly() and write(), peaking at 140-180 per second, every few seconds.

It shows inefficient internal reads and writes, because they correspond to peaks of 20 OS read events and 50 write events per second.

stefantalpalaru · 2020-03-17T21:47:41Z

all this metrics can be and should be implemented more safely using internal chronos mechanisms

You mean these objects you keep in a table inside the global dispatcher threadvar?

nim-chronos/chronos/asyncloop.nim

Line 198 in 7ed9f14

TrackerBase* = ref object of RootRef

Why resort to that ugly hack when I have proper metrics available, complete with atomic operations (those that confused you when you started commenting that metrics should not be global).

zah · 2020-03-17T23:27:58Z

I understand that you do not care, but all this metrics can be and should be implemented more safely using internal Chronos mechanisms.

Our experience as C/C++ programmers has thought us that a library without dependencies is something to strive for, a virtue. But why is this the case? One reason is that unfocused dependencies can introduce code bloat (e.g. you need a portable tray icon and you end up adding Qt to your project). But the more probable reason is that in C and C++, integrating dependencies in your build can be quite painful.

In Nim, both of these problems can be avoided if we strive to have small and granular dependencies, especially when we are building all of the libraries ourselves. The alternative to having dependencies is to re-implement the same functionality in multiple places. These end up linked in the same binary, leading to an unnecessary increase of the final code size (both for humans to review and for machines to load in caches).

Reusing the same components also changes the economics of development. It's unlikely that we'll ever have the time to work on the Chronos internal metrics, but we might invest in creating real-time visualisation and analysis tools for nim-metrics if it's used everywhere in our codebases. The smaller and more focused our libraries are, the better.

With all this said, there will be a problem with a dependency to nim-metrics here. To avoid the cyclic graph, we'll need to have a separate package called metrics-server that will depend on metrics(-core) and chronos in the future.

cheatfate · 2020-03-18T07:56:10Z

@stefantalpalaru because dispatchers will run in per-thread mode, there will be no global dispatcher which will cover all the threads, so there no reason to use atomics and globals.

@zah because its dependency which perform inc(value) only. And this dependency even not supports custom trackers of some other library, so you need to integrate "globals" and other unsafe constructs to library. So from my point of view if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values without applying changes to chronos scheduler which can seriously affect performance.

cheatfate · 2020-03-18T08:06:41Z

So in future when chronos will start support creating multi-threaded servers there will be N threads and N dispatchers which should work in parallel for every connected peer. But all this dispatchers will be required to sync with single Lock at every possible poll() call. And i'm not talking here about applying sort operation on CountTable.

stefantalpalaru · 2020-03-18T11:23:12Z

So in future when chronos will start support creating multi-threaded servers there will be N threads and N dispatchers which should work in parallel

You support that right now.

all this dispatchers will be required to sync with single Lock at every possible poll() call

Every 50 calls, because of if chronos_poll_ticks.value.int64 mod ticksBetweenChecks == 0:

there no reason to use atomics and globals

There's no harm either.

if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values

How?

zah · 2020-03-19T12:43:20Z

@zah because its dependency which perform inc(value) only. And this dependency even not supports custom trackers of some other library, so you need to integrate "globals" and other unsafe constructs to library. So from my point of view if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values without applying changes to chronos scheduler which can seriously affect performance.

nim-metrics allows you to instantiate the counters as fields within objects as well. Saying that it's just about inc(value) misses the point a bit - a metrics library exists so we can have tools that can monitor the data in real-time, archive it, analyse it, etc.

When the responsibilities are cleanly divided between different libraries, we can invest the time to perfect them. If you have some ideas how nim-metrics can be made less intrusive, go ahead and suggest changes to the API - let's work towards creating the best possible metrics library.

cheatfate · 2020-03-19T16:10:01Z

@zah i dont think discussing nim-metrics features in nim-chronos repository is a right place.
nim-metrics it self its just a inc(value) and nothing more, and if we going to make it less intrusive it should stay with just inc(value). All other dependencies (web servers, udp broadcasters) should be optional.

Right now, if we going to accept this PR we are including low-quiality web server into low-level networking framework. So question is does "BSD sockets" library must have kernel space web server just to allow you to monitor some metrics? I think approach should be different.

Chronos itself is able to perform inc(value) and nim-metrics integration should be able to pass or read this values, without any changes to chronos scheduler.

Second problem is that all nim-metrics values are global variables, and chronos are per thread engine, in such way if we going to use chronos in many threads this variables will be not totally correct or totally incorrect. This also creates many problems with gcsafe.

stefantalpalaru · 2020-03-19T16:21:34Z

nim-metrics it self its just a inc(value) and nothing more

You should read the documentation: https://github.com/status-im/nim-metrics/#nim-metrics

Right now, if we going to accept this PR we are including low-quiality web server into low-level networking framework.

Here's the diff, show me the web server that gets included into Chronos: https://github.com/status-im/nim-chronos/pull/85/files

Second problem is that all nim-metrics values are global variables, and chronos are per thread engine, in such way if we going to use chronos in many threads this variables will be not totally correct or totally incorrect. This also creates many problems with gcsafe.

No, of course it doesn't create any problem. GC safety is already guaranteed by nim-metrics.

Chronos itself is able to perform inc(value) and nim-metrics integration should be able to pass or read this values, without any changes to chronos scheduler.

How? How do I get the rate of Future callbacks executed for a specific Future type out of Chronos right now? What's the API for that? Is that data available now, or a couple of years in the future?

zah · 2020-03-19T18:16:40Z

The web server component in nim-metrics is already optional. Merely using metrics in some project doesn't create any security risks. Furthermore, the spirit of the library is that all monitoring should be optional and I suggested that we should disable it by default here.

cheatfate · 2020-03-25T08:30:37Z

@zah right now i see changes in scheduler and this is not acceptable. It seriously affects performance and create single lock sync problems, if chronos will be used in multi threaded environment. Also because of global counters it produce incorrect numbers which will be sum of all working dispatchers.

zah · 2020-03-25T09:42:17Z

@stefantalpalaru's argument is that the performance impact is not severe, because there is a notion of sampling (you run the measuring code every N-th poll, where N can be controlled).

Furthermore, the performance impact will be zero if the feature is disabled by default which is something I've suggested.

cheatfate · 2020-03-25T11:11:13Z

@zah and every N-th poll you will perform sort operation of thousands Future[T] identifiers just to get metrics information which is not always useful, and after that second thread will perform sort operation of equal dataset. So we are going to freeze all our dispatchers for some time controlled by N?

cheatfate · 2020-03-25T11:12:27Z

As i said this metrics should be carefully applied using internal chronos mechanisms, after that we can make a method to transfer this values to any other compatible metrics framework.

stefantalpalaru · 2020-08-06T02:22:17Z

All the new metrics, showing a Medalla node (make NIMFLAGS="-d:insecure -d:chronosFutureTracking" medalla) in the initial block syncing phase. No attestation processing yet.

zah · 2020-08-11T10:23:14Z

While having metrics from Chronos is an extremely valuable addition to NBC, this PR has been merged without establishing a consensus within the team which is something we should strive to avoid. Even more relaxed practices such as rough consensus would have been violated, so the best way forward is to revert the change as soon as possible and continue the discussion from there.

The concerns of both sides are valid:

We need metrics from Chronos
The integration was not done in an acceptable way (due to the use of global variables and locking)

The right course of action would have been improving the PR until there are no more blocking objections. Through the use of instance-bound counters (either coming from nim-metrics or through regular variables), the undesired locking would have been avoided and this was a clear direction for how the PR could have been improved.

That said, I would also point out that progress requires both sides to cooperate. This PR has been lingering for too long and we've been deprived from the benefits it brings for no good reason (forcing our hand a little bit to merge it in the current emergency). The metrics library is envisioned to become a very comprehensive solution with instance-bound counters offering fine-grained compile-time configuration mechanisms that are unpractical to develop internally in Chronos and thus it can make Chronos a better library in the long-term. The presented objections towards having it as dependency has been largely addressed (e.g. It was already pointed out that the server components there are optional) and I wish @cheatfate would have more good will in general towards the nim-metrics vision shared by the team.

cheatfate · 2020-08-11T12:12:21Z

@zah

Sorry, but i can't see how i can cooperate if people totally ignoring my concerns.

You said this metrics are very valuable asset, but i never seen that something got improved using metrics. So the only way when metrics can be used as a tool to find and fix something is when you put metric "file_xxx_line_yyy_metric_zzz" at every line of your code. With current approach the only way of using metrics is to find correlation between one even and another event... What i can say if somebody wants to see such correlations it means there no understanding on how code is working...

For example i can't understand what number of pending timers or pending callbacks can give to the developer at some moment of time... Without diving deep into this callbacks and timers this counters are absolutely useless.

Poor understanding of how async is working generates poor PRs.

zah · 2020-08-11T12:28:37Z

@cheatfate, it's very hard to argue that having metrics is not useful. Would you claim that you were never surprised by a result we obtained from metrics? As scientists often say, many discoveries don't come from deductions, but rather from observations that makes you say "Hmm, that's strange".

cheatfate · 2020-08-11T12:38:22Z

@zah sorry but i have never been surprised by metrics, usually i have mentioned problems highlighted by metrics many months before it was highlighted by metrics.

Also nim-metrics has pretty poor API and design which forces everybody to declare metrics in global scope, as a consequences you are required to put gcsafe pragma everywhere where you working with this metric variables... So you turning off pretty useful compiler feature just because you want to use metrics. I dont think metrics are worth it.

zah · 2020-08-11T12:56:44Z

Also nim-metrics has pretty poor API and design which forces everybody to declare metrics in global scope, as a consequences you are required to put gcsafe pragma everywhere where you working with this metric variables... So you turning off pretty useful compiler feature just because you want to use metrics. I dont think metrics are worth it.

This is not true and it's an example for the lack of cooperation that I'm highlighting. The API already allows you to use the counters as fields inside objects and in the future this will get even better because it will be possible to use a compile-time switch that automatically removes the field of a particular metric and all the corresponding updates to it. This is the kind of capability that takes some effort to perfect in a library like nim-metrics and that doesn't make sense for other libraries to re-invent.

Give the other side the benefit of the doubt, work with them towards improving the API instead of claiming it's just not good and treating it as if it was developed by a third party team that you cannot influence.

cheatfate · 2020-08-11T13:05:04Z

@zah if its true why every possible interaction with metrics requires to set {.gcsafe.}? I think we are all confused here because of metrics here.

stefantalpalaru · 2020-08-11T13:13:09Z

this PR has been merged without establishing a consensus within the team

Libp2p developers jumping through hoops to use the Chronos branch with metrics counts as a consensus where it matters.

the best way forward is to revert the change as soon as possible and continue the discussion from there

I disagree. I see a very useful feature being blocked for almost 5 months based on non-technical reasons. If anything, we need to stop driving with this handbrake on.

If you hold some fringe beliefs about real programmers debugging without instrumentation, you don't get to veto features useful to us - lesser programmers doing the actual debugging. The feature gets merged sooner, rather than later and whenever you manage to reimplement those metrics you say you understand a lot better, you can make another PR.

This policy is based on common sense and it would allow us to iterate faster, which is very important in this alpha stage.

This reverts commit e45ef32. Metrics implemented this way, with a lock inside the otherwise tight event loop are not consistent with the chronos architecture that for good or bad uses thread local variables to avoid them - the solution does not have rough consensus behind it, and other avenues should be explored for this generally useful functionality.

stefantalpalaru force-pushed the metrics branch from bdaa004 to 1c1e0e3 Compare March 13, 2020 02:11

arnetheduck reviewed Mar 17, 2020

View reviewed changes

zah reviewed Mar 17, 2020

View reviewed changes

cheatfate reviewed Mar 17, 2020

View reviewed changes

stefantalpalaru force-pushed the metrics branch 2 times, most recently from 01a0328 to 81fd5cc Compare April 2, 2020 15:42

stefantalpalaru force-pushed the metrics branch from 6ccd8ef to 325f308 Compare April 9, 2020 22:06

stefantalpalaru mentioned this pull request Jun 5, 2020

100% cpu usage synced node status-im/nimbus-eth2#1122

Closed

some metrics for monitoring futures

4a1be99

stefantalpalaru force-pushed the metrics branch from 325f308 to 4a1be99 Compare August 6, 2020 02:18

stefantalpalaru merged commit e45ef32 into master Aug 6, 2020

stefantalpalaru deleted the metrics branch August 6, 2020 17:30

mratsim mentioned this pull request Sep 23, 2020

-d:useMalloc for the default GC nim-lang/Nim#15394

Open

bung87 pushed a commit to bung87/nim-chronos that referenced this pull request Nov 17, 2020

some metrics for monitoring futures (status-im#85)

64247fd

stefantalpalaru mentioned this pull request Nov 16, 2021

Nim GC metrics for the main thread status-im/nimbus-eth2#3108

Merged

		declareCounter chronos_processed_callbacks, "total number of processed callbacks"


		proc processCallbacksByFuture() =

some metrics for monitoring futures #85

some metrics for monitoring futures #85

Conversation

stefantalpalaru commented Mar 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnetheduck commented Mar 17, 2020

Choose a reason for hiding this comment

cheatfate commented Mar 17, 2020

stefantalpalaru commented Mar 17, 2020

cheatfate left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefantalpalaru commented Mar 17, 2020

zah commented Mar 17, 2020 • edited Loading

cheatfate commented Mar 18, 2020

cheatfate commented Mar 18, 2020 • edited Loading

stefantalpalaru commented Mar 18, 2020

zah commented Mar 19, 2020

cheatfate commented Mar 19, 2020

stefantalpalaru commented Mar 19, 2020

zah commented Mar 19, 2020

cheatfate commented Mar 25, 2020

zah commented Mar 25, 2020

cheatfate commented Mar 25, 2020

cheatfate commented Mar 25, 2020 • edited Loading

stefantalpalaru commented Aug 6, 2020

zah commented Aug 11, 2020

cheatfate commented Aug 11, 2020

zah commented Aug 11, 2020

cheatfate commented Aug 11, 2020

zah commented Aug 11, 2020 • edited Loading

cheatfate commented Aug 11, 2020

stefantalpalaru commented Aug 11, 2020

zah commented Mar 17, 2020 •

edited

Loading

cheatfate commented Mar 18, 2020 •

edited

Loading

cheatfate commented Mar 25, 2020 •

edited

Loading

zah commented Aug 11, 2020 •

edited

Loading