Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some metrics for monitoring futures #85

Merged
merged 1 commit into from
Aug 6, 2020
Merged

some metrics for monitoring futures #85

merged 1 commit into from
Aug 6, 2020

Conversation

stefantalpalaru
Copy link
Contributor

You can see these metrics in action inside a nim-beacon-chain repo, by running make NIMFLAGS="-d:insecure" eth2_network_simulation and, a minute later in another terminal, cd tests/simulation/prometheus; prometheus.

Now you can go to this address in your browser and see something like this: graph

@@ -284,6 +287,9 @@ proc addCallback*(future: FutureBase, cb: CallbackFunc, udata: pointer = nil) =
## Adds the callbacks proc to be called when the future completes.
##
## If future has already completed then ``cb`` will be called immediately.
{.gcsafe.}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if chronos is thread safe, but this is certainly not

also, it seems wrong to be creating strings here and counting the callbacks like this - normally what's done is that if a particular counter is expensive, you count for example 1% of the instances and scale the result accordingly, getting a statistical approximation - something that the metrics library should take care of - if metrics require you to pre-aggregate stuff, something is wrong with the library.

finally, when metrics since it's not necessarily part of every codebase to be running this stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want all future types ending up in the metrics. Just the most frequently used ones, hence the aggregation, sorting and truncating (further down, in the other file). I know it's ugly, but dragging around hundreds of labels and trying to sort them with Prometheus queries is uglier.

If I add when defined(metrics): all over the code, it will get less readable. Is the performance gain worth it?

No, it's not thread-safe. When we start using multiple event loops in parallel and we want combined metrics, we'll need locks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just about runtime performance but also compile time and dependency hygiene..

also, like this, there's a significant chunk of metrics code intertwined with chronos' callback / future logic - code-wise, the metrics stuff should be factored out into its own functions at least so that it takes up less code-space where the important things are going on - it's already a readability issue that needs to be fixed

@arnetheduck
Copy link
Member

pretty graphs, looks useful as a feature

@@ -729,6 +747,28 @@ elif unixPlatform:
# poll() call.
loop.processCallbacks()

# Wait until we have a decent amount of data and pick the most frequently
# seen futures.
const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must agree this feature needs more compile-time configuration and perhaps it should be disabled by default for a library such as Chronos.

@cheatfate
Copy link
Collaborator

First of all chronos already has trackers which can be considered as metrics replacement. And this feature is cross-platform and used already.

Currently only to track transport or server leaks. But it can be easily extended to track any values inside of chronos.

Also i dont think metrics are valuable source of information, at least in proposed manner, because even if you have number of callbacks processed and number of timers processed, this do not solve any problems. And actually this problems are already known. Maybe in future we can introduce some glue layer to nim-metrics or even migrate nim-metrics to chronos (not asyncdispatch). But right now i dont think its a valuable addition.

Please consider that metrics will work only on *nix platforms.

@stefantalpalaru
Copy link
Contributor Author

  • added "--skipUserCfg --skipParentCfg" to a test case, which is useful when testing from inside nim-beacon-chain
  • made the "metrics" module an optional dependency
  • protected the global count table with a lock, when threading is enabled, making the code thread-safe
  • moved the "callbacks by future" processing code in its own proc and used that also on Windows
  • added another metric to the Windows poll()

Copy link
Collaborator

@cheatfate cheatfate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that you do not care, but all this metrics can be and should be implemented more safely using internal chronos mechanisms.

when defined(metrics):
var callbacksByFuture* = initCountTable[string]()
when defined(threads):
var callbacksByFutureLock*: Lock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

callbacksByFuture must not be global, because async dispatcher will be single-threaded even in multi-threaded environment, so for N threads there will be present N dispatchers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because dispatcher will be single-threaded locks are not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also because dispatcher will be single-threaded locks are not needed.

You can have multiple event loops running in different threads and collect metrics from all of them.

declareCounter chronos_loop_timers, "loop timers"
declareGauge chronos_loop_timers_queue, "loop timers queue"
when defined(metrics):
declareCounter chronos_loop_timers, "loop timers"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this metrics variables must not be globals.

declareCounter chronos_processed_callbacks, "total number of processed callbacks"


proc processCallbacksByFuture() =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the reason to count callbacks per Future?
Almost every possible Future[T] will have only 1 or 2 callbacks, in rare cases it will have more then 2 (allFutures, one), but i really do not understand why this code ever needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the reason to count callbacks per Future?

Per Future type, not instance.

but i really do not understand why this code ever needed

So I can see which Future types are used more often. In NBC's case, it's those produced by readExactly() and write(), peaking at 140-180 per second, every few seconds.

It shows inefficient internal reads and writes, because they correspond to peaks of 20 OS read events and 50 write events per second.

@stefantalpalaru
Copy link
Contributor Author

all this metrics can be and should be implemented more safely using internal chronos mechanisms

You mean these objects you keep in a table inside the global dispatcher threadvar?

TrackerBase* = ref object of RootRef

Why resort to that ugly hack when I have proper metrics available, complete with atomic operations (those that confused you when you started commenting that metrics should not be global).

@zah
Copy link
Contributor

zah commented Mar 17, 2020

I understand that you do not care, but all this metrics can be and should be implemented more safely using internal Chronos mechanisms.

Our experience as C/C++ programmers has thought us that a library without dependencies is something to strive for, a virtue. But why is this the case? One reason is that unfocused dependencies can introduce code bloat (e.g. you need a portable tray icon and you end up adding Qt to your project). But the more probable reason is that in C and C++, integrating dependencies in your build can be quite painful.

In Nim, both of these problems can be avoided if we strive to have small and granular dependencies, especially when we are building all of the libraries ourselves. The alternative to having dependencies is to re-implement the same functionality in multiple places. These end up linked in the same binary, leading to an unnecessary increase of the final code size (both for humans to review and for machines to load in caches).

Reusing the same components also changes the economics of development. It's unlikely that we'll ever have the time to work on the Chronos internal metrics, but we might invest in creating real-time visualisation and analysis tools for nim-metrics if it's used everywhere in our codebases. The smaller and more focused our libraries are, the better.

With all this said, there will be a problem with a dependency to nim-metrics here. To avoid the cyclic graph, we'll need to have a separate package called metrics-server that will depend on metrics(-core) and chronos in the future.

@cheatfate
Copy link
Collaborator

@stefantalpalaru because dispatchers will run in per-thread mode, there will be no global dispatcher which will cover all the threads, so there no reason to use atomics and globals.

@zah because its dependency which perform inc(value) only. And this dependency even not supports custom trackers of some other library, so you need to integrate "globals" and other unsafe constructs to library. So from my point of view if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values without applying changes to chronos scheduler which can seriously affect performance.

@cheatfate
Copy link
Collaborator

cheatfate commented Mar 18, 2020

So in future when chronos will start support creating multi-threaded servers there will be N threads and N dispatchers which should work in parallel for every connected peer. But all this dispatchers will be required to sync with single Lock at every possible poll() call. And i'm not talking here about applying sort operation on CountTable.

@stefantalpalaru
Copy link
Contributor Author

So in future when chronos will start support creating multi-threaded servers there will be N threads and N dispatchers which should work in parallel

You support that right now.

all this dispatchers will be required to sync with single Lock at every possible poll() call

Every 50 calls, because of if chronos_poll_ticks.value.int64 mod ticksBetweenChecks == 0:

there no reason to use atomics and globals

There's no harm either.

if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values

How?

@zah
Copy link
Contributor

zah commented Mar 19, 2020

@zah because its dependency which perform inc(value) only. And this dependency even not supports custom trackers of some other library, so you need to integrate "globals" and other unsafe constructs to library. So from my point of view if chronos are able to count metrics by its own, nim-metrics should be able to obtain this values without applying changes to chronos scheduler which can seriously affect performance.

nim-metrics allows you to instantiate the counters as fields within objects as well. Saying that it's just about inc(value) misses the point a bit - a metrics library exists so we can have tools that can monitor the data in real-time, archive it, analyse it, etc.

When the responsibilities are cleanly divided between different libraries, we can invest the time to perfect them. If you have some ideas how nim-metrics can be made less intrusive, go ahead and suggest changes to the API - let's work towards creating the best possible metrics library.

@cheatfate
Copy link
Collaborator

@zah i dont think discussing nim-metrics features in nim-chronos repository is a right place.
nim-metrics it self its just a inc(value) and nothing more, and if we going to make it less intrusive it should stay with just inc(value). All other dependencies (web servers, udp broadcasters) should be optional.

Right now, if we going to accept this PR we are including low-quiality web server into low-level networking framework. So question is does "BSD sockets" library must have kernel space web server just to allow you to monitor some metrics? I think approach should be different.

Chronos itself is able to perform inc(value) and nim-metrics integration should be able to pass or read this values, without any changes to chronos scheduler.

Second problem is that all nim-metrics values are global variables, and chronos are per thread engine, in such way if we going to use chronos in many threads this variables will be not totally correct or totally incorrect. This also creates many problems with gcsafe.

@stefantalpalaru
Copy link
Contributor Author

nim-metrics it self its just a inc(value) and nothing more

You should read the documentation: https://github.com/status-im/nim-metrics/#nim-metrics

Right now, if we going to accept this PR we are including low-quiality web server into low-level networking framework.

Here's the diff, show me the web server that gets included into Chronos: https://github.com/status-im/nim-chronos/pull/85/files

Second problem is that all nim-metrics values are global variables, and chronos are per thread engine, in such way if we going to use chronos in many threads this variables will be not totally correct or totally incorrect. This also creates many problems with gcsafe.

No, of course it doesn't create any problem. GC safety is already guaranteed by nim-metrics.

Chronos itself is able to perform inc(value) and nim-metrics integration should be able to pass or read this values, without any changes to chronos scheduler.

How? How do I get the rate of Future callbacks executed for a specific Future type out of Chronos right now? What's the API for that? Is that data available now, or a couple of years in the future?

@zah
Copy link
Contributor

zah commented Mar 19, 2020

The web server component in nim-metrics is already optional. Merely using metrics in some project doesn't create any security risks. Furthermore, the spirit of the library is that all monitoring should be optional and I suggested that we should disable it by default here.

@cheatfate
Copy link
Collaborator

@zah right now i see changes in scheduler and this is not acceptable. It seriously affects performance and create single lock sync problems, if chronos will be used in multi threaded environment. Also because of global counters it produce incorrect numbers which will be sum of all working dispatchers.

@zah
Copy link
Contributor

zah commented Mar 25, 2020

@stefantalpalaru's argument is that the performance impact is not severe, because there is a notion of sampling (you run the measuring code every N-th poll, where N can be controlled).

Furthermore, the performance impact will be zero if the feature is disabled by default which is something I've suggested.

@cheatfate
Copy link
Collaborator

@zah and every N-th poll you will perform sort operation of thousands Future[T] identifiers just to get metrics information which is not always useful, and after that second thread will perform sort operation of equal dataset. So we are going to freeze all our dispatchers for some time controlled by N?

@cheatfate
Copy link
Collaborator

cheatfate commented Mar 25, 2020

As i said this metrics should be carefully applied using internal chronos mechanisms, after that we can make a method to transfer this values to any other compatible metrics framework.

@stefantalpalaru
Copy link
Contributor Author

All the new metrics, showing a Medalla node (make NIMFLAGS="-d:insecure -d:chronosFutureTracking" medalla) in the initial block syncing phase. No attestation processing yet.

img

@stefantalpalaru stefantalpalaru merged commit e45ef32 into master Aug 6, 2020
@stefantalpalaru stefantalpalaru deleted the metrics branch August 6, 2020 17:30
@zah
Copy link
Contributor

zah commented Aug 11, 2020

While having metrics from Chronos is an extremely valuable addition to NBC, this PR has been merged without establishing a consensus within the team which is something we should strive to avoid. Even more relaxed practices such as rough consensus would have been violated, so the best way forward is to revert the change as soon as possible and continue the discussion from there.

The concerns of both sides are valid:

  1. We need metrics from Chronos
  2. The integration was not done in an acceptable way (due to the use of global variables and locking)

The right course of action would have been improving the PR until there are no more blocking objections. Through the use of instance-bound counters (either coming from nim-metrics or through regular variables), the undesired locking would have been avoided and this was a clear direction for how the PR could have been improved.

That said, I would also point out that progress requires both sides to cooperate. This PR has been lingering for too long and we've been deprived from the benefits it brings for no good reason (forcing our hand a little bit to merge it in the current emergency). The metrics library is envisioned to become a very comprehensive solution with instance-bound counters offering fine-grained compile-time configuration mechanisms that are unpractical to develop internally in Chronos and thus it can make Chronos a better library in the long-term. The presented objections towards having it as dependency has been largely addressed (e.g. It was already pointed out that the server components there are optional) and I wish @cheatfate would have more good will in general towards the nim-metrics vision shared by the team.

@cheatfate
Copy link
Collaborator

@zah

Sorry, but i can't see how i can cooperate if people totally ignoring my concerns.

You said this metrics are very valuable asset, but i never seen that something got improved using metrics. So the only way when metrics can be used as a tool to find and fix something is when you put metric "file_xxx_line_yyy_metric_zzz" at every line of your code. With current approach the only way of using metrics is to find correlation between one even and another event... What i can say if somebody wants to see such correlations it means there no understanding on how code is working...

For example i can't understand what number of pending timers or pending callbacks can give to the developer at some moment of time... Without diving deep into this callbacks and timers this counters are absolutely useless.

Poor understanding of how async is working generates poor PRs.

@zah
Copy link
Contributor

zah commented Aug 11, 2020

@cheatfate, it's very hard to argue that having metrics is not useful. Would you claim that you were never surprised by a result we obtained from metrics? As scientists often say, many discoveries don't come from deductions, but rather from observations that makes you say "Hmm, that's strange".

@cheatfate
Copy link
Collaborator

@zah sorry but i have never been surprised by metrics, usually i have mentioned problems highlighted by metrics many months before it was highlighted by metrics.

Also nim-metrics has pretty poor API and design which forces everybody to declare metrics in global scope, as a consequences you are required to put gcsafe pragma everywhere where you working with this metric variables... So you turning off pretty useful compiler feature just because you want to use metrics. I dont think metrics are worth it.

@zah
Copy link
Contributor

zah commented Aug 11, 2020

Also nim-metrics has pretty poor API and design which forces everybody to declare metrics in global scope, as a consequences you are required to put gcsafe pragma everywhere where you working with this metric variables... So you turning off pretty useful compiler feature just because you want to use metrics. I dont think metrics are worth it.

This is not true and it's an example for the lack of cooperation that I'm highlighting. The API already allows you to use the counters as fields inside objects and in the future this will get even better because it will be possible to use a compile-time switch that automatically removes the field of a particular metric and all the corresponding updates to it. This is the kind of capability that takes some effort to perfect in a library like nim-metrics and that doesn't make sense for other libraries to re-invent.

Give the other side the benefit of the doubt, work with them towards improving the API instead of claiming it's just not good and treating it as if it was developed by a third party team that you cannot influence.

@cheatfate
Copy link
Collaborator

@zah if its true why every possible interaction with metrics requires to set {.gcsafe.}? I think we are all confused here because of metrics here.

@stefantalpalaru
Copy link
Contributor Author

this PR has been merged without establishing a consensus within the team

Libp2p developers jumping through hoops to use the Chronos branch with metrics counts as a consensus where it matters.

the best way forward is to revert the change as soon as possible and continue the discussion from there

I disagree. I see a very useful feature being blocked for almost 5 months based on non-technical reasons. If anything, we need to stop driving with this handbrake on.

If you hold some fringe beliefs about real programmers debugging without instrumentation, you don't get to veto features useful to us - lesser programmers doing the actual debugging. The feature gets merged sooner, rather than later and whenever you manage to reimplement those metrics you say you understand a lot better, you can make another PR.

This policy is based on common sense and it would allow us to iterate faster, which is very important in this alpha stage.

zah added a commit that referenced this pull request Aug 15, 2020
This reverts commit e45ef32.

Metrics implemented this way, with a lock inside the otherwise tight
event loop are not consistent with the chronos architecture that for
good or bad uses thread local variables to avoid them - the solution
does not have rough consensus behind it, and other avenues should be
explored for this generally useful functionality.
zah added a commit that referenced this pull request Aug 15, 2020
This reverts commit e45ef32.

Metrics implemented this way, with a lock inside the otherwise tight
event loop are not consistent with the chronos architecture that for
good or bad uses thread local variables to avoid them - the solution
does not have rough consensus behind it, and other avenues should be
explored for this generally useful functionality.
bung87 pushed a commit to bung87/nim-chronos that referenced this pull request Nov 17, 2020
bung87 pushed a commit to bung87/nim-chronos that referenced this pull request Nov 17, 2020
This reverts commit e45ef32.

Metrics implemented this way, with a lock inside the otherwise tight
event loop are not consistent with the chronos architecture that for
good or bad uses thread local variables to avoid them - the solution
does not have rough consensus behind it, and other avenues should be
explored for this generally useful functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants