-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some metrics for monitoring futures #85
Conversation
bdaa004
to
1c1e0e3
Compare
chronos/asyncfutures2.nim
Outdated
@@ -284,6 +287,9 @@ proc addCallback*(future: FutureBase, cb: CallbackFunc, udata: pointer = nil) = | |||
## Adds the callbacks proc to be called when the future completes. | |||
## | |||
## If future has already completed then ``cb`` will be called immediately. | |||
{.gcsafe.}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if chronos is thread safe, but this is certainly not
also, it seems wrong to be creating strings here and counting the callbacks like this - normally what's done is that if a particular counter is expensive, you count for example 1% of the instances and scale the result accordingly, getting a statistical approximation - something that the metrics library should take care of - if metrics require you to pre-aggregate stuff, something is wrong with the library.
finally, when metrics
since it's not necessarily part of every codebase to be running this stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want all future types ending up in the metrics. Just the most frequently used ones, hence the aggregation, sorting and truncating (further down, in the other file). I know it's ugly, but dragging around hundreds of labels and trying to sort them with Prometheus queries is uglier.
If I add when defined(metrics):
all over the code, it will get less readable. Is the performance gain worth it?
No, it's not thread-safe. When we start using multiple event loops in parallel and we want combined metrics, we'll need locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just about runtime performance but also compile time and dependency hygiene..
also, like this, there's a significant chunk of metrics code intertwined with chronos' callback / future logic - code-wise, the metrics stuff should be factored out into its own functions at least so that it takes up less code-space where the important things are going on - it's already a readability issue that needs to be fixed
pretty graphs, looks useful as a feature |
chronos/asyncloop.nim
Outdated
@@ -729,6 +747,28 @@ elif unixPlatform: | |||
# poll() call. | |||
loop.processCallbacks() | |||
|
|||
# Wait until we have a decent amount of data and pick the most frequently | |||
# seen futures. | |||
const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must agree this feature needs more compile-time configuration and perhaps it should be disabled by default for a library such as Chronos.
First of all Currently only to track transport or server leaks. But it can be easily extended to track any values inside of Also i dont think Please consider that metrics will work only on *nix platforms. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you do not care, but all this metrics can be and should be implemented more safely using internal chronos mechanisms.
chronos/asyncloop.nim
Outdated
when defined(metrics): | ||
var callbacksByFuture* = initCountTable[string]() | ||
when defined(threads): | ||
var callbacksByFutureLock*: Lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
callbacksByFuture
must not be global, because async dispatcher will be single-threaded even in multi-threaded environment, so for N threads there will be present N dispatchers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also because dispatcher will be single-threaded locks are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also because dispatcher will be single-threaded locks are not needed.
You can have multiple event loops running in different threads and collect metrics from all of them.
declareCounter chronos_loop_timers, "loop timers" | ||
declareGauge chronos_loop_timers_queue, "loop timers queue" | ||
when defined(metrics): | ||
declareCounter chronos_loop_timers, "loop timers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this metrics variables must not be globals.
chronos/asyncloop.nim
Outdated
declareCounter chronos_processed_callbacks, "total number of processed callbacks" | ||
|
||
|
||
proc processCallbacksByFuture() = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the reason to count callbacks per Future?
Almost every possible Future[T] will have only 1 or 2 callbacks, in rare cases it will have more then 2 (allFutures, one), but i really do not understand why this code ever needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the reason to count callbacks per Future?
Per Future type, not instance.
but i really do not understand why this code ever needed
So I can see which Future types are used more often. In NBC's case, it's those produced by readExactly() and write(), peaking at 140-180 per second, every few seconds.
It shows inefficient internal reads and writes, because they correspond to peaks of 20 OS read events and 50 write events per second.
You mean these objects you keep in a table inside the global dispatcher threadvar? nim-chronos/chronos/asyncloop.nim Line 198 in 7ed9f14
Why resort to that ugly hack when I have proper metrics available, complete with atomic operations (those that confused you when you started commenting that metrics should not be global). |
Our experience as C/C++ programmers has thought us that a library without dependencies is something to strive for, a virtue. But why is this the case? One reason is that unfocused dependencies can introduce code bloat (e.g. you need a portable tray icon and you end up adding Qt to your project). But the more probable reason is that in C and C++, integrating dependencies in your build can be quite painful. In Nim, both of these problems can be avoided if we strive to have small and granular dependencies, especially when we are building all of the libraries ourselves. The alternative to having dependencies is to re-implement the same functionality in multiple places. These end up linked in the same binary, leading to an unnecessary increase of the final code size (both for humans to review and for machines to load in caches). Reusing the same components also changes the economics of development. It's unlikely that we'll ever have the time to work on the Chronos internal metrics, but we might invest in creating real-time visualisation and analysis tools for nim-metrics if it's used everywhere in our codebases. The smaller and more focused our libraries are, the better. With all this said, there will be a problem with a dependency to nim-metrics here. To avoid the cyclic graph, we'll need to have a separate package called |
@stefantalpalaru because dispatchers will run in per-thread mode, there will be no global dispatcher which will cover all the threads, so there no reason to use atomics and globals. @zah because its dependency which perform |
So in future when |
You support that right now.
Every 50 calls, because of
There's no harm either.
How? |
nim-metrics allows you to instantiate the counters as fields within objects as well. Saying that it's just about When the responsibilities are cleanly divided between different libraries, we can invest the time to perfect them. If you have some ideas how nim-metrics can be made less intrusive, go ahead and suggest changes to the API - let's work towards creating the best possible metrics library. |
@zah i dont think discussing Right now, if we going to accept this PR we are including low-quiality web server into low-level networking framework. So question is does "BSD sockets" library must have kernel space web server just to allow you to monitor some metrics? I think approach should be different. Chronos itself is able to perform Second problem is that all |
You should read the documentation: https://github.com/status-im/nim-metrics/#nim-metrics
Here's the diff, show me the web server that gets included into Chronos: https://github.com/status-im/nim-chronos/pull/85/files
No, of course it doesn't create any problem. GC safety is already guaranteed by nim-metrics.
How? How do I get the rate of Future callbacks executed for a specific Future type out of Chronos right now? What's the API for that? Is that data available now, or a couple of years in the future? |
The web server component in nim-metrics is already optional. Merely using metrics in some project doesn't create any security risks. Furthermore, the spirit of the library is that all monitoring should be optional and I suggested that we should disable it by default here. |
@zah right now i see changes in scheduler and this is not acceptable. It seriously affects performance and create single lock sync problems, if |
@stefantalpalaru's argument is that the performance impact is not severe, because there is a notion of sampling (you run the measuring code every N-th poll, where Furthermore, the performance impact will be zero if the feature is disabled by default which is something I've suggested. |
@zah and every N-th poll you will perform sort operation of thousands Future[T] identifiers just to get metrics information which is not always useful, and after that second thread will perform sort operation of equal dataset. So we are going to freeze all our dispatchers for some time controlled by |
As i said this metrics should be carefully applied using internal chronos mechanisms, after that we can make a method to transfer this values to any other compatible metrics framework. |
01a0328
to
81fd5cc
Compare
325f308
to
4a1be99
Compare
While having metrics from Chronos is an extremely valuable addition to NBC, this PR has been merged without establishing a consensus within the team which is something we should strive to avoid. Even more relaxed practices such as rough consensus would have been violated, so the best way forward is to revert the change as soon as possible and continue the discussion from there. The concerns of both sides are valid:
The right course of action would have been improving the PR until there are no more blocking objections. Through the use of instance-bound counters (either coming from nim-metrics or through regular variables), the undesired locking would have been avoided and this was a clear direction for how the PR could have been improved. That said, I would also point out that progress requires both sides to cooperate. This PR has been lingering for too long and we've been deprived from the benefits it brings for no good reason (forcing our hand a little bit to merge it in the current emergency). The metrics library is envisioned to become a very comprehensive solution with instance-bound counters offering fine-grained compile-time configuration mechanisms that are unpractical to develop internally in Chronos and thus it can make Chronos a better library in the long-term. The presented objections towards having it as dependency has been largely addressed (e.g. It was already pointed out that the server components there are optional) and I wish @cheatfate would have more good will in general towards the nim-metrics vision shared by the team. |
Sorry, but i can't see how i can cooperate if people totally ignoring my concerns. You said this metrics are very valuable asset, but i never seen that something got improved using metrics. So the only way when metrics can be used as a tool to find and fix something is when you put metric "file_xxx_line_yyy_metric_zzz" at every line of your code. With current approach the only way of using metrics is to find correlation between one even and another event... What i can say if somebody wants to see such correlations it means there no understanding on how code is working... For example i can't understand what number of Poor understanding of how |
@cheatfate, it's very hard to argue that having metrics is not useful. Would you claim that you were never surprised by a result we obtained from metrics? As scientists often say, many discoveries don't come from deductions, but rather from observations that makes you say "Hmm, that's strange". |
@zah sorry but i have never been surprised by metrics, usually i have mentioned problems highlighted by metrics many months before it was highlighted by metrics. Also |
This is not true and it's an example for the lack of cooperation that I'm highlighting. The API already allows you to use the counters as fields inside objects and in the future this will get even better because it will be possible to use a compile-time switch that automatically removes the field of a particular metric and all the corresponding updates to it. This is the kind of capability that takes some effort to perfect in a library like nim-metrics and that doesn't make sense for other libraries to re-invent. Give the other side the benefit of the doubt, work with them towards improving the API instead of claiming it's just not good and treating it as if it was developed by a third party team that you cannot influence. |
@zah if its true why every possible interaction with |
Libp2p developers jumping through hoops to use the Chronos branch with metrics counts as a consensus where it matters.
I disagree. I see a very useful feature being blocked for almost 5 months based on non-technical reasons. If anything, we need to stop driving with this handbrake on. If you hold some fringe beliefs about real programmers debugging without instrumentation, you don't get to veto features useful to us - lesser programmers doing the actual debugging. The feature gets merged sooner, rather than later and whenever you manage to reimplement those metrics you say you understand a lot better, you can make another PR. This policy is based on common sense and it would allow us to iterate faster, which is very important in this alpha stage. |
This reverts commit e45ef32. Metrics implemented this way, with a lock inside the otherwise tight event loop are not consistent with the chronos architecture that for good or bad uses thread local variables to avoid them - the solution does not have rough consensus behind it, and other avenues should be explored for this generally useful functionality.
This reverts commit e45ef32. Metrics implemented this way, with a lock inside the otherwise tight event loop are not consistent with the chronos architecture that for good or bad uses thread local variables to avoid them - the solution does not have rough consensus behind it, and other avenues should be explored for this generally useful functionality.
This reverts commit e45ef32. Metrics implemented this way, with a lock inside the otherwise tight event loop are not consistent with the chronos architecture that for good or bad uses thread local variables to avoid them - the solution does not have rough consensus behind it, and other avenues should be explored for this generally useful functionality.
You can see these metrics in action inside a nim-beacon-chain repo, by running
make NIMFLAGS="-d:insecure" eth2_network_simulation
and, a minute later in another terminal,cd tests/simulation/prometheus; prometheus
.Now you can go to this address in your browser and see something like this: