-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precise per-component ETH-denominated rewards tracking in ncli_db validatorDb
#3107
Conversation
…lidatorDb` Also adds a Jupyter notebook for producing validator performance reports
@@ -76,7 +76,7 @@ type | |||
## database - this may have a number of "natural" causes such as switching | |||
## between different versions of the client and accidentally using an old | |||
## database. | |||
db: SqStoreRef | |||
db*: SqStoreRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should not be exposed - expose a readOnly
function instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and add a test - the db has fairly extensive coverage and should have coverage for this as well
@@ -76,7 +76,7 @@ type | |||
## database - this may have a number of "natural" causes such as switching | |||
## between different versions of the client and accidentally using an old | |||
## database. | |||
db: SqStoreRef | |||
db*: SqStoreRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and add a test - the db has fairly extensive coverage and should have coverage for this as well
@@ -606,7 +640,9 @@ type SomeMergeBlock = | |||
proc process_block*( | |||
cfg: RuntimeConfig, | |||
state: var merge.BeaconState, blck: SomeMergeBlock, flags: UpdateFlags, | |||
cache: var StateCache): Result[void, cstring] {.nbench.}= | |||
cache: var StateCache, | |||
outSlotRewards: SlotRewards | type DontTrackRewards = DontTrackRewards): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this design puts too much noise and responsibilities into the state transition function - what should be done instead is to follow the info
model and extract a minimal set of information that's computed anyway, then do the math outside (in ncli_db
in this case).
Like this, it becomes hard to audit the state transition for its main purpose - ncli_db is not a main purpose so we design accordingly and don't invade the state transition for it.
Also, many of the same computations are needed for the val-mon branch - the aim is thus to find an architecture that allows both these features extract information with minimal impact to "normal" state transitioning and without magic compile-time modes that are only used sometimes.
Finally, like this, upstream test suite that verifies that computations are done correctly doesn't cover the numbers - in EpochInfo
, the idea is that the cache that's built up to add up the rewards and penalties internally in the state transition is also used outside - this ensures that the exported information is as accurate as the one used inside.
To summarize: what I see should happen here is the introduction of a BlockInfo
(it's not a SlotInfo
- slots don't carry rewards) object that not only gets exported but also used as part of the "normal" flow of the function. Further, where possible, the function should not export computed numbers - instead, it should export the minimal and orthogonal information needed to recreate the computations, because the context in which the numbers get used may vary - that means for example not summing rewards and penalities, and where possible, exporting flags and base rewards instead of final numbers. Ideally we would get rid of the tract/don't track difference because the impact should, be minimal anyway - the way we test any change to the spec is using the db benchmarking tool for the last 50k mainnet blocks, like so: #3089 - the same should be done here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried briefly to redo the calculations in ncli_db
, but this lead to significant duplication of the logic in the spec that would have been quite fragile and difficult to maintain in the future (not to mention that it was not trivial to get right in the first place). Having "too much noise" is a much smaller problem in my opinion.
Comparing the new features to the existing EpochInfo
functionality is not quite right, because in ncli_db
we replace EpochInfo
with DetailedEpochInfo
(which carries more fields that have to be updated in more places and summed up at the end). Are you suggesting that DetailedEpochInfo
should be the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the spec benchmark should be unnecessary right now, because the new code is not compiled in the normal spec execution flow. The added DontTrackRewards
type parameter is erased away form the final run-time code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm suggesting to replace it with a single implementation that conveys all necessary information and benchmark that.
Further down, I also argue that separating the concerns and minimising the information transfer will give design advantages because it becomes more clear what information is simple (to someone that understands the reward code) to compute, and what information indeed needs complex data to represent, which in turn will inform a better database design.
I'm not at all comfortable with the idea of having parts of ncli_db in here - the minor duplication of logic is the lesser evil because the spec code doesn't change much except when the spec changes, while ncli_db should be quick and easy to modify - from a higher level, the spec code is not that difficult either - there are a few rewards, and you either get them or not - the most complex piece in here is the base reward - once you have that, you simply split it up between the various categories of rewards in a weighted fashion.
When debugging and reasoning about what the spec code does, this kind of code is a major distraction and obstacle, in a moment when fast and accurate understanding is critical and time matters.
On the other hand, updating ncli_db is never critical and the way we use it changes over time - we should not introduce barriers to making changes in ncli_db, ergo we put the logic there, separately, to do exactly what ncli_db needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as here:
#3107 (comment)
@@ -483,6 +483,54 @@ type | |||
|
|||
flags*: set[RewardFlags] | |||
|
|||
DetailedRewardsAndPenalties* = object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I can tell, most of this information is already present in EpochInfo
, albeit in compressed form - same as a comment below, we should instead extract raw, minimal and orthogonal information from here as part of the "normal" flow for the state transition, then do any computations that ncli_db needs in ncli_db instead of pollution the state transition with pure ncli_db concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compression in EpochInfo
is a lossy one though, which is the reason it needs the unified delta
value for each validator. I didn't want to increase the memory usage of the normal state transition by introducing multiple per-component deltas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix both problems by exporting building blocks instead of computed data ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that this is too complicated and can be explored in future PRs. I can perhaps put online the alternative code that I abandoned as a start of such an effort.
A defining feature of the current approach is that we currently have assertions that the component-wise balances match the actual balance changes that end up in the BeaconState
. This really requires you to replicate the logic of every single source of balance change in ncli_db
which will results in quite a lot of code that is much harder to maintain. Only the fact that the spec is not really expected to change provides some justification for such a heavy duplication of delicate logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can perhaps put online the alternative code that I abandoned as a start of such an effort.
let's start there - an actual example will make it easier to judge.
That said, from the point of view of ncli_db, of course it makes sense to put the code here - the sticking point is that ncli_db is not a priority here, and that as far as possible, code for it should be kept out of here. This was a distinctive and clear priority in the architecture of the original solution and division of labour, both when writing the spec code and when creating ncli_db and its validator performance analysis features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative code is not finished though and I'm personally not convinced that spending more time on it is beneficial. If you strongly disagree with this and you are willing to take over the maintenance of the more complicated version of ncli_db
, then fishing this code on your own in a separate effort makes most sense to me.
@@ -567,8 +583,11 @@ func get_inactivity_penalty_delta*(validator: RewardStatus, | |||
|
|||
delta | |||
|
|||
func get_outcome(delta: RewardDelta): int64 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functions like this, which introduce significant risk of overflow Defect
:s, should be kept out of the state transition codebase entirely - from what I can tell, it's not needed or used for any "ordinary" state transition flow - in the spec, we use either saturating, explicitly failing, or in rare cases, overflowing math, but never Defect, as far as is humanely possible, with nim injecting defects left and right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be removed since it is trivial and indeed used only in the reward tracking branches of the code, but I don't think it can overflow considering the maximum rewards and penalties are bounded and far below the int64
capacity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but when it's here, it's easy to call by accident - it also is distracting because during review, it's one more thing to consider from a security pov - we've had enough bugs of this nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's limited to the RewardDelta
type. How can you construct overflowing instances of this type by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by copy-pasting the poor programming style that converts uint64 to int64 without range-checking, on the assumption that the caller has done it already - a comment would perhaps solve it, but nevertheless, the original argument stands: the code in spec
implements the python spec as closely as possible, using the conventions of the spec, as closely as possible - for better or worse, the spec uses saturating and/or overflowing logic and if we can stick to it, it's easier to reason about.
target_attester INTEGER NOT NULL, | ||
head_attester INTEGER NOT NULL, | ||
source_outcome INTEGER NOT NULL, | ||
max_source_reward INTEGER NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the database is already quite heavy - this looks like it's storing redundant information that trivially could be computed instead - it would probably net a significant payoff in terms of performance and size to not store these redundant records
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify which information you consider trivially computable or redundant? The values that end up here could be unique per validator and and they depend on a number of epoch properties.
Where should the computation happen? The python code in the Jupyter notebook should replicate the reward computation logic of the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine storing the base reward per epoch + validator flags (one int) per validator is quite a lot more efficient. the reason I care is that I run this analysis for all validators to get baselines and other statistic. I initially ran it for all "perfect" records as well but this completely blew up the database.
the computation can be done either in the database with a view or indeed in jupyter (potentially by importing the pyspec) - the functions are mostly trivial.
that said, this doesn't really matter for this iteration of the PR - it can be adjusted later - I do think that it's important to remove the computations from the state transition itself and redo them in ncli_db directly to whatever format and information ncli_db needs/wants - when designing the entire functionality it that way, it will also become immediately clear what the minimal information set in the database is, and how far we should take the normalization of the data therein.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as here:
#3107 (comment)
Indeed, the produced database is huge, but why is this a problem? It's easy to get enough storage on a machine where you intend to run the tool. Considering that updates to the database are incremental, speed is not much of a problem either.
## `RewardStatus` augmented with detailed `RewardInfo`. | ||
detailed_info*: DetailedRewardsAndPenalties | ||
|
||
SlotRewards* = ref object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nim has an issue with var
types that appear in the branches of or
types. When you have a parameter of a function with such an or
type, the parameter symbol ends up being treated as a non-var within the body of the function even when the var
branch is instantiated. I should have added a TODO comment to explain this indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the lesser evil still is to do var (SR | DT)
then - the conventions help significantly when deciphering the state transition - what is safe and what is not safe to do with the various helper functions - in fact, I wish nim had call-site annotations for mutability, so you'd have to call with f(a, var b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var (A | B)
is not possible here because B
is a type parameter:
outSlotRewards: SlotRewards | type DontTrackRewards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the kind of trickery that I very much like to avoid in spec code.
the feature / corner case was too complicated for the person who introduced it in the compiler to implement that they couldn't do it well - it's not a good sign in general that people that read the code will understand what's going on, without relying on complex and rarely used language details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's complex to understand precisely? You can claim that it's unfamiliar, sure, but claiming that it is complex is wildly exaggerated.
Also, in your accusations, you often confuse "complicated for the person who introduced it in the compiler to implement" with "the person who implemented in the compiler had very limited development time and wasn't able to follow the perfect Q/A process".
By paying the price of using a less familiar trick, you get to follow the zero overhead principle - you pay nothing for a feature that you don't use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the point I'm trying to make: it's not zero-overhead - there's a significant cognitive overhead of sorting out the cleverness and filtering out the noise when reading the code, which is something we do a lot when reverse engineering problems.
The approach of making DontTrackRewards
a var and not break convention is also zero-overhead in terms of execution time - compilers automatically remove unused arguments from the function signature (under the right conditions)
The approach of not making these clever things at all and simply check the performance with a benchmark goes even further: it makes the argument that if the overhead cannot be detected in real life, it's not there, and the complexity is actually unwarranted, no matter what theory you had in mind when writing the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
significant cognitive overhead
I find this to be an exaggerated claim. Benchmarking the code, measuring the effects on compilation time, making sure that there won't be any regression in your results in the future (e.g. due to changes in the used compilers), etc is more development overhead and technical debt than learning a new Nim trick that everybody can benefit from in other situations.
@@ -255,7 +259,9 @@ proc state_transition_block_aux( | |||
altair.SigVerifiedSignedBeaconBlock | altair.TrustedSignedBeaconBlock | | |||
merge.TrustedSignedBeaconBlock | merge.SigVerifiedSignedBeaconBlock | | |||
merge.SignedBeaconBlock, | |||
cache: var StateCache, flags: UpdateFlags): bool {.nbench.} = | |||
cache: var StateCache, flags: UpdateFlags, | |||
outSlotRewards: SlotRewards | type DontTrackRewards = DontTrackRewards): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throughout, we use var
to signify mutability instead of hungarian naming - we don't use ref
unless the function takes a copy of the reference for some reason, which these functions don't do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained here:
#3107 (comment)
# https://github.com/ethereum/consensus-specs/blob/v1.1.5/specs/altair/beacon-chain.md#modified-slash_validator | ||
# https://github.com/ethereum/consensus-specs/blob/v1.1.4/specs/merge/beacon-chain.md#modified-slash_validator | ||
# Extracted from `slash_validator` in order to reuse in ncli_db for detailed rewards tracking | ||
proc get_slashing_penalty*( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could simply use overloading, which is what most of the state transition uses - it's easier to recognise, generally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code where this was extracted from was not using overloading. In general, I disagree that overloading is superior to a series of when
statements unless you are designing a library aiming to take advantage of the open-closed principle (which is not the case here). Overloading leads to inferior error messages, it has more compilation failure modes and it's more expensive for the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point was mostly about conventions, and how we can make this new function stand out less - the conventions help guide the reader with less effort than a custom-written when jungle that breaks the convention (for example because it's easier for tooling to understand it) - in this particular case, it's trivial either way - in the larger functions, making sense of a when jungle with the extra indentation it brings is difficult, so overloading was chosen as the dominant strategy, and it has worked pretty well so far as far as I can tell, including when deciphering error messages.
the error messages are perhaps more relevant when you've created a complex and deep type machinery with many layers of generics and significant risk of having the wrong functions being called accidentally, like happens with nim-serialization - this code however is as simple as it gets, and it pays off, error messages included.
these style questions are a bit like colors, everyone has a favorite - mixing them all in one place however rarely ends up well (unless you like brown).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the convention is wrong, let's fix the convention.
Also, the convention you are describing is hardly followed in the spec code. After all, I've just copy-pasted the exact body of this function which was inlined in another spec function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, yes, you copied an exception - I mentioned in the review how it can be brought in line with the rest, but also noted it's a minor point.
if you want to change the convention, the way to go about it would be through a separate PR that changes the convention across the board - I suspect it will be ugly mainly because it's not enforced by a language rule but rather by the whims of whoever writes the code, which is hard to maintain at a good level.
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've come up with few more arguments against unnecessary use of overloading in general.
My primary way of navigating the project is through the "Jump to function name" functionality provided by Vim's CTRL+P plugin and its ctags integration. It looks like this:
... but you get to jump directly to function names. Needless to say, this workflow works much better when no overloading is present.
Your assertion that the error messages are just fine also doesn't match my experience. During the development of this feature I've made some trivial typos and I had to face some pretty bad error messages due to the heavy overloading of process_block
.
In general, if we follow the general philosophy that the least powerful tool should be used to solve any particular problem, I think there is a clear gradient of power/complexity that goes like this:
- Function are the most simple tool
- Templates are needed sometimes to produce better code or abstract away control flow constructs
- Closed sets of alternatives are better handled with case objects or a series of when statements
- Open sets of alternatives need to be handled with overloading or methods (dynamic dispatch is considered more complex than static dispatch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, Github doesn't seem to like uploading gifs. Here is the Ctrl+P animation I wanted to share:
https://miro.medium.com/max/1400/1*-DMqWWZ_btpEiyNgKpc8NQ.gif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jump to function name
well, with a when jungle, you're jumping to the beginning of a potentially long tree of complex statements that you manually have to disambiguate - first, you have to remember what type you were jumping from (which likely was deduced as well), then find the right when clause by executing the type comparison code in your head and so on, and there is no hope that a tool will ever understand this.
With overloading, a tool like nimsuggest can understand that it's supposed to make a contextual jump and gets you to the right spot without the extra processing that you, the developer, have to do.
when statements are free-form - they provide no constraining structure that you can learn - at best, you can try to write a style guide with conventions, but that's generally not good enough for tooling, or reading by experience.
For anything larger than a few lines of code, the extra level of indent also makes a significant dent in readability - you don't know where the phase0 logic vs the altair logic begins and ends. Soon you end up with a fairly convoluted structure: a manually written when jungle that forwards calls to an actual implementation that has non-overloaded names.
All that said, more important than the particular style is the fact that having two or more styles for this problem brings out all the negatives of either approach, but none of the benefits - you end up having to deal with both styles and nobody is happy - this is what my comment primarily was about: most of the spec uses overloading to solve this problem and this new function represents an opportunity to align the code with that convention simply based on a majority decision. It's a completely different argument whether the convention should be changed or not, but no matter what your opinion is on this subject, I think we can agree that this PR is not the time and place for making that change.
The function previously used a when jungle because it had a "shared" section - the new function has no shared code - this difference gives the overload argument more weight. If we really want to make this piece of code clean, we'd simply select the constant based on type using a much smaller helper function, then make a generic function with the division logic.
Ah, the joy of arguing code style philosophy over morning ☕ :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the Ctrl+P animation I wanted to share:
very ambitious - however, I'd recommend picking a language that has overloading - free choice makes for a much more compelling argument 😛
This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.
This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.
657f9d5
to
a4667d1
Compare
This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.
This is an alternative take on #3107 that aims for more minimal interventions in the spec modules at the expense of duplicating more of the spec logic in ncli_db.
Also adds a Jupyter notebook for producing validator performance reports