Safe Cron #493

ZenGround0 · 2022-10-19T19:09:26Z

ZenGround0
Oct 19, 2022
Collaborator

This is a sketch not a complete design

Cron Today

Cron is a convenient pattern for system actors to schedule work at a particular epoch. Cron performs system critical functions and in some cases these must execute without an error or the system will halt. Today cron does 4 major things
A. ProvingDeadline part 1. Charge faults if a post is not submitted during a proving deadline
B. ProvingDeadline part 2. Remove state for sectors that are either expiring naturally, terminated explicitly or have been faulty for too long
C. Batch up ProveCommitSector PoReps, actually run the proofs and add the sectors to miner state

There are other things but they are either not critical for not breaking the network (precommit expiry, worker key changing) or present no scaling problems (updates to singleton power and reward actors), or are market cron which I'm not investigating here.

Cron cost model

Apart from C there is no gas accounting tied up with cron messages. It is an assumption that all required system crons can execute to complete in a small enough time to not violate block time guarantees. We use a syscall to charge for C during the actual ProveCommit message.

For A and B cron looks a lot like an unbounded queue of work, the sort of thing that's a fundamental problem. However the work for these cron jobs scales with the number of proven sectors on the network. And on closer inspection that is actually a constant value defined by the steady state carrying capacity when sectors are added as fast as possible and are constantly expiring. Steady state carrying capacity is influenced by sector residence time and maximal onboarding rates. Current estimates for filecoin are ~700 EiB to ~1500 EiB.

Problems lurking today

Thankfully there are no acute problems.

User defined contracts want to make use of the cron pattern. Today that would break the network.
As the network grows it is unclear that we have enough block time to support A and B for all sectors. There is no safeguard to prevent the network from growing past the point where the 30s blocktime is violated by long cron jobs. We have no bounds on cron execution time.
There's likely a signficant difference between worst case and average case total cron time for an sp of a given storage power. Inefficient partition packing (i.e. no compaction) leads to more cron work for fewer sectors. This needs investigation and actual data. There is no incentive for individual operators to optimize their deadline costs because these costs are incurred on the whole system, not on the operator.
Proving deadline programming errors render miner actors useless, a catastrophic and essentially unrecoverable (needs network upgrade) error. This was encountered on space race and thankfully this has not been observed on mainnet. The system is not robust to errors here so changes to this code have a high risk.

Just remove cron?

C is easy to move away from the cron model, this essentially the point of hyperdrive. There is some verification/proving cost increase for batches of 2-4 sectors where the current model is cheaper. It seems trivial to modify prove commit methods to take batches explicitly and avoid cron altogether.
A is difficult to move from the cron model. The system needs to check up that posts where submitted and deduct penalties for posts not submitted for every active deadline to maintain network security guarantees.
B is less difficult but not easy to move away from cron. There is an incentive for sps to explicitly expire their sectors to unlock pledge. But this gets tricky when comparing termination fee losses to pledge gains when the sector is terminated. There is also the termination signal to the market to consider though this could also probably be worked around with some effort.

Safe Cron

Summary

The cron actor can account for subcall gas, catch out of gas errors and limit cron job execution just like the vm.
All system actor cron jobs use this model. The "free gas" given to system actors today is made explicit as a protocol parameter.
If system cron jobs exceed gas they fail (safely) and miner operators must fix their state with explicit calls to the network costing gas. This incentivizes operators to keep their cron costs down.

Optionally this system can also be used by user actors. Cron would prioritize all system jobs with free gas and then with any remaining room do message scheduling just like block producers to select cron jobs. User actors would only have the guarantee "executes after this epoch" never "executes on this epoch".

Gas handling within cron

Though I suspect there is a way to do this with new syscalls I do not understand fvm gas charging enough to know if it is feasible in the current or future iterations of fvm for an actor to do vm style execution that stops when gas is exceeded.

Free gas protocol parameter

Set so that cron of all data at carrying capacity does not exceed the allocated chunk of protocol blocktime for cron
Set based on number of sectors or partitions. A deadline with more work to do gets more free gas.
Set so that a "reasonably efficiently compacted" deadline can execute with only free gas in "the worst case"

Setting these parameters is likely to be quite difficult and need extensive analysis. Just getting some empirical averages won't be enough.

Safe HandleProvingDeadline

Probably the most challenging part of this proposal is to redesign miner deadlines so that they can freeze and be resurrected while keeping the system secure and fair to storage providers. This is a sketch of one direction. A lot of issues are not worked out.

Keep alive deadline power

One idea from @Stebalien a while ago is for the power table to require keep alives from cron. A rough sketch along these lines:

On deadline cron power actor queries miner for deadline power and will deduct this power from power table after miner crons run
Successful miner cron finishes by calling power actor and zeroing out the amount of power to deduct
Failed miner cron never calls power actor so this power is deducted

The deduction function could be more permissive removing more and more power for each uncronned deadline, adding a grace period etc.

Deadline freeze and resurrect

Deadlines that have failed to run HandleProvingDeadline on time are recognized (probably by an epoch flag being set to a value too far in the past). Such deadlines are rendered immutable so that new sector assignment and snap deals updates can no longer happen within them.

To resurrect the frozen deadline a new method "ResurrectFrozenDeadline" can now be run explicitly on a frozen deadline that has not successfully cronned. It runs the regular proving deadline functionalities expiring sectors and deducting faults. The method reactivates the deadline power, unfreezes the deadline, and makes the deadline mutable again. It probably should actually run any post proofs it posted to make composing with disputing simple (forbid dispute when deadline is frozen, check all proofs on resurrection) . It likely also pays out an extra fee to act as an incentive to not fail cron, though maybe the power deduction and gas fee are enough.

If a deadline remains frozen for many deadlines nothing happens to it. Its sectors and their pledge stay locked up and unproven. Since handle proving deadline is a system function ResurrectProvingDeadlines should be safe to open up publically. This way operators avoiding running ResurrectProvingDeadlines for cost reasons don't lock sector infos in the state indefinitely and market actors waiting on termination signals can be made whole.

One important consequence of these changes is that deadline cron must be decoupled deadline to deadline. The current rescheduling logic would fail the whole miner actor on one cron failure. This design requires adopting the approach of scheduling proving deadlines for each deadline with sectors.

Cron for user actors

There are obviously many possible use cases. One that has been under discussion for a long time by @nikkolasg is the ability to execute messages that are time lock encrypted with DRAND randomness taken from a syscall.

Assumptions

This idea is written under the assumption that cron work is small enough that efficiently compacted miner actors can all schedule ~worst case proving deadline gas for every sector and not exceed the blocktime. It is not obvious that this is true. Learning that it is not true would be quite valuable for network safety under load. Mitigations include changes to proving deadline and sector expiration to reduce cron work, or rate limiting onboarding to reduce carrying capacity.
I'm under the impression that handle proving deadline's timeliness requirement is about getting post proofs proven on time. I'm assuming differences in total storage power and reward are not significant between deadlien freeze and resurrection. I'm not aware of other pieces of HandleProvingDeadline that must run at the precise deadline end instead of at a future epoch, but all such instances need to be evaluated.

Just Remove cron?

The sketch of how to make miner actor cron methods work without cron ends up using cron mostly as a convenience. The hard work of making HandleProvingDeadline failures acceptable for system security and also recoverable would also let us take it out of cron entirely. The biggest argument against this is operator cost both in terms of gas but also in terms of having a new "window-post" style message to keep track of. There are technical difficulties removing deadline cron. As the design above stands there is no valid block to put a user initiated HandleProvingDeadline message because a post message could be processed after it and once the deadline is over the deadline is frozen.

We could makes this better by creating some grace period before freeze. We could potentially use the following deadline as the window to call HandleProvingDeadline before your previous deadline gets frozen. We could even tack this onto window post calls if there are sectors in your deadline.

Stebalien · 2022-10-19T20:33:01Z

Stebalien
Oct 19, 2022
Collaborator

The FVM team has discussed a few interesting options here:

Gas futures: One could buy gas from a future block on-chain (with lots of limits and a carefully designed price function). This is a good option for operations that must execute at a specific epoch and have a very predictable gas limit.
On-chain message pool: Literally, a message pool with gas specs etc., but on-chain (where messages pay to enter, and probably expire after some time). This is a good option for operations that don't have to execute at a specific epoch but need to execute "soon".

Both of these solutions are complementary and I'd expect gas futures might be used to trigger more complex operations via an on-chain message pool.

But neither of these approaches are likely right for miner cron. A "safe" HandleProvingDeadline is still my favorite there.

3 replies

anorth Oct 20, 2022
Maintainer

Yes I think there are two separate discussions here. I have also explored the gas futures idea as a path to supplying cron as a service to user actors. This is in contrast to the general theme of "make cron go away" and I think quite promising. But I also think it's less important than the system cron discussion that the OP is mostly focussed on. A possible overlap is that, with a cron service that costs tokens to use, we could possibly rework the miner cron to use that service and pay for it from SP funds.

Thanks for opening this discussion Zen. I suggest focussing this one one on the "safe cron" aspect, and defer "cron for users" to elsewhere.

Note I have a slightly related discussion about fault notifications in #265, in the case of missed Window PoSt. I think this may be necessary for #298, and fault notifications generally a necessary primitive to avoid very expensive querying to determine if things have changed.

ZenGround0 Oct 20, 2022
Collaborator Author

A possible overlap is that, with a cron service that costs tokens to use, we could possibly rework the miner cron to use that service

Yeah this is why I included user cron discussion here. A nice way to bound cron execution is to meter all cron executions and set an explicit protocol gas limit.

But I agree the first place to focus is making miner cron safe, then we can work on metered cron and run miner cron with explicit bounds.

anorth Oct 20, 2022
Maintainer

Oh we should also reference #242 which is explicitly about user cron

anorth · 2022-10-20T02:11:08Z

anorth
Oct 20, 2022
Maintainer

feasible in the current or future iterations of fvm for an actor to do vm style execution that stops when gas is exceeded

I would expect this to be quite reasonable to request of the VM a capability to add a gas limit parameter to an internal send (and to know the current gas limit and approximate consumption).

0 replies

jennijuju · 2022-10-20T15:23:05Z

jennijuju
Oct 20, 2022
Maintainer

lotus 🤔 follow ups
@arajasek suggested maybe we can drop miner cron and just use a distributed mechanism..

@magik6k while designing filecoin back in the day, there was no cron and things were also functional. so specifically, DeclareFaults was introduced into cron only very close to the filecoin mainnet launch. That being said, we might want to consider moving the miner cron jobs to user space:

user declare faults
user properly expires sectors and sets termination
users have the incentive to do given they want the power table to get updated from faults and such

We still ned to think more on how this user actions would look like, but wanna leave the thoughts here first

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safe Cron #493

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Safe Cron #493

ZenGround0 Oct 19, 2022 Collaborator

Cron Today

Cron cost model

Problems lurking today

Just remove cron?

Safe Cron

Summary

Gas handling within cron

Free gas protocol parameter

Safe HandleProvingDeadline

Keep alive deadline power

Deadline freeze and resurrect

Cron for user actors

Assumptions

Just Remove cron?

Replies: 3 comments · 3 replies

Stebalien Oct 19, 2022 Collaborator

anorth Oct 20, 2022 Maintainer

ZenGround0 Oct 20, 2022 Collaborator Author

anorth Oct 20, 2022 Maintainer

anorth Oct 20, 2022 Maintainer

jennijuju Oct 20, 2022 Maintainer

ZenGround0
Oct 19, 2022
Collaborator

Replies: 3 comments 3 replies

Stebalien
Oct 19, 2022
Collaborator

anorth Oct 20, 2022
Maintainer

ZenGround0 Oct 20, 2022
Collaborator Author

anorth Oct 20, 2022
Maintainer

anorth
Oct 20, 2022
Maintainer

jennijuju
Oct 20, 2022
Maintainer