Safe Cron #493
Replies: 3 comments 3 replies
-
The FVM team has discussed a few interesting options here:
Both of these solutions are complementary and I'd expect gas futures might be used to trigger more complex operations via an on-chain message pool. But neither of these approaches are likely right for miner cron. A "safe" HandleProvingDeadline is still my favorite there. |
Beta Was this translation helpful? Give feedback.
-
I would expect this to be quite reasonable to request of the VM a capability to add a gas limit parameter to an internal send (and to know the current gas limit and approximate consumption). |
Beta Was this translation helpful? Give feedback.
-
lotus 🤔 follow ups @magik6k while designing filecoin back in the day, there was no cron and things were also functional. so specifically, DeclareFaults was introduced into cron only very close to the filecoin mainnet launch. That being said, we might want to consider moving the miner cron jobs to user space:
We still ned to think more on how this user actions would look like, but wanna leave the thoughts here first |
Beta Was this translation helpful? Give feedback.
-
This is a sketch not a complete design
Cron Today
Cron is a convenient pattern for system actors to schedule work at a particular epoch. Cron performs system critical functions and in some cases these must execute without an error or the system will halt. Today cron does 4 major things
A. ProvingDeadline part 1. Charge faults if a post is not submitted during a proving deadline
B. ProvingDeadline part 2. Remove state for sectors that are either expiring naturally, terminated explicitly or have been faulty for too long
C. Batch up ProveCommitSector PoReps, actually run the proofs and add the sectors to miner state
There are other things but they are either not critical for not breaking the network (precommit expiry, worker key changing) or present no scaling problems (updates to singleton power and reward actors), or are market cron which I'm not investigating here.
Cron cost model
Apart from C there is no gas accounting tied up with cron messages. It is an assumption that all required system crons can execute to complete in a small enough time to not violate block time guarantees. We use a syscall to charge for C during the actual ProveCommit message.
For A and B cron looks a lot like an unbounded queue of work, the sort of thing that's a fundamental problem. However the work for these cron jobs scales with the number of proven sectors on the network. And on closer inspection that is actually a constant value defined by the steady state carrying capacity when sectors are added as fast as possible and are constantly expiring. Steady state carrying capacity is influenced by sector residence time and maximal onboarding rates. Current estimates for filecoin are ~700 EiB to ~1500 EiB.
Problems lurking today
Thankfully there are no acute problems.
Just remove cron?
C is easy to move away from the cron model, this essentially the point of hyperdrive. There is some verification/proving cost increase for batches of 2-4 sectors where the current model is cheaper. It seems trivial to modify prove commit methods to take batches explicitly and avoid cron altogether.
A is difficult to move from the cron model. The system needs to check up that posts where submitted and deduct penalties for posts not submitted for every active deadline to maintain network security guarantees.
B is less difficult but not easy to move away from cron. There is an incentive for sps to explicitly expire their sectors to unlock pledge. But this gets tricky when comparing termination fee losses to pledge gains when the sector is terminated. There is also the termination signal to the market to consider though this could also probably be worked around with some effort.
Safe Cron
Summary
The cron actor can account for subcall gas, catch out of gas errors and limit cron job execution just like the vm.
All system actor cron jobs use this model. The "free gas" given to system actors today is made explicit as a protocol parameter.
If system cron jobs exceed gas they fail (safely) and miner operators must fix their state with explicit calls to the network costing gas. This incentivizes operators to keep their cron costs down.
Optionally this system can also be used by user actors. Cron would prioritize all system jobs with free gas and then with any remaining room do message scheduling just like block producers to select cron jobs. User actors would only have the guarantee "executes after this epoch" never "executes on this epoch".
Gas handling within cron
Though I suspect there is a way to do this with new syscalls I do not understand fvm gas charging enough to know if it is feasible in the current or future iterations of fvm for an actor to do vm style execution that stops when gas is exceeded.
Free gas protocol parameter
Set so that cron of all data at carrying capacity does not exceed the allocated chunk of protocol blocktime for cron
Set based on number of sectors or partitions. A deadline with more work to do gets more free gas.
Set so that a "reasonably efficiently compacted" deadline can execute with only free gas in "the worst case"
Setting these parameters is likely to be quite difficult and need extensive analysis. Just getting some empirical averages won't be enough.
Safe HandleProvingDeadline
Probably the most challenging part of this proposal is to redesign miner deadlines so that they can freeze and be resurrected while keeping the system secure and fair to storage providers. This is a sketch of one direction. A lot of issues are not worked out.
Keep alive deadline power
One idea from @Stebalien a while ago is for the power table to require keep alives from cron. A rough sketch along these lines:
The deduction function could be more permissive removing more and more power for each uncronned deadline, adding a grace period etc.
Deadline freeze and resurrect
Deadlines that have failed to run HandleProvingDeadline on time are recognized (probably by an epoch flag being set to a value too far in the past). Such deadlines are rendered immutable so that new sector assignment and snap deals updates can no longer happen within them.
To resurrect the frozen deadline a new method "ResurrectFrozenDeadline" can now be run explicitly on a frozen deadline that has not successfully cronned. It runs the regular proving deadline functionalities expiring sectors and deducting faults. The method reactivates the deadline power, unfreezes the deadline, and makes the deadline mutable again. It probably should actually run any post proofs it posted to make composing with disputing simple (forbid dispute when deadline is frozen, check all proofs on resurrection) . It likely also pays out an extra fee to act as an incentive to not fail cron, though maybe the power deduction and gas fee are enough.
If a deadline remains frozen for many deadlines nothing happens to it. Its sectors and their pledge stay locked up and unproven. Since handle proving deadline is a system function ResurrectProvingDeadlines should be safe to open up publically. This way operators avoiding running ResurrectProvingDeadlines for cost reasons don't lock sector infos in the state indefinitely and market actors waiting on termination signals can be made whole.
One important consequence of these changes is that deadline cron must be decoupled deadline to deadline. The current rescheduling logic would fail the whole miner actor on one cron failure. This design requires adopting the approach of scheduling proving deadlines for each deadline with sectors.
Cron for user actors
There are obviously many possible use cases. One that has been under discussion for a long time by @nikkolasg is the ability to execute messages that are time lock encrypted with DRAND randomness taken from a syscall.
Assumptions
This idea is written under the assumption that cron work is small enough that efficiently compacted miner actors can all schedule ~worst case proving deadline gas for every sector and not exceed the blocktime. It is not obvious that this is true. Learning that it is not true would be quite valuable for network safety under load. Mitigations include changes to proving deadline and sector expiration to reduce cron work, or rate limiting onboarding to reduce carrying capacity.
I'm under the impression that handle proving deadline's timeliness requirement is about getting post proofs proven on time. I'm assuming differences in total storage power and reward are not significant between deadlien freeze and resurrection. I'm not aware of other pieces of HandleProvingDeadline that must run at the precise deadline end instead of at a future epoch, but all such instances need to be evaluated.
Just Remove cron?
The sketch of how to make miner actor cron methods work without cron ends up using cron mostly as a convenience. The hard work of making HandleProvingDeadline failures acceptable for system security and also recoverable would also let us take it out of cron entirely. The biggest argument against this is operator cost both in terms of gas but also in terms of having a new "window-post" style message to keep track of. There are technical difficulties removing deadline cron. As the design above stands there is no valid block to put a user initiated HandleProvingDeadline message because a post message could be processed after it and once the deadline is over the deadline is frozen.
We could makes this better by creating some grace period before freeze. We could potentially use the following deadline as the window to call HandleProvingDeadline before your previous deadline gets frozen. We could even tack this onto window post calls if there are sectors in your deadline.
Beta Was this translation helpful? Give feedback.
All reactions