Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor Hasher lookup handling in the chiplets bus #360

Closed
grjte opened this issue Aug 8, 2022 · 7 comments
Closed

refactor Hasher lookup handling in the chiplets bus #360

grjte opened this issue Aug 8, 2022 · 7 comments
Assignees
Labels
processor Related to Miden VM processor

Comments

@grjte
Copy link
Contributor

grjte commented Aug 8, 2022

As discussed in #348, the first version of the hasher handling in the chiplets bus was straightforward and left room for several future optimizations.

In particular, the following should be refactored and optimized in the future:

  • how hasher lookup values are computed
  • how lookups are requested from the decoder
  • how lookups are stored for "providing" them to the bus from the hash chiplet module.

Each item is described below. In all cases, there are also inline TODOs in the code with comments.

how hasher lookup values are computed

The hasher state is currently stored in the HasherLookup struct, which makes it heavy and means that other things which contain it become correspondingly much heavier (e.g. ChipletsLookupRow). Similarly, the "next" hasher state is held in the Absorb variant of the HasherLookupContext enum, which has the same knock-on effects. The aforementioned struct & enum are here.

Instead, we could get the state from the trace when the lookup values are included in the $b_{chip}$ column. However, because the trace is in column-major form, this might not be more performant, so any change should be benchmarked.

Edit (tohrnii):
Once we remove the hasher state from HasherLookup, we should refactor the hash_span_block method to avoid doing is_memoized checks multiple times.

how lookups are requested from the decoder

Currently, the entire hash computation is done when the decoder makes its initialization request and all intermediate lookups required for the correctness of $b_{chip}$ are queued. When the decoder needs subsequent lookups (e.g. as it absorbs new operations during RESPAN or when the completes code blocks and needs the return hash), it sends a request, and they are dequeued and sent to the $b_{chip}$ bus.

Instead, it might be better to compute the lookups at the time they are needed. This requires refactoring the decoder and the functions in the hasher more extensively.

See the relevant discussions in the previous PR:

how lookups are stored for "providing" them to the bus from the hash chiplet module.

Currently, all lookups are saved as they are computed during hash computations. At the end, during fill_trace, the hash chiplet iterates through and provides each lookup to the bus $b_{chip}$ one by one.

There are a few different options here:

  1. provide the lookups to the bus as soon as they are computed, instead of saving them and sending them later. This would require refactoring the request/response handling a bit in the chiplet bus module since there is currently an assumption that all requests come first. It's a simple refactor, but it does also mean that the Hash chiplet would work differently from the Bitwise and Memory chiplets
  2. during the fill_trace function, add the lookup "responses" to the chiplet bus in bulk instead of individually.

There are a few relevant comments about this:

@grjte grjte added processor Related to Miden VM processor on hold This issue is blocked or we don't want to start it yet v0.3 labels Aug 8, 2022
@tohrnii tohrnii removed the on hold This issue is blocked or we don't want to start it yet label Oct 4, 2022
@tohrnii tohrnii mentioned this issue Oct 7, 2022
2 tasks
@tohrnii
Copy link
Contributor

tohrnii commented Oct 20, 2022

Regarding computing lookups as the requests are made by the decoder, this is being worked on in PR #436 (the code is WIP and not very readable right now). The issue with this approach is that the context about the blocks is lost for intermediate requests while respan is called for span blocks, and when read_hash_result is called for control blocks and the only place to get this context from is the decoder. There are a few things we need for correct lookups computation when the decoder sends the request.

  1. For span blocks, while absorbing span batches, we need information about whether the batch is the last batch in the span block or not to determine appropriate selectors for the hasher trace.
  2. We also need information about the block hash to know if the batch is part of a memoized span block. This needs to be sent by the decoder along with the lookup request during absorb_span_batch calls. The decoder also needs to send start row address of the trace for this block as well as the index of the batch in the span block to compute the lookup address for memoized blocks since we copy the complete memoized block's trace when init_span_block is called. This can be refactored to avoid needing batch index by copying the memoized trace when the subsequent lookup request is sent by the decoder instead.
  3. Lastly, for control blocks we need information about the response cycle when the request is made during read_hash_result since this information is lost by the hasher. This can be done by returning the response cycle info at the time of hash_control_block is called back to the decoder and then the decoder can send this back again at the time read_hash_result is called.

However, all of this requires the decoder to keep track of details that it ideally shouldn't. The approach we already use is much cleaner in comparison.

@bobbinth What do you think? Should we close this issue without this change or is there a better way to handle this?

@grjte
Copy link
Contributor Author

grjte commented Oct 21, 2022

Currently, the first and third items from the issue description have been completed and we've gotten improvements from these. The final change (how lookups are requested) wouldn't reduce the amount of data we're storing, it would just change the interface so we wouldn't manage these intermediary queued_requests. Given the complications introduced by control block memoization that @tohrnii has pointed out, I'm not sure it's worth it.

@bobbinth
Copy link
Contributor

Sorry for the delay on this (I've been trying to think it through)!

First, I should say that I'm very happy with the results we've gotten so far, and I'm fine with closing this issue as remaining work will give us only marginal benefit. So, we can drop the work done in #436.

Having said that, I think we should keep an open issue for the remaining work (i.e., figuring out how to get rid of queued_requests). While it may not be worth our time now, we may want to come back to this in the future (as I think the current structure is not as intuitive as it could be). And I do think that we could come with with a good solution over time.

I also want to clarify that I don't think we need to change how the hasher actually builds the execution trace - only how lookups are computed. Specifically:

  1. For span blocks, while absorbing span batches, we need information about whether the batch is the last batch in the span block or not to determine appropriate selectors for the hasher trace.

Unless I'm missing something, I don't think computing lookup values is related to setting selectors. The entire trace for a span block would still be computed as a part of the Hasher::hash_span_block(). But this method would not create any lookups. Instead, the HasherLookupContext::Start lookup would be created in Chiplets::hash_span_block(), and all other lookups would be created by Chiplets::absorb_span_batch() and Chiplets::read_hash_result() (though, we'd probably need to have a specialized version of this method for SPAN blocks).

There is a bit more complexity here as Decoder::respan() doesn't have access to the address of the batch being executed but I think it shouldn't be too complicated to provide this info to this method, and once it is there, other parts should be relatively straight-forward (though, it is very possible I'm not taking something into account).

  1. We also need information about the block hash to know if the batch is part of a memoized span block.

Again, I think memoization code could stay as it is now. The only thing that needs to change (in my opinion) is that Hasher would not longer be responsible for computing lookups, and that responsibility would move into hasher-related Chiplets methods.

For control blocks, the solution should be relatively simple:

  1. At the time when control block execution ends, the decoder knows the address of the block (which it received from the hasher at the time when block hash is computed). We would need to update some method signatures in the decoder to propagate this info back to end_join_block(), end_split_block() etc. methods.
  2. We can then pass this address into Chiplets::read_hash_result() method, and that should be enough to compute a lookup for the return value (addr for the lookup would be addr of the block + 7).

So unless I missed something (which is very possible), as far as control blocks go, the changes should be pretty light.

@grjte
Copy link
Contributor Author

grjte commented Oct 24, 2022

If I understand correctly, you're suggesting adding new methods in the chiplets to add the hasher lookups rather than having them be part of the methods control block and span block methods in the hasher module. The decoder would call these methods at the appropriate cycle, and they would be immediately added.

I agree that this should be simpler than the current approach. The only complexity is in providing the needed address for lookups during span blocks. Other control blocks are straightforward, as you said.

I think it makes sense to open a new issue for getting rid of the queued_requests with a sketch of the above method. Then we can close this one and update the tracking PR.

@grjte
Copy link
Contributor Author

grjte commented Oct 24, 2022

@tohrnii is there anything either of us are missing here?

@tohrnii
Copy link
Contributor

tohrnii commented Oct 25, 2022

Sorry for a late response.
@bobbinth Thanks for the clarification. It makes a lot of sense to build the trace first at once and compute the lookups later for span blocks.

Unless I'm missing something, I don't think computing lookup values is related to setting selectors

You are right, it was required with earlier structure as we were building the trace and computing the lookups together as the decoder sends the request.

For control blocks, the solution should be relatively simple.

Thanks. Yeah makes sense. I made a mistake in thinking through this.

I'll close this issue and the draft PR for now and add this as a separate issue for later.

@tohrnii
Copy link
Contributor

tohrnii commented Oct 25, 2022

1st and 3rd items in this issue have been completed by #424 and #430.
2nd item in this issue is being tracked in issue #446.

@tohrnii tohrnii closed this as completed Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor Related to Miden VM processor
Projects
None yet
Development

No branches or pull requests

3 participants