-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add filecoin commitment merkle root codecs #172
Conversation
flipped the tag to |
I think I understood the linked Multihash additions and how you could think of the Merkle Tree they us as a "powerful hash function". Though I don't understand this PR. Currently the codec in a CID is used to know how to parse/decode the data that was used to create the hash. It is not for interpreting the data (e.g. GeoJSON and JSON is the same codec). If I apply this to e.g. CommP. There we have some input data, which is just a list of bytes. To create the multihash value, you padd the data, add some zeros, build a merkle tree. You can generator the hash from the initial list of bytes. Now I would say that the underlying data has the |
@vmx my understanding is that the hash they produce isn’t of the original data, it’s of this novel merkle tree built from that data, and the codec describes the block format of that novel merkle tree. They aren’t storing and decoding those blocks by CID, but it is theoretically possible to do so, which is why it makes sense to have a codec. Another way to look at it is this: you could change the algorithm generating the merkle tree (adjust the padding perhaps) and keep the hashing function the same, which would mean that the merkle tree format is now sufficiently different that it would require a different codec but not a different hashing function. If you just use |
I'm dealing with this confusion for BTC at the moment and am trying to draw some graphs to describe what this is actually trying to do for documentation, cause it is confusing. I'll try and get something in here asap that illustrates it. I think it's justifiable, and if it isn't then we have a big problem with all the coins because they do this too. At the base, you have So in terms of utility, you could receive a CID identified with these codecs and know that:
So it works, and it wouldn't be a stretch to tag this If I'm doing my calculations right, then a ~<1Gb CAR file in Filecoin could be describable by 67,108,863 CID:Block pairs--from those 32-byte chunks all the way up to the root CommP. That's the part that makes me question how reasonable this is if "but you can do it" is the best argument. |
Thanks @rvagg for that great explanation. That really cleared up things for me, now it makes sense to me and I think this justifies being a codec. Though, I think it should be kind of a general |
Wellll that is an option, then it'd force differentiation down to the bottom layer leaves of commp, you just wouldn't be able to receive a CID and know it's a Filecoin thing until you did that digging, but how would you know how or where to get a loader from to navigate? This is a bit of a vexed problem and comes up in BTS and others too - same problem with loaders unless we publish all of these nodes onto the IPFS network and they just magically show up. Maybe this is stretching the purpose of CIDs a bit too much since we need an additional piece of information. For now I think that we're forced to embed that information in the codec so it serves two purposes (1) this is a binary merkle node and (2) it's from within filecoin. Also worth noting here that the binary merkle node won't work for sealed sectors since they have an novel structure, not simply binary. It also may not quite be sufficient for BTC because it uses a hack to make its merkle trees binary, doubling up the last element if there are an odd number of elements at any layer of the merkle. You'd still get a "normal" looking merkle, it's just that some nodes would end up with two of the same links, side by side. It's kind of nice to put BTC's merkle off to the side and say "binary, with an odd-numbered hack", where as FIL uses "binary, and we make sure it's binary by padding the input data, so we have a lot of zeros." |
I've put some more thought into this and came to a similar conclusion as
Here's my idea on how to justify having a custom codec, while (kind of) keeping the notion of a The So the codec would be something like "do some specified padding in between bytes and at the end of the file and use it as leaves for a merkle tree". We might call that then e.g. The codec and the hashing algorithm would be independent of each other and could independently be swapped with something else. |
I'm not sure we need to go that far even to justify it, because we can say that |
But this way you cannot identify when you hit the leaf nodes. You cannot tell the difference between whether something is raw bytes (a leaf) or a hash (merkle tree node). Or would you say it's fine that the leaf nodes have the same codec as all this is more of a theoretical exercise? Why I want a proper justification is, that I think it is important that we agree on how to stretch the multiformats, so if similar requests from third parties come in, we can more easily accept or reject it. |
go-ipld-btc does it by length, either 64 bytes == node or > 64 == leaf. In the case of CommP and CommD (maybe CommR), it'd be 64 bytes == node or 32 == leaf. A node would return 2 CIDs with the same codec & multihash, a leaf would return the same codec but The reason I'm fine with using length as a determinant is that it's a serialization format and serialization formats use all sorts of signals to determine what they contain, you just hand off to the codec and say "give me what you've got". In the case of go-ipld-btc you hand it your data and it comes back with either a 2-link node or a full transaction. The alternative is to be really specific which could get explosive. I'd probably have to put up the following codecs for the btc stuff I'm working on (I may not be able to avoid some of these).
|
I guess that should be Using the size as signalling would be fine with me. Though things won't really work as you cannot retrieve the leaf nodes while traversing the merkle tree. For inner nodes, you can construct a CID to keep traversing. You won't be able to retrieve the leaf nodes as they use a different hash function ( So I don't think the "each merkle tree node is as block"-paradigm would actually be possible. I wonder if my idea of "the whole merkle tree is part of codec" (#172 (comment)) would work. You could even address slices of the input data. |
@rvagg explained to me how this works. Conceptually there will be another layer of blocks between the leaves and the layer above. This layer will contain nodes which contain the hash of a single leaf node. A traversal starting at the merkle root would look like this. The hash algorithm of the merkle tree nodes is always the same, hence I won't mention it for brevity. You know the hash of the root of the merkle tree and that it is a a Filecoin Binary Merkle Tree (that is its codec, let's call it Do that recursively. At one point (the level before the leaves) you will retrieve data that is only 32 bytes long, it's a single hash. Now you know that the next level are the leaves. Hence you use the |
Due to feedback about CommP being strictly a subset of the CommD merkle, following on from Jeromy's comment in the original thread, I've trimmed it down to two codecs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I do have one concern here.
I just want to make sure it's clear that these multicodecs do NOT define a traversable data structure you can shoehorn into working like IPLD
@Stebalien discussed this and tried to work out what you can do, but it doesn't work, at least as long as the intermediate hashes of the binary merkle tree that filecoin constructs are not themselves CIDs
The issue is this: you could imagine this cid identifys a block that is just a 64 byte concatenation of two raw filecoin hashes.
You could grab that block, and then "deserialize" it by splitting it in two and adding the same multihash parameters from the root CID, and use that to request the next block.
The problem is depth -- at what point are you looking at two hashes concatentated together, and at what point are you looking at the lowest level of leaf data -- actual content of the piece being stored? There's no way to know from the CID alone (you can also figure it out if you know the underlying piece size, theoretically)
I only bring this up because I know @jbenet has a desire for Filecoin miners to actually transfer data in a way that is incrementally verifiable by requesting the PieceCID, and this definitely does not get us there.
You could imagine defining a seperate IPLD format for transferring that has the same incremental verifiability properties as the PieceCID, whose root CID could be calculated form PieceCID & PieceSize in a filecoin deal, but as long as the nodes of the binary merkle tree Filecoin constructs are hashes without any CID information, this codec alone does not get us to IPLD data.
That's why @Stebalien and I originally just went with Raw.
BTW, the optimal way to transfer that is both incremental verifiable and mostly efficient would involve transmitting only the top levels of the filecoin merkle tree to the miner down to the smallest chunk you wanted to verify incrementally (say maybe 1mb) and then transmitting the remaining raw data in larger chunks and reconstructing the bottom levels of the tree on the miner side to make sure it matched the top. (otherwise the tree itself is at least as big as the underlying piece) That's gonna have to be some kind of different format with a different CID anyway (theoretically derivable from the PieceCID & PieceSize), so maybe this makes the above comment less relevant. |
oops sorry didn't mean to close the PR. |
Ahh but you can with CommP and CommD since at the base you are hashing 32-byte chunks of the raw data, not 64, so you have a differentiator. (It's the same as this differentiator). So if you were to construct some magical loader that is able to take one of these CIDs and return a data chunk, then an IPLD codec might do this:
(The loader is a whole other issue, how would such a loader work? would it index all of these chunks?? That's out of scope here but does pose an interesting challenge for the utility of these CIDs). I don't know about CommR, however, that's probably quite different and maybe this isn't even possible? @porcuquine? |
I wondered the same and had a call with @rvagg who explained it to me. I had hoped that my explanation at #172 (comment) would explain how this would work. For me the missing piece was the intermediate level right before the leaves which contains the 32 byte hashes. @hannahhoward I'm happy to try to explain it again, draw it or have a call with you. |
@rvagg oh! wow I work on Filecoin fulltime and I didn't know the lowest layer is hashes of 32 bytes not two 32 byte leaves. you're right then! |
Hold on:
This is not true. The original data is not independently hashed. Rather, the original data are used as-is and form the leaves of a binary merkle tree. @rvagg we talked about this once before, and you remembered that you had implemented this correctly for dumb drop, I believe.
CommR is more complicated and also likely to change as constructions change. It's true that the current complications might simplify this aspect (though I have not tried to completely follow) in that the bottom layer involves a differently-shaped hash (11-ary) than the tree above it (8-ary). However, that tree itself does not directly yield the root. There are layers above it. And everything I said applies only to one half of the final 'binary tree' joining two intermediate commitments. (I wrote this out not as a specification but to wave my hands at the shape and complexity of CommR. We can discuss it more if actually useful, but given all which has come before, I think it's best kept simple.) |
I thought for the "stretching the multicodec definition and think about how things could theoretically work"-purpose I thought it wouldn't matter how it actually works, we could just add a theoretical layer above the leaf nodes. But we can't, as it would then change the hashes :-/ This means that do not know when we hit leaf nodes when we traverse the merkle tree. I can see that we might still want to merge this. Tough it violates my current view of how I think what multicodecs are, hence I withdraw my approval. |
@vmx I don’t share the view that blocks must be constructed this way in order to satisfy the definition of a multicodec. I’ve been thinking a lot about the broader topic of “shadow graphs” (any parallel graph of additionally computed state of another graph) and regardless of how you build that graph, you’ll find yourself in a position where the original data has no parallel reference point inside the shadow graph and can only be found by following a traversal somewhere else. In other words, not having in-block references between nodes inside of shadow graph branches and leaves is fine as long as you can theoretically build a linking structure over all the relevant data that could represent a complete graph. We aren’t storing that right now but it can be built, and it could not be built if we don’t allow these to be proper multicodecs. We just need to keep in mind that, because the blocks themselves don’t have the necessary references, we cannot rely on the traversal and pinning systems we’ve built until you do build this larger “meta graph” that links all of them together. But that’s not enough to hold up adding the codec. |
@rvagg so then it looks like my comment stands. But I don't know if that's a blocker. Just to be clear then this is not even theoretically a block/IPLD format that could be used with say Graphsync. |
Again, if our goal is to transfer piece data as opposed to payload in a way that is incrementally verifiable over Graphsync/Bitswap, I believe the most optimized way to do that is through a separate IPLD format with a different CID (possibly derivable from PieceCid + PieceSize) which wraps piece data. |
also I confirmed that absent a length of the underlying data (i.e. height of the tree), the commP does not uniquely identify a piece of filecoin data. So one options to consider is whether the codec itself should specify size of piece (probably in power of 2) -- which I think goes beyond the purpose of a CID, but would be the way to make it a unique identity. |
@mikeal's comment (#172 (comment)) wasn't clear to me. After a sync conversation the outcome for me was: you can traverse the tree, but you need out-of-band information properly do that. That out-of-band information is the input data size (i.e. tree height). |
(aside: what we discussed was the 254-bit truncation being consistent throughout the graph, hence the new multihash in #171). You're right, I mispoke about them being hashed but they are still differentiated by size, so the base layer comes back as 32-byte chunks while every other layer is 64-byte. But that only changes my claim that the lowest level CIDs could be In IPLD Schema language an
BUT I still think this is entirely academic for the purpose of FIL though, the graphs are impractically large to be identifying them in this way and the base data is impossible to reconstruct without a If we wanted to be more free with our definitions, we could potentially flip in the future and say something like "these multicodecs identify the entirety of the Comm{X} merkle trees and the multihash identifies the hashing algorithm used within the merkle process, with the Data associated with the CID being the base, padded data". That would be a bit of a shift for multicodec, although @vmx noted yesterday that we already have |
IMO the only thing I'm seeing as a potential blocker to moving forward here is the combining of CommP and CommD, but I haven't seen any objections to that yet. |
Bah, yet again I wrote that too quickly and should have let it simmer in my head a bit longer. OK, my picture of how this works is bunk. If they're not hashed you can't turn them into CIDs like every other layer, |
OK, there's two ways of framing that work for the current proposals:
So the second option pushes this further than some may be comfortable with, but maybe using multicodec as a generic identifier type catalogue is not a terrible thing. We're still dealing with content-addressed data at least, it's not like a mime type.
I think at this stage my preference is to go for option 1, it still works, a codec could still be theoretically written that would do a traversal, it just wouldn't get you anywhere helpful. But as I keep saying, I think the limits of practicality rule out such an implementation regardless, this is simply an identification and differentiation exercise. |
I have mostly avoided this part of the discussion, as I don't have a strong opinion and also am not fully indoctrinated into the context of the decision. Still, based on the above, let me take one shot at a position. Please bear in mind, that I know very little about CIDs, Codecs, IPLD, or how any of these are used in detail. My comments below may only be useful as abstract ideas about a system similar-to-but-different from the one we have. I'm writing them out partially just to think this through for myself, and partially in case this perspective is useful as we make any minor adjustments to what we have that prove useful or necessary. Based on the conversation above, it sounds like our definitions may not yet have stabilized in a way that deals with all needs yet anyway. Some of what I write below may overlap ideas expressed in the preceding discussion. I am not claiming anything here is novel, just trying to work through the ideas somewhat independently. The process of producing a merkle root from base data, is special. It does create a duality in which the distinction between the 'original base data', and the 'immediate precursor hash inputs' as sources is made ambiguous. This ambiguity is not accidental. Rather, it's a direct byproduct of the function of merkle trees as structures from which proofs can be generated. The nature of these structures is such that they are receptive to — but do not directly require — extra annotation which elaborates their intended meaning. Specifically, if a merkle root is annotated with a height, this can be interpreted as a declaration that there must exist leaves at a certain depth, and that merkle inclusion proofs using that root (along with that annotation) must have the corresponding number of elements. However, the presence of such an annotation doesn't eliminate the possibility of shorter proofs used to demonstrate knowledge of some set of interior nodes. In that sense, every root actually corresponds to a family of merkle trees. At the very least, every tree of height less than any specified height is also implicitly encoded by the same root. In fact, it is also the case that there are an infinite number of potentially larger trees of greater height associated with the same root. If my leaf data happens to have been an interior row of such a larger tree, I can even produce proofs that this root is definitely also the root of a known tree with height greater than it claims as its own. Part of the ambiguity seems to come from the question of what we consider the 'hash function' to be. Let's consider SHA2, which uses a merkle-damgard construction to hash arbitrarily long sequences of input data. A SHA2 hash also has the property that — without explicit specification of input length — we cannot know whether a given input is the 'real' original data or a collision. To accomplish this, there is some internal function which is repeatedly applied in order to produce the final result. What happens if we consider 'construction of a binary merkle tree using H as an internal function' to be a hash function on its own? This is conceptually like applying a merkle-damgard construction (minus padding considerations) to any binary hash function. Filecoin used to do exactly this to define our pedersen hash over arbitrarily-sized input, for example. What distinguishes the use of merkle-tree construction as the method of combining an inner hash is that it is intentionally prone to collisions according to a useful structure. In other words, assuming no height-based 'personalization', ambiguity of the original data's length allows incremental revelation of information. Throughout this note, I am only considering uniform trees which use an identical hash function at each row, with no extra height-based padding — although the idea could be extended to deal with some forms of such tagging also. Consider the following alternate interpretation of a merkle inclusion proof. Instead of proving knowledge/possession of a specific leaf, the goal of the prover is to prove knowledge or existence of a tree with a given depth. Every root trivially corresponds to a tree of depth 0, and the proof is a zero-length path of complementary hashes, yielding the root (which is also the leaf, and to which zero hashes are applied). Given an arbitrary root, a verifier has no idea whether it was produced from a tree of a given height. In theory, even imaginary trees which were not used to construct the root can be invented, but this is assumed to be impossible given an adequate (for this purpose) hash function. A prover's goal is to demonstrate knowledge of the existence of a tree of some height. For example, to prove the existence of a (binary) tree of height 1, a proof presents the entire tree: the leaf, its complementary hash, and ordering information. This convinces a verifier that there exists a known tree of height 1 with the given root. Likewise, any verified inclusion proof of length n-1 proves the existence of a tree of height n with the given root. In this game, the uncertainty as to whether ever longer proofs can or will be provided is a feature, not a bug. The point is that in the absence of length data internal to the hash function, length ambiguity is always possible. In the specific case of CommP and CommD, this ambiguity is a feature, not a bug. It means a storage client can compute CommP of his own piece and later verify that it has been packed into a sector with commitment CommD — but without needing to know in advance either the size of the eventual sector or the content of any other data which might be packed in the same sector. This is not only a matter of convenience, but also of capability. If size of the eventual tree must be encoded in CommP, then the same piece cannot be packed into sectors of different sizes without changing its CommP. Of course, the reverse is possible: CommP could force knowledge of the length of its own base data (by encoding height information into the hash function), as could CommD. But this is not strictly necessary, as long as inclusion proofs are checked in a context in which explicit knowledge of expected size is provided. All this is to say: I think there is a consistent point of view from which generation of an N-ary merkle tree can:
So, to be concrete, a The problem is that if a CID needs to uniquely map to a specific base result, we are out of luck. I don't know enough about IPLD to know how traversal of recursive structures works (if at all), but it seems to me that this structure does not need to be inconsistent with realistic uses. (Again, note that I'm ignorant of how things currently work, so this may not make sense in that context.) Is there a hard requirement that CIDs must correspond to collision-resistant hashes? If so, how is this future-proofed? That is, do we invalidate all past CIDs of a given type if a collision is ever detected? If not, then we must have a mechanism for dealing with potential collisions (even if expected to be very rare in the short-term future). What I am proposing here is that we accept a type of CID with many expected collisions of an expected structure. If we can do that, everything becomes simple. For example, if I do have the context of a known content length, I can search for content of that length or any shorter length. If the final expansion is not discoverable, I can repeat the process recursively. This encompasses both extremes: if cached and available, I can find the specific data I need (if I know content-length) in one hop. Or, in the worst case, I have to traverse the entire tree, decoding each row's hashes in turn. (Although Filecoin is not one, there are imaginable use cases for this behavior.) In between, recovery of the original data might take any number of hops, depending on which cached expansions are available to me. But if I search exhaustively, I will eventually reach the original data if it exists, and if this is done cleverly, I can hope to do so efficiently. Interestingly, this also provides a mechanism for formalizing merkle inclusion proofs as first-class citizens of the ecosystem. A merkle-inclusion proof of a given leaf + index within base data of a specified size can then be specified as the minimal set of verifiable CID -> content mappings required to retrieve the leaf from the root via the path. In this context, I define a verifiable mapping as one in which the CID can be directly generated from the input data, and the target (leaf or intermediate node) is verified to exist at the specified position within the input data. Having a uniform and universal specification of the information content (as opposed to byte structure) of merkle inclusion proofs would be valuable. It would allow for more general tooling for specification, implementation, and verification of the correctness of implementation based on specification. If current structure doesn't already provide this directly, then this is in itself a good motivation for adopting something like what I've described. It might be the case that this is best accomplished by adding some new entity with the described properties, if the existing types (CID, Codec) cannot be made consistent with the usage I propose. I also might be missing something. I've thought quite a bit about this last topic (structure, specification, implementation, verification of specification-implementation) of inclusion proofs — but almost not at all (previously) about how this relates to content-addressable data as such. |
This was kind of vague. What I meant by 'directly generated' was 'using one application of the inner/atomic hash function'. That said, since the CID can always be generated by building a complete tree, you can also think of presentation of the entire original data (or an entire subtree row at any point of the path) as also being a valid merkle proof — just not a minimal one. This is probably actually a useful distinction. It would be good to have a canonical minimal inclusion proof, as well as recognition that non-minimal proofs are possible and verifiable. A minimal proof can always be constructed from a non-minimal one, and the existence of the minimal form also allows for a (not necessarily unique — but deterministically specifiable) canonical binary form for proofs (which I consider highly desirable for other reasons). |
After reading @porcuquine's excellent comment, I come to the conclusion that we should (intentionally) not distinguish whether we hash the base data or some inner merkle tree node. To me this means that the codec for the nodes of such a merkle tree should be This would (kind-of) downgrade the CID into a multihash for the purpose of CommD, CommP and CommR. |
@vmx and I will have a chat about this 1:1 to try and move this forward. I'm recording my thoughts as they are now, maybe this will change:
Yes, but:
I don't know if this is the next logical step. It may be the right step if we were treating this as strictly an exercise in defining a mapping to an IPLD model of the world, which I've been trying hard to do but apparently failing. It doesn't seem like that's appropriate if this is all about inclusion proofs. We're back to a simple (?) need for classification of an identifier, something CIDs should be good at. Using these 2 new codecs and not using the Combined with the new multihashes, we'd get CIDs that say:
Then you get differentiation along two dimensions:
(Granted that you could also get that same differentiation with a single multihash that packed everything into it - In an IPLD world, you might also be able to get a corresponding byte array that could be used to verify the CID is correct, but that classic IPLD usage is not quite how this will be used and we should probably stop trying to stretch it there. What you don't get the classic IPLD "codec" sense of "I know how to decode this byte array into something useful". But that's probably OK if we don't pretend this is IPLD. |
Edited the comment text to be clearer:
From our chat yesterday, I think we're ready to move forward. I'll comment further over in #161. Thanks @porcuquine for your patience and excellent description. Both @vmx and I enjoyed your last write-up and learned a bunch from it. |
I'm glad to have been able to participate in the discussion. You all are thinking this through carefully, which I appreciate. Thank you for not letting my partially-formed ideas distract from resolving the immediate issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @rvagg mentions over #161 (comment) I still don't understand why we need a different codec for CommD/C and CommR (and not just a single one). Though I trust the Filecoin team that this is needed for some reason I haven't heard/understand yet, hence I'm approving this PR.
These describe roots & nodes of a merkle tree, not the underlying data. In the case of CommP and CommD they are binary merkle trees using sha2-256-trunc2. For CommR they are novel structure merkle trees using poseidon-bls12_381-a2-fc1. All nodes of the respective merkle trees could also be described using this codec if required, all the way to base data. It is anticipated that the primary use will be restricted to the roots. Ref: #161 Closes: #161 Closes: #167
These only describe the roots of a merkle tree, not the underlying data. In the case of CommP and CommD they are binary merkle trees using sha2-256-trunc2. For CommR they are novel structure merkle trees using poseidon-bls12_381-a2-fc1.
All nodes of the respective merkle trees could also be described using this codec if required, all the way to base data. It is anticipated that the primary use will be restricted to the roots.
This PR partly assumes #170 and #171, although it would be possible to use
identity
multihash with these to form a CID. It does depend on agreeing that our approach to merkle tree "hashing" wrt multihash is that we identify individual nodes, rather than treating the entire merkle process as a "hash function". So in the case of CommP, the CID we'd generate corresponds to a "Block" that's 64-bytes long which is the concatenation of two sha2-256-trunc2 hashes. You could theoretically generate CIDs for every node of the merkle tree down to the base data. Although you're not guaranteed to find useful data at the base (in the case of CommP it's fr32 padded to insert 2 bit spaces for every 254 bits and zero padded to fit a base2 size, but the original non-padded size would have to be provided by other means).Ref: #161
Closes: #161
Closes: #167
R= @vmx @mikeal @dignifiedquire @porcuquine @Stebalien
Also @whyrusleeping had a comment about CommP and CommD possibly being redundant in #161? I'm not sure about the ultimate anticipated use of each of these values so can't speak to that.