Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add filecoin commitment merkle root codecs #172

Merged
merged 1 commit into from
May 12, 2020
Merged

add filecoin commitment merkle root codecs #172

merged 1 commit into from
May 12, 2020

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Apr 16, 2020

These only describe the roots of a merkle tree, not the underlying data. In the case of CommP and CommD they are binary merkle trees using sha2-256-trunc2. For CommR they are novel structure merkle trees using poseidon-bls12_381-a2-fc1.

All nodes of the respective merkle trees could also be described using this codec if required, all the way to base data. It is anticipated that the primary use will be restricted to the roots.

This PR partly assumes #170 and #171, although it would be possible to use identity multihash with these to form a CID. It does depend on agreeing that our approach to merkle tree "hashing" wrt multihash is that we identify individual nodes, rather than treating the entire merkle process as a "hash function". So in the case of CommP, the CID we'd generate corresponds to a "Block" that's 64-bytes long which is the concatenation of two sha2-256-trunc2 hashes. You could theoretically generate CIDs for every node of the merkle tree down to the base data. Although you're not guaranteed to find useful data at the base (in the case of CommP it's fr32 padded to insert 2 bit spaces for every 254 bits and zero padded to fit a base2 size, but the original non-padded size would have to be provided by other means).

Ref: #161
Closes: #161
Closes: #167

R= @vmx @mikeal @dignifiedquire @porcuquine @Stebalien

Also @whyrusleeping had a comment about CommP and CommD possibly being redundant in #161? I'm not sure about the ultimate anticipated use of each of these values so can't speak to that.

@rvagg
Copy link
Member Author

rvagg commented Apr 16, 2020

flipped the tag to filecoin from ipld in the second (fixup) commit here, but that's open for discussion of course.

@vmx
Copy link
Member

vmx commented Apr 16, 2020

I think I understood the linked Multihash additions and how you could think of the Merkle Tree they us as a "powerful hash function". Though I don't understand this PR.

Currently the codec in a CID is used to know how to parse/decode the data that was used to create the hash. It is not for interpreting the data (e.g. GeoJSON and JSON is the same codec).

If I apply this to e.g. CommP. There we have some input data, which is just a list of bytes. To create the multihash value, you padd the data, add some zeros, build a merkle tree. You can generator the hash from the initial list of bytes. Now I would say that the underlying data has the raw codec. CommP is just how we interpret the data and is nothing that (IMHO) should be part of the CID.

@mikeal
Copy link
Contributor

mikeal commented Apr 16, 2020

@vmx my understanding is that the hash they produce isn’t of the original data, it’s of this novel merkle tree built from that data, and the codec describes the block format of that novel merkle tree. They aren’t storing and decoding those blocks by CID, but it is theoretically possible to do so, which is why it makes sense to have a codec.

Another way to look at it is this: you could change the algorithm generating the merkle tree (adjust the padding perhaps) and keep the hashing function the same, which would mean that the merkle tree format is now sufficiently different that it would require a different codec but not a different hashing function. If you just use raw you would lose this point of differentiation.

@rvagg
Copy link
Member Author

rvagg commented Apr 17, 2020

I'm dealing with this confusion for BTC at the moment and am trying to draw some graphs to describe what this is actually trying to do for documentation, cause it is confusing. I'll try and get something in here asap that illustrates it. I think it's justifiable, and if it isn't then we have a big problem with all the coins because they do this too.

At the base, you have raw data. For CommP, we'd fr32 bit pad it, zero pad it to a base2 size and then break all of that up into 32-byte "blocks", each of which would still be raw, and as an array, would join together to create that fr32+zero-padded data still. Then you make a merkle tree out of it and call each node in that merkle tree a "block", the first layer of which is a sha2-256-trunc2 hash of the 32-byte raw block and then every layer above that is a sha2-256-trunc2 hash of two concatenated blocks from the layer below, all the way up to the root. It's all of these nodes in the merkle tree that we could make a CID for and identify with these codecs (plus the sha2-256-trunc2 hash).

So in terms of utility, you could receive a CID identified with these codecs and know that:

  1. If the block it loads (from some mystical "loader") is 64-bytes, then it's the concatenation of two hashes, and you could "decode" that block (in the IPLD sense!) into an array of two more CIDs, and you could further load each of those.
  2. If the block it loads is 32-bytes, then it's going to be the base raw data, that you'd want to concatenate to blocks around it to form ... a thing ... that I don't know what you'd do with, but you could do it at least.

So it works, and it wouldn't be a stretch to tag this ipld for the above reasons either. The practicality is the questionable thing.

If I'm doing my calculations right, then a ~<1Gb CAR file in Filecoin could be describable by 67,108,863 CID:Block pairs--from those 32-byte chunks all the way up to the root CommP. That's the part that makes me question how reasonable this is if "but you can do it" is the best argument.

@vmx
Copy link
Member

vmx commented Apr 17, 2020

Thanks @rvagg for that great explanation. That really cleared up things for me, now it makes sense to me and I think this justifies being a codec. Though, I think it should be kind of a general binary merkle tree codec or so, not specific to CommP.

@rvagg
Copy link
Member Author

rvagg commented Apr 20, 2020

Wellll that is an option, then it'd force differentiation down to the bottom layer leaves of commp, you just wouldn't be able to receive a CID and know it's a Filecoin thing until you did that digging, but how would you know how or where to get a loader from to navigate? This is a bit of a vexed problem and comes up in BTS and others too - same problem with loaders unless we publish all of these nodes onto the IPFS network and they just magically show up. Maybe this is stretching the purpose of CIDs a bit too much since we need an additional piece of information. For now I think that we're forced to embed that information in the codec so it serves two purposes (1) this is a binary merkle node and (2) it's from within filecoin.

Also worth noting here that the binary merkle node won't work for sealed sectors since they have an novel structure, not simply binary. It also may not quite be sufficient for BTC because it uses a hack to make its merkle trees binary, doubling up the last element if there are an odd number of elements at any layer of the merkle. You'd still get a "normal" looking merkle, it's just that some nodes would end up with two of the same links, side by side. It's kind of nice to put BTC's merkle off to the side and say "binary, with an odd-numbered hack", where as FIL uses "binary, and we make sure it's binary by padding the input data, so we have a lot of zeros."

@vmx
Copy link
Member

vmx commented Apr 20, 2020

I've put some more thought into this and came to a similar conclusion as

For now I think that we're forced to embed that information in the codec so it serves two purposes (1) this is a binary merkle node and (2) it's from within filecoin.

Here's my idea on how to justify having a custom codec, while (kind of) keeping the notion of a Block, which is CID + Data. With the codec and hash algorithm information you can recreated the hash that is stored in the CID, from the Data.

The Data in our case are some raw bytes (any input you like) and (here's the stretch) does not get hashed directly, but pre-processed first. With the codec information we know how to do the processing, e.g. we must pad the data in a certain way and that it is a merkle tree. The hash algorithm information is used to build up the merkle tree.

So the codec would be something like "do some specified padding in between bytes and at the end of the file and use it as leaves for a merkle tree". We might call that then e.g. fil-piece-unsealed. Though if e.g. CommP and CommD use the exact same padding algorithm, it shouldn't be a separate codec, but the same one.

The codec and the hashing algorithm would be independent of each other and could independently be swapped with something else.

@rvagg
Copy link
Member Author

rvagg commented Apr 21, 2020

I'm not sure we need to go that far even to justify it, because we can say that fil-piece-unsealed is a serial format, specifically that we expect the block data to be 64-bytes long and we can split that in half and it'll yield two 32-byte hashes which can be turned into CIDs themselves using the same codec. It's as clean as an ipld codec and the basically same as what I'm having to do with the BTC merkle trees, a bitcoin-tx CID points to a chunk that, if 64-bytes long, yields 2 more CIDs when decoded properly according to the bitcoin-tx format. Changing the tags of these to ipld might make more sense technically, but given their likely use (which mostly won't involve decoding the block), I think their own filecoin tag might be cleaner.

@vmx
Copy link
Member

vmx commented Apr 21, 2020

I'm not sure we need to go that far even to justify it, because we can say that fil-piece-unsealed is a serial format, specifically that we expect the block data to be 64-bytes long and we can split that in half and it'll yield two 32-byte hashes which can be turned into CIDs themselves using the same codec.

But this way you cannot identify when you hit the leaf nodes. You cannot tell the difference between whether something is raw bytes (a leaf) or a hash (merkle tree node). Or would you say it's fine that the leaf nodes have the same codec as all this is more of a theoretical exercise?

Why I want a proper justification is, that I think it is important that we agree on how to stretch the multiformats, so if similar requests from third parties come in, we can more easily accept or reject it.

@rvagg
Copy link
Member Author

rvagg commented Apr 21, 2020

https://github.com/ipld/go-ipld-btc/blob/5fe5af640eda869dc1236673de9fd321ba14062b/parsing.go#L133-L138

go-ipld-btc does it by length, either 64 bytes == node or > 64 == leaf. In the case of CommP and CommD (maybe CommR), it'd be 64 bytes == node or 32 == leaf. A node would return 2 CIDs with the same codec & multihash, a leaf would return the same codec but raw. What you do with the raw data is another matter.

The reason I'm fine with using length as a determinant is that it's a serialization format and serialization formats use all sorts of signals to determine what they contain, you just hand off to the codec and say "give me what you've got". In the case of go-ipld-btc you hand it your data and it comes back with either a 2-link node or a full transaction.

The alternative is to be really specific which could get explosive. I'd probably have to put up the following codecs for the btc stuff I'm working on (I may not be able to avoid some of these).

  • bitcoin-block
  • bitcoin-tx-merkle
  • bitcoin-tx-nowitness
  • bitcoin-witness-commitment
  • bitcoin-witness-nonce
  • bitcoin-witness-merkle
  • bitcoin-tx-witness

@vmx
Copy link
Member

vmx commented Apr 21, 2020

he same codec but raw

I guess that should be identity as it's about the hash.

Using the size as signalling would be fine with me. Though things won't really work as you cannot retrieve the leaf nodes while traversing the merkle tree. For inner nodes, you can construct a CID to keep traversing. You won't be able to retrieve the leaf nodes as they use a different hash function (identify) hence a different multihash than the inner nodes. If you are within a node, you cannot tell if the next level are leaves or not as you would need to retrieve them to check their size.

So I don't think the "each merkle tree node is as block"-paradigm would actually be possible.

I wonder if my idea of "the whole merkle tree is part of codec" (#172 (comment)) would work. You could even address slices of the input data.

@vmx
Copy link
Member

vmx commented Apr 22, 2020

Though things won't really work as you cannot retrieve the leaf nodes while traversing the merkle tree. […] If you are within a node, you cannot tell if the next level are leaves or not as you would need to retrieve them to check their size.

@rvagg explained to me how this works. Conceptually there will be another layer of blocks between the leaves and the layer above. This layer will contain nodes which contain the hash of a single leaf node.

A traversal starting at the merkle root would look like this. The hash algorithm of the merkle tree nodes is always the same, hence I won't mention it for brevity. You know the hash of the root of the merkle tree and that it is a a Filecoin Binary Merkle Tree (that is its codec, let's call it fil-merkle). You create a CID out of this information and retrieve its data. It's 64 bytes (two 32 hashes). Hence you know that the children have the fil-merkle codec. Create again CIDa out of all that information and retrieve the data.

Do that recursively. At one point (the level before the leaves) you will retrieve data that is only 32 bytes long, it's a single hash. Now you know that the next level are the leaves. Hence you use the raw codec instead to construct a CID out of it. Now you can retrieve the raw data.

@rvagg
Copy link
Member Author

rvagg commented Apr 23, 2020

Due to feedback about CommP being strictly a subset of the CommD merkle, following on from Jeromy's comment in the original thread, I've trimmed it down to two codecs. fil-commitment-unsealed and fil-commitment-sealed. I hope the names are appropriate 🤞 . Will post a wrap-up summary in #161.

Copy link

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I do have one concern here.

I just want to make sure it's clear that these multicodecs do NOT define a traversable data structure you can shoehorn into working like IPLD

@Stebalien discussed this and tried to work out what you can do, but it doesn't work, at least as long as the intermediate hashes of the binary merkle tree that filecoin constructs are not themselves CIDs

The issue is this: you could imagine this cid identifys a block that is just a 64 byte concatenation of two raw filecoin hashes.

You could grab that block, and then "deserialize" it by splitting it in two and adding the same multihash parameters from the root CID, and use that to request the next block.

The problem is depth -- at what point are you looking at two hashes concatentated together, and at what point are you looking at the lowest level of leaf data -- actual content of the piece being stored? There's no way to know from the CID alone (you can also figure it out if you know the underlying piece size, theoretically)

I only bring this up because I know @jbenet has a desire for Filecoin miners to actually transfer data in a way that is incrementally verifiable by requesting the PieceCID, and this definitely does not get us there.

You could imagine defining a seperate IPLD format for transferring that has the same incremental verifiability properties as the PieceCID, whose root CID could be calculated form PieceCID & PieceSize in a filecoin deal, but as long as the nodes of the binary merkle tree Filecoin constructs are hashes without any CID information, this codec alone does not get us to IPLD data.

That's why @Stebalien and I originally just went with Raw.

@hannahhoward
Copy link

BTW, the optimal way to transfer that is both incremental verifiable and mostly efficient would involve transmitting only the top levels of the filecoin merkle tree to the miner down to the smallest chunk you wanted to verify incrementally (say maybe 1mb) and then transmitting the remaining raw data in larger chunks and reconstructing the bottom levels of the tree on the miner side to make sure it matched the top. (otherwise the tree itself is at least as big as the underlying piece) That's gonna have to be some kind of different format with a different CID anyway (theoretically derivable from the PieceCID & PieceSize), so maybe this makes the above comment less relevant.

@hannahhoward
Copy link

oops sorry didn't mean to close the PR.

@rvagg
Copy link
Member Author

rvagg commented Apr 24, 2020

The problem is depth -- at what point are you looking at two hashes concatentated together, and at what point are you looking at the lowest level of leaf data -- actual content of the piece being stored? There's no way to know from the CID alone (you can also figure it out if you know the underlying piece size, theoretically)

Ahh but you can with CommP and CommD since at the base you are hashing 32-byte chunks of the raw data, not 64, so you have a differentiator. (It's the same as this differentiator). So if you were to construct some magical loader that is able to take one of these CIDs and return a data chunk, then an IPLD codec might do this:

  • Is the chunk 64-bytes long? Then decode it into a tuple of CIDs with the same codec and multihash as the original.
  • Is the chunk 32-bytes long? Then decode it as a raw data chunk.

(The loader is a whole other issue, how would such a loader work? would it index all of these chunks?? That's out of scope here but does pose an interesting challenge for the utility of these CIDs).

I don't know about CommR, however, that's probably quite different and maybe this isn't even possible? @porcuquine?

@vmx
Copy link
Member

vmx commented Apr 24, 2020

at what point are you looking at the lowest level of leaf data -- actual content of the piece being stored? There's no way to know from the CID alone (you can also figure it out if you know the underlying piece size, theoretically)

I wondered the same and had a call with @rvagg who explained it to me. I had hoped that my explanation at #172 (comment) would explain how this would work. For me the missing piece was the intermediate level right before the leaves which contains the 32 byte hashes. @hannahhoward I'm happy to try to explain it again, draw it or have a call with you.

@hannahhoward
Copy link

@rvagg oh! wow I work on Filecoin fulltime and I didn't know the lowest layer is hashes of 32 bytes not two 32 byte leaves. you're right then!

@porcuquine
Copy link

Hold on:

The problem is depth -- at what point are you looking at two hashes concatentated together, and at what point are you looking at the lowest level of leaf data -- actual content of the piece being stored? There's no way to know from the CID alone (you can also figure it out if you know the underlying piece size, theoretically)

Ahh but you can with CommP and CommD since at the base you are hashing 32-byte chunks of the raw data, not 64, so you have a differentiator. (It's the same as this differentiator).

This is not true. The original data is not independently hashed. Rather, the original data are used as-is and form the leaves of a binary merkle tree. @rvagg we talked about this once before, and you remembered that you had implemented this correctly for dumb drop, I believe.

I don't know about CommR, however, that's probably quite different and maybe this isn't even possible? @porcuquine?

CommR is more complicated and also likely to change as constructions change. It's true that the current complications might simplify this aspect (though I have not tried to completely follow) in that the bottom layer involves a differently-shaped hash (11-ary) than the tree above it (8-ary). However, that tree itself does not directly yield the root. There are layers above it. And everything I said applies only to one half of the final 'binary tree' joining two intermediate commitments. (I wrote this out not as a specification but to wave my hands at the shape and complexity of CommR. We can discuss it more if actually useful, but given all which has come before, I think it's best kept simple.)

@vmx
Copy link
Member

vmx commented Apr 24, 2020

This is not true. The original data is not independently hashed. Rather, the original data are used as-is and form the leaves of a binary merkle tree.

I thought for the "stretching the multicodec definition and think about how things could theoretically work"-purpose I thought it wouldn't matter how it actually works, we could just add a theoretical layer above the leaf nodes. But we can't, as it would then change the hashes :-/ This means that do not know when we hit leaf nodes when we traverse the merkle tree.

I can see that we might still want to merge this. Tough it violates my current view of how I think what multicodecs are, hence I withdraw my approval.

@vmx vmx self-requested a review April 24, 2020 17:17
@mikeal
Copy link
Contributor

mikeal commented Apr 24, 2020

@vmx I don’t share the view that blocks must be constructed this way in order to satisfy the definition of a multicodec.

I’ve been thinking a lot about the broader topic of “shadow graphs” (any parallel graph of additionally computed state of another graph) and regardless of how you build that graph, you’ll find yourself in a position where the original data has no parallel reference point inside the shadow graph and can only be found by following a traversal somewhere else.

In other words, not having in-block references between nodes inside of shadow graph branches and leaves is fine as long as you can theoretically build a linking structure over all the relevant data that could represent a complete graph. We aren’t storing that right now but it can be built, and it could not be built if we don’t allow these to be proper multicodecs.

We just need to keep in mind that, because the blocks themselves don’t have the necessary references, we cannot rely on the traversal and pinning systems we’ve built until you do build this larger “meta graph” that links all of them together. But that’s not enough to hold up adding the codec.

@hannahhoward
Copy link

@rvagg so then it looks like my comment stands. But I don't know if that's a blocker. Just to be clear then this is not even theoretically a block/IPLD format that could be used with say Graphsync.

@hannahhoward
Copy link

hannahhoward commented Apr 24, 2020

Again, if our goal is to transfer piece data as opposed to payload in a way that is incrementally verifiable over Graphsync/Bitswap, I believe the most optimized way to do that is through a separate IPLD format with a different CID (possibly derivable from PieceCid + PieceSize) which wraps piece data.

@hannahhoward
Copy link

also I confirmed that absent a length of the underlying data (i.e. height of the tree), the commP does not uniquely identify a piece of filecoin data. So one options to consider is whether the codec itself should specify size of piece (probably in power of 2) -- which I think goes beyond the purpose of a CID, but would be the way to make it a unique identity.

@vmx
Copy link
Member

vmx commented Apr 24, 2020

@mikeal's comment (#172 (comment)) wasn't clear to me. After a sync conversation the outcome for me was: you can traverse the tree, but you need out-of-band information properly do that. That out-of-band information is the input data size (i.e. tree height).

@rvagg
Copy link
Member Author

rvagg commented Apr 25, 2020

@porcuquine

This is not true. The original data is not independently hashed. Rather, the original data are used as-is and form the leaves of a binary merkle tree. @rvagg we talked about this once before, and you remembered that you had implemented this correctly for dumb drop, I believe.

(aside: what we discussed was the 254-bit truncation being consistent throughout the graph, hence the new multihash in #171).

You're right, I mispoke about them being hashed but they are still differentiated by size, so the base layer comes back as 32-byte chunks while every other layer is 64-byte. But that only changes my claim that the lowest level CIDs could be raw, you don't have that ability but it doesn't really matter, they get to be fil-commitment-unsealed but may be raw bytes or two CIDs.

In IPLD Schema language an fil-commitment-unsealed block can be defined roughly as:

type UnseaedCommitment union {
  | UnseaedCommitmentNode 64
  | UnseaedCommitmentLeaf 32
} representation length # I'm making this up, but it's a length-discriminated union

type UnseaedCommitmentNode struct {
  left &UnseaedCommitment
  right &UnseaedCommitment
}

type UnseaedCommitmentLeaf bytes

BUT I still think this is entirely academic for the purpose of FIL though, the graphs are impractically large to be identifying them in this way and the base data is impossible to reconstruct without a content-length context anyway. I find it difficult to imagine anyone writing a Loader or a codec that would implement these things as the reverse of what's done now. What we're doing here is an exercise in attempting to fit it into our current mental schema for multicodecs and CIDs.

If we wanted to be more free with our definitions, we could potentially flip in the future and say something like "these multicodecs identify the entirety of the Comm{X} merkle trees and the multihash identifies the hashing algorithm used within the merkle process, with the Data associated with the CID being the base, padded data". That would be a bit of a shift for multicodec, although @vmx noted yesterday that we already have 0xd6 being a binary merkle, so maybe not as big as we fear.

@rvagg
Copy link
Member Author

rvagg commented Apr 25, 2020

IMO the only thing I'm seeing as a potential blocker to moving forward here is the combining of CommP and CommD, but I haven't seen any objections to that yet.

@rvagg
Copy link
Member Author

rvagg commented Apr 25, 2020

Bah, yet again I wrote that too quickly and should have let it simmer in my head a bit longer.

OK, my picture of how this works is bunk.

If they're not hashed you can't turn them into CIDs like every other layer, fil-commitment-unsealed + sha2-256-trunc254-padded doesn't make sense when you get to the base layer's pieces because sha2-256-trunc254-padded isn't in the picture. You could make a loader for these pieces but it wouldn't be following standard loader rules of being able to check that the hash matches the data, they have to be identity for that to work but you don't have enough context to correctly make such CIDs ahead of time. Or you could have a loader that says "nope, doesn't load as a CID, maybe the data is in the CID itself", but that wouldn't be a very IPLD-friendly loader. Maybe that's OK? Or maybe we're just forced into a corner of having to say that these are identifying the entirety of a merkle tree with the multihash being the hash used within the nodes. This stretches multicodec & CID pretty far though because you're not using the multihash to perform the hashing this is purely an activity in identification.

@rvagg
Copy link
Member Author

rvagg commented Apr 25, 2020

OK, there's two ways of framing that work for the current proposals:

  1. These CIDs represent the tip of a merkle tree with the associated hash function. You're not going to use it to retrieve an actual block and if you did then it probably wouldn't be useful or it'd be impractical to navigate to any base data (again, the base data is obscured anyway)
  2. These CIDs represent the entirety of a merkle tree, the notes column will probably need to be adjusted and using this framing we're kind of abusing both the multicodec and multihash - we're saying that the multicodec is the "hash function" applied to the base data and the multihash is the hash function used in each node.

So the second option pushes this further than some may be comfortable with, but maybe using multicodec as a generic identifier type catalogue is not a terrible thing. We're still dealing with content-addressed data at least, it's not like a mime type.

  1. Another option might be to rework these proposals and do something like fil-sha2-256-trunc254-padded-merkle and fil-poseidon....-merkle multihashes and say that the merkles are the hashing functions, and then use fil-commitment-sealed and fil-commitment-unsealed as (somewhat redundant) multicodecs just to make CIDs.

I think at this stage my preference is to go for option 1, it still works, a codec could still be theoretically written that would do a traversal, it just wouldn't get you anywhere helpful. But as I keep saying, I think the limits of practicality rule out such an implementation regardless, this is simply an identification and differentiation exercise.

@porcuquine
Copy link

I have mostly avoided this part of the discussion, as I don't have a strong opinion and also am not fully indoctrinated into the context of the decision. Still, based on the above, let me take one shot at a position. Please bear in mind, that I know very little about CIDs, Codecs, IPLD, or how any of these are used in detail. My comments below may only be useful as abstract ideas about a system similar-to-but-different from the one we have. I'm writing them out partially just to think this through for myself, and partially in case this perspective is useful as we make any minor adjustments to what we have that prove useful or necessary. Based on the conversation above, it sounds like our definitions may not yet have stabilized in a way that deals with all needs yet anyway. Some of what I write below may overlap ideas expressed in the preceding discussion. I am not claiming anything here is novel, just trying to work through the ideas somewhat independently.

The process of producing a merkle root from base data, is special. It does create a duality in which the distinction between the 'original base data', and the 'immediate precursor hash inputs' as sources is made ambiguous. This ambiguity is not accidental. Rather, it's a direct byproduct of the function of merkle trees as structures from which proofs can be generated. The nature of these structures is such that they are receptive to — but do not directly require — extra annotation which elaborates their intended meaning. Specifically, if a merkle root is annotated with a height, this can be interpreted as a declaration that there must exist leaves at a certain depth, and that merkle inclusion proofs using that root (along with that annotation) must have the corresponding number of elements. However, the presence of such an annotation doesn't eliminate the possibility of shorter proofs used to demonstrate knowledge of some set of interior nodes.

In that sense, every root actually corresponds to a family of merkle trees. At the very least, every tree of height less than any specified height is also implicitly encoded by the same root. In fact, it is also the case that there are an infinite number of potentially larger trees of greater height associated with the same root. If my leaf data happens to have been an interior row of such a larger tree, I can even produce proofs that this root is definitely also the root of a known tree with height greater than it claims as its own.

Part of the ambiguity seems to come from the question of what we consider the 'hash function' to be. Let's consider SHA2, which uses a merkle-damgard construction to hash arbitrarily long sequences of input data. A SHA2 hash also has the property that — without explicit specification of input length — we cannot know whether a given input is the 'real' original data or a collision. To accomplish this, there is some internal function which is repeatedly applied in order to produce the final result.

What happens if we consider 'construction of a binary merkle tree using H as an internal function' to be a hash function on its own? This is conceptually like applying a merkle-damgard construction (minus padding considerations) to any binary hash function. Filecoin used to do exactly this to define our pedersen hash over arbitrarily-sized input, for example. What distinguishes the use of merkle-tree construction as the method of combining an inner hash is that it is intentionally prone to collisions according to a useful structure. In other words, assuming no height-based 'personalization', ambiguity of the original data's length allows incremental revelation of information. Throughout this note, I am only considering uniform trees which use an identical hash function at each row, with no extra height-based padding — although the idea could be extended to deal with some forms of such tagging also.

Consider the following alternate interpretation of a merkle inclusion proof. Instead of proving knowledge/possession of a specific leaf, the goal of the prover is to prove knowledge or existence of a tree with a given depth. Every root trivially corresponds to a tree of depth 0, and the proof is a zero-length path of complementary hashes, yielding the root (which is also the leaf, and to which zero hashes are applied). Given an arbitrary root, a verifier has no idea whether it was produced from a tree of a given height. In theory, even imaginary trees which were not used to construct the root can be invented, but this is assumed to be impossible given an adequate (for this purpose) hash function. A prover's goal is to demonstrate knowledge of the existence of a tree of some height. For example, to prove the existence of a (binary) tree of height 1, a proof presents the entire tree: the leaf, its complementary hash, and ordering information. This convinces a verifier that there exists a known tree of height 1 with the given root. Likewise, any verified inclusion proof of length n-1 proves the existence of a tree of height n with the given root. In this game, the uncertainty as to whether ever longer proofs can or will be provided is a feature, not a bug.

The point is that in the absence of length data internal to the hash function, length ambiguity is always possible. In the specific case of CommP and CommD, this ambiguity is a feature, not a bug. It means a storage client can compute CommP of his own piece and later verify that it has been packed into a sector with commitment CommD — but without needing to know in advance either the size of the eventual sector or the content of any other data which might be packed in the same sector. This is not only a matter of convenience, but also of capability. If size of the eventual tree must be encoded in CommP, then the same piece cannot be packed into sectors of different sizes without changing its CommP.

Of course, the reverse is possible: CommP could force knowledge of the length of its own base data (by encoding height information into the hash function), as could CommD. But this is not strictly necessary, as long as inclusion proofs are checked in a context in which explicit knowledge of expected size is provided.

All this is to say: I think there is a consistent point of view from which generation of an N-ary merkle tree can:

  • be viewed as a (kind of) hash in its own right — just one lacking some properties desirable of cryptographic hashes generally, but having others desirable for their use as proof vehicles.
  • include the case of a single N-ary hash using the specified atomic hash function
  • consistently corresponds to a potentially infinite set of base data (each larger than the last by a factor of N).

So, to be concrete, a fc-poseidon-treehash-2 (made-up name) root can be decoded to a text of length 64 bytes, or one of length 128, 256, 512, etc. Whether the length needs to be specified depends on the context, just as with any other hash function used to summarize data.

The problem is that if a CID needs to uniquely map to a specific base result, we are out of luck. I don't know enough about IPLD to know how traversal of recursive structures works (if at all), but it seems to me that this structure does not need to be inconsistent with realistic uses. (Again, note that I'm ignorant of how things currently work, so this may not make sense in that context.)

Is there a hard requirement that CIDs must correspond to collision-resistant hashes? If so, how is this future-proofed? That is, do we invalidate all past CIDs of a given type if a collision is ever detected? If not, then we must have a mechanism for dealing with potential collisions (even if expected to be very rare in the short-term future).

What I am proposing here is that we accept a type of CID with many expected collisions of an expected structure. If we can do that, everything becomes simple. For example, if I do have the context of a known content length, I can search for content of that length or any shorter length. If the final expansion is not discoverable, I can repeat the process recursively. This encompasses both extremes: if cached and available, I can find the specific data I need (if I know content-length) in one hop. Or, in the worst case, I have to traverse the entire tree, decoding each row's hashes in turn. (Although Filecoin is not one, there are imaginable use cases for this behavior.) In between, recovery of the original data might take any number of hops, depending on which cached expansions are available to me. But if I search exhaustively, I will eventually reach the original data if it exists, and if this is done cleverly, I can hope to do so efficiently.

Interestingly, this also provides a mechanism for formalizing merkle inclusion proofs as first-class citizens of the ecosystem. A merkle-inclusion proof of a given leaf + index within base data of a specified size can then be specified as the minimal set of verifiable CID -> content mappings required to retrieve the leaf from the root via the path.

In this context, I define a verifiable mapping as one in which the CID can be directly generated from the input data, and the target (leaf or intermediate node) is verified to exist at the specified position within the input data.

Having a uniform and universal specification of the information content (as opposed to byte structure) of merkle inclusion proofs would be valuable. It would allow for more general tooling for specification, implementation, and verification of the correctness of implementation based on specification. If current structure doesn't already provide this directly, then this is in itself a good motivation for adopting something like what I've described. It might be the case that this is best accomplished by adding some new entity with the described properties, if the existing types (CID, Codec) cannot be made consistent with the usage I propose. I also might be missing something. I've thought quite a bit about this last topic (structure, specification, implementation, verification of specification-implementation) of inclusion proofs — but almost not at all (previously) about how this relates to content-addressable data as such.

@porcuquine
Copy link

porcuquine commented Apr 25, 2020

In this context, I define a verifiable mapping as one in which the CID can be directly generated from the input data, and the target (leaf or intermediate node) is verified to exist at the specified position within the input data.

This was kind of vague. What I meant by 'directly generated' was 'using one application of the inner/atomic hash function'.

That said, since the CID can always be generated by building a complete tree, you can also think of presentation of the entire original data (or an entire subtree row at any point of the path) as also being a valid merkle proof — just not a minimal one. This is probably actually a useful distinction. It would be good to have a canonical minimal inclusion proof, as well as recognition that non-minimal proofs are possible and verifiable. A minimal proof can always be constructed from a non-minimal one, and the existence of the minimal form also allows for a (not necessarily unique — but deterministically specifiable) canonical binary form for proofs (which I consider highly desirable for other reasons).

@vmx
Copy link
Member

vmx commented Apr 27, 2020

After reading @porcuquine's excellent comment, I come to the conclusion that we should (intentionally) not distinguish whether we hash the base data or some inner merkle tree node. To me this means that the codec for the nodes of such a merkle tree should be raw, hence no new codec.

This would (kind-of) downgrade the CID into a multihash for the purpose of CommD, CommP and CommR.

@rvagg
Copy link
Member Author

rvagg commented Apr 28, 2020

@vmx and I will have a chat about this 1:1 to try and move this forward. I'm recording my thoughts as they are now, maybe this will change:

I come to the conclusion that we should (intentionally) not distinguish whether we hash the base data or some inner merkle tree node

Yes, but:

To me this means that the codec for the nodes of such a merkle tree should be raw, hence no new codec.

I don't know if this is the next logical step. It may be the right step if we were treating this as strictly an exercise in defining a mapping to an IPLD model of the world, which I've been trying hard to do but apparently failing. It doesn't seem like that's appropriate if this is all about inclusion proofs.

We're back to a simple (?) need for classification of an identifier, something CIDs should be good at. Using these 2 new codecs and not using the ipld tag in their entries, but leaving them as filecoin might give us the differentiation from classic IPLD usage that we need to handle the ambiguity here.

Combined with the new multihashes, we'd get CIDs that say:

  • This is a point in a merkle tree for unsealed filecoin sectors, sha2-256-trunc254-padded was used to get here; or:
  • This is a point in a merkle tree for sealed filecoin sectors, poseidon-bls12_381-a2-fc1 was used to get here

Then you get differentiation along two dimensions:

  1. What is this thing I'm holding and what might it yield if I were to give it to the right API? (Differentiation between sealed and unsealed seems important.)
  2. What hash function was involved in getting to here from some starting point? (Being able to differentiate when the hash function changes at a future date seems important.)

(Granted that you could also get that same differentiation with a single multihash that packed everything into it - filecoin-unsealed-merkle-1, but I don't know that we need that level of indirection if we can pack more information into the CID up front.)

In an IPLD world, you might also be able to get a corresponding byte array that could be used to verify the CID is correct, but that classic IPLD usage is not quite how this will be used and we should probably stop trying to stretch it there.

What you don't get the classic IPLD "codec" sense of "I know how to decode this byte array into something useful". But that's probably OK if we don't pretend this is IPLD.

@rvagg
Copy link
Member Author

rvagg commented Apr 29, 2020

Edited the comment text to be clearer:

Filecoin piece or sector data commitment merkle node/root (CommP & CommD)
&
Filecoin sector data commitment merkle node/root - sealed and replicated (CommR)

From our chat yesterday, I think we're ready to move forward. I'll comment further over in #161.

Thanks @porcuquine for your patience and excellent description. Both @vmx and I enjoyed your last write-up and learned a bunch from it.

@porcuquine
Copy link

I'm glad to have been able to participate in the discussion. You all are thinking this through carefully, which I appreciate. Thank you for not letting my partially-formed ideas distract from resolving the immediate issues.

Copy link
Member

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @rvagg mentions over #161 (comment) I still don't understand why we need a different codec for CommD/C and CommR (and not just a single one). Though I trust the Filecoin team that this is needed for some reason I haven't heard/understand yet, hence I'm approving this PR.

These describe roots & nodes of a merkle tree, not the underlying
data. In the case of CommP and CommD they are binary merkle trees
using sha2-256-trunc2. For CommR they are novel structure merkle
trees using poseidon-bls12_381-a2-fc1.

All nodes of the respective merkle trees could also be described
using this codec if required, all the way to base data. It is
anticipated that the primary use will be restricted to the roots.

Ref: #161
Closes: #161
Closes: #167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filecoin hashing functions to table
5 participants