-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC|BB|L2-09/10 Handling arbitrarily large blocks and Treating files as large blocks #29
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# RFC|BB|L2-08: Handle Arbitrary Block Sizes | ||
|
||
* Status: `Brainstorm` | ||
|
||
## Abstract | ||
|
||
This RFC proposes adding a new type of data exchange to Bitswap for handling blocks of data arbitrarily larger than the 1MiB limit by using the features of common hash functions that allow for pausing and then resuming the hashes of large objects. | ||
|
||
## Shortcomings | ||
|
||
Bitswap has a maximum block size of 1MiB which means that it cannot transfer all forms of content addressed data. A prominent example of this is Git repos which even though they can be represented as a content addressed IPLD graph cannot necessarily be transferred over Bitswap if any of the objects in the repo exceed 1MiB. | ||
|
||
## Description | ||
|
||
The major hash functions work by taking some data `D` chunking it up into `n` pieces `P_0...P_n-1` then they modify an internal state `S` by loading pieces into the hash function in some way. This means that there are points in the hash function where we can pause processing and get the state of the hash function so far. Bitswap can utilize this state to effectively break up large blocks into smaller ones. | ||
|
||
### Example: Merkle–Damgård constructions like SHA-1 or SHA-2 | ||
|
||
MD pseudo-code looks roughly like: | ||
|
||
```golang | ||
func Hash(D []byte) []byte { | ||
pieces = getChunks(D) | ||
|
||
var S state | ||
for i, p := range pieces { | ||
S = process(S, p) // Call this result S_i | ||
} | ||
|
||
return finalize(S) // Call this H, the final hash | ||
} | ||
``` | ||
|
||
From the above we can see that: | ||
|
||
1. At any point in the process of hashing D we could stop, say after piece `j`, save the state `S_j` and then resume later | ||
2. We can always calculate the final hash `H` given only `S_j` and all the pieces `P_j+1..P_n-1` | ||
|
||
The implication for Bitswap is that if each piece size is not more than 1MiB then we can send the file **backwards** in 1MiB increments. In particular a server can send `(S_n-2, P_n-1)` and the client can use that to compute that `P_n-1` is in fact the last part of the data associated with the final hash `H`. The server can then send `(S_n-3, P_n-2)` and the client can calculate that `P_n-2` is the last block of `S_n-2` and therefore also the second to last block of `H`, and so on. | ||
|
||
#### Extension | ||
|
||
This scheme requires linearly downloading a file which is quite slow with even modest latencies. However, utilizing a scheme like [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) (i.e. downloading metadata manifests up front) we can make this fast/parallelizable | ||
|
||
#### Security | ||
|
||
In order for this scheme to be secure it must be true that there is only a single pair `(S_i-1, P_i)` that can be produced to match with `S_i`. If the pair must be of the form `(S_i-1, P_malicious)` then this is certainly true since otherwise one could create a collision on the overall hash function. However, given that there are two parameters to vary it seems possible this could be computationally easier than finding a collision on the overall hash function. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like some 👀 on this ideally from people who are more practiced with this type of cryptanalysis than I am. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure that finding any kind of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After some poking around (thanks @Stebalien for pointers) it seems that as long as the underlying compression function is not subject to freestart collisions then we should be fine and if not then things become trickier. My understanding from this paper and its corresponding thesis is that SHA-256 is not yet subject to freestart collisions. Even if we were subject to freestart collisions things may not necessarily be so bad since the attacker would also need to be the creator of the file and would be able to selectively give some people the data and other people would not get different data, but instead just waste some bandwidth and other resources which on its face doesn't seem like a super worthwhile attack. If so then what we're really trying to avoid here is approximately a pseudo-second-preimage attack on the compressor function (close to the Definition 7 here). My understanding is that this would be even harder for an attacker to pull off and might even be reasonably safe for functions like SHA-1 which are no longer collision resistant (although pseudo-preimage attacks are of course may be easier to pull off than full preimage attacks). @dbaarda thanks for the feedback, it does seem like this is probably ok. However, I do think it's a little more subtle than there are no collisions on SHA-2 implying there are no issues in this scheme.
My understanding is that this indicates a collision on the compressor function, but not on the overall hash function since a hash collision is that given some starting state IV There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I did not know "freestart collisions" had a name; I'm glad it does! |
||
|
||
#### SHA-3 | ||
|
||
While SHA-3 is not a Merkle–Damgård construction it follows the same psuedocode structure above | ||
|
||
### Example: Tree constructions like Blake3, Kangaroo-Twelve, or ParallelHash | ||
|
||
In tree constructions we are not restricted to downloading the file backwards and can instead download the parts of the file the we are looking for, which includes downloading the file forwards for sequential streaming. | ||
|
||
There is detail about how to do this for Blake3 in the [Blake3 paper](https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf) section 6.4, Verified Streaming | ||
|
||
### Implementation Plan | ||
|
||
#### Bitswap changes | ||
|
||
* When a server responds to a request for a block if the block is too large then instead send a traversal order list of the block as defined by the particular hash function used (e.g. linear and backwards for SHA-1,2,3) | ||
* Large Manifests | ||
* If the list is more than 1MiB long then only send the first 1MiB along with an indicator that the manifest is not complete | ||
* When the client is ready to process more of the manifest then it can send a request WANT_LARGE_BLOCK_MANIFEST containing the multihash of the entire large block and the last hash in the manifest | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about: Instead of special-casing the manifest file (and having to deal with large vs small manifests), recursively treat the manifest as a downloadable artifact: If the manifest is small (1<MB), send the whole manifest in the response, otherwise send the manifest of the manifest. |
||
* When requesting subblocks send requests as `(full block multihash, start index, end index)` | ||
* process subblock responses separately from full block responses verifying the results as they come in | ||
* As in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) specify how much trust goes into a given manifest, examples include | ||
* download at most 20 unverified blocks at a time from a given manifest | ||
* grow trust geometrically (e.g. 10 blocks, then if those are good 20, 40, ...) | ||
|
||
#### Datastore | ||
|
||
* Servers should cache/store a particular chunking for the traversal that is defined by the implementation for the particular hash function (e.g. 256 KiB segments for SHA-2) | ||
* Once clients receive the full block they should process it and store the chunking, reusing the work from validating the block | ||
* Clients and servers should have a way of aliasing large blocks as a concatenated set of smaller blocks | ||
* Need to quarantine subblocks until the full block is verified as in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) | ||
|
||
#### Hash function support | ||
|
||
* Add support for SHA-1/2 (should be very close to the same) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sha 1 is deprecated / not-recommended at this point. It seems unclear it's valuable or safe to support it. why do we want to? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "git" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mostly for Git support, however with Git eventually moving to SHA-2 if it turned out SHA-1 was unworkable we could probably deal with it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would strongly prefer if support for this for sha-1 is opt-in, if not on by default. See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html which I propose adding in multiformats/multicodec#203 for why. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the time this is usable, SHA-256 support in Git will likely be stabilized anyway, given it's already implemented AFAICT, so I don't see the point in making it opt-out. |
||
* Make it possible for people to register new hash functions locally, but some should be built into the protocol | ||
|
||
## Evaluation Plan | ||
|
||
* IPFS file transfer benchmarks as in [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) | ||
|
||
## Prior Work | ||
|
||
* This proposal is almost identical to the one @Stebalien proposed [here](https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/6) | ||
* Utilizes overlapping principles with [RFC|BB|L2 - Speed up Bitswap with GraphSync CID only query](https://github.com/protocol/beyond-bitswap/issues/25) | ||
|
||
### Alternatives | ||
|
||
An alternative way to deal with this problem would be if there was a succinct and efficient cryptographic proof that could be submitted that showed the equivalence of two different DAG structures under some constraints. For example, showing that a single large block with a SHA-2 hash is the equivalent to a tree where the concatenated leaf nodes give the single large block. | ||
|
||
### References | ||
|
||
This was largely taken from [this draft](https://hackmd.io/@adin/sha256-dag) | ||
|
||
## Results | ||
|
||
## Future Work |
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,45 @@ | ||||||||||||
# RFC|BB|L2-10: UnixFS files identified using hash of the full content | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @ribasushi I feel like you may have some thoughts on this 😄 |
||||||||||||
|
||||||||||||
* Status: `Brainstorm` | ||||||||||||
|
||||||||||||
## Abstract | ||||||||||||
|
||||||||||||
This RFC proposes that for UnixFS files we allow for downloading data using a CID corresponding to the hash of the entire file instead of just the CID of the particular UnixFS DAG (tree width, chunking, internal node hash function, etc.). | ||||||||||||
|
||||||||||||
Note: This is really more about IPFS than Bitswap, but it's close by and dependent on another RFC. | ||||||||||||
|
||||||||||||
## Shortcomings | ||||||||||||
|
||||||||||||
There exists a large quantity of content on the internet that is already content addressable and yet not downloadable via IPFS and Bitswap. For example, many binaries, videos, archives, etc. that are distributed today have their SHA-256 listed along side them so that users can run `sha2sum file` and compare the output with what they were expecting. When these files are added to IPFS they can be added as: a) An application-specific DAG format for files (such as UnixFSv1) which are identified by a DAG root CID which is different from a CID of the multihash of the file data itself b) a single large raw block which cannot be processed by Bitswap. | ||||||||||||
|
||||||||||||
Additionally, for users using application specific DAGs with some degree of flexibility to them (e.g. UnixFS where there are multiple chunking strategies) two users who import the same data could end up with different CIDs for that data. | ||||||||||||
|
||||||||||||
## Description | ||||||||||||
|
||||||||||||
Utilizing the results of [RFCBBL209](../rfcBBL209/README.md) we can download arbitrarily sized raw blocks. We allow UnixFS files that have raw leaves to be stored internally as they are now but also aliased as a single virtual block. | ||||||||||||
|
||||||||||||
## Implementation plan | ||||||||||||
|
||||||||||||
* Implement [RFCBBL209](../rfcBBL209/README.md) | ||||||||||||
* Add an option when doing `ipfs add` that creates a second aliased block in a segregated blockstore | ||||||||||||
* Add the second blockstore to the provider queue | ||||||||||||
|
||||||||||||
## Impact | ||||||||||||
|
||||||||||||
This scheme allows a given version of IPFS to have a canonical hash for files (e.g. SHA256 of the file data itself), which allows for independent chunking schemes, and by supporting the advertising/referencing of one or more common file hash schemes allow people to find some hash on a random website and check to see if it's discoverable in IPFS. | ||||||||||||
|
||||||||||||
There are also some larger ecosystem wide impacts to consider here, including: | ||||||||||||
|
||||||||||||
1. There's a lot of confusion around UnixFS CIDs not being derivable from SHA256 of a file, this approach may either tremendously help or cause even more confusion (especially as we move people from UnixFS to IPLD). An example [thread](https://discuss.ipfs.io/t/cid-concept-is-broken/9733) about this | ||||||||||||
2. Storage overhead for multiple "views" on the same data and extra checking + advertising of the data | ||||||||||||
3. Are there any deduplication use case issues we could run into here based on users not downloading data that was chunked as the data creator did it, but instead based on how they want to chunk it (or likely the default chunker) | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Decent deduplication requires using a consistent hash and chunking algorithm, and to deduplicate arbitrarily aligned data, the chunking algorithm must be content based with variably sized blocks of any byte-length between a min and max value. Every different way of chunking/hashing the data (and creating the merkle-tree) results in another copy of the data that cannot be deduplicated. To do deduplication efficiently you want IPFS to have a fixed chunking/hashing algorithm and merkle-tree under the hood, and then support alternative "views" on top of this, that can present the underlying data as if it was chunked/hashed differently. I don't know how much demand there is for alternative tree views, but certainly a common use-case is the "one big file with this hash" view. This could be implemented as just an alternative DHT entry similar to an IPNS entry that is keyed by the whole-file hash and points to a list of CID's (each different hash/chunker/tree options result in a different CID. ideally there is only one) for that file. These could be signed by the node that did the upload for verification purposes, but you would still need to download the whole file to verify the whole-file hash. I don't know how much demand there is for alternative tree views of the data, but this could be implemented using an alternative merkle tree containing the desired hash-type for each node, and where the "raw" leaves are actually ranged references into the underlying native merkle tree nodes. I'm not sure exactly how validation of these alerternative-view merkle nodes would work, but you would probably have to download the data-segment (by downloading the underlying merkle-tree-fragment) for each -node to validate the hash. There might be ways to include signatures by the uploading peer-node, but you probably want to do this in a way that the same alternative-view-uploaded by different peers can share the same data. Perhaps an IPNS entry pointing at the alternative-view-root-merkle node is the best way that peers can sign that the've uploaded/verified that alternative-view. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In think @dbaarda has a great idea, that mapping a full-file hash (SHA-256 for ex) to a CID right at the DHT layer seems like a clean way to add this functionality with no redesign of anything existing (other than a new DHT type), and just a small amount of new code. Also it means any existing data already stored on IPFS doesn't need to be re-stored (re-added) but anyone at any time could put it's canonical hash (SHA-256) in the DHT and immediately it would be findable by everyone else. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you extend this idea a tiny bit and make the alias value In the DHT a CID+range (offset + length) then you can add aliases to any piece of arbitrary data regardless of the underlying chunking/merkle tree structure. This would allow you to eg. Add a sha256 alias to a file that had been added inside an uncompressed tar file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One of the tricks that @aschmahmann is trying to balance in this proposal, as I understand it, is being able to take a legacy hash, like the sha256 of a large file, and have some confidence that you're getting 'correct' data while you're downloading it. If the DHT just holds a mapping to a CID, you don't know until you fully retrieve the file that it will hash to the original sha256 value you were interested in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At least you only have to trust the peer that gave you the CID and then you can still pull data from all the other untrusted peers in the normal trustless way, where they can only fool you "1MB at a time (bitswap)" so to speak. Also if you get a DHT answer from at least 3 peers, you can go with whatever the consensus is about what the correct CID is, before you try it, but I'm not sure if DHT is designed to try to get 3 or more answers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agree, but maybe that's the price you pay for "larger blocks"; you can't validate them until you've downloaded them, just like small blocks. Right now people can't have content-addressable "blocks" of data larger than 2M (with 1M recommended) at all. You can't have content-addressable "blocks" larger than that, unless you're OK with the key/hash not being of the raw data, but of a merkle-tree node where the hash depends on the chunking algorithm chosen. People might want to build applications on top of IPFS with larger blocks, and this would facilitate that. Adding a simple CID Alias to the DHT suddenly means you can have blocks of any size keyed by the block's content hash. Under the hood IPFS is chunking things into pieces optimized for it's deduping/network/storage/etc requirements, but you now optionally can have an abstract "block" on top of that with less restrictions.
I would have said that the big advantage of the DHT is you can find things with it. Any solution that doesn't put the hash/key/cid that you want to find the data by in a DHT is not going to be findable, at least not by that key. You need some kind of mapping from the key/hash you have to the CID the data is actually published under.
Note this is true with current non-IPFS publishing of ISO images and their hash; you need to download the whole thing before you can verify it against the published hash. I agree it would be good to have some way to validate that the CID alias for a large block actually does hash to the key it's indexed by, but I haven't really got a good solution. Signing by the publishing peer might help, but I haven't thought it through. Perhaps CID aliases should be published via IPNS to be trusted? Note you don't have to prove that each individual raw block (or maybe block fragment) is a valid part of the whole large block, just that whole data referred to by the CID reference has that hash, since the IPFS fetching of that CID will validate the individual raw blocks are part of that CID as they are fetched.
It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;
I think 1. would be the main use-case, but once you have that capability people would figure out creative ways to use it. Note 1. and 3. only require CID-aliases that point at an existing CID, but 2. and 4. require CID-aliases that include ranges (offset+length).
Ah... that's interesting... maybe the mechanisms for adding provider records could be used for adding CIDs to a CID-alias entry? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To clarify, I was making the assumption that any DHT entry that points a SHA-256 to a wrong CID (for whatever reason) would have a digital signature so the peer responsible for the "mistake" could be identified and blamed and then somehow 'down-ranked' (as a potential hostile actor) by the peer that discovered the mistake. Like you guys, I can't see a way to avoid having to trust that a CID given in the DHT is correct, until after the full download is complete. Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain. It may be the case that true content-addressing (based on a file hash) is simply impractical in a trustless-system, once you consider that it's synonymous with a big attack vector. If this is true (and I can't prove it isn't) then my apologies for even being one of proponents of the idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is much longer than I wanted (sorry about that). Separating this into two sections, one about how the DHT works and the other about whether I think this approach of storing CID mappings in the DHT is something we should do. DHT Background Clarification
You're right I was being sloppy with my wording/explanation here. The point I was trying to make is that if you wanted to put some custom key-value pair
This strategy is how IPNS over PubSub works and manages to be faster than IPNS over the DHT even for first time lookups (relevant logic largely lives in go-libp2p-pubsub-router and go-libp2p-discovery). The things you lose over having native DHT record support are the ability to publish and then go offline for a few hours and then come back online later. This isn't to say we shouldn't figure out a way to make it easier to upgrade the DHT and support new record types, just that this other path works.
Generally open p2p systems are subject to Sybil attacks, this includes IPFS. There are mitigations available (and we even do some of them), but overall the concept persists. The question to ask is what bad thing occurs/can occur if someone performs a Sybil attack. Thus far the only thing that happens to IPFS under a Sybil attack is basically a denial of service attack on some the resource being attacked. Thoughts on unverified relationships between two CIDsContinuing from above this proposal allows for different types of attacks than just a DoS on the Sybil attacked resource, and without some reputation backing isn't too difficult to pull off.
I think proposal 09 covers most of these use cases at the network layer if clients are willing to make custom block stores locally. Zooming out if we have a primitive that lets us download arbitrarily sized blocks and we want to be able to download these blocks in parts from multiple peers who are transforming the data in various ways (chunking, compression, etc.) that's ok as long as peers present a virtual blockstore that presents the data in its canonical form. This might end up requiring computation/storage tradeoffs (e.g. for decompression), but it gives us verifiability which IMO is key.
This proposal does exactly that, you just advertise a provider record in the DHT for the full-file SHA-2 and can then download without worrying about chunking schemes, etc. and it does it verifiably.
This proposal doesn't cover this use case, but the idea of working with compression representations as if they're the data itself seems like a bit of a mine field with people wanting various different things out of compression. One ask is "given that Bob has a compressed version of the file F how can Alice download it knowing only SHA2(F)?" and if you want to be able to download from multiple people who could have compressed the data differently then either way you'll need to be downloading based on the raw bytes in the file. If so, then Bob can have a virtual datastore where if someone asks him for F he'll decompress it before sending. Another ask is to try and minimize download bandwidth, perhaps by using compression. That compression might come at the data layer, or could come at the transport layer (there's an RFC in this repo about compression in the Bitswap transport). If the question is "I want to download data from peers who have gzip(F) and all I know is SHA2(F)" then we run into verifiability problems. There might be value here, especially if I'm building a friend-to-friend network on IPFS and I know who I trust, but given the verifiability issues IMO this doesn't seem great. If downloading gzip(F) is really so important to the user then they could always pass around references to it as SHA2(gzip(F)).
Even assuming that by merkle-tree DAG you're referring only to UnixFS files I'm still confused. I'm not sure I see the immediate value here, but couldn't this proposal mostly let you do this? If an application wanted to treat a subset of a file as a new file then they could just do another add and this reduces to case 1.
This seems like it's really about clients having multiple virtual blocks backed by the same underlying bytes. As long as there is a canonical representation that people can advertise/search under and there is a canonical download representation (e.g. the raw bytes of the large block/file) then how you internally represent that single block is up to you. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @aschmahmann I agree with everything you said, and you outlined the attack vector very well too. I also agree there's not really any value to identifying individual merkle nodes by their SHA2 just because the entire file is identified by SHA2, with all due respect to the person bringing that up. He is thinking "the right way" however, because normally in a tree-architecture when you have every node being "handled" identically to every other node (including the root itself) that's usually indicative of a good design, so it made sense to bring that up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Apologies for the late and waaay too long post. I started writing this some time ago and RL kept interfering before I could post it, so each time I came back to it I added more, and now I'm struggling to edit it down. I guess the TLDR is; I'm not against this proposal. The reverse-download-block-incremental hash idea is sound, and a good way to validate the download incrementally so it can be rejected early if it's bad. It's probably the only way for non-homomorphic hash functions, which are the ones people currently use and care about. I just wanted to point out;
A lot of the rest of my words are me thinking aloud and getting a bit off-topic... I'm sorry about that.
You are right that you can do this, but it's very inefficient and more unreliable when 'v' is smaller than a list of the peers hosting a bitswapped block containing 'v' would be. Putting 'v' directly in the DHT instead of in an IPFS block means you avoid more network transfers and round-trips (bitswap is avoided entirely), avoid storing as much data in the DHT (v is smaller than the peer records would be) and storage in peer's blockstores (you don't need peers to store 'v' in a block at all), and avoid relying on available peers providing the block. I was proposing a 'v' record that contains only a CID, offset, length, and maybe a publishing-peer-signature, which would be smaller than a list of peers. This would be like a sym-link to any arbitrary piece of data in IPFS, referenced by a hash of that data. This would be a low-level efficient piece that other things could be built on top of. How you safely use this feature is probably something for upper layers using it to figure out. But you are right; this feature doesn't need to be added to the DHT to build proof-of-concept implementations using this idea. Adding it to the DHT would be an optimization that could be done later. I guess my main point is that it's worth designing things with this optimization in the back of your mind, so that you don't end up with something that's hard to migrate to use it later.
I don't know much about pubsub (yet) but I'm betting pubsub has it's own push notification/lookup system that bypasses the DHT and propogates changes faster than DHT writes propogate. I suspect it's this publishing path that makes it faster, not that pubsub puts it's 'v' values in bitswapped blocks. I'm willing to bet that a DHT lookup of a tiny 'v' record directly in the DHT will nearly always be faster than looking up a 'v' record in a block. The only exception might be when the 'v' record is already cached in the local blockstore, but even then caching in the DHT client should be just as fast. If it's not, then there is something wrong.
That's the obvious functional difference, but there's a big performance difference too.
Note bad clients can "add" CID entries that don't match the blocks they serve for them too. The only protection is these "bad blocks" are small and thus identified as bad before much is downloaded, and peers providing these bad blocks (presumably) get ignored eventually. There are several different levels to what a CID alias could do, with increasing features and risks;
I agree that CID aliases are more vulnerable to attacks than CID's and IPNS records, but that's largely because IPFS delegates and denies responsibility for the big risk part; that the CID or IPNS record points at the data that the person publishing says it does. A person can "add" a compromised binary and publicize the CID or even an IPNS record as the place to get it, and IPFS will happily verify that yep, you've downloaded the compromised binary exactly as it was "added", and will not tell you it's been compromised. Verifying the downloaded binary against an officially published hash is something that (currently) has to be done outside IPFS.
The bit about "if clients are willing to make custom block stores locally" worries me. I don't think this is necessary, and implies that a whole bunch of de-duplication at the blockstore and (more importantly) network layer will be broken. I was thinking the blockstore and network layer would always use the underlying IPFS "as-added" native merkle-dag using the existing fetching/storing/caching mechanisms, and "large blocks" would be a higher-level abstraction for "reading" arbitrarily offset/sized "virtual blocks" from a CID. Under the hood it would just fetch/walk the original merkle tree and download the relevant "leaf" raw data blocks that encapsulate that larger virtual block. The addition of the DHT CID alias idea would give you a way to reference/search for these "virtual blocks". This would mean the local data store and network layers would be unchanged, and all the raw data for any large blocks would deduplicate/reuse the native merkle-tree data.
Note this would not work with compression, only concatenation, as is done by eg tar (NOT tar.gz). You could write a custom tar file uploader that not only gave you a CID for the whole tar file, but a CID alias for every file inside the tar file. This would be more efficient than doing "add" of the tar and each file individually UNLESS you had/used a tar-file aware chunker/dag-builder that could achieve the same thing by building the merkle-dag to reflect the tar-file contents.
This idea was just presented as an example that someone might see a use for.
This is actually about creating multiple virtual blocks you can advertise/search for, that clients can then fetch/store using the "as-added" merkle-tree native representation. It's not purely internal, because the virtual blocks are advertised/searched for by other clients. That the network/storage layers use the native merkle-tree representation means all the data is transmitted/deduplicated at that layer, and clients assemble them into the "large blocks" themselves as needed. |
||||||||||||
|
||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this would make public gateways way more useful by removing MITM risk (cc @Gozala @autonome)
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are still issues with directories, but yes for files this can help, good idea 💡! |
||||||||||||
## Evaluation Plan | ||||||||||||
|
||||||||||||
TBD | ||||||||||||
|
||||||||||||
## Prior Work | ||||||||||||
|
||||||||||||
## Results | ||||||||||||
|
||||||||||||
## Future Work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that breaking blocks on hash function chunk boundaries means breaking rabin (or any other content based) chunking, which is essential for efficient content deduplication. For this to support arbitrary content boundaries for IPFS blocks, the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more on what you were thinking here? What do you mean by accumulated hash value (my guess is the internal State) and the trailing partial hash (not sure what this means).
I'm thinking there are three different chunkers involved here:
S
along the way, which has the restriction of every state being on a 64 byte boundary (e.g. we could use a fixed 256KiB, or a more complex function as long as it always ended on a 64B boundary)IIUC the thing that you're getting at here is that it's unfortunate that if Rabin would've saved us bandwidth here that it's only saving us disk storage because of the restrictions at the exchange layer.
I think there's potentially a way around this (which has tradeoffs) by allowing more data into the manifest. For example, when B responds to A giving them a manifest of blocks to download in addition to giving them a list of the intermediate States B could also send more information about the data corresponding to those states.
For example, B could send
(State, large block start index, large block end index, []ConsecutiveSubBlock)
whereConsecutiveSubBlock = (SubBlockMultiHash, subblock start index, subblock end index)
and in this way A could decide whether to get the blocks corresponding to the State transition by either asking about bytes in the full block or by asking for the ConsecutiveSubBlocks and then using some of the bytes. This would allow B to tell A about a deduplicating chunking scheme they could use, but A 1) wouldn't be obligated to use it when downloading data 2) wouldn't be obligated to use it in their datastore.WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you chunk on arbitrary byte boundaries, then the full "state" needed to resume hashing needs to include the tail data past the last 64B boundary that has not yet been included into the hash-chunker. It means the full state needed is a little larger.
Note this can vary depending on the hash function used. I think some have >64 byte hash-chunkers.
This is the blocker; rabin chunkers must chunk on arbitrary byte boundaries and have variable sized chunks to work properly. The classic example is the large file modified by adding a single byte at the front; despite everything except the first byte being duplicated, chunkers that can't re-align on the 1-byte offset will fail to de-duplicate anything. Any chunker constrained to 64 byte boundaries will fail to find duplicates after any insert/delete that is not a nice multiple of 64 bytes.
This bit is internal client level and doesn't really matter from an API point of view. The important bit is 2.
Rabin can't be used at all if the chunks are constrained to 64 Byte boundaries. Without using some kind of content-based chunker like rabin, you get no de-duplication at all between slightly mutated data within files unless the data mutations are constrained to multiples of your block size; so no shifting of data except by multiples of the block size.
I don't think I fully understand this. Are you effectively creating virtual-large-blocks for each state transition out of a list of ranged sub-blocks in your manifest? And this is so that the sub-blocks can be created using a de-duplicating chunker, while the virtual-large-blocks can be 64byte aligned?
If yes, this sounds very similar to the "virtual DAG" idea I was proposing you could do using CID-aliases in the DHT. Note that you would not need to have a list of ConsecutiveSubBlocks, you could just have a CID+offset+length where the CID could be a reference to the merkle-dag node that contains the whole range. You can walk the DAG to find the leaf nodes needed from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a overview of how chunking/encoding/etc affect data deduplication, I think I've summarized everything about it here;
https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a little long, especially since I think we're basically on the same page but figured it'd help to be more explicit.
So I think what we're describing is just two ways of getting at the same idea. We're in agreement that the exchange layer chunker, which is verifiable, cannot just send over data-layer chunker information (e.g. rabin) since they won't line up on the same boundaries.
Overall we need something that a client who has
State_i
can receive(State_i-1, bytes)
and can verifiably check that the given state transition is valid.IIUC the difference is:
Your approach
I was having some difficulty describing what you were getting at here, but I think you were leaning towards:
Define transitions from
(State_i-1, bytes) -> State_i
as(SHA2_State_with_extra_data_i-1, CIDs for data layer chunking blocks (e.g. rabin))
toSHA2_State_with_extra_data_i
where the extra data might include some bytes left over in between rounds of SHA2 chunking.My approach
Define transitions from
(State_i-1, bytes) -> State_i
as(SHA2_State_i-1, []WayOfDescribingBytes) -> SHA2_State_i-1)
where I've listed two implementations ofWayOfDescribingBytes
.Below we have three peers: Alice, the client. Bob the server that sends the manifest and/or data. Charlie another server that can send a manifest and/or data.
[(MH1, start at byte 55), MH2, (MH3, end at byte 12)]
. Note: In one of your other comments I think you allude to allowing for graphs here instead of just a list of blocks, IMO that seems like an excessive level of indirection since we already know what we want.Both of these approaches describe a set of bytes and have their own advantages/disadvantages:
2.Advantage: If people are using the same chunker I get deduplication benefits. Disadvantage: Fails completely if someone has the same large block, but chunked up differently
Yes, although the "large" in this case isn't very big since the sum of the sizes should be less than the recommended Bitswap block size (i.e. 1MiB). We're just accounting for if Rabin chunking gives us 10 chunks that contain that 1MiB of data.
This isn't perfect, for example if the data is a 1TiB file that's expressible using a rabin chunker in 1MiB we'll send metadata corresponding to a 1TiB file since we want to account for someone using a different chunker. However, as the manifests should be very small compared to the data and most real data likely won't have such extreme differences in chunking vs non-chunking size I suspect this is ok.
Yes, and also handle the case where people have used different chunkers for the same data.
That's a great post 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long comment thread, but I just want to add that we get common-prefix deduplication of stuff with the linear hash for free.