-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC|BB|L2-09/10 Handling arbitrarily large blocks and Treating files as large blocks #29
base: master
Are you sure you want to change the base?
Conversation
|
||
#### Security | ||
|
||
In order for this scheme to be secure it must be true that there is only a single pair `(S_i-1, P_i)` that can be produced to match with `S_i`. If the pair must be of the form `(S_i-1, P_malicious)` then this is certainly true since otherwise one could create a collision on the overall hash function. However, given that there are two parameters to vary it seems possible this could be computationally easier than finding a collision on the overall hash function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like some 👀 on this ideally from people who are more practiced with this type of cryptanalysis than I am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that finding any kind of (S_i-1, P_i)
pair that matches another (S_i-1, P_other)
is at least as hard to crack as finding any arbitrary matching (S_foo, P_foo)
(S_bar, P_bar)
pair, which is the same as finding a hash collision between any two blocks. It is a birthday-attack, but secure hash functions have enough bits to be safe from birthday attacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some poking around (thanks @Stebalien for pointers) it seems that as long as the underlying compression function is not subject to freestart collisions then we should be fine and if not then things become trickier.
My understanding from this paper and its corresponding thesis is that SHA-256 is not yet subject to freestart collisions.
Even if we were subject to freestart collisions things may not necessarily be so bad since the attacker would also need to be the creator of the file and would be able to selectively give some people the data and other people would not get different data, but instead just waste some bandwidth and other resources which on its face doesn't seem like a super worthwhile attack.
If so then what we're really trying to avoid here is approximately a pseudo-second-preimage attack on the compressor function (close to the Definition 7 here). My understanding is that this would be even harder for an attacker to pull off and might even be reasonably safe for functions like SHA-1 which are no longer collision resistant (although pseudo-preimage attacks are of course may be easier to pull off than full preimage attacks).
@dbaarda thanks for the feedback, it does seem like this is probably ok. However, I do think it's a little more subtle than there are no collisions on SHA-2 implying there are no issues in this scheme.
(S_foo, P_foo) (S_bar, P_bar)
pair, which is the same as finding a hash collision between any two blocks
My understanding is that this indicates a collision on the compressor function, but not on the overall hash function since a hash collision is that given some starting state IV H(IV, P_good) = H(IV, P_bad)
(Definition 1 in that paper linked above) which mean that unless you can chain back S_foo
and S_bar
to some common state IV
that there isn't a full hash collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I did not know "freestart collisions" had a name; I'm glad it does!
@@ -0,0 +1,45 @@ | |||
# RFC|BB|L2-10: UnixFS files identified using hash of the full content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ribasushi I feel like you may have some thoughts on this 😄
|
||
#### Hash function support | ||
|
||
* Add support for SHA-1/2 (should be very close to the same) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sha 1 is deprecated / not-recommended at this point. It seems unclear it's valuable or safe to support it. why do we want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"git"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly for Git support, however with Git eventually moving to SHA-2 if it turned out SHA-1 was unworkable we could probably deal with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would strongly prefer if support for this for sha-1 is opt-in, if not on by default. See https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html which I propose adding in multiformats/multicodec#203 for why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the time this is usable, SHA-256 support in Git will likely be stabilized anyway, given it's already implemented AFAICT, so I don't see the point in making it opt-out.
|
||
## Description | ||
|
||
The major hash functions work by taking some data `D` chunking it up into `n` pieces `P_0...P_n-1` then they modify an internal state `S` by loading pieces into the hash function in some way. This means that there are points in the hash function where we can pause processing and get the state of the hash function so far. Bitswap can utilize this state to effectively break up large blocks into smaller ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that breaking blocks on hash function chunk boundaries means breaking rabin (or any other content based) chunking, which is essential for efficient content deduplication. For this to support arbitrary content boundaries for IPFS blocks, the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.
Can you elaborate more on what you were thinking here? What do you mean by accumulated hash value (my guess is the internal State) and the trailing partial hash (not sure what this means).
I'm thinking there are three different chunkers involved here:
- The SHA-256 hash chunker which uses 64 byte boundaries
- The exchange layer (i.e. Bitswap) chunker which emits the intermediate state
S
along the way, which has the restriction of every state being on a 64 byte boundary (e.g. we could use a fixed 256KiB, or a more complex function as long as it always ended on a 64B boundary) - The data storage layer where the data could be stored in any chunking fashion we want although we should store the exchange layer mappings for reuse when people download from us in the future
IIUC the thing that you're getting at here is that it's unfortunate that if Rabin would've saved us bandwidth here that it's only saving us disk storage because of the restrictions at the exchange layer.
I think there's potentially a way around this (which has tradeoffs) by allowing more data into the manifest. For example, when B responds to A giving them a manifest of blocks to download in addition to giving them a list of the intermediate States B could also send more information about the data corresponding to those states.
For example, B could send (State, large block start index, large block end index, []ConsecutiveSubBlock)
where ConsecutiveSubBlock = (SubBlockMultiHash, subblock start index, subblock end index)
and in this way A could decide whether to get the blocks corresponding to the State transition by either asking about bytes in the full block or by asking for the ConsecutiveSubBlocks and then using some of the bytes. This would allow B to tell A about a deduplicating chunking scheme they could use, but A 1) wouldn't be obligated to use it when downloading data 2) wouldn't be obligated to use it in their datastore.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hash state will need to include more than just the accumulated hash value; it will also need the trailing partial hash chunk data.
Can you elaborate more on what you were thinking here? What do you mean by accumulated hash value (my guess is the internal State) and the trailing partial hash (not sure what this means).
If you chunk on arbitrary byte boundaries, then the full "state" needed to resume hashing needs to include the tail data past the last 64B boundary that has not yet been included into the hash-chunker. It means the full state needed is a little larger.
I'm thinking there are three different chunkers involved here:
- The SHA-256 hash chunker which uses 64 byte boundaries
Note this can vary depending on the hash function used. I think some have >64 byte hash-chunkers.
- The exchange layer (i.e. Bitswap) chunker which emits the intermediate state
S
along the way, which has the restriction of every state being on a 64 byte boundary (e.g. we could use a fixed 256KiB, or a more complex function as long as it always ended on a 64B boundary)
This is the blocker; rabin chunkers must chunk on arbitrary byte boundaries and have variable sized chunks to work properly. The classic example is the large file modified by adding a single byte at the front; despite everything except the first byte being duplicated, chunkers that can't re-align on the 1-byte offset will fail to de-duplicate anything. Any chunker constrained to 64 byte boundaries will fail to find duplicates after any insert/delete that is not a nice multiple of 64 bytes.
- The data storage layer where the data could be stored in any chunking fashion we want although we should store the exchange layer mappings for reuse when people download from us in the future
This bit is internal client level and doesn't really matter from an API point of view. The important bit is 2.
IIUC the thing that you're getting at here is that it's unfortunate that if Rabin would've saved us bandwidth here that it's only saving us disk storage because of the restrictions at the exchange layer
Rabin can't be used at all if the chunks are constrained to 64 Byte boundaries. Without using some kind of content-based chunker like rabin, you get no de-duplication at all between slightly mutated data within files unless the data mutations are constrained to multiples of your block size; so no shifting of data except by multiples of the block size.
I think there's potentially a way around this (which has tradeoffs) by allowing more data into the manifest. For example, when B responds to A giving them a manifest of blocks to download in addition to giving them a list of the intermediate States B could also send more information about the data corresponding to those states.
For example, B could send
(State, large block start index, large block end index, []ConsecutiveSubBlock)
whereConsecutiveSubBlock = (SubBlockMultiHash, subblock start index, subblock end index)
and in this way A could decide whether to get the blocks corresponding to the State transition by either asking about bytes in the full block or by asking for the ConsecutiveSubBlocks and then using some of the bytes. This would allow B to tell A about a deduplicating chunking scheme they could use, but A 1) wouldn't be obligated to use it when downloading data 2) wouldn't be obligated to use it in their datastore.
I don't think I fully understand this. Are you effectively creating virtual-large-blocks for each state transition out of a list of ranged sub-blocks in your manifest? And this is so that the sub-blocks can be created using a de-duplicating chunker, while the virtual-large-blocks can be 64byte aligned?
If yes, this sounds very similar to the "virtual DAG" idea I was proposing you could do using CID-aliases in the DHT. Note that you would not need to have a list of ConsecutiveSubBlocks, you could just have a CID+offset+length where the CID could be a reference to the merkle-dag node that contains the whole range. You can walk the DAG to find the leaf nodes needed from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a overview of how chunking/encoding/etc affect data deduplication, I think I've summarized everything about it here;
https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is a little long, especially since I think we're basically on the same page but figured it'd help to be more explicit.
If you chunk on arbitrary byte boundaries, then the full "state" needed to resume hashing needs to include the tail data past the last 64B boundary that has not yet been included into the hash-chunker. It means the full state needed is a little larger.
...
I don't think I fully understand this.
So I think what we're describing is just two ways of getting at the same idea. We're in agreement that the exchange layer chunker, which is verifiable, cannot just send over data-layer chunker information (e.g. rabin) since they won't line up on the same boundaries.
Overall we need something that a client who has State_i
can receive (State_i-1, bytes)
and can verifiably check that the given state transition is valid.
IIUC the difference is:
Your approach
I was having some difficulty describing what you were getting at here, but I think you were leaning towards:
Define transitions from (State_i-1, bytes) -> State_i
as (SHA2_State_with_extra_data_i-1, CIDs for data layer chunking blocks (e.g. rabin))
to SHA2_State_with_extra_data_i
where the extra data might include some bytes left over in between rounds of SHA2 chunking.
My approach
Define transitions from (State_i-1, bytes) -> State_i
as (SHA2_State_i-1, []WayOfDescribingBytes) -> SHA2_State_i-1)
where I've listed two implementations of WayOfDescribingBytes
.
Below we have three peers: Alice, the client. Bob the server that sends the manifest and/or data. Charlie another server that can send a manifest and/or data.
- Byte offsets within the large block itself (works, but not great for deduplication)
- Multihashes (since the IPLD codecs aren't required) of the blocks Bob has used locally to chunk the data (e.g. rabin) along with the start offset of the first block and end offset of the last block. For example,
[(MH1, start at byte 55), MH2, (MH3, end at byte 12)]
. Note: In one of your other comments I think you allude to allowing for graphs here instead of just a list of blocks, IMO that seems like an excessive level of indirection since we already know what we want.
Both of these approaches describe a set of bytes and have their own advantages/disadvantages:
- Advantage: Works even if I ask Charlie for parts of the large block and Charlie has used a different chunker than Bob (e.g. buzzhash or fixed size). Disadvantage: Wastes bandwidth if Charlie and Bob used the same chunker, or if Alice had previously downloaded a large block (e.g. a file) that utilizes the data chunks in from how Bob has chunked up the data.
2.Advantage: If people are using the same chunker I get deduplication benefits. Disadvantage: Fails completely if someone has the same large block, but chunked up differently
Are you effectively creating virtual-large-blocks for each state transition out of a list of ranged sub-blocks in your manifest?
Yes, although the "large" in this case isn't very big since the sum of the sizes should be less than the recommended Bitswap block size (i.e. 1MiB). We're just accounting for if Rabin chunking gives us 10 chunks that contain that 1MiB of data.
This isn't perfect, for example if the data is a 1TiB file that's expressible using a rabin chunker in 1MiB we'll send metadata corresponding to a 1TiB file since we want to account for someone using a different chunker. However, as the manifests should be very small compared to the data and most real data likely won't have such extreme differences in chunking vs non-chunking size I suspect this is ok.
And this is so that the sub-blocks can be created using a de-duplicating chunker, while the virtual-large-blocks can be 64byte aligned?
Yes, and also handle the case where people have used different chunkers for the same data.
If you want a overview of how chunking/encoding/etc affect data deduplication
That's a great post 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long comment thread, but I just want to add that we get common-prefix deduplication of stuff with the linear hash for free.
|
||
1. There's a lot of confusion around UnixFS CIDs not being derivable from SHA256 of a file, this approach may either tremendously help or cause even more confusion (especially as we move people from UnixFS to IPLD). An example [thread](https://discuss.ipfs.io/t/cid-concept-is-broken/9733) about this | ||
2. Storage overhead for multiple "views" on the same data and extra checking + advertising of the data | ||
3. Are there any deduplication use case issues we could run into here based on users not downloading data that was chunked as the data creator did it, but instead based on how they want to chunk it (or likely the default chunker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decent deduplication requires using a consistent hash and chunking algorithm, and to deduplicate arbitrarily aligned data, the chunking algorithm must be content based with variably sized blocks of any byte-length between a min and max value. Every different way of chunking/hashing the data (and creating the merkle-tree) results in another copy of the data that cannot be deduplicated.
To do deduplication efficiently you want IPFS to have a fixed chunking/hashing algorithm and merkle-tree under the hood, and then support alternative "views" on top of this, that can present the underlying data as if it was chunked/hashed differently. I don't know how much demand there is for alternative tree views, but certainly a common use-case is the "one big file with this hash" view. This could be implemented as just an alternative DHT entry similar to an IPNS entry that is keyed by the whole-file hash and points to a list of CID's (each different hash/chunker/tree options result in a different CID. ideally there is only one) for that file. These could be signed by the node that did the upload for verification purposes, but you would still need to download the whole file to verify the whole-file hash.
I don't know how much demand there is for alternative tree views of the data, but this could be implemented using an alternative merkle tree containing the desired hash-type for each node, and where the "raw" leaves are actually ranged references into the underlying native merkle tree nodes. I'm not sure exactly how validation of these alerternative-view merkle nodes would work, but you would probably have to download the data-segment (by downloading the underlying merkle-tree-fragment) for each -node to validate the hash. There might be ways to include signatures by the uploading peer-node, but you probably want to do this in a way that the same alternative-view-uploaded by different peers can share the same data. Perhaps an IPNS entry pointing at the alternative-view-root-merkle node is the best way that peers can sign that the've uploaded/verified that alternative-view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think @dbaarda has a great idea, that mapping a full-file hash (SHA-256 for ex) to a CID right at the DHT layer seems like a clean way to add this functionality with no redesign of anything existing (other than a new DHT type), and just a small amount of new code.
Also it means any existing data already stored on IPFS doesn't need to be re-stored (re-added) but anyone at any time could put it's canonical hash (SHA-256) in the DHT and immediately it would be findable by everyone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you extend this idea a tiny bit and make the alias value In the DHT a CID+range (offset + length) then you can add aliases to any piece of arbitrary data regardless of the underlying chunking/merkle tree structure. This would allow you to eg. Add a sha256 alias to a file that had been added inside an uncompressed tar file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the tricks that @aschmahmann is trying to balance in this proposal, as I understand it, is being able to take a legacy hash, like the sha256 of a large file, and have some confidence that you're getting 'correct' data while you're downloading it.
If the DHT just holds a mapping to a CID, you don't know until you fully retrieve the file that it will hash to the original sha256 value you were interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least you only have to trust the peer that gave you the CID and then you can still pull data from all the other untrusted peers in the normal trustless way, where they can only fool you "1MB at a time (bitswap)" so to speak. Also if you get a DHT answer from at least 3 peers, you can go with whatever the consensus is about what the correct CID is, before you try it, but I'm not sure if DHT is designed to try to get 3 or more answers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- IMO verifiable data is really important here, DHT's (and open p2p networks in general) are Sybil attackable/spammable so working with unverified data is tough and needs to be really worth the associated problems
Agree, but maybe that's the price you pay for "larger blocks"; you can't validate them until you've downloaded them, just like small blocks.
Right now people can't have content-addressable "blocks" of data larger than 2M (with 1M recommended) at all. You can't have content-addressable "blocks" larger than that, unless you're OK with the key/hash not being of the raw data, but of a merkle-tree node where the hash depends on the chunking algorithm chosen. People might want to build applications on top of IPFS with larger blocks, and this would facilitate that.
Adding a simple CID Alias to the DHT suddenly means you can have blocks of any size keyed by the block's content hash. Under the hood IPFS is chunking things into pieces optimized for it's deduping/network/storage/etc requirements, but you now optionally can have an abstract "block" on top of that with less restrictions.
- Almost anything you can do with a custom DHT record type you can do with provider records + a custom protocol. The advantage of using the DHT is generally that someone can publish and then go offline and the record is still there (for a while, e.g. a day), however, by going the custom protocol route you can have things work even if a client doesn't have a DHT implementation (or it's been turned off)
I would have said that the big advantage of the DHT is you can find things with it. Any solution that doesn't put the hash/key/cid that you want to find the data by in a DHT is not going to be findable, at least not by that key. You need some kind of mapping from the key/hash you have to the CID the data is actually published under.
At least you only have to trust the peer that gave you the CID and then you can still pull data from all the other untrusted peers in the normal trustless way
Yes, that's true but anyone can just put mappings in which means you could easily be given a bogus CID (may make the "best out of 3" approach not doable). To make things worse this can be used by a malicious actor to do a sort of attack on a third party by getting you to try and download a large file from them that wastes both of your bandwidths.
Note this is true with current non-IPFS publishing of ISO images and their hash; you need to download the whole thing before you can verify it against the published hash.
I agree it would be good to have some way to validate that the CID alias for a large block actually does hash to the key it's indexed by, but I haven't really got a good solution. Signing by the publishing peer might help, but I haven't thought it through. Perhaps CID aliases should be published via IPNS to be trusted? Note you don't have to prove that each individual raw block (or maybe block fragment) is a valid part of the whole large block, just that whole data referred to by the CID reference has that hash, since the IPFS fetching of that CID will validate the individual raw blocks are part of that CID as they are fetched.
Note the DHT entry containing a CID alias reference can only be validated by downloading all the data pointed at by that reference, but this doesn't have to be the whole file
I think we're on the same page here, but a CID can be validated using the data corresponding to that block. In normal/current usage blocks are relatively small (<1MiB) and large data collections are established by using IPLD to create content-address linked DAGs, this proposal is about the fact that it happens to be that there is a pre-existing "format" for files where the data is just a single large block and it'd be nice to be compatible with that.
What sorts of use cases are you envisioning where I can lookup the SHA-256 of a large section of a single large block? How is anyone finding the reference to that subsection of a large block and why wouldn't they just break the data into digestible pieces and make an IPLD DAG?
It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;
-
Create a whole-file sha-256 CID-alias that points at a CID containing a single uploaded file. This means you can fetch the file using it's whole-file sha-256 hash, instead of a hash that varies depending on the chunking algorithm chosen.
-
Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.
-
Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.
-
Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.
I think 1. would be the main use-case, but once you have that capability people would figure out creative ways to use it. Note 1. and 3. only require CID-aliases that point at an existing CID, but 2. and 4. require CID-aliases that include ranges (offset+length).
These DHT CID alias entries can be used to build an alternative merkle tree "view" structure that has its own blocks/nodes/sums while referring to the same underlying data in the IPFS preferred format
Related to the above which question/use cases are you hoping to answer with the DHT CID aliases? I think having some concrete examples will make it easier to talk about this.
DHT ... so it definitely supports multiple answers per key
A bit, but not really there is support for multiple provider records (i.e. a map of key to a list of peers), but not for anything else. There could be, but it's not a trivial thing (more info here).
Ah... that's interesting... maybe the mechanisms for adding provider records could be used for adding CIDs to a CID-alias entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I was making the assumption that any DHT entry that points a SHA-256 to a wrong CID (for whatever reason) would have a digital signature so the peer responsible for the "mistake" could be identified and blamed and then somehow 'down-ranked' (as a potential hostile actor) by the peer that discovered the mistake. Like you guys, I can't see a way to avoid having to trust that a CID given in the DHT is correct, until after the full download is complete.
Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain. It may be the case that true content-addressing (based on a file hash) is simply impractical in a trustless-system, once you consider that it's synonymous with a big attack vector. If this is true (and I can't prove it isn't) then my apologies for even being one of proponents of the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much longer than I wanted (sorry about that).
Separating this into two sections, one about how the DHT works and the other about whether I think this approach of storing CID mappings in the DHT is something we should do.
DHT Background Clarification
Almost anything you can do with a custom DHT record type you can do with provider records + a custom protocol. The advantage of using the DHT is generally that someone can publish and then go offline and the record is still there (for a while, e.g. a day), however, by going the custom protocol route you can have things work even if a client doesn't have a DHT implementation (or it's been turned off)
I would have said that the big advantage of the DHT is you can find things with it.
You're right I was being sloppy with my wording/explanation here. The point I was trying to make is that if you wanted to put some custom key-value pair (k,v)
in the DHT generally you can get around the DHT not supporting your custom pair by doing this:
- Make a key
k' = Multihash("MyKeyType:" + k)
using some common hash function (e.g. SHA2) - Put provider records in the DHT saying that you are a provider of
k'
- Register a protocol handler that does whatever you want (e.g. returns
v
givenk
) - Users who want to find the value for
k
calculatek'
, find the providers ofk'
, connect to them using the custom protocol, and ask them forv
This strategy is how IPNS over PubSub works and manages to be faster than IPNS over the DHT even for first time lookups (relevant logic largely lives in go-libp2p-pubsub-router and go-libp2p-discovery).
The things you lose over having native DHT record support are the ability to publish and then go offline for a few hours and then come back online later. This isn't to say we shouldn't figure out a way to make it easier to upgrade the DHT and support new record types, just that this other path works.
Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain.
Generally open p2p systems are subject to Sybil attacks, this includes IPFS. There are mitigations available (and we even do some of them), but overall the concept persists. The question to ask is what bad thing occurs/can occur if someone performs a Sybil attack. Thus far the only thing that happens to IPFS under a Sybil attack is basically a denial of service attack on some the resource being attacked.
Thoughts on unverified relationships between two CIDs
Continuing from above this proposal allows for different types of attacks than just a DoS on the Sybil attacked resource, and without some reputation backing isn't too difficult to pull off.
- Attack: The adversary can say "Popular file with SHA2 X corresponds to DAG Y" where DAG Y is the wrong data
- Now the user is forced to not just get an "error not found" but actually download a potentially large amount of data before realizing they've been duped
- The adversary doesn't even need to waste resources since DAG Y can be hosted by someone else
- The adversary can cause a DoS on the "someone else" hosting the data by overwhelming them with innocent peers thinking Y is really some popular content
It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;
I think proposal 09 covers most of these use cases at the network layer if clients are willing to make custom block stores locally. Zooming out if we have a primitive that lets us download arbitrarily sized blocks and we want to be able to download these blocks in parts from multiple peers who are transforming the data in various ways (chunking, compression, etc.) that's ok as long as peers present a virtual blockstore that presents the data in its canonical form. This might end up requiring computation/storage tradeoffs (e.g. for decompression), but it gives us verifiability which IMO is key.
- Create a whole-file sha-256 CID-alias that points at a CID containing a single uploaded file. This means you can fetch the file using it's whole-file sha-256 hash, instead of a hash that varies depending on the chunking algorithm chosen.
This proposal does exactly that, you just advertise a provider record in the DHT for the full-file SHA-2 and can then download without worrying about chunking schemes, etc. and it does it verifiably.
- Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.
This proposal doesn't cover this use case, but the idea of working with compression representations as if they're the data itself seems like a bit of a mine field with people wanting various different things out of compression.
One ask is "given that Bob has a compressed version of the file F how can Alice download it knowing only SHA2(F)?" and if you want to be able to download from multiple people who could have compressed the data differently then either way you'll need to be downloading based on the raw bytes in the file. If so, then Bob can have a virtual datastore where if someone asks him for F he'll decompress it before sending.
Another ask is to try and minimize download bandwidth, perhaps by using compression. That compression might come at the data layer, or could come at the transport layer (there's an RFC in this repo about compression in the Bitswap transport). If the question is "I want to download data from peers who have gzip(F) and all I know is SHA2(F)" then we run into verifiability problems. There might be value here, especially if I'm building a friend-to-friend network on IPFS and I know who I trust, but given the verifiability issues IMO this doesn't seem great. If downloading gzip(F) is really so important to the user then they could always pass around references to it as SHA2(gzip(F)).
- Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.
Even assuming that by merkle-tree DAG you're referring only to UnixFS files I'm still confused. I'm not sure I see the immediate value here, but couldn't this proposal mostly let you do this? If an application wanted to treat a subset of a file as a new file then they could just do another add and this reduces to case 1.
- Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.
This seems like it's really about clients having multiple virtual blocks backed by the same underlying bytes. As long as there is a canonical representation that people can advertise/search under and there is a canonical download representation (e.g. the raw bytes of the large block/file) then how you internally represent that single block is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aschmahmann I agree with everything you said, and you outlined the attack vector very well too. I also agree there's not really any value to identifying individual merkle nodes by their SHA2 just because the entire file is identified by SHA2, with all due respect to the person bringing that up. He is thinking "the right way" however, because normally in a tree-architecture when you have every node being "handled" identically to every other node (including the root itself) that's usually indicative of a good design, so it made sense to bring that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late and waaay too long post. I started writing this some time ago and RL kept interfering before I could post it, so each time I came back to it I added more, and now I'm struggling to edit it down. I guess the TLDR is;
I'm not against this proposal. The reverse-download-block-incremental hash idea is sound, and a good way to validate the download incrementally so it can be rejected early if it's bad. It's probably the only way for non-homomorphic hash functions, which are the ones people currently use and care about. I just wanted to point out;
-
You can store hash-state at arbitrary byte boundaries by also including the hash-chunker-tail-fragment in the hash state. This means you don't need to re-chunk the data into hash-chunker-aligned chunks, but can calculate the incremental hash for the existing merkle-dag data blocks with arbitrary alignment. Decent de-duplicating chunkers WILL chunk on arbitrary boundaries.
-
If you do want to keep the hash-state a little smaller and/or optimize the chunk size for the incremental hash, a CID alias in the DHT could be used to efficiently build a "virtual merkle-tree" with it's own chunking that refers/reuses the data in an underlying existing "as-added" merkle-tree with different chunking. This feature would also be a useful-building-block for other uses.
A lot of the rest of my words are me thinking aloud and getting a bit off-topic... I'm sorry about that.
I would have said that the big advantage of the DHT is you can find things with it.
You're right I was being sloppy with my wording/explanation here. The point I was trying to make is that if you wanted to put some custom key-value pair
(k,v)
in the DHT generally you can get around the DHT not supporting your custom pair by doing this:
- Make a key
k' = Multihash("MyKeyType:" + k)
using some common hash function (e.g. SHA2)- Put provider records in the DHT saying that you are a provider of
k'
- Register a protocol handler that does whatever you want (e.g. returns
v
givenk
)- Users who want to find the value for
k
calculatek'
, find the providers ofk'
, connect to them using the custom protocol, and ask them forv
You are right that you can do this, but it's very inefficient and more unreliable when 'v' is smaller than a list of the peers hosting a bitswapped block containing 'v' would be. Putting 'v' directly in the DHT instead of in an IPFS block means you avoid more network transfers and round-trips (bitswap is avoided entirely), avoid storing as much data in the DHT (v is smaller than the peer records would be) and storage in peer's blockstores (you don't need peers to store 'v' in a block at all), and avoid relying on available peers providing the block.
I was proposing a 'v' record that contains only a CID, offset, length, and maybe a publishing-peer-signature, which would be smaller than a list of peers. This would be like a sym-link to any arbitrary piece of data in IPFS, referenced by a hash of that data. This would be a low-level efficient piece that other things could be built on top of. How you safely use this feature is probably something for upper layers using it to figure out.
But you are right; this feature doesn't need to be added to the DHT to build proof-of-concept implementations using this idea. Adding it to the DHT would be an optimization that could be done later. I guess my main point is that it's worth designing things with this optimization in the back of your mind, so that you don't end up with something that's hard to migrate to use it later.
This strategy is how IPNS over PubSub works and manages to be faster than IPNS over the DHT even for first time lookups (relevant logic largely lives in go-libp2p-pubsub-router and go-libp2p-discovery).
I don't know much about pubsub (yet) but I'm betting pubsub has it's own push notification/lookup system that bypasses the DHT and propogates changes faster than DHT writes propogate. I suspect it's this publishing path that makes it faster, not that pubsub puts it's 'v' values in bitswapped blocks.
I'm willing to bet that a DHT lookup of a tiny 'v' record directly in the DHT will nearly always be faster than looking up a 'v' record in a block. The only exception might be when the 'v' record is already cached in the local blockstore, but even then caching in the DHT client should be just as fast. If it's not, then there is something wrong.
The things you lose over having native DHT record support are the ability to publish and then go offline for a few hours and then come back online later. This isn't to say we shouldn't figure out a way to make it easier to upgrade the DHT and support new record types, just that this other path works.
That's the obvious functional difference, but there's a big performance difference too.
Worst case scenario is a large number of hackers joining the DHT and publishing thousands of incorrect CIDs for popular SHA-256 files, and so designing to fight that scenario is critical and actually may not even be "doable". If IPFS is currently capable of stopping that kind of attack then definitely we don't want to introduce the first weak link in the chain.
Note bad clients can "add" CID entries that don't match the blocks they serve for them too. The only protection is these "bad blocks" are small and thus identified as bad before much is downloaded, and peers providing these bad blocks (presumably) get ignored eventually.
There are several different levels to what a CID alias could do, with increasing features and risks;
-
Just provide an alternative CID using a different hash to an existing block. This could eg. be used to provide a sha1 alias, as needed by git, to a block added using a sha256 hash. This only requires a CID, no offset+length range, and no peer-signature (it would not add anything). This has no greater risk and exactly the same protections as the existing CIDs, with the exception that obviously sha1 is a weaker hash. Note that this could be done without CID aliases by just re-adding the block using a different hash, but CID aliases mean the blocks and lists of providers are shared/de-duplicated, at the cost an extra DHT lookup to de-reference the CID alias.
-
Provide an alternative alias CID using a hash of all the "raw data" under that CID. This could be used to provide a sha256 alias to a whole file, or any node in a merkle-tree-DAG, by a hash of its content. This doesn't require an offset+length range, but it probably does require a peer-signature. It can only be fully validated after downloading all the "raw data", but note that each block under the CID is validated normally as part of that CID, and the final validation of the whole data is verifying that the CID alias points at a CID that has the correct overall hash. Before starting the download, the CID alias peer-signature can be use to check that it has been published by a trusted peer, and peers found to be publishing bad CID aliases can be blacklisted.
-
Provide an alias to an arbitrarily offset/length "large block" of data under a CID, keyed using the hash of that data. This is the same as 2. except it also requires an offset+length range. It's risks and mitigations are the same as 2. with the extra risk that degenerate CID aliases could point a high-level CID alias with a range on either side of a block-boundary, requiring downloading all the merkle-DAG nodes from the root down to and including the two raw nodes on either side of the boundary just to get a tiny piece of data. This is likely to be a minor inefficiency, but if it does look like a deliberate DOS attempt the signing peer can be blacklisted.
I agree that CID aliases are more vulnerable to attacks than CID's and IPNS records, but that's largely because IPFS delegates and denies responsibility for the big risk part; that the CID or IPNS record points at the data that the person publishing says it does. A person can "add" a compromised binary and publicize the CID or even an IPNS record as the place to get it, and IPFS will happily verify that yep, you've downloaded the compromised binary exactly as it was "added", and will not tell you it's been compromised. Verifying the downloaded binary against an officially published hash is something that (currently) has to be done outside IPFS.
It allows you to create a virtual "block" keyed by its hash using any multi-hash, of any data in IPFS, regardless of how that data is already chunked and signed. This means you can do things like;
I think proposal 09 covers most of these use cases at the network layer if clients are willing to make custom block stores locally. Zooming out if we have a primitive that lets us download arbitrarily sized blocks and we want to be able to download these blocks in parts from multiple peers who are transforming the data in various ways (chunking, compression, etc.) that's ok as long as peers present a virtual blockstore that presents the data in its canonical form. This might end up requiring computation/storage tradeoffs (e.g. for decompression), but it gives us verifiability which IMO is key.
The bit about "if clients are willing to make custom block stores locally" worries me. I don't think this is necessary, and implies that a whole bunch of de-duplication at the blockstore and (more importantly) network layer will be broken.
I was thinking the blockstore and network layer would always use the underlying IPFS "as-added" native merkle-dag using the existing fetching/storing/caching mechanisms, and "large blocks" would be a higher-level abstraction for "reading" arbitrarily offset/sized "virtual blocks" from a CID. Under the hood it would just fetch/walk the original merkle tree and download the relevant "leaf" raw data blocks that encapsulate that larger virtual block. The addition of the DHT CID alias idea would give you a way to reference/search for these "virtual blocks". This would mean the local data store and network layers would be unchanged, and all the raw data for any large blocks would deduplicate/reuse the native merkle-tree data.
- Create whole-file sha-256 CID-aliases that point at each file inside a single CID that contains an uncompressed tar file.
This proposal doesn't cover this use case, but the idea of working with compression representations as if they're the data itself seems like a bit of a mine field with people wanting various different things out of compression.
One ask is "given that Bob has a compressed version of the file F how can Alice download it knowing only SHA2(F)?" and if you want to be able to download from multiple people who could have compressed the data differently then either way you'll need to be downloading based on the raw bytes in the file. If so, then Bob can have a virtual datastore where if someone asks him for F he'll decompress it before sending.
Note this would not work with compression, only concatenation, as is done by eg tar (NOT tar.gz). You could write a custom tar file uploader that not only gave you a CID for the whole tar file, but a CID alias for every file inside the tar file. This would be more efficient than doing "add" of the tar and each file individually UNLESS you had/used a tar-file aware chunker/dag-builder that could achieve the same thing by building the merkle-dag to reflect the tar-file contents.
- Add sha-256 CID-aliases for every node in an existing merkle-tree DAG, so that they can be referenced not only by the hash of the node, but by the sha-256 hash of all the data under that node.
Even assuming that by merkle-tree DAG you're referring only to UnixFS files I'm still confused. I'm not sure I see the immediate value here, but couldn't this proposal mostly let you do this? If an application wanted to treat a subset of a file as a new file then they could just do another add and this reduces to case 1.
This idea was just presented as an example that someone might see a use for.
- Create an IPLD DAG using a particular chunking and hash algorithm that is actually a "virtual view" of data already uploaded into IPFS with a completely different chunking and hash algorithm. The "leaf nodes" in this virtual-view DAG will be cid-aliases into the already uploaded data, and would not be limited by IPFS's 2M block size. Note all the data in these different DAGs will be fully de-duplicated.
This seems like it's really about clients having multiple virtual blocks backed by the same underlying bytes. As long as there is a canonical representation that people can advertise/search under and there is a canonical download representation (e.g. the raw bytes of the large block/file) then how you internally represent that single block is up to you.
This is actually about creating multiple virtual blocks you can advertise/search for, that clients can then fetch/store using the "as-added" merkle-tree native representation. It's not purely internal, because the virtual blocks are advertised/searched for by other clients. That the network/storage layers use the native merkle-tree representation means all the data is transmitted/deduplicated at that layer, and clients assemble them into the "large blocks" themselves as needed.
I love these proposals @aschmahmann. One quick note, you've probably seen it already, but we had this RFC in a REALLY early brainstorm stage (we ended up not having the time to discuss it and develop it further) that goes in the line of these two RFCs. It was more focused on allowing the exchange of larger blocks through the wire (without having to change any of the building blocks, i.e. blockstore, CIDs, max block size, etc.). Our proposal was to introduce the concept of piece as an "irreducible aggregation of blocks" uniquely identified through a multihash. Thus, clients would be able to request in the exchange manifest the size of the pieces it wants to receive. This would promote the reception of content from peers that store a minimum number of blocks of the requested content. This requires some additional computation in the seeder's side to build these "pieces", but they can be easily cached (similar to what is done in FIB tables of in-network caches):
Your proposals are way more thoroughly thought, and I think they deprecate this RFC, but wanted to reference it here in case it inspires someone, and for the sake of completion. Edit:
The fact that is the client the one pulling the desired block size prevents the "bogus data" misbehaviour. However, this scheme can still introduce new potential attacks to worry about such as clients requesting extremely large piece sizes, making seeders to do a lot of useless work in building the piece. |
@adlrocha yep I took a brief look at the RFCs in the repo, although I may not have fully grasped all of them 😅. My understanding of rfcBBL207 is that it's pretty different from this in that it is basically asking for an IPLD selector version of Bitswap's HAVEs where I say "please tell me if you have this CID and some part of the graph underneath it" as a way of not wasting connections. If this is a concern it might be useful to utilize #25 to get something like a graph manifest from many peers and then use that to inform scheduling of which peers to ask for which blocks without needing many HAVE queries.
The above approach might alleviate your concern here assuming that running the IPLD selector query isn't too expensive or could be limited. PS: If GitHub ever introduces threads outside of commenting on a random line in a PR I'll be a happier person 😄 |
1. There's a lot of confusion around UnixFS CIDs not being derivable from SHA256 of a file, this approach may either tremendously help or cause even more confusion (especially as we move people from UnixFS to IPLD). An example [thread](https://discuss.ipfs.io/t/cid-concept-is-broken/9733) about this | ||
2. Storage overhead for multiple "views" on the same data and extra checking + advertising of the data | ||
3. Are there any deduplication use case issues we could run into here based on users not downloading data that was chunked as the data creator did it, but instead based on how they want to chunk it (or likely the default chunker) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this would make public gateways way more useful by removing MITM risk (cc @Gozala @autonome)
4. File identified using hash of the full content enables [validation of HTTP gateway responses](https://github.com/ipfs/in-web-browsers/issues/128) without running full IPFS stack, which allows for: | |
- user agents such as web browsers: display integrity indicator when HTTP response matched the CID from the request | |
- IoT devices: downloading firmware updates over HTTPS without the need for trusting a gateway or a CA | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still issues with directories, but yes for files this can help, good idea 💡!
…nses Co-authored-by: Marcin Rataj <lidel@lidel.org>
I think that there might be a use for LTHash in here somewhere. It's a homomorphic hash that has the property that it's composable so that the final hash can be computed over the chunk hashes regardless of how it's chucked. I think this would address the issue of CID aliases where a file has multiple alias because different chunkers were used. With an LTHash the file hash can be computed consistently over multiple chunkings. I think you can also publish a list of chunk hashes and verify that it would produce the final hash as long as the intermediate hashes hashed correctly. Still doesn't solve the problem if someone wanted to lookup by some other hash like sha256 but it might provide a way of referring to the complete file hash that might be more resistant to Sybil attacks. LTHash is the first practical composable hash that I know of as SL2Hash had issues. https://github.com/lukechampine/lthash |
I'm a bit confused why this proposal is trying to roll in the ideas from #25 all at once. I think it would be simpler to just start with the naive and trustless reverse stream algo, and then combine it with #25. Basically, the core idea is all hashing induces a "freestart merkel dag", and we are taking advantage of that for the the first time. I'm not against the #25-like bits, I just want to layer ideas as well as possible and reduce risk. |
Basically because without it L2-10 is IMO pretty useless since instead of being able to download a file in parallel across many peers you now (assuming SHA1/2) can stream data from at most one peer and do so at a rate of at most 1MiB per RTT. Here is how I see it: You can do L2-09 without #25, but it's utility (assuming you're using SHA1/2 and not a tree based hash) IIUC is basically restricted to dealing with existing formats like Git that happen to sometimes have >1MiB blocks and so downloading really slowly is better than nothing. Having something like #25 is what makes this useful enough to even bother proposing L2-10. |
I'll admit just getting something to work with Git is my main priority. And what about the layering argument, where you have:
The IPFS ecosystem as a whole i phenomenal at layering. I'm not sure whether the graphsync vs bitswap distinction is supposed to be permanent or is more to track the evolution of a protocol, but if it's the former, the 3 RFC split keeps that distinction. Finally, per "Make it work, make it right, make it fast", it seems like it's good to be able to sign off on the relatively easy parts while the fancier graph sync stuff is worked out? E.g. there are a gazillion variations on the latter two steps (in the 3-part plan) that people might want, as evidenced by the big discussion this PR has seen already. But the first step is such a constrained design space I think there's not too much to disagree about. |
I want to put forward what I see as the simplest version of this proposal, which is just a mechanism to support block sizes > 1MB in bitswap. This avoids the question of DAG equivalence, new multicodecs, implicit IPLD Dag relationships, etc. Currently, bitswap & graphsync have the following format for blocks at the protocol layer type Block struct {
Prefix []byte
Data []byte
} I believe if we now modify this to be something like: type Block struct {
Prefix []byte
Data []byte
IsChunked bool
ShaSoFar []byte
RemainderSha []byte
} And require the chunks are sent in order have everything we need to support arbitrarily large blocks in bitswap and graphsync we'd maintain a map in bitswap / go-graphsync (actually probably a ministore on disk in reality): var inProgressBlocks map[cid.Cid]struct {
bytesSoFar []byte
shaSoFar []byte
} If I get back a chunked block, if the ShaSoFar is nil, I hash Data + RemainderSha based on the Prefix to get the original CID and make sure there is no existing entry in the map, AND the CID matches a want. I make a new entry. IfI get a back a chunked block with ShaSoFar != nil, I hash ShaSoFar + Data + RemainderSha based on the Prefix and get the original CID, and make sure there is an existing entry in the table AND the ShaSoFar matches the one in the Block. I add the data and update the ShaSoFar to include it. If RemainderSha = nil I verify the updated ShaSoFar matches the whole CID, and remove the entry from the blockstore, and save the assembled block. Potential Advantages/Disadvantages:
I think all the other larger questions are important ones to address, but if the immediate need is serving GIT objects, this strikes me as the fastest and simplest possible path. Edit: Upon further reflection I believe I have simply reinvented @Ericson2314's #30 (though perhaps fleshing out what it looks like at the protocol level is helpful?) |
I would like to see us define "DAG equivalence" at an IPLD level more clearly before we move towards trying to support looking up files in IPFS by the sha of all their bytes. I believe what we are essentially talking for files at least, is equavalence in the context of IPLD's Flexible Byte Layout ADL: https://github.com/ipld/specs/blob/master/data-structures/flexible-byte-layout.md |
* When a server responds to a request for a block if the block is too large then instead send a traversal order list of the block as defined by the particular hash function used (e.g. linear and backwards for SHA-1,2,3) | ||
* Large Manifests | ||
* If the list is more than 1MiB long then only send the first 1MiB along with an indicator that the manifest is not complete | ||
* When the client is ready to process more of the manifest then it can send a request WANT_LARGE_BLOCK_MANIFEST containing the multihash of the entire large block and the last hash in the manifest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about: Instead of special-casing the manifest file (and having to deal with large vs small manifests), recursively treat the manifest as a downloadable artifact:
If the manifest is small (1<MB), send the whole manifest in the response, otherwise send the manifest of the manifest.
These are two proposals, one is mostly about Bitswap and the other about IPFS. They build off of the concepts from #25.
Areas I think are in need of the biggest confirmation/feedback:
A lot of this was taken from https://hackmd.io/@adin/sha256-dag which has markdown that allows graphviz drawings 😄