-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP: format for denylists for IPFS Nodes and Gateways #299
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,112 @@ | ||||||
# IPIP 0002: Denylists for IPFS Nodes and Gateways | ||||||
|
||||||
- Start Date: 2022-07-14 | ||||||
- Related Issues: | ||||||
- [ipfs/specs/issues/298](https://github.com/ipfs/specs/issues/298) | ||||||
|
||||||
## Summary | ||||||
|
||||||
This issue proposes a descriptive and maintainable denylist, leaving open the possibility of allowlists in the future. These can be used to help standardize content moderation for IPFS nodes and gateways. | ||||||
|
||||||
## Motivation | ||||||
|
||||||
The current [Bad Bit Denylist](https://badbits.dwebops.pub/denylist.json), which is a list of hashed CIDs, has multiple major drawbacks for implementation. One is lacking support for content path, it's hard to block an IPNS path only based on CIDs. The other is lacking description and status code response for each anchor. The description is especially for hashed blocking items, since the hashing is unidirectional, the description can outline the reason why it's been blocked thus increasing maintainability. | ||||||
|
||||||
A well-thought-out denylist schema can ease the implementation of the denylist, and bring consensus on how content blocking works for gateway and node operators in the community. | ||||||
|
||||||
## Detailed design | ||||||
|
||||||
|
||||||
### Denylist Schema | ||||||
|
||||||
Here is the proposed denylist and each field will be explained in detail later. | ||||||
|
||||||
```js= | ||||||
{ | ||||||
action: "block", | ||||||
entries: [ | ||||||
{ | ||||||
type: "cid", | ||||||
content: "bafybeihfqymzmqhbutdd7i4mkq2ltzznzgoshi4r2pnv4hsc2acsojawoe", | ||||||
description: "ipfs quick start", | ||||||
status_code: 410 | ||||||
}, | ||||||
{ | ||||||
type: "content_path", | ||||||
content: "/ipns/example.com", | ||||||
description: "example.com", | ||||||
status_code: 410 | ||||||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. UX nits:
|
||||||
}, | ||||||
{ | ||||||
type: "content_path", | ||||||
content: "/ipfs/bafybeihfqymzmqhbutdd7i4mkq2ltzznzgoshi4r2pnv4hsc2acsojawoe", | ||||||
description: "ipfs readme", | ||||||
status_code: 451 | ||||||
}, | ||||||
{ | ||||||
type: "hashed_cid", | ||||||
content: "9056e0f9948c942c16af3564af56d4bb96b6203ad9ccd3425ec628bcd843cc39", | ||||||
description: "sensitive cid that needs to be blocked", | ||||||
status_code: 451 | ||||||
}, | ||||||
{ | ||||||
type: "hashed_content_path", | ||||||
content: "65e60fcaa506ca5b0b49d7aa73df5ba32446bddb4e72a1f8bb5df12eaaaa8745", | ||||||
description: "sensitive content path that needs to be blocked", | ||||||
status_code: 410 | ||||||
} | ||||||
] | ||||||
} | ||||||
``` | ||||||
|
||||||
#### `action` | ||||||
|
||||||
Though it's called `denylist`, the only allowed action field here is `block`. Other actions such as `allow` can be added in the future to enable allowlists or other types of content lists. | ||||||
|
||||||
|
||||||
#### Each denylist entry | ||||||
|
||||||
- `type`: specifies the type of the content should be blocked. E.g. `cid`, `hashed_content_path`. | ||||||
- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency. | ||||||
- `content_path`: the content path needs to be blocked). | ||||||
- `description`: description of the CIDs or content paths. | ||||||
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be authoritative. Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A status code of |
||||||
|
||||||
**Side notes on `hashed_cids` & `hashed_content_paths` types** | ||||||
|
||||||
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using sha256 is sensible, but hard-coding a specific hash function in the spec is against the spirit Multiformats, which we aim to use in IPFS stack. Perhaps we could future-proof this at the low cost of adding This will keep the digest string intact, but turn the field into a valid Multihash, allowing list creators to switch the hash function in the future. An alternative is to have hash function type in a separate field, but this seems less expensive.
Suggested change
We could even make it a valid There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future. We don't need the full cid, so I do like @lidel's idea of just using the multihash. |
||||||
|
||||||
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format. This is a non forward compatible change, pls be carefull. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how we can avoid CIDs-as-strings when arbitrary Are you suggesting this IPIP define different CID normalization rule for |
||||||
|
||||||
## Design rationale | ||||||
|
||||||
The gist of the rationale is tackling the inconveniences of blocking implementation when using [current denylist](https://badbits.dwebops.pub/denylist.json). Adding support for content path, description, status code and action can make the denylist more maintainable, extensible, and easier to implement. This is especially true when a list is maintained by multiple parties or needs to keep records for auditing. | ||||||
|
||||||
Other minor design decisions, including CIDv1 normalization, allowing both plain text and hashed entries in one denylist, are also made to ease the implementation. It will force the consistency between multiple denylists that pave the way for wider adoption. | ||||||
|
||||||
Denylists are important to empower gateway to make their own policy decisions. While gateway can access any IPFS content, they might decide to not serve it all. This could be for reputation concerns, for safety, or for internal reasons. | ||||||
|
||||||
|
||||||
### Operation benefit | ||||||
|
||||||
The proposed schema could ease the implementation of denylist for gateway and node operators. It supports both CID and content path and each entry has a customizable description and response status code. | ||||||
|
||||||
The other operation benefit comes after the wide adoption of the proposed denylist, a new onboard gateway operator can use a shared to start a gateway right away. | ||||||
|
||||||
### Compatibility | ||||||
|
||||||
No existing implementations yet. | ||||||
lidel marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Security | ||||||
|
||||||
The following concern may not lie in this scope, but it is worth to be mentioned in this proposal. The blocking of CIDs which are not malicious and are widely used can potentially jeopardize the availability of multiple sites on that IPFS gateway. Possible examples include common images, javascript bundles, or stylesheets. | ||||||
|
||||||
### Alternatives | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: useful to mention if https://jsonlines.org was considered instead of JSON. It allows for applying big denylistis without loading entire thing in memory. Seems the direction is to split lists into composable chunks instead, but worth documenting why we did not choose https://jsonlines.org |
||||||
|
||||||
[Bad Bits Denylist](https://badbits.dwebops.pub/) is focusing on blocking public flagged CIDs for IPFS node operators, the blocking mainly happens between the nodes. | ||||||
|
||||||
[Denylist implementation of NFT.Storage](https://github.com/nftstorage/nft.storage/pull/1721/files) follows the above bad bits denylist format and creates [a separated denylist](https://github.com/nftstorage/nft.storage/pull/1721/files#diff-05dcde18c34b023574f6f073330869c633ee086a5a4917de2016d49e6044a3ee) for specific usage. | ||||||
|
||||||
### Copyright | ||||||
|
||||||
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question raised during triage today: what happens when the block list grows to megabytes?
This is a real concern, as https://badbits.dwebops.pub/ alone is getting close to 900KiB, and history shows that even efficient pattern-matching things like adblock lists are multiple megabytes in size (example).
This spec should provide a way for representing handle big, big blocklists.
💡 One idea is to introduce a special entry with
type: "import"
and eithercid
orcontent_path
pointing at some other list. This provides a solution for sharding and maintaining big lists AND allows composing denylists using existing ones.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sharded lists, we may need to specify a prefix or something to indicate which list is for which shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathew-cf you want to spec a JSON based HAMT ? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that you mention it @Jorropo i think composability is probably enough for now lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey folks,
We have been operating an internal denylist that is synced with badbits in {nft/web3}.storage. We need to align well on this direction with content path. There is at least a limitation around this that we have found:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workaround for the first case (content path blocked, bypassed via CID) could be to resolve the content path and then block that CID as well.
For the second case (CID blocked, bypassed with content path), we're planning on using
x-ipfs-roots
to ensure that none of the resolved CIDs are blockedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently hitting a limitation with current badbits implementation, and thoughts here would not solve that as well.
Currently in nftstorage.link + w3s.link, we rely on anchors fetched from badbits. We put them in a KV Store in the edge and before resolving content (including via Cache), we check KV Store and if is there, we don't serve content. If cid+path resolution, we will also check
ETag
presence in our denylist before serving.However, we want to move forward with more agressive caching, and current approach leads to problems that we cannot solve. Let's consider for instance, https://blog.cloudflare.com/introducing-cache-reserve/ that allows content to be permanently cached on the Edge for up to 30 days (without HITs). This would be a desirable feature for a gateway, given the costs to use this kind of feature are considerably lower than the needed bandwidth to go to the origin all the time. So, if a gateway wants to rely on this kind of feature, it needs to be proactive on purging from Cache bad content. And so, what do we need to purge the content? We need its HTTP URL, aka cid+path.
With the above in mind, I think we should attempt to move out of the
hashed_cid
+hashed_content_path
types. Having root CIDs and content path are likely what we need.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to confirm you describe your service's use case, and not suggesting removal of support for hashed entries?
These must be part of spec like this one, due to very bad bits.