-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340
Conversation
Spec for the IPIP-298. It adds a standard reference for content list definition, should these be allow lists or deny lists. It supports CIDv1, sha256 hashed CIDv1, and absolute and partial path matching. It is amied at offering a common schema for node operators to interact with content lists. It should be noted this IPIP was first written by @foreseaz, and this PR is an iteration over their work in #299.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super super excited to move this forward. Navigating the space between moderation and censorship is a tricky one and having a well thought-out spec to allow for collaboration is essential to keeping the IPFS/IPLD/Filecoin space accessible to small parties and/or ideologically motivated initiatives.
I gave it a good read and left some comments/questions. Main observations are:
- We might consider using CID's to do the double hashing (which is apparently standardised now), and gain various other benefits (e.g. CID upgrades, not worrying about codecs).
- Formalising reasons on a per-item basis might increase transparency and granularity in terms of filtering.
> | ||
> The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
> | ||
> Before the hashing, all CIDv0 in `CID_V1` field is converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the current proposal would have future CID's come in at 4-255 (e.g. CIDv2 could be 4). That could be a bit messy. Perhaps we could split the space 0-255 into various sections for semantic reasons?
Something like:
0-63
: CIDv and possibly other content-addressed identifiers.64-127
: Hashed variants of aforementioned.128-192
: Various path matching algorithms.193-155
: 🤷🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: the encoding for SHA256 has not been specified, and the current mechanism doesn't allow for changing hashing algorithm later.
Is there any reason not to use CIDv1 (e.g. CID of CID) which deals with encoding and hashing algo out of the box?
As a bonus, we might use multicodec to describe what it is we're hashing (e.g. CIDv1 or something else entirely, think blocking pubkeys or subnets in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, as discussed, it might be good to be aware of/keep track of libp2p's double hashing efforts: ChainSafe/go-libp2p-kad-dht#1
And their implementation.
Notably, we might consider to salt our hashes to reduce rainbow table attacks, like in this comment: https://github.com/ChainSafe/go-libp2p-kad-dht/pull/1/files#r986978182
From there, I found that there's already DBL_SHA2_256
defined in go-multihash
as well as in multihash
.
Using multihash like this might even allow us to do away with the need for distinct types for CID versions or double hashed types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multihash makes sense to be incencitive to CID
for the hashing, it's expected to be fixed. If one wants to introduce a new hash method, it can add a new type.
I feel like adding a separation of the number space at this time is an overkill. What other types do you want the implementation to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm super late to the party, and this is maybe fixed elsewhere, but I wanted to +1 @dokterbob's suggestion about salting the double hashes (and making sure that the salt doesn't match the one in the double-hashed DHT). Otherwise you're handing a treasure map to bad actors.
There's perhaps a broader discussion about the privacy properties that we need from such a system. These are all some form of private set intersection (PSI), where double hashing is the "naive" case — there are much more sophisticated techniques. (Fancier does not nessesarily mean better, but may be worth exploring)
|
||
```ipldsch | ||
type List struct { | ||
Action Int # Suggested action to be taken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that some kind of formalisation of the reasons for blocking (either on item or on list level) could be very useful. This would ease the efforts for node maintainers to implement sensible defaults (e.g. copyrighted content we do not serve nor store, but strong violent material does not even enter the DHT). It could also facilitate granularity, transparency and accountability, as the Bad Bits precedent suggests that various kind of blocked content risks ending up in just one list.
This all or nothing type of blocking has a severe creep from going to moderation towards censorship. I imagine a scenario where 1 or 2 parties provide lists which then the majority of node maintainers follow, applying the strongest possible blocking behaviour for everything just to be on the save side.
To bring this down to earth; for ipfs-search.com we want to tell people when content is filtered because of DCMA-complaints (e.g. chilling effects) but in cases of actually illegal content we want to leave it out altogether. Similarly, I can imagine that node maintainers might not want to publicly serve copyrighted content but they might be happy to have it in their DHT and perhaps even their store, whereas they might strongly reject even relaying information about somebody POV'ing a massacre.
} | ||
|
||
type Entry struct { | ||
Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using multihash instead of CID here could help prevent content blocks from being circumvented by changing the base or codec. Additionally, it could remove the need of explicitly specifying SHA256_CID_V1
. Relates to this comment https://github.com/ipfs/specs/pull/340/files#r1018935560
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the multihash.
I feel having the sha256_multihash still make sense, as the hash query is specific to how IPFS handles block query at the moment. There is no reason to tie the list specification to this, or add complexity downstream about the hashing method for the hash.
```ipldsch | ||
type List struct { | ||
Action Int # Suggested action to be taken | ||
StatusCode optional Int # HTTP status code to be returned by IPFS gateways by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as in #299 (comment) this should be authoritatively limited.
I think anything not 451
, 410
or 200
has no place on a badlist list. I don't want people start to use 3xx or whatever.
Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos
, Legal
(maybe with a reason like Legal Copyright
, Legal Hatespeech
, Legal other
, ...), ...
Some gateway operators might prefer faking 500
s or just flat out ignore certain reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would remove the word "Legal" since not really a legal opinion or necessarily illegal depending on the jurisdiction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update StatusCode
to Categories Int[]
.
Categories
values would be subdivided in ranges. 1xxx for allow-like, 2xxx for block-like.
Specifying the categories is going to be similar to Actions. Some codes being assigned, some being for internal use, and some for future use.
If you have a list of categories you think should be included, you can list them in the comments.
} | ||
|
||
type Entry struct { | ||
Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a type that has has exactly the same meaning as the current badbits anchor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add the type, as well as its specification in the comment?
{ | ||
Type: 3, // Type.SHA256_CID_V1 | ||
Content: "9056e0f9948c942c16af3564af56d4bb96b6203ad9ccd3425ec628bcd843cc39", | ||
Description: "sensitive cid that needs to be blocked" | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is very difficult to append anything to this list. You cannot have one process writing to it and another one reading from it at the same time. You need to open, read, deserialize, serialize, write, close.
That means that if you have an external process generating items to be blocked, every update needs to edit the file and then trigger the IPFS implementation to re-read it (possibly in full). Or the IPFS implementation needs to provide API to make the updates itself (and then write the file (again, possibly in full).
I have the feeling this format is not well suited to very large lists, or those updated frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would defining a max list size as part of this spec help?
Together with Link
field, it allows for composing big lists from smaller ones, and updates would only change part of a bigger DAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't improve anything if instead of 1 item of 100GB we have 100 items of GB, somehow linked.
I also don't think the list format should include a Link field. Links to multiple lists etc should be a meta-format above that can be implemented later if we need some glue to handle multi-subscriptions (right now not needed).
I personally think lists should be line-based files, as small as possible, editable by hand for simple things like appending a CID. See lists used by UBlock etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficiencies of CBOR lists vs prior art from adblockers
Ack. A compact, efficient, line-based syntax similar to static filter lists used by adblockers (example) will be more efficient and easier to consume than having CBOR.
Main problem: duplicated "Type" "Content" and "Description" strings for every entry – having these strings over and over again in CBOR gets very wasteful very fast if the list is big.
Manifest vs Compact list
So.. this PR feels like the meta format for aggregating lists you describe. In the other comment @hsanjuan suggested adding a type for already existing double hashed format from https://badbits.dwebops.pub, we could do the same here and specify a separate type for lists in the "compact syntax".
I think we should decouple and work on these things in parallel (meta-manifest and the compact list formats).
Will make discussion more productive, and allow us to ship things faster (e.g., in Kubo, we need compact format sooner than the meta-manifest one: users should be able to self-service, will take time).
@thibmeu thoughts on limiting traversable DAG-CBOR/JSON to DAG-like meta-lists, and having a compact syntax for "leaves"? Or supporting lists in both, and seeing which approach gets more traction?
Next steps
@hsanjuan i'd like to explore your proposal – would you have time to propose a syntax for compact list?
(gist / hackmd is fine – I can turn it into a separate IPIP if you have limited bandwidth – this way we could flesh it out in separate IPIP PR without blocking this one, and even reference that IPIP from here, if needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entries
structure could be transformed to a CSV. That's also the reason why multihash is serialised to a string via multibase, so it's easier to export.
type,content,description,categories
1,/path/to/content,questionable content,1001 1002 1005
...
I agree this would be easier to update. Metadata around it could be passed as a frontmatter possibly
action: 1
categories: 1000 1001
description: an example content list
---
type,content,description,categories
1,/path/to/content,questionable content,1001 1002 1005
The front matter still allows for compostability with CBOR, but makes the list self contained in a text format if need be.
I would not worry about the existing format for bad bits list. It's a single list, and I would avoid requiring the spec and implementation to be constrained by one historical list instance (similar to Cloudflare safemode in a way).
I have no need for CBOR specifically, it's just a format supported by IPFS and to which an IPLD schema can serialise to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to explore your proposal – would you have time to propose a syntax for compact list?
yes, next week at some point? I have ideas but I also need to explain them (because they link to other aspects that are not related to the list itself), which takes more time. Format would be in the direction @thibmeu proposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added replies to comments. New feedback still appreciated.
I'll update the document on next week, with the change of StatusCode to Categories.
```ipldsch | ||
type List struct { | ||
Action Int # Suggested action to be taken | ||
StatusCode optional Int # HTTP status code to be returned by IPFS gateways by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update StatusCode
to Categories Int[]
.
Categories
values would be subdivided in ranges. 1xxx for allow-like, 2xxx for block-like.
Specifying the categories is going to be similar to Actions. Some codes being assigned, some being for internal use, and some for future use.
If you have a list of categories you think should be included, you can list them in the comments.
} | ||
|
||
type Entry struct { | ||
Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the multihash.
I feel having the sha256_multihash still make sense, as the hash query is specific to how IPFS handles block query at the moment. There is no reason to tie the list specification to this, or add complexity downstream about the hashing method for the hash.
} | ||
|
||
type Entry struct { | ||
Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add the type, as well as its specification in the comment?
> | ||
> The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
> | ||
> Before the hashing, all CIDv0 in `CID_V1` field is converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multihash makes sense to be incencitive to CID
for the hashing, it's expected to be fixed. If one wants to introduce a new hash method, it can add a new type.
I feel like adding a separation of the number space at this time is an overkill. What other types do you want the implementation to support?
Updates structure of the content list: * Use `Categories` instead of `StatusCode`. This allows for a granular definition of content, without enforcing HTTP actions. * Move `Entries` to a linked list by default, named `[EntryChunk]`. This allows to keep the inlining that was mentioned before, but does not add overhead for chaining multiple entries if needed. Content type has been updated to mention multihash instead of CIDv1. To keep encoding simple, the multihash is encoded using multibase. Minor edits: * Remove all references to status code as a possible response. * Reference multicodec dbl-sha2-256. Reviewers should note that the `Categories` definition is left empty, and is to be updated in a future commit.
### User | ||
|
||
User is an actor that can make request to the IPFS Node. It's considered authenticated already. | ||
Access to the node can be made via an HTTP gateway, Bitswap, or another transport method exposed by the IPFS Node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: covering more ground
Access to the node can be made via an HTTP gateway, Bitswap, or another transport method exposed by the IPFS Node. | |
Access to the node can be made via an HTTP gateway, command-line, RPC API, Bitswap, or another transport method exposed by the IPFS Node. |
|
||
type EntryChunk struct { | ||
Entries [Entry] # List of Entries objects | ||
Next optional Link # CID for the next EntryChunk object. Entries referenced this way are not inlined, and possibly require an additional request for discovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single Link
makes it:
- impossible to compose a list from two preexisting ones (e.g., having "anti-phishing" meta-list composed of multiple third-part lists provided by gateways that received reports and/or run active detection)
- expensive to maintain a big list (instead of having a DAG, and only recomputing changed branch, one needs to change all lists along the single chain).
Would it be ok to change this field into array of links?
Next optional Link # CID for the next EntryChunk object. Entries referenced this way are not inlined, and possibly require an additional request for discovery. | |
Next optional [Link] # List of CIDs for the next EntryChunk objects. Entries referenced this way are not inlined, and possibly require an additional request for discovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about having multiple Entries [EntryChunk]
instead of multiple Next [Link]
? I feel it makes it:
- Easier to read when lists are composed: it's simply a new EntryChunk.
- Does not modify a specific EntryChunk upon the addition of a new list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the topic of having a max size (I'm unable to answer inline to your comment), how do you envision its usage?
I feel this is something that can be done at consumption time, and I'm not sure how it helps to have a max size for a series of linked lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Yes,
[EntryChunk]
also works.- Remaining problem around
Link
is that there is no place to provide description for it– If i am aggregating third-party lists into a single meta-lists, would be nice to note why each link was included. Maybe addDescription optional String
toEntryChunk
?
- Remaining problem around
- re: max size, i only meant max size per document, encouraging people to "shard" at the same cut-off and making it easier for implementers to fail fast when consuming a single document. (total across lists is a separate problem).
{ | ||
Type: 3, // Type.SHA256_CID_V1 | ||
Content: "9056e0f9948c942c16af3564af56d4bb96b6203ad9ccd3425ec628bcd843cc39", | ||
Description: "sensitive cid that needs to be blocked" | ||
} | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would defining a max list size as part of this spec help?
Together with Link
field, it allows for composing big lists from smaller ones, and updates would only change part of a bigger DAG.
@thibmeu @mathew-cf and others: I wanted to make sure you saw the expansion of this topic that is occurring in #383 . Your feedback on whether that is meets Cloudflare's needs is welcome. |
Spec for the IPIP-298.
It adds a standard reference for content list definition, should these
be allow lists or deny lists. It supports CIDv1, sha256 hashed CIDv1,
and absolute and partial path matching.
It is aimed at offering a common schema for node operators to interact
with content lists.
It should be noted this IPIP was first written by @foreseaz, and this PR
is an iteration over their work in #299.
cc @Jorropo @lidel