-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-0383: Compact Denylist Format #383
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
One thing I didnt like about the existing double hash is the concatenation leads to more work (you have to hash each subpath), I think a tuple of hash(CID), [path length, hash(path)] would spare the work so to see if In the end in the denylist automation the path is always I wrote something to reprocess historical ones if we need to |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
7bb7428
to
77a804b
Compare
I have done one more pass to the spec, explaining a bit better complex things like double-hashing and making a distinction between NameSystem and PathResolver. I think I'm pretty happy with the current form. |
Test fixtures are missing... A list of tests will come next. An implementation of a content blocker with support to this format exists now in https://github.com/hsanjuan/nopfs, and the Badbits list is also available in compact form: https://badbits.dwebops.pub/badbits.deny. |
COMPACT_DENYLIST_FORMAT.md
Outdated
The latter form allows blocking by double-hash using any hashing function of | ||
choice. Implementations will have to hash requests using all the hashing functions | ||
used in the denylist, so we recommend sticking to one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementations will have to hash requests using ALL the hashing functions used in the denylist
Do we need this yet? Can we specify use of a multihash for self-describing future-proof goodness but require that only sha256 is valid for now to simplify things? For context: I'm looking at how i would add support for this spec to e-ipfs and w3s.link.
I acknowledge that we already have to accept providers can only do best effort enforcement of these lists given we already have to deal with path based rules which can't enforced at the BlockService level e.g in e-ipfs which is bitswap only... so i could just ignore entries that are hashed with a fn i don't support yet, but i'd prefer to be able to say "we support all block items that can be applied by the blockservice".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If your denylist only includes one type of hashing function, then you only hash using that one to check the double-hashes. So it is up to who makes the denylist... I would totally discourage using multiple hashing functions for items in the same denylist.
But other than that, since they are multihashes there's no reason to support anything. In a way I'm deferring the discussion of "what is the best hashing function for this" until later, because sha256 is probably not it. Mostly you just want something very very fast, and doesn't necessarily need to be a cryptographic one as it just meant to obfuscate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, the tradeoffs are expressiveness of list-sharing for compute-obligation for conforming processors of the lists. here too deciding layering before committing to a degree of expressiveness (or the core/extension, required/optional corrolary) might be preferable to deciding this in advance of commitments to implement and conform at scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine prominently recommending a hashing function. Which one should it be though?
What is:
- A cryptographic hash function (we need this in the end as pre-image and collision protection are important).
- Well supported in all languages/platforms
- Very fast
Is it blake3? sha256? Are we ok with md5 (sure there are collisions but need to be carefully crafted). ? I'll ask around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider defining it as only sha2-256 for the first version of the spec? That restriction could be loosened later if folks need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pragmatic reasons, our options are probably limited to either sha2-256 or blake3 (ipfs impls. will have both).
Between sha2 and blake3, the latter should make a faster default (faster than md5 too). Maybe benchmark?
ps. The purpose behind the explicit list of hashes in the YAML header was to protect the node operator from silently exploding the list of hashes to check when the intention is to only use one (or two) functions. But maybe it is not a real DoS risk, these lists will come from somehow trusted sources.
If we decide that we only use a single hash function per list, it still feels like something that should be explicitly defined in YAML header, and not hard-coded in the spec (future-proofing without having to change the specs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i strongly agree that the header yaml should define the hash fn / use mutlhashes for future upgradeability, and i am asking that the spec, in it's initial wording, require only sha-256 be used.
we can even add a note in the spec that folks should raise a PR here if that is causing them real perf problems.
adding support for more hash functions would have to take into account that there is no baseline requirement of what hash functions an ipfs impl must support. web3-storage does not verify blake3 today. surprising but true https://github.com/web3-storage/car-block-validator/blob/cb121b780c86724c0266f797b5a4badffad77140/src/index.js#L1-L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to list hash functions in the header. As you parse the list, you see the multihash function used in the multihashes that you encounter and that is the list of hashing functions you have double-hash with. Putting the list at the top is redundant.
I think the issue here is a matter of compatibility, and sha256 being the best thing that is nicely and universally supported everywhere. I'm ok with the spec requiring sha256 for double-hashing, otherwise leave things open as they are. If people use other hashes (which remains possible), they would be breaking the spec, but that's up to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually disagree that listing which hash functions are used in a list is necessary and always redundant, because I am focused on a use-case where more lists from more unfamiliar authorities get added and subtracted more often, making the list itself worth annotating. If managing and federating many denylists is being optimized for, all kinds of (optional!) frontmatter fields could be quite helpful in deciding which ones to federate/follow before parsing them, particularly for software that builds on CIDs selectively (which might need to know in advance if a denylist includes entries it cannot, or does not need to, parse).
In any case, I would argue that an optional frontmatter field for declaring which hash functions were/are used to populate a list is good, actually. I don't have strong opinions, but some basic questions, in case enough people agree that a section is worth adding to the IPIP itself:
- should the values be
name
entries from themulticodec
table, or the actual tags? mild preference for name, but admittedly this adds fragility and another possible conflict/drift risk. - what should a consumer do/think/feel if an entry uses a tag not announced up top? perhaps the spec need not say anything here and leave that up to use-case-specific considerations, but perhaps not.
- should list hosters ADD entries to the frontmatter declaration over time if they add functions over time, or create a new list at a new URL when these "env var"s change? Considering many subscribers have to unload and reload the list for any changes except EOF appends, this could complicate the spec a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name for sanity
- With or without hash-function annotations at the top, you're in the same situation if you find a hash that you don't support while parsing. Annotation at the top would only help to reject a list altogether before parsing it. But why would you do that and not try to work with the entries that you can support in any case? Anyways, you can use any optional hints you want. We can standardize an optional hint name for this as long as it's optional.
- I think you're finding the subtle issues. "whoops now my list includes also blake3 double-hashes" is bad to communicate in any way. "List A is done and you can follow up again on list B" is also not considered. It would be easier if we specify what the recommended behavior is when finding an unknown hashing function.
Also, we are talking a lot about hashing functions pertaining to the double-hash entries that are enconded as multihashes. You could also have normal /ipfs/<cid>
entries, where the CID itself is not something that your underlying IPFS implementation understands: maybe you don't know the ipld codec, maybe the hash-function is unknown. In this cases, it would not be the "compact denylist parser component/blocker" the one that needs to be aware, but the underlying "ipfs" implementation. etc. but some implications might be similar.
So my practical approach is, 1) everything goes 2) try to parse what you can 3) strong recommendation or obligation to only use sha256 for double-hashing, everything else is non-spec-compliant (for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased and applied editorial fixes to publish both spec and IPIP on https://specs.ipfs.tech/ website.
I'll be revisiting this as we iron remaining kinks in reference implementation ipfs/kubo#10161 (+ related PRs), but the content is ready for final reviews from the wider audience of node operators and implementers.
I'll mention this during next IPFS Implementers Call.
|
||
## Detailed design | ||
|
||
See :cite[compact-denylist-format]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ This reference will start working once we have the spec published at https://specs.ipfs.tech/compact-denylist-format
improves the flow of the document, hints are described only once now (3x before) in a single place, table of contents is easy to glance at and follows the flow of the file syntax
0ebce5a
to
3eb75a7
Compare
Fixes #8492 This introduces "nopfs" as a preloaded plugin into Kubo with support for denylists from ipfs/specs#383 It automatically makes Kubo watch *.deny files found in: - /etc/ipfs/denylists - $XDG_CONFIG_HOME/ipfs/denylists - $IPFS_PATH/denylists * test: Gateway.NoFetch and GatewayOverLibp2p adds missing tests for "no fetch" gateways one can expose, in both cases the offline mode is done by passing custom blockservice/exchange into path resolver, which means global path resolver that has nopfs intercept is not used, and the content blocking does not happen on these gateways. * fix: use offline path resolvers where appropriate this fixes the problem described in #10161 (comment) by adding explicit offline path resolvers that are backed by offline exchange, and using them in NoFetch gateways instead of the default online ones --------- Co-authored-by: Henrique Dias <hacdias@gmail.com> Co-authored-by: Marcin Rataj <lidel@lidel.org>
Would it be possible to include metadata tags about categories of why a line has been added? This would be useful for displaying in Brave as an enhancement to the messaging that normally gets displayed with safe browsing which seems like the best way to fit this in. |
You can add whatever hints per line. It's just not specified whether some labels have special meanings that require certain actions. Implementations can give meaning to labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been ready for ratification since December, but took bit longer due to holidays and other work. No concerns were raised, merging.
Thank you @hsanjuan for driving this, on both spec and implementation front, and for being patient ❤️
Thank you to everyone who provided feedback and tested the reference implementations.
FYSA we shipped implementation in Kubo 0.24 and is also included in Rainbow.
I've just posted a question on r/IPFS and I think it would be better suited here. I just added a file to IPFS that was a .txt file of the output of The error message strikes me as strange and I have some questions about this feature
On the face of it this feature seems like a hideous idea that undermines the whole point of IPFS. Sure nobody wants to host hateful material, CSAM, [insert buzzword]-phobia, blah, blah, blah, but we all know that this feature will just get abused by governments and the big-tech companies that will end-up being the backbone of IPFS. Like consider a journalist wants to publish WikiLeaks 2.0 on IPFS, how quickly will the admins cave to the FBI and block that content? |
This is not really the best place. Ask on the discuss.ipfs.tech forums if you still wondering about these questions. |
This IPIP introduces a line-based denylist format for content blocking on IPFS
focused on simplicity, scalability and ease-of-use.
Related: