Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-0383: Compact Denylist Format #383

Merged
merged 35 commits into from
Feb 1, 2024
Merged

IPIP-0383: Compact Denylist Format #383

merged 35 commits into from
Feb 1, 2024

Conversation

hsanjuan
Copy link
Contributor

@hsanjuan hsanjuan commented Mar 9, 2023

This IPIP introduces a line-based denylist format for content blocking on IPFS
focused on simplicity, scalability and ease-of-use.

Related:

@hsanjuan

This comment was marked as resolved.

@hsanjuan hsanjuan self-assigned this Mar 9, 2023
@hsanjuan hsanjuan marked this pull request as ready for review March 10, 2023 16:15
@hsanjuan hsanjuan requested a review from a team as a code owner March 10, 2023 16:15
@thattommyhall
Copy link
Member

thattommyhall commented Mar 14, 2023

One thing I didnt like about the existing double hash is the concatenation leads to more work (you have to hash each subpath), I think a tuple of hash(CID), [path length, hash(path)] would spare the work

so to see if /ipfs/bafyWHATEVER/some/long/path/to/a/thing is blocked you would only have to hash bafyWHATEVER and see if it has any paths associated with it, maybe you see a length 3 one so you hash /some/long/path too and see if its blocked.

In the end in the denylist automation the path is always / and the only URLs were legacy ones

I wrote something to reprocess historical ones if we need to

@BigLep

This comment was marked as resolved.

@hsanjuan

This comment was marked as resolved.

@hsanjuan
Copy link
Contributor Author

One thing I didnt like about the existing double hash is the concatenation leads to more work (you have to hash each subpath), I think a tuple of hash(CID), [path length, hash(path)] would spare the work

so to see if /ipfs/bafyWHATEVER/some/long/path/to/a/thing is blocked you would only have to hash bafyWHATEVER and see if it has any paths associated with it, maybe you see a length 3 one so you hash /some/long/path too and see if its blocked.

In the end in the denylist automation the path is always / and the only URLs were legacy ones

I wrote something to reprocess historical ones if we need to

  • Let's agree of talking always of double-hashing the MULTIHASH.
  • Part of this is implementation specific (how to hash the minimum amount of things). However, it would allow more efficient checking to have a hint like path length (not even component-count, but actual length of the path string), so that no double-hashing needs to even occur if lengths don't match. I'm thinking this could be metadata added to the rule via a hint, and implementations could decide whether to use it for their benefit or not.

@hsanjuan

This comment was marked as resolved.

@hsanjuan

This comment was marked as resolved.

@hsanjuan

This comment was marked as resolved.

@lidel lidel changed the title IPIP: compact denylist format IPIP-383: compact denylist format Mar 27, 2023
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
@hsanjuan hsanjuan force-pushed the denylist-compact branch 3 times, most recently from 7bb7428 to 77a804b Compare March 28, 2023 21:39
@hsanjuan
Copy link
Contributor Author

I have done one more pass to the spec, explaining a bit better complex things like double-hashing and making a distinction between NameSystem and PathResolver. I think I'm pretty happy with the current form.

@hsanjuan
Copy link
Contributor Author

Test fixtures are missing... A list of tests will come next.

An implementation of a content blocker with support to this format exists now in https://github.com/hsanjuan/nopfs, and the Badbits list is also available in compact form: https://badbits.dwebops.pub/badbits.deny.

Comment on lines 330 to 364
The latter form allows blocking by double-hash using any hashing function of
choice. Implementations will have to hash requests using all the hashing functions
used in the denylist, so we recommend sticking to one.
Copy link
Member

@olizilla olizilla Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementations will have to hash requests using ALL the hashing functions used in the denylist

Do we need this yet? Can we specify use of a multihash for self-describing future-proof goodness but require that only sha256 is valid for now to simplify things? For context: I'm looking at how i would add support for this spec to e-ipfs and w3s.link.

I acknowledge that we already have to accept providers can only do best effort enforcement of these lists given we already have to deal with path based rules which can't enforced at the BlockService level e.g in e-ipfs which is bitswap only... so i could just ignore entries that are hashed with a fn i don't support yet, but i'd prefer to be able to say "we support all block items that can be applied by the blockservice".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your denylist only includes one type of hashing function, then you only hash using that one to check the double-hashes. So it is up to who makes the denylist... I would totally discourage using multiple hashing functions for items in the same denylist.

But other than that, since they are multihashes there's no reason to support anything. In a way I'm deferring the discussion of "what is the best hashing function for this" until later, because sha256 is probably not it. Mostly you just want something very very fast, and doesn't necessarily need to be a cryptographic one as it just meant to obfuscate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, the tradeoffs are expressiveness of list-sharing for compute-obligation for conforming processors of the lists. here too deciding layering before committing to a degree of expressiveness (or the core/extension, required/optional corrolary) might be preferable to deciding this in advance of commitments to implement and conform at scale.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine prominently recommending a hashing function. Which one should it be though?

What is:

  • A cryptographic hash function (we need this in the end as pre-image and collision protection are important).
  • Well supported in all languages/platforms
  • Very fast

Is it blake3? sha256? Are we ok with md5 (sure there are collisions but need to be carefully crafted). ? I'll ask around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider defining it as only sha2-256 for the first version of the spec? That restriction could be loosened later if folks need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pragmatic reasons, our options are probably limited to either sha2-256 or blake3 (ipfs impls. will have both).
Between sha2 and blake3, the latter should make a faster default (faster than md5 too). Maybe benchmark?

ps. The purpose behind the explicit list of hashes in the YAML header was to protect the node operator from silently exploding the list of hashes to check when the intention is to only use one (or two) functions. But maybe it is not a real DoS risk, these lists will come from somehow trusted sources.
If we decide that we only use a single hash function per list, it still feels like something that should be explicitly defined in YAML header, and not hard-coded in the spec (future-proofing without having to change the specs).

Copy link
Member

@olizilla olizilla Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i strongly agree that the header yaml should define the hash fn / use mutlhashes for future upgradeability, and i am asking that the spec, in it's initial wording, require only sha-256 be used.

we can even add a note in the spec that folks should raise a PR here if that is causing them real perf problems.

adding support for more hash functions would have to take into account that there is no baseline requirement of what hash functions an ipfs impl must support. web3-storage does not verify blake3 today. surprising but true https://github.com/web3-storage/car-block-validator/blob/cb121b780c86724c0266f797b5a4badffad77140/src/index.js#L1-L17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to list hash functions in the header. As you parse the list, you see the multihash function used in the multihashes that you encounter and that is the list of hashing functions you have double-hash with. Putting the list at the top is redundant.

I think the issue here is a matter of compatibility, and sha256 being the best thing that is nicely and universally supported everywhere. I'm ok with the spec requiring sha256 for double-hashing, otherwise leave things open as they are. If people use other hashes (which remains possible), they would be breaking the spec, but that's up to them.

Copy link
Collaborator

@bumblefudge bumblefudge Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually disagree that listing which hash functions are used in a list is necessary and always redundant, because I am focused on a use-case where more lists from more unfamiliar authorities get added and subtracted more often, making the list itself worth annotating. If managing and federating many denylists is being optimized for, all kinds of (optional!) frontmatter fields could be quite helpful in deciding which ones to federate/follow before parsing them, particularly for software that builds on CIDs selectively (which might need to know in advance if a denylist includes entries it cannot, or does not need to, parse).

In any case, I would argue that an optional frontmatter field for declaring which hash functions were/are used to populate a list is good, actually. I don't have strong opinions, but some basic questions, in case enough people agree that a section is worth adding to the IPIP itself:

  1. should the values be name entries from the multicodec table, or the actual tags? mild preference for name, but admittedly this adds fragility and another possible conflict/drift risk.
  2. what should a consumer do/think/feel if an entry uses a tag not announced up top? perhaps the spec need not say anything here and leave that up to use-case-specific considerations, but perhaps not.
  3. should list hosters ADD entries to the frontmatter declaration over time if they add functions over time, or create a new list at a new URL when these "env var"s change? Considering many subscribers have to unload and reload the list for any changes except EOF appends, this could complicate the spec a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. name for sanity
  2. With or without hash-function annotations at the top, you're in the same situation if you find a hash that you don't support while parsing. Annotation at the top would only help to reject a list altogether before parsing it. But why would you do that and not try to work with the entries that you can support in any case? Anyways, you can use any optional hints you want. We can standardize an optional hint name for this as long as it's optional.
  3. I think you're finding the subtle issues. "whoops now my list includes also blake3 double-hashes" is bad to communicate in any way. "List A is done and you can follow up again on list B" is also not considered. It would be easier if we specify what the recommended behavior is when finding an unknown hashing function.

Also, we are talking a lot about hashing functions pertaining to the double-hash entries that are enconded as multihashes. You could also have normal /ipfs/<cid> entries, where the CID itself is not something that your underlying IPFS implementation understands: maybe you don't know the ipld codec, maybe the hash-function is unknown. In this cases, it would not be the "compact denylist parser component/blocker" the one that needs to be aware, but the underlying "ipfs" implementation. etc. but some implications might be similar.

So my practical approach is, 1) everything goes 2) try to parse what you can 3) strong recommendation or obligation to only use sha256 for double-hashing, everything else is non-spec-compliant (for now).

COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
@lidel lidel changed the title IPIP-383: compact denylist format IPIP-0383: Compact Denylist Format Oct 25, 2023
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased and applied editorial fixes to publish both spec and IPIP on https://specs.ipfs.tech/ website.
I'll be revisiting this as we iron remaining kinks in reference implementation ipfs/kubo#10161 (+ related PRs), but the content is ready for final reviews from the wider audience of node operators and implementers.
I'll mention this during next IPFS Implementers Call.

src/ipips/ipip-0383.md Show resolved Hide resolved

## Detailed design

See :cite[compact-denylist-format].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ This reference will start working once we have the spec published at https://specs.ipfs.tech/compact-denylist-format

src/compact-denylist-format.md Outdated Show resolved Hide resolved
src/compact-denylist-format.md Outdated Show resolved Hide resolved
COMPACT_DENYLIST_FORMAT.md Outdated Show resolved Hide resolved
src/compact-denylist-format.md Outdated Show resolved Hide resolved
lidel added 2 commits October 26, 2023 23:39
improves the flow of the document, hints are described only once now (3x
before) in a single place, table of contents is easy to glance at
and follows the flow of the file syntax
lidel added a commit to ipfs/kubo that referenced this pull request Oct 28, 2023
Fixes #8492

This introduces "nopfs" as a preloaded plugin into Kubo
with support for denylists from ipfs/specs#383

It automatically makes Kubo watch *.deny files found in:

- /etc/ipfs/denylists
- $XDG_CONFIG_HOME/ipfs/denylists
- $IPFS_PATH/denylists

* test: Gateway.NoFetch and GatewayOverLibp2p

adds missing tests for "no fetch" gateways one can expose,
in both cases the offline mode is done by passing custom
blockservice/exchange into path resolver, which means
global path resolver that has nopfs intercept is not used,
and the content blocking does not happen on these gateways.

* fix: use offline path resolvers where appropriate

this fixes the problem described in
#10161 (comment)
by adding explicit offline path resolvers that are backed
by offline exchange, and using them in NoFetch gateways
instead of the default online ones

---------

Co-authored-by: Henrique Dias <hacdias@gmail.com>
Co-authored-by: Marcin Rataj <lidel@lidel.org>
@kdenhartog
Copy link

Would it be possible to include metadata tags about categories of why a line has been added? This would be useful for displaying in Brave as an enhancement to the messaging that normally gets displayed with safe browsing which seems like the best way to fit this in.

@hsanjuan
Copy link
Contributor Author

Would it be possible to include metadata tags about categories of why a line has been added? This would be useful for displaying in Brave as an enhancement to the messaging that normally gets displayed with safe browsing which seems like the best way to fit this in.

You can add whatever hints per line. It's just not specified whether some labels have special meanings that require certain actions. Implementations can give meaning to labels.

Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been ready for ratification since December, but took bit longer due to holidays and other work. No concerns were raised, merging.

Thank you @hsanjuan for driving this, on both spec and implementation front, and for being patient ❤️
Thank you to everyone who provided feedback and tested the reference implementations.

FYSA we shipped implementation in Kubo 0.24 and is also included in Rainbow.

src/compact-denylist-format.md Outdated Show resolved Hide resolved
src/compact-denylist-format.md Outdated Show resolved Hide resolved
src/compact-denylist-format.md Outdated Show resolved Hide resolved
@lidel lidel merged commit 7b2bbeb into main Feb 1, 2024
4 checks passed
@lidel lidel deleted the denylist-compact branch February 1, 2024 15:14
@anthonybudd
Copy link

anthonybudd commented May 28, 2024

I've just posted a question on r/IPFS and I think it would be better suited here.


I just added a file to IPFS that was a .txt file of the output of printenv from inside a docker container, when I went to the file’s address I was greeted by this 401 screen, I have been out of the IPFS community for a while and I have not seen this feature before.

The error message strikes me as strange and I have some questions about this feature

  • Who controls the list of blocked content? Is this list centralized?
  • How/Why was file found and blocked so quickly? ipns link to a plain text file that is also blocked
  • Who decides what qualifies as “legal, abuse, malware or security reasons”? Given the interplanetary nature of IPFS, what nation's laws are being used? For example I could share an image of Jon Venables in the USA but this would be illegal in the UK. If the UK court ruled this was illegal but the server was in the US, what would happen to my content in this case?
  • What safeguards are in place to prevent this feature being abused? For example, say a whistleblower published documents on IPFS, would it be possible for a government to leverage this feature to prevent it from being shared, "national security reasons"?
  • How does this feature work technically? If IPFS is a peer-to-peer protocol how is this (presumably) centralized list of banned content shared between nodes?

On the face of it this feature seems like a hideous idea that undermines the whole point of IPFS. Sure nobody wants to host hateful material, CSAM, [insert buzzword]-phobia, blah, blah, blah, but we all know that this feature will just get abused by governments and the big-tech companies that will end-up being the backbone of IPFS. Like consider a journalist wants to publish WikiLeaks 2.0 on IPFS, how quickly will the admins cave to the FBI and block that content?

Screen Shot 2024-05-28 at 15 38 03

@hsanjuan
Copy link
Contributor Author

hsanjuan commented Sep 3, 2024

I've just posted a question on r/IPFS and I think it would be better suited here.

This is not really the best place. Ask on the discuss.ipfs.tech forums if you still wondering about these questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ratified
Development

Successfully merging this pull request may close these issues.

9 participants