Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wishlist: Add the ability to tell if something needed to be "cleaned" or not. #109

Closed
jvanasco opened this issue Oct 3, 2013 · 10 comments
Closed
Labels

Comments

@jvanasco
Copy link
Contributor

jvanasco commented Oct 3, 2013

It doesn't seem possible to test whether or not text needed to be "cleaned"

As part of the tokenization process, bleach returns text with the attributes alphabetized, so a string comparison is not possible.

For example:

>>> input = """<a href="http://example.com" title="Example Link" class="foo">Example</a>"""
>>> output = bleach.clean(input)
>>> print output
u"""<a class="foo" href="http://example.com" title="Example Link">Example</a>"""

so , obviously...

>>> input == output
False

This makes it impossible to track metrics regarding unsafe content submission -- or alert users if their content is unsafe.

@Kos
Copy link

Kos commented Jan 16, 2014

+1, would prefer to reject non-comformant input rather than try to fix it. (Also would be useful for unit-testing specific whitelists.)

@marfire
Copy link

marfire commented Jan 30, 2014

Agreed. What I'd like to see is clean taking an argument—say, reject=True—that tells it to raise an exception if the input is "unclean".

@jvanasco: At the moment I'm doing this by first cleaning the input with a universal whitelist (e.g. this list); then cleaning it with the actual whitelist; and finally, comparing the two. Obviously a less-than-ideal hack...

@jsocol
Copy link
Contributor

jsocol commented Jan 31, 2014

I understand where this is coming from but I don't think it's going to be very high priority and I would question putting into the project at all, because it is a very different goal and serves very different use cases.

The reason attribute order (and quote style, and several other things that can be different) doesn't matter to Bleach is that the intent is to filter user input, and publish it safely. And that matters in terms of design and implementation decisions.

This would philosophically move from "content filter" to "content validator", and there are almost certainly better ways to design and build a content validator.

On a practical level, the assumption that silently changing content is the the only correct behavior has filtered into many places. There are Python idioms, like list comprehensions with if x not in y, that are fine if you want to filter and escape content, but would have to be completely reworked if the goal was to validate. There are also regexes that stomp on content that would need to be unwound.

Even the existence linkify is part of the filter assumption. Had bleach been built to validate, I doubt the leap to "filter another way" would have happened.

All of that said, I do see a path to this: the "cleaning pipeline" idea from #58, which would make it easy to put something on top of the pipe to raise an exception. But that's a big, hairy yak. (The more I think about it the more I think a small sprint at PyCon would be a great opportunity to tackle some of this.)

@jvanasco
Copy link
Contributor Author

I understand the concern with validation vs filtering , but there is a thin line here. I should clarify...

Because of the way bleach operates ( in that it does filter and adjust virtually all input ) it's not possible to track if Bleach is doing anything substantive to my input ( ie, anything other than minor reformatting ).

In some cases this is admittedly similar to validating, but in others it's not. It is similar to validating input if we're looking at the behavior of stripping tags and cleaning. But It's not similar to validating if we're looking at the behavior of linkify with everything whitelisted.

As a user, this makes Bleach a bit of a black-box to me -- I'm always using it, but never able to tell if I actually needed to or not. When profiling bottlenecks or adjusting the product to use computing resources better, I can't do anything about the stuff that interfaces with Bleach.

@jsocol
Copy link
Contributor

jsocol commented Jan 31, 2014

Because of the way bleach operates ( in that it does filter and adjust virtually all input ) it's not possible to track if Bleach is doing anything substantive to my input ( ie, anything other than minor reformatting ).

Rejecting input wouldn't really answer that, though, unless you wanted to answer the question by making your users deal with validation errors and rewriting their content. Tracking whether anything was changed is a different question and almost harder, because even if it's just a boolean (and thus not very interesting: which rules were violated) there's state we have to track through the whole parsing process and find a way to return.

As a user, this makes Bleach a bit of a black-box to me -- I'm always using it, but never able to tell if I actually needed to or not. When profiling bottlenecks or adjusting the product to use computing resources better, I can't do anything about the stuff that interfaces with Bleach.

Let's take a step back, because this is a different question. Thanks for clarifying it.

First, I want to point out that the best-case scenario (there are tags or links but they're all fine) has the worst-case performance, whether this is "here's the filtered output and yes I had to change it" or "raise InvalidInput". In the case where nothing needs to change, answering that question requires parsing the whole string, anyway. So even asking the question is nearly the same amount of work as just cleaning the whole thing in the first place.

If you're really looking at this from a performance perspective (and I get that, Bleach is not fast, and allocating compute is important) faster heuristics—that err on the side of "filter"—are almost certainly the way to go.

For example, you could check if any of the interesting <>"'& are in the plain text, and only clean if so (this is not, strictly speaking, a recommendation so much as a directional starting point). If you want to know if you should linkify, you could import the URL regex and test the input, and look for <a\b as another fast heuristic. Some level of heuristics like those may make sense to put into Bleach, itself, in the interest of performance.

But if they're outside of Bleach, you can record the parse/don't-parse decision on your own, with whatever logging tools you have in the stack. If you're making decisions about your product, like whether users are even using HTML or if you could drop Bleach and just escape things, you don't want Bleach making that call and hiding it from you.

@jsocol
Copy link
Contributor

jsocol commented Jan 31, 2014

Sorry, I know I'm getting wordy here. I swear I even tried to edit that down.

@jvanasco
Copy link
Contributor Author

no need to apologize.

my concern was always for performance logging and inspection. the +1s were for rejecting input.

and yeah -- on the current build, it would definitely be a performance hit to track things. that's why i wishlisted and brought this up, instead of providing a patch. i wanted to get this scenario on your radar if you ever change internals. i've seen some custom parsers that make stuff like this easy (however none are as fast as bleach is right now ).

the performance product decisions we'd make wouldn't be as drastic as yours. they'd have to do more with content caching/pre-compiling or pulling things out of various 'immediate' views and pushing them into subsequent views ( ajax on demand, click for more, etc )

i think i might try the clean-with-whitelists trick for comparison on input.

@marfire
Copy link

marfire commented Jan 31, 2014

It's clear that we're talking about a couple of different things in this thread. I just want to expand on the validation question.

Although it's certainly true that validation is different from filtering in important ways, I don't find the use cases to be all that different. For example, if I'm accepting input from users I might want in some cases to filter the input automatically and in other cases to alert the user about the invalid input. In fact, html5lib itself works this way: you can tell it to validate by setting strict=True or you can have it process the input normally.

I don't know enough about the bleach implementation to offer an opinion on the practical issues. But I would note that having the option to validate the input needn't prevent you from doing what you're doing now and changing the output when filtering.

Rejecting input wouldn't really answer that, though, unless you wanted to answer the question by making your users deal with validation errors and rewriting their content.

Just FYI, that is exactly what I want to do. Perhaps you're right that I'm really looking for a different tool, but to me the validating and filtering seem like two sides of the same coin.

@jsocol
Copy link
Contributor

jsocol commented Dec 12, 2014

With #121 / #137, attribute order is deterministic, but not preserved from the original. That should support the clean-twice workaround.

@willkg
Copy link
Member

willkg commented Sep 19, 2017

If someone wants to tackle this, I suggest doing it in another library and using bleach if that helps and depending on how that effort turns out, maybe we can merge the two at some future date.

Given the above comments, I think we're at a point where it's clear this is impossible to do effectively and thus it's not something I want to add to this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants