-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wishlist: Add the ability to tell if something needed to be "cleaned" or not. #109
Comments
+1, would prefer to reject non-comformant input rather than try to fix it. (Also would be useful for unit-testing specific whitelists.) |
Agreed. What I'd like to see is @jvanasco: At the moment I'm doing this by first |
I understand where this is coming from but I don't think it's going to be very high priority and I would question putting into the project at all, because it is a very different goal and serves very different use cases. The reason attribute order (and quote style, and several other things that can be different) doesn't matter to Bleach is that the intent is to filter user input, and publish it safely. And that matters in terms of design and implementation decisions. This would philosophically move from "content filter" to "content validator", and there are almost certainly better ways to design and build a content validator. On a practical level, the assumption that silently changing content is the the only correct behavior has filtered into many places. There are Python idioms, like list comprehensions with Even the existence All of that said, I do see a path to this: the "cleaning pipeline" idea from #58, which would make it easy to put something on top of the pipe to raise an exception. But that's a big, hairy yak. (The more I think about it the more I think a small sprint at PyCon would be a great opportunity to tackle some of this.) |
I understand the concern with validation vs filtering , but there is a thin line here. I should clarify... Because of the way bleach operates ( in that it does filter and adjust virtually all input ) it's not possible to track if Bleach is doing anything substantive to my input ( ie, anything other than minor reformatting ). In some cases this is admittedly similar to validating, but in others it's not. It is similar to validating input if we're looking at the behavior of stripping tags and cleaning. But It's not similar to validating if we're looking at the behavior of As a user, this makes Bleach a bit of a black-box to me -- I'm always using it, but never able to tell if I actually needed to or not. When profiling bottlenecks or adjusting the product to use computing resources better, I can't do anything about the stuff that interfaces with Bleach. |
Rejecting input wouldn't really answer that, though, unless you wanted to answer the question by making your users deal with validation errors and rewriting their content. Tracking whether anything was changed is a different question and almost harder, because even if it's just a boolean (and thus not very interesting: which rules were violated) there's state we have to track through the whole parsing process and find a way to return.
Let's take a step back, because this is a different question. Thanks for clarifying it. First, I want to point out that the best-case scenario (there are tags or links but they're all fine) has the worst-case performance, whether this is "here's the filtered output and yes I had to change it" or "raise InvalidInput". In the case where nothing needs to change, answering that question requires parsing the whole string, anyway. So even asking the question is nearly the same amount of work as just cleaning the whole thing in the first place. If you're really looking at this from a performance perspective (and I get that, Bleach is not fast, and allocating compute is important) faster heuristics—that err on the side of "filter"—are almost certainly the way to go. For example, you could check if any of the interesting But if they're outside of Bleach, you can record the parse/don't-parse decision on your own, with whatever logging tools you have in the stack. If you're making decisions about your product, like whether users are even using HTML or if you could drop Bleach and just escape things, you don't want Bleach making that call and hiding it from you. |
Sorry, I know I'm getting wordy here. I swear I even tried to edit that down. |
no need to apologize. my concern was always for performance logging and inspection. the +1s were for rejecting input. and yeah -- on the current build, it would definitely be a performance hit to track things. that's why i wishlisted and brought this up, instead of providing a patch. i wanted to get this scenario on your radar if you ever change internals. i've seen some custom parsers that make stuff like this easy (however none are as fast as bleach is right now ). the performance product decisions we'd make wouldn't be as drastic as yours. they'd have to do more with content caching/pre-compiling or pulling things out of various 'immediate' views and pushing them into subsequent views ( ajax on demand, click for more, etc ) i think i might try the clean-with-whitelists trick for comparison on input. |
It's clear that we're talking about a couple of different things in this thread. I just want to expand on the validation question. Although it's certainly true that validation is different from filtering in important ways, I don't find the use cases to be all that different. For example, if I'm accepting input from users I might want in some cases to filter the input automatically and in other cases to alert the user about the invalid input. In fact, I don't know enough about the
Just FYI, that is exactly what I want to do. Perhaps you're right that I'm really looking for a different tool, but to me the validating and filtering seem like two sides of the same coin. |
If someone wants to tackle this, I suggest doing it in another library and using bleach if that helps and depending on how that effort turns out, maybe we can merge the two at some future date. Given the above comments, I think we're at a point where it's clear this is impossible to do effectively and thus it's not something I want to add to this library. |
It doesn't seem possible to test whether or not text needed to be "cleaned"
As part of the tokenization process,
bleach
returns text with the attributes alphabetized, so a string comparison is not possible.For example:
so , obviously...
This makes it impossible to track metrics regarding unsafe content submission -- or alert users if their content is unsafe.
The text was updated successfully, but these errors were encountered: