Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Cleaner.validate (verify, etc.) #537

Closed
DylanYoung opened this issue May 28, 2020 · 3 comments
Closed

Support Cleaner.validate (verify, etc.) #537

DylanYoung opened this issue May 28, 2020 · 3 comments

Comments

@DylanYoung
Copy link

Enhancement:

We want to reject dirty input rather than cleaning it. It seems painful to accomplish this by subclassing cleaner.

Curious if this could be either built-in or a hook added "on_content_change" that would be called whenever the content is changed from the original value due to bleach rules (this probably isn't the best approach as it would slow down the other uses of bleach, maybe on_first_change, though I think it would be better to build this in).

Happy to look into a patch if you're amenable :)

@DylanYoung
Copy link
Author

DylanYoung commented Jun 19, 2020

This was a duplicate and already rejected. I still think it's a very good idea and don't really understand the comments in the original issue.

Proposed solution:
Split the Sanitizing Filter into two:

  1. Normalizing Filter
  2. Sanitizing Filter

When verify mode is active, simply raise an exception from Sanitizing Filter whenever it makes a change. Catch that exception in the validate method and return False or return (False, normalized_html)

@DylanYoung
Copy link
Author

Duplicate of #109

@DylanYoung
Copy link
Author

Currently, we use something like the following as a partial implementation (inefficient: double parsing) of the above:

class BleachNormalizer(object):
    def __init__(self):
        self.parser = html5lib_shim.BleachHTMLParser(
            tags=None,
            strip=False,
            consume_entities=True,
            namespaceHTMLElements=False,
        )
        self.walker = html5lib_shim.getTreeWalker('etree')
        serializer_kwargs = dict(
            std_serializer_kwargs,
            # The normalizer doesn't use bleaches alphabetizer, maybe we should?
            alphabetical_attributes=True,
        )
        self.serializer = html5lib_shim.BleachHTMLSerializer(
            **serializer_kwargs
        )
        self.filters = [NormalizingBleachFilter]

    def normalize(self, html):
        dom = self.walker(
            self.parser.parseFragment(html)
        )
        for f in self.filters:
            dom = f(dom)
        return self.serializer.render(dom)

If I get more time to work on this, I'll be happy to propose a patch / forked project (as suggested in #109 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant