Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump sanitize gem #127

Merged
merged 3 commits into from
Mar 23, 2018
Merged

Bump sanitize gem #127

merged 3 commits into from
Mar 23, 2018

Conversation

h-lame
Copy link
Contributor

@h-lame h-lame commented Mar 21, 2018

For: https://trello.com/c/X6CWuGOX/134-rails-security-vulnerability-sanitize-463-libxml-292

This mitigates against rgrove/sanitize#176. While we're at it we tidy up some config that we no longer need to set because it's handled by default by the gem now. Check individual commits for details.

h-lame added 3 commits March 21, 2018 17:04
This fixes CVE-2018-2740 (See: rgrove/sanitize#176)

We also have to fix some tests around table tags, because as of sanitize
3.x it uses a parser more like a browser which means it will strip invalid
HTML and correct it when it's less-broken.  Tables are one of the things
it does this for.
Sanitize::Config objects are now frozen by default so we can't manipulate
them in place.  It does now come with a `merge` method that lets you
combine them though, and this means we no longer need our `deep_merge`
method.

While we're manipulating the config we can also remove some of the
settings that we've added because they have been rolled into the default
RELAXED settings.  Specifically: `div`, `aside`, `span` are now allowed
elements, `id`, `class` are allowed attributes, and `rel` is allowed on
an `a` tag.
@h-lame h-lame force-pushed the bump-sanitize-gem branch from 8e3a400 to 95a9a02 Compare March 21, 2018 17:17
@suzannehamilton suzannehamilton merged commit 1c46d03 into master Mar 23, 2018
@suzannehamilton suzannehamilton deleted the bump-sanitize-gem branch March 23, 2018 09:23
kevindew added a commit that referenced this pull request Jul 17, 2019
This is to resolve a blocking issue that prevents Whitehall from
upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of
Whitehall content have unusual unicode characters that the Santize gem
strips out, this then makes the content fail HTML Validation and was
introduced as part of a Santize update in
#127.

The approach I've taken to resolve this is to remove forbidden unicode
characters from the source markdown that is converted to HTML. Since
these characters are invalid in a HTML output it seems reasonable to
consider them as invalid as part of govspeak input.

Looking through Whitehalls database I found there to be 765
Edition::Translation models and 248 GovspeakContent models that have at
last one of these characters in their body fields.

Initially I intended to resolve this with a data migration on Whitehall,
however that would have caused a problem if anyone used these characters
again as it would be very difficult to determine why content was failing
govspeak validation. I figure it's actually more user friendly to strip
the character out and then if they notice a space or similar is missing
they can correct that with their keyboard.

I noticed that in 2019 there had been 74 of these items added so this
doesn't appear to be a legacy issue. My current theory is that perhaps
these are characters that are present in content that is being pasted
from such as a PDF document.

Anyway, with this merged in Whitehall can proceed to be upgraded to
Govspeak 6 as with this and
#159 all GovspeakContent models
are valid govspeak and only 14 Edition::Translation bodies are invalid,
all of which are also invalid under Govspeak 5.5.
kevindew added a commit that referenced this pull request Jul 17, 2019
This is to resolve a blocking issue that prevents Whitehall from
upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of
Whitehall content have unusual unicode characters that the Santize gem
strips out, this then makes the content fail HTML Validation and was
introduced as part of a Santize update in
#127.

The approach I've taken to resolve this is to remove forbidden unicode
characters from the source markdown that is converted to HTML. Since
these characters are invalid in a HTML output it seems reasonable to
consider them as invalid as part of govspeak input.

Looking through Whitehalls database I found there to be 765
Edition::Translation models and 248 GovspeakContent models that have at
last one of these characters in their body fields.

Initially I intended to resolve this with a data migration on Whitehall,
however that would have caused a problem if anyone used these characters
again as it would be very difficult to determine why content was failing
govspeak validation. I figure it's actually more user friendly to strip
the character out and then if they notice a space or similar is missing
they can correct that with their keyboard.

I noticed that in 2019 there had been 74 of these items added so this
doesn't appear to be a legacy issue. My current theory is that perhaps
these are characters that are present in content that is being pasted
from such as a PDF document.

Anyway, with this merged in Whitehall can proceed to be upgraded to
Govspeak 6 as with this and
#159 all GovspeakContent models
are valid govspeak and only 14 Edition::Translation bodies are invalid,
all of which are also invalid under Govspeak 5.5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants