Bump sanitize gem #127

h-lame · 2018-03-21T17:13:59Z

For: https://trello.com/c/X6CWuGOX/134-rails-security-vulnerability-sanitize-463-libxml-292

This mitigates against rgrove/sanitize#176. While we're at it we tidy up some config that we no longer need to set because it's handled by default by the gem now. Check individual commits for details.

This fixes CVE-2018-2740 (See: rgrove/sanitize#176) We also have to fix some tests around table tags, because as of sanitize 3.x it uses a parser more like a browser which means it will strip invalid HTML and correct it when it's less-broken. Tables are one of the things it does this for.

Sanitize::Config objects are now frozen by default so we can't manipulate them in place. It does now come with a `merge` method that lets you combine them though, and this means we no longer need our `deep_merge` method. While we're manipulating the config we can also remove some of the settings that we've added because they have been rolled into the default RELAXED settings. Specifically: `div`, `aside`, `span` are now allowed elements, `id`, `class` are allowed attributes, and `rel` is allowed on an `a` tag.

This is to resolve a blocking issue that prevents Whitehall from upgrading to Govspeak > 5.5. A problem we have is that a number of pieces of Whitehall content have unusual unicode characters that the Santize gem strips out, this then makes the content fail HTML Validation and was introduced as part of a Santize update in #127. The approach I've taken to resolve this is to remove forbidden unicode characters from the source markdown that is converted to HTML. Since these characters are invalid in a HTML output it seems reasonable to consider them as invalid as part of govspeak input. Looking through Whitehalls database I found there to be 765 Edition::Translation models and 248 GovspeakContent models that have at last one of these characters in their body fields. Initially I intended to resolve this with a data migration on Whitehall, however that would have caused a problem if anyone used these characters again as it would be very difficult to determine why content was failing govspeak validation. I figure it's actually more user friendly to strip the character out and then if they notice a space or similar is missing they can correct that with their keyboard. I noticed that in 2019 there had been 74 of these items added so this doesn't appear to be a legacy issue. My current theory is that perhaps these are characters that are present in content that is being pasted from such as a PDF document. Anyway, with this merged in Whitehall can proceed to be upgraded to Govspeak 6 as with this and #159 all GovspeakContent models are valid govspeak and only 14 Edition::Translation bodies are invalid, all of which are also invalid under Govspeak 5.5.

h-lame added 3 commits March 21, 2018 17:04

Update changelog

95a9a02

h-lame force-pushed the bump-sanitize-gem branch from 8e3a400 to 95a9a02 Compare March 21, 2018 17:17

thomasleese approved these changes Mar 21, 2018

View reviewed changes

suzannehamilton approved these changes Mar 22, 2018

View reviewed changes

suzannehamilton merged commit 1c46d03 into master Mar 23, 2018

suzannehamilton deleted the bump-sanitize-gem branch March 23, 2018 09:23

kevindew mentioned this pull request Jul 17, 2019

Remove forbidden unicode characters from output #164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump sanitize gem #127

Bump sanitize gem #127

h-lame commented Mar 21, 2018 •

edited

Loading

Bump sanitize gem #127

Bump sanitize gem #127

Conversation

h-lame commented Mar 21, 2018 • edited Loading

h-lame commented Mar 21, 2018 •

edited

Loading