-
Notifications
You must be signed in to change notification settings - Fork 0
Whitespace
Treatment of whitespace in HTML is determined by its rendering in browsers. This is called whitespace collapsing. When Turndown processes HTML, the rule of thumb is:
- The whitespace SHOULD be collapsed the HTML way if the generated Markdown would render differently than the original HTML.
- The whitespace MIGHT be collapsed the HTML way, as long as it does not cause rendering differences.
The second principle allows turndown to simplify several things and its right operation actually depends on it.
There is not much special about whitesplace inside text. The situation is more tricky for whitespace at the edges of text nodes.
See CommonMark spec for more information.
The situation in Turndown 6.0.0 is as follows. Consider an element containing
a text node and a followed by text node <b>foo(End)</b>(Start)Next
. The
non-breaking space (
= \u00A0
) is alternatively written as ·
to improve
readability.
# | End | Next Start | HTML collapse | Turndown operation | Example of Processing |
---|---|---|---|---|---|
1 | ASCII WS | ASCII WS | eaten | no op |
<i>foo </i> bar → <i>foo</i> bar → _foo_ bar
|
2 | nonWS | ASCII WS | no-op | no op |
<i>foo</i> bar → <i>foo</i> bar → _foo_ bar
|
3 | ASCII WS | nonWS | no-op | move End outside |
<i>foo </i>bar → <i>foo </i>bar → _foo_ bar
|
4 | nonASCII WS | nonWS | no-op | •change End to 0x20 •move End outside |
<i>foo </i>bar → <i>foo </i>bar → _foo_ bar
|
5 | nonASCII WS | nonASCII WS | no-op | •change End to 0x20 •move End outside |
<i>foo </i> bar → <i>foo </i> bar → _foo_ ·bar
|
6 | ASCII WS | nonASCII WS | no-op | move End outside |
<i>foo </i> bar → <i>foo </i> bar → _foo_ ·bar
|
7 | nonASCII WS | ASCII WS | no-op | •output End as is •change End to 0x20 move End outside |
<i>foo </i> bar → <i>foo </i> bar → _foo·_ bar
|
Cases 1 and 2 exactly match the rule of thumb. Let's discuss the other ones.
Although the case 3 is a small change to HTML behavior:
- Text content still matches.
- Not really unexpected, normal WS should be treated as a fragile thing.
- Likely resulting from unintended input artefacts, e.g. mouse-selecting text and pressing the I button.
- A strictly matching encoding -
_foo _bar
- is just too ugly given the above reasons.
On the other hand, this is also applied to inlines that don't need it,
specifically to <code>
. E.g. (` foo `)
in Markdown renders as
(<code> foo </code>)
. But such HTML would convert back to ( `foo` )
now.
This might be unintended even in the current code, as rules.code
in
commonmark-rules.js
actually invokes trim()
when testing on emptiness,
which either just resembles the letter of CommonMark spec, or it also
suggest that untrimmed content is expected. [DO-NOT-COLLAPSE-CODE-WS]
The technical issue behind the current behavior lies in CommonMark spec. CommonMark requires some of the tags not to be surrounded by Unicode whitespace. But HTML whitespace collapsing works only with ASCII whitespace.
Suppose ~
means a non-breaking space (HTML
, unicode \u00A0
).
The current behavior has three issues:
- Replacing Unicode whitespace with ASCII is not expected by users,
e.g.
Law §~<b>1782</b>
should not break after§
. [RESPECT-ORIGINAL-WS] - Without extra escaping, replacing Unicode whitespace can produce false
formatting. E.g.
<p>~1. foo</p>
<p>1.~<b>foo</b></p>
and both produces ordered lists, which were not on the input. [RESPECT-ORIGINAL-WS] - Users do not expect ASCII and nonASCII whitespace to be merged, e.g.
always add <b>~km</b> as the distance unit
should not collapse in a single space afteradd
. [DO-NOT-COLLAPSE-MIXED-WS] - Some users might expect unicode whitespace to be kept wihin emphasis
elements, e.g.
Law §<b>~1782</b>
, which is achievable by using HTML entities Markdown. But this is can be considered a similar situation to normal whitespace, where it is actually moved. So we prefer it over introducing conversion to HTML entities.[DO-NOT-MOVE-UNICODE-WS]
Case 7 adds extra issue of broken formating on top of the previous issue.
This is partially due to an implementation detail of how it is decided
when the content should be trim()
med. [TRIM-REGARDLESS-OF-WS-DETECTION]
The issue would also occured if HTML whitespace was not collapsed. But it is actually collapsed and the enabler of this issue is the mentioned [DO-NOT-COLLAPSE-MIXED-WS].
Successful completion of [RESPECT-ORIGINAL-WS] and
[DO-NOT-COLLAPSE-MIXED-WS] leads to the following results.
Same as above, ·
represents \u00A0
and
.
# | Name | Input | Output |
---|---|---|---|
4 | element with trailing nonASCII WS followed by nonWS | <i>foo·</i>bar |
_foo_·bar |
5 | element with trailing nonASCII WS followed by nonASCII WS | <i>foo·</i>·bar |
_foo_··bar |
6 | element with trailing ASCII WS followed by nonASCII WS | <i>foo </i>·bar |
_foo_ ·bar |
7 | element with trailing nonASCII WS followed by ASCII WS | <i>foo·</i> bar |
_foo_· bar |
4 mirrored | nonWS followed by element with leading nonASCII WS | foo<i>·bar</i> |
foo·_bar_ |
5 mirrored | nonASCII WS followed by element with leading nonASCII WS | foo·<i>·bar</i> |
foo··_bar_ |
6 mirrored | nonASCII WS followed by element with leading ASCII WS | foo·<i> bar</i> |
foo· _bar_ |
7 mirrored | ASCII WS followed by element with leading nonASCII WS | foo <i>·bar</i> |
foo ·_bar_ |
[DO-NOT-COLLAPSE-CODE-WS] is slightly more tricky to describe as it has a few precondidions:
- It is only meaningful when
<code>
element is treated as a preformatted inline element (like in GitLab). See this issue at thecollapse-whitespace
project and its fix. - Although it is harmless to assert the code always to be inline-preformatted, there might still be users expecting the old behavior, so making this configurable makes sense.
-
flankingWhitspace()
has to match such setting. - And
rules.code
incommonmark-rules.js
contains a minor bug, which has to be fixed.
Might sound complicated, but the code is actually very skinny and leads
to the following results when preformattedCode
setting is enabled:
Input | Output |
---|---|
An <code> indented code line</code> |
An ` indented code line` |
(<code> foo </code>) |
(` foo `) |
(<i> <code> bar </code> </i>) |
( _` bar `_ ) |
See the behavior in GitLab: