HTML API: Introduce HTML templating #12

dmsnell · 2024-01-11T03:30:29Z

Trac ticket: Core-60229

🚧👷‍♂️🏗️ This feature is currently being proposed for the WordPress 6.6 release cycle, being pushed back from 6.5 to allow for more time to let the design work proceed.

To be recreated in the WordPress repo once WordPress#5683 merges.

Todo

embed replacement inside the Tag Processor
don't allow replacement of escaped funky comment syntax (currently occurs inside attribute value based on how this is built at the moment)
figure out what rules need to apply for nested HTML
figure out how to differentiate nested HTML without requiring sigils or other extra syntax

Description

Currently only renders text data:

does not render nested HTML (escapes everything)
does not escape URLs differently than other attributes
relies on esc_attr() and esc_html() which today are the same thing in WordPress, but which could be greatly expanded to improve the overall escaping situation

echo WP_HTML_Template::render(
	<<<HTML
<a href="</%url>">
	<img src="</%url>">
	</%url>
</a>
HTML,
	array( 'url' => 'https://s.wp.com/i/atat.png?w=640&h=480&alt="atat>atst"' ),
);

outputs

<a href="https://s.wp.com/i/atat.png?w=640&amp;h=480&amp;alt=&quot;atat&gt;atst&quot;">
<img src="https://s.wp.com/i/atat.png?w=640&amp;h=480&amp;alt=&quot;atat&gt;atst&quot;">
https://s.wp.com/i/atat.png?w=640&amp;h=480&amp;alt=&quot;atat&gt;atst&quot;
</a>

This proposed templating syntax uses closing tags containing invalid tag names, so-called "funky comments," as placeholders, because they are converted to HTML comments in the DOM and because there is near universal existing support for them in all browsers, and because the syntax cannot be nested. The % at the front indicates that the value for the placeholder should come from the args array with a key named according to what follows the %.

dmsnell · 2024-01-11T03:30:40Z

cc: @westonruter @ockham

westonruter · 2024-01-11T18:05:44Z

funky comment syntax

What is the reasoning behind the current syntax?

dmsnell · 2024-01-11T20:03:02Z

Sorry @westonruter - the reasoning was in the Trac ticket, but I've copied it here.

I also wrote about it in the HTML API Progress Report a while back. The idea is that this happens to be a convenient syntax that checks off basically everything I've been looking for in order to replace dynamic content - shortcodes 2.0.

In short (very short):

funky comments cannot be nested, by construction
once inside a funky comment, all bytes are allowed until the first ASCII >
the first symbol provides a very convenient sigil form to differentiate multiple kinds of bits of content: a placeholder, a shortcode, a translation, etc…
these are pure HTML and not a new superset syntax of HTML, meaning that when editing in an HTML editor they should stand out properly and not break the flow.
being existing HTML syntax, they cannot break syntax boundaries and cause further parsing problems down the line
they are concise and easily hand-written

Since its introduction in WordPress 6.2 the HTML Tag Processor has provided a way to scan through all of the HTML tags in a document and then read and modify their attributes. In order to reliably do this, it also needed to be aware of other kinds of HTML syntax, but it didn't expose those syntax tokens to consumers of the API. In this patch the Tag Processor introduces a new scanning method and a few helper methods to read information about or from each token. Most significantly, this introduces the ability to read `#text` nodes in the document. What's new in the Tag Processor? ================================ - `next_token()` visits every distinct syntax token in a document. - `get_token_type()` indicates what kind of token it is. - `get_token_name()` returns something akin to `DOMNode.nodeName`. - `get_modifiable_text()` returns the text associated with a token. - `get_comment_type()` indicates why a token represents an HTML comment. Example usage. ============== {{{ <?php function strip_all_tags( $html ) { $text_content = ''; $processor = new WP_HTML_Tag_Processor( $html ); while ( $processor->next_token() ) { if ( '#text' !== $processor->get_token_type() ) { continue; } $text_content .= $processor->get_modifiable_text(); } return $text_content; } }}} What changes in the Tag Processor? ================================== Previously, the Tag Processor would scan the opening and closing tag of every HTML element separately. Now, however, there are special tags which it only visits once, as if those elements were void tags without a closer. These are special tags because their content contains no other HTML or markup, only non-HTML content. - SCRIPT elements contain raw text which is isolated from the rest of the HTML document and fed separately into a JavaScript engine. There are complicated rules to avoid escaping the script context in the HTML. The contents are left verbatim, and character references are not decoded. - TEXTARA and TITLE elements contain plain text which is decoded before display, e.g. transforming `&` into `&`. Any markup which resembles tags is treated as verbatim text and not a tag. - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the textarea and title elements, but no character references are decoded. For example, `&` inside a STYLE element is passed to the CSS engine as the literal string `&` and _not_ as `&`. Because it's important not treat this inner content separately from the elements containing it, the Tag Processor combines them when scanning into a single match and makes their content available as modifiable text (see below). This means that the Tag Processor will no longer visit a closing tag for any of these elements unless that tag is unexpected. {{{ <title>There is only a single token in this line</title> <title>There are two tokens in this line></title></title> </title><title>There are still two tokens in this line></title> }}} What are tokens? ================ The term "token" here is a parsing term, which means a primitive unit in HTML. There are only a few kinds of tokens in HTML: - a tag has a name, attributes, and a closing or self-closing flag. - a text node, or `#text` node contains plain text which is displayed in a browser and which is decoded before display. - a DOCTYPE declaration indicates how to parse the document. - a comment is hidden from the display on a page but present in the HTML. There are a few more kinds of tokens that the HTML Tag Processor will recognize, some of which don't exist as concepts in HTML. These mostly comprise XML syntax elements that aren't part of HTML (such as CDATA and processing instructions) and invalid HTML syntax that transforms into comments. What is a funky comment? ======================== This patch treats a specific kind of invalid comment in a special way. A closing tag with an invalid name is considered a "funky comment." In the browser these become HTML comments just like any other, but their syntax is convenient for representing a variety of bits of information in a well-defined way and which cannot be nested or recursive, given the parsing rules handling this invalid syntax. - `</1>` - `</%avatar_url>` - `</{"wp_bit": {"type": "post-author"}}>` - `</[post-author]>` - `</__( 'Save Post' );>` All of these examples become HTML comments in the browser. The content inside the funky content is easily parsable, whereby the only rule is that it starts at the `<` and continues until the nearest `>`. There can be no funky comment inside another, because that would imply having a `>` inside of one, which would actually terminate the first one. What is modifiable text? ======================== Modifiable text is similar to the `innerText` property of a DOM node. It represents the span of text for a given token which may be modified without changing the structure of the HTML document or the token. There is currently no mechanism to change the modifiable text, but this is planned to arrive in a later patch. Tags ==== Most tags have no modifiable text because they have child nodes where text nodes are found. Only the special tags mentioned above have modifiable text. {{{ <div class="post">Another day in HTML</div> └─ tag ──────────┘└─ text node ─────┘└────┴─ tag }}} {{{ <title>Is <img> > <image>?</title> │ └ modifiable text ───┘ │ "Is <img> > <image>?" └─ tag ─────────────────────────────┘ }}} Text nodes ========== Text nodes are entirely modifiable text. {{{ This HTML document has no tags. └─ modifiable text ───────────┘ }}} Comments ======== The modifiable text inside a comment is the portion of the comment that doesn't form its syntax. This applies for a number of invalid comments. {{{  │ └─ modifiable text ──────┘ │ └─ comment token ───────────────┘ }}} {{{  │ └─ modifiable text ────────┘ │ └─ comment token ───────────────┘ }}} {{{ <[CDATA[this is an invalid comment]]> │ └─ modifiable text ───────┘ │ └─ comment token ───────────────────┘ }}} Other token types also have modifiable text. Consult the code or tests for further information. Developed in WordPress#5683 Discussed in https://core.trac.wordpress.org/ticket/60170 Follows [57575] Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam Fixes #60170 git-svn-id: https://develop.svn.wordpress.org/trunk@57348 602fd350-edb4-49c9-b593-d223f7449a82

Follow-up to [55744]. See #59651. git-svn-id: https://develop.svn.wordpress.org/trunk@57349 602fd350-edb4-49c9-b593-d223f7449a82

…nstance()`. This improves consistency as `get_instance()` is more commonly used in core. See #59656. git-svn-id: https://develop.svn.wordpress.org/trunk@57350 602fd350-edb4-49c9-b593-d223f7449a82

Ensures referencing the correct CSS custom property. Props RavanH, poena, onemaggie, huzaifaalmesbah, mukesh27. Fixes #60325. git-svn-id: https://develop.svn.wordpress.org/trunk@57351 602fd350-edb4-49c9-b593-d223f7449a82

Theme.json stylesheets attempting to use a custom root selector are generated with in correct styles. Props aaronrobertshaw, get_dave, mukesh27. Fixes #60343. git-svn-id: https://develop.svn.wordpress.org/trunk@57352 602fd350-edb4-49c9-b593-d223f7449a82

More categories, better organization for patterns as they grow and power more WordPress websites. Props aaronrobertshaw, get_dave. Fixes #60342. git-svn-id: https://develop.svn.wordpress.org/trunk@57353 602fd350-edb4-49c9-b593-d223f7449a82

Add a new `hooked_block_{$block_type}` filter that allows modifying a hooked block (in parsed block format) prior to insertion, while providing read access to its anchor block (in the same format). This allows block authors to e.g. set a hooked block's attributes, or its inner blocks; the filter can peruse information about the anchor block when doing so. As such, this filter provides a solution to both #59572 and #60126. The new filter is designed to strike a good balance and separation of concerns with regard to the existing [https://developer.wordpress.org/reference/hooks/hooked_block_types/ `hooked_block_types` filter], which allows addition or removal of a block to the list of hooked blocks for a given anchor block -- all of which are identified only by their block ''types''. This new filter, on the other hand, only applies to ''one'' hooked block at a time, and allows modifying the entire (parsed) hooked block; it also gives (read) access to the parsed anchor block. Props gziolo, tomjcafferkey, andrewserong, isabel_brison, timbroddin, yansern. Fixes #59572, #60126. git-svn-id: https://develop.svn.wordpress.org/trunk@57354 602fd350-edb4-49c9-b593-d223f7449a82

…ter. Add missing explanation of the dynamic part of the hook name. Follow-up [57354]. Props swissspidy. See #59572, #60126. git-svn-id: https://develop.svn.wordpress.org/trunk@57355 602fd350-edb4-49c9-b593-d223f7449a82

Props shailu25. Fixes #60346. git-svn-id: https://develop.svn.wordpress.org/trunk@57356 602fd350-edb4-49c9-b593-d223f7449a82

Ensure logged out users are redirected to the media file when attachment pages are inactive. This removes the read_post capability check from the canonical redirects as anonymous users lack the permission. This was previously committed in [57310] before being reverted in [57318]. This update includes a fix to cover instances where revealing a URL could be considered a data leak and greatly expands the unit tests to ensure that this is covered along with many other instances. Follow-up to [56657], [56658], [56711], [57310], [57318]. Props peterwilsoncc, jorbin, afercia, aristath, chesio, joppuyo, jorbin, lakshmananphp, poena, sergeybiryukov, swissspidy, johnbillion. Fixes #59866. See #57913. git-svn-id: https://develop.svn.wordpress.org/trunk@57357 602fd350-edb4-49c9-b593-d223f7449a82

Currently only renders text data: - does not render nested HTML (escapes everything) - does not escape URLs

dmsnell · 2024-01-25T22:28:33Z

Closing to recreate in WordPress/wordpress-develop

dmsnell · 2024-01-26T00:26:47Z

Replaced by WordPress#5949

dmsnell requested a review from sirreal January 11, 2024 03:30

dmsnell force-pushed the html-api/render-function branch 3 times, most recently from 2051d38 to 57ac540 Compare January 12, 2024 12:54

dmsnell force-pushed the html-api/scan-all-tokens branch from 103a556 to 14eeb07 Compare January 12, 2024 12:54

dmsnell force-pushed the html-api/render-function branch 3 times, most recently from c6f5881 to bb33652 Compare January 12, 2024 18:51

dmsnell force-pushed the html-api/scan-all-tokens branch from 9f29920 to 1098c19 Compare January 15, 2024 17:22

dmsnell force-pushed the html-api/render-function branch 4 times, most recently from 2301aaf to 2a964f4 Compare January 15, 2024 22:01

dmsnell force-pushed the html-api/scan-all-tokens branch from 0ca080a to f502153 Compare January 16, 2024 15:36

dmsnell mentioned this pull request Jan 17, 2024

HTML API: Introduce WP_HTML::tag() for safely creating HTML. WordPress/wordpress-develop#5884

Draft

dmsnell force-pushed the html-api/scan-all-tokens branch 6 times, most recently from 9d01322 to 30991d7 Compare January 24, 2024 21:47

dmsnell and others added 6 commits January 24, 2024 23:35

Docs: Fix typo in _get_block_template_file() DocBlock.

fa441af

Follow-up to [55744]. See #59651. git-svn-id: https://develop.svn.wordpress.org/trunk@57349 602fd350-edb4-49c9-b593-d223f7449a82

I18N: Rename WP_Translation_Controller::instance() method to `get_i…

a615250

…nstance()`. This improves consistency as `get_instance()` is more commonly used in core. See #59656. git-svn-id: https://develop.svn.wordpress.org/trunk@57350 602fd350-edb4-49c9-b593-d223f7449a82

Twenty Twenty-Four: Change font family slug to lowercase.

b617b5c

Ensures referencing the correct CSS custom property. Props RavanH, poena, onemaggie, huzaifaalmesbah, mukesh27. Fixes #60325. git-svn-id: https://develop.svn.wordpress.org/trunk@57351 602fd350-edb4-49c9-b593-d223f7449a82

ockham and others added 10 commits January 25, 2024 13:46

Docs: Fix a few typos in wp-includes/pomo/po.php.

bb14399

Props shailu25. Fixes #60346. git-svn-id: https://develop.svn.wordpress.org/trunk@57356 602fd350-edb4-49c9-b593-d223f7449a82

HTML API: Introduce HTML Template Renderer

ba8701b

Currently only renders text data: - does not render nested HTML (escapes everything) - does not escape URLs

Test the handling of malicious attribute names and values

d4ce9f4

Explore a refactor of image lightbox code

fdfd521

Test refactor of wp_video_shortcut

97139af

Explore refactors in post-template

2b2b29a

Explore refactor of wp-editor code, adds set_modifiable_text()

ba8ff86

dmsnell force-pushed the html-api/render-function branch from b6e4c11 to ba8ff86 Compare January 25, 2024 22:26

dmsnell closed this Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Introduce HTML templating #12

HTML API: Introduce HTML templating #12

dmsnell commented Jan 11, 2024 •

edited

Loading

dmsnell commented Jan 11, 2024

westonruter commented Jan 11, 2024

dmsnell commented Jan 11, 2024

dmsnell commented Jan 25, 2024

dmsnell commented Jan 26, 2024

HTML API: Introduce HTML templating #12

HTML API: Introduce HTML templating #12

Conversation

dmsnell commented Jan 11, 2024 • edited Loading

Todo

Description

dmsnell commented Jan 11, 2024

westonruter commented Jan 11, 2024

dmsnell commented Jan 11, 2024

dmsnell commented Jan 25, 2024

dmsnell commented Jan 26, 2024

dmsnell commented Jan 11, 2024 •

edited

Loading