WP_HTML_Tag_Processor overview issue #44410

adamziel · 2022-09-23T05:39:23Z

#42485) Introduce WP_HTML_Tag_Processor for reliably modifying HTML attributes. Dynamic blocks often need to inject a CSS class name or set <img src /> in the rendered block HTML markup but lack the means to do so. WP_HTML_Tag_Processor solves this problem. It scans through an HTML document to find specific tags, then transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener). Importantly, it does not fully parse HTML or _recurse_ into the HTML structure. Instead WP_HTML_Tag_Processor scans linearly through a document and only parses the HTML tag openers. Example: ``` $p = new WP_HTML_Tag_Processor('<div id="first"><img /></div>'); $p->next_tag('img')->set_attribute('src', '/wp-content/logo.png'); echo $p; // <div id="first"><img src="/wp-content/logo.png" /></div> ``` For more details and context, see the original GitHub Pull Request at #42485 and the overview issue at #44410. Co-authored-by: Adam Zieliński <adam@adamziel.com> Co-authored-by: Dennis Snell <dennis.snell@automattic.com> Co-authored-by: Grzegorz Ziółkowski <grzegorz.ziolkowski@automattic.com> Co-authored-by: Sören Wrede <soerenwrede@gmail.com> Co-authored-by: Colin Stewart <79332690+costdev@users.noreply.github.com>

azaozz · 2022-09-23T18:56:49Z

Frankly I think #42485 was not ready for merging. Seems it lacks any security features, references, docs, how-to's, etc. Even doesn't mention what security and escaping requirements it has. It seems open to exploits in its current state.

This issue mentions some of that, but misses few things. For example validity of new attribute names. Also, it doesn't mention the needed security related inline docs and examples which are crucial for new APIs.

As I commented on #42485 (comment) the security side of this API still awaits a decision. Imho it needs to implement some restrictions that will make is much more robust and problems-free in the future, and now is the time for these restrictions. If this is not done (in time) there is a risk that WP will end up with another exploits-prone API just like the shortcodes. If you don't believe me please see the huge, slow, cumbersome regex-based shortcodes sanitization functions that had to be added in order to try to secure badly written, exploitable code in plugins. Please consider restricting the illegal chars in attribute names and html-entity encoding the htmlspecialchars in attribute values!

Also see #42485 (comment).

adamziel · 2022-09-30T04:24:41Z

@azaozz – #42485 was merged to continue iterating in smaller PRs, not because every kink was worked out. I agree with your points about html-entity encoding and restricting illegal chars. The former is now merged, the latter is in motion. The list of todos also reflects updating the docs.

About this point:

Decide whether on* attributes should be banned in set_attribute( ) to make adding inline JavaScript harder (see @peterwilsoncc's comment).

I don't think there's an effective way to do that. Say we restrict onclick. The developer can just work around it like this:

$p->set_attribute( '__onclick', 'alert()' );
$markup = str_replace( '__onclick', 'onclick', $p );

Notably, the processor does not allow injecting new tags, so there's no risk of injecting a rogue <script> tag. I'll mark that todo as completed for now – if anyone disagrees please speak out.

adamziel · 2022-09-30T04:27:49Z

Find a canonical way of stringifying the processor. One that's different from (string) $w.

A getter like $p->getUpdatedHTML() or a magic property like $p->updatedHTML could do the trick here. I mean that in addition to having __toString(), not as a replacement.

adamziel · 2022-12-01T23:23:50Z

Wild idea: Tag Processor could use seekable PHP streams instead of strings and internal pointers:

php > $fp = fopen('php://memory', 'w+');
php > fwrite($fp, '<a href="');
php > fwrite($fp, 'http://wordpress.org/');
php > fwrite($fp, '">Link</a>');
php > fseek($fp, 0);
php > echo fread($fp, 1024);
<a href="http://wordpress.org/">Link</a>

Potential benefits: Simpler, faster code that uses less memory (if we can get rid of $updated_html). This needs more investigation.

adamziel · 2023-02-07T13:37:12Z

6.2 dev note draft below. @dmsnell @gziolo @ockham feel free to edit this comment and adjust as you see fit:

WordPress 6.2 ships with WP_HTML_Tag_Processor – a tool to adjust HTML tag attributes in the block markup:

$p = new WP_HTML_Tag_Processor( '<a href="#"></a>' );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' );
echo $p->get_updated_html();
// <a></a>

Before WordPress 6.2, the block markup was typically updated using regular expressions or the DOMDocument class. Both have downsides. The former is tedious and prone to security issues, while the latter uses libxml2 which does not support HTML5.

WP_HTML_Tag_Processor is safe and efficient. Unlike full-fledged HTML parsers, the processor avoids handling malformed markup, semantic problems, and building a document tree. Any problems that are present on the input are passed on to the browser. The processor doesn’t fix HTML just as it won’t break HTML.

The tradeoff is that it only offers a simplified API to modify HTML attributes. If you want to replace an img tag with a full-fledged figure layout, this API won’t offer that functionality. Similarly, the processor won’t help you replace all the child nodes of a particular div with a completely new markup. This system is focused on finding specific HTML tags and adding, removing, or updating the attributes on those tags.

Here's how to use WP_HTML_Tag_Processor:

Remove the href attribute from an anchor tag:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' );

Add a style attribute to the first tag in the document:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();
$p->set_attribute( 'style', 'display: none' );

Add a CSS class to the first tag having the wp-block-media-text__content class:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( array(
    'class_name' => 'wp-block-media-text__content'
) );
$p->add_class( 'wp-foo-bar' );

Add the srcset attribute to all image tags:

$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {
    if (
        isset( $p->get_attribute( 'src' ) ) &&
        ! isset( $p->get_attribute( 'srcset' )
    ) {
        $srcset = build_srcset( $p->get_attribute( 'src' ) );
        $p->set_attribute( 'srcset', $srcset );
    }
}

For additional documentation, refer to GitHub Pull Request. If you want to learn more about the motivation for this new API, check the post that proposed A new system for simply and reliably updating HTML attributes.

gziolo · 2023-02-09T07:48:08Z

I reviewed the listed todo items, and I have an impression that we can close this issue after landing HTML Tag Process in WordPress core. If there are any remaining issues, we can open issues that target individual use cases.

Find a canonical way of stringifying the processor. One that's different from (string) $w (#44410 (comment)).

There is get_updated_html method that fulfills the requirement.

Review the documentation blocks.

This happened as part of the WP core merge.

Improve description for keys in test data providers that use numeric values (example).

Fixed in https://github.com/WordPress/wordpress-develop/blob/39bfc2580d9b0ea7e6ff45b8d904408d925cbc4b/tests/phpunit/tests/html/wpHtmlTagProcessor.php

Explore adding a has_next_tag method (#44600 (comment))

Is that still necessary? Could we use the bookmarking system instead?

adamziel · 2023-02-09T21:08:44Z

@gziolo I refreshed the list of related issues – there's still some work related to the character decoding, the documentation, and the bookmarks API.

gziolo · 2023-02-10T08:47:59Z

@gziolo I refreshed the list of related issues – there's still some work related to the character decoding, the documentation, the bookmarks API.

Thank you for updating the description with the recently started tasks.

A handbook page

Documentation is going to be automatically generated from PHPDoc comments for all classes and methods. It will be exposed in the Code Reference section of Official WordPress Developer Resources, similar to for example WP_Block_Type class. What type of handbook page you would like to include in addition to that? Is that in the Block Editor Handbook?

In the codebase, there is the following comment included:

### Possible future direction for this module
 *
 *  - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c".
 *    This would increase the size of the changes for some operations but leave more
 *    natural-looking output HTML.
 *  - Decode HTML character references within class names when matching. E.g. match having
 *    class `1<"2` needs to recognize `class="1&lt;&quot;2"`. Currently the Tag Processor
 *    will fail to find the right tag if the class name is encoded as such.
 *  - Properly decode HTML character references in `get_attribute()`. PHP's
 *    `html_entity_decode()` is wrong in a couple ways: it doesn't account for the
 *    no-ambiguous-ampersand rule, and it improperly handles the way semicolons may
 *    or may not terminate a character reference.

Does #47040 from the list cover points 2 and 3? It looks like we still need an issue for point 1 about removing whitespaces.

There is also an unlisted PR that introduces a new class for sourcing block attributes that gets referenced in the PR from the section with new APIs:

#46345

I see that some listed tasks reference or extend that work, they explore opening modifications to chunks of HTML:

What do you all see as a top priority and how people can get involved to help with efforts?

adamziel · 2023-02-10T09:06:49Z

Documentation is going to be automatically generated from PHPDoc comments for all classes and methods. It will be exposed in the Code Reference section of Official WordPress Developer Resources, similar to for example WP_Block_Type class. What type of handbook page you would like to include in addition to that? Is that in the Block Editor Handbook?

I'm thinking of a proper guide on how to use WP_HTML_Tag_Processor. API docs don't quite cut it.

Does #47040 from the list cover points 2 and 3?

Point 3 – yes.
Point 2 – potentially? I'm not sure – it requires a deeper look on how CSS classes are read from the attributes. @dmsnell may know off the top of his head.

There is also an unlisted PR that introduces a new class for sourcing block attributes that gets referenced in the PR from the section with new APIs:

Good spot 👍 Added to the list

What do you all see as a top priority and how people can get involved to help with efforts?

I think the two open PRs may be blockers for the draft PR. Can you confirm @ockham ? If yes, then those would be the priority to me. Then the set_content one and the CSS selector one.

dmsnell · 2023-02-10T20:05:28Z

Does #47040 from the list cover points 2 and 3?

Yes, completely. There's an enhancement possible noted in the decoder class which is adding a method like "decoded_strpos" or "decoded_has_substring" but when I was testing the performance wasn't impacted by calling decode on every call, since this only does something when character references are present, and in those cases (which are uncommon) we have to start decoding anyway.

ockham · 2023-02-13T11:38:28Z

I see that some listed tasks reference or extend that work, they explore opening modifications to chunks of HTML:

WP_HTML_Processor: Add set_content_inside_balanced_tags #47036

WP_HTML_Tag_Processor: Allow non-attribute lexical updates #47068

Tag Processor: Add bookmark invalidation logic #47559

What do you all see as a top priority and how people can get involved to help with efforts?

I think the two open PRs may be blockers for the draft PR. Can you confirm @ockham ? If yes, then those would be the priority to me. Then the set_content one and the CSS selector one.

Yes, that's correct. Plus the non-attribute updates one depends on the bookmark invalidation one. So currently, it's a cascade of dependencies, including one more PR that I filed only recently. They are, in order:

adamziel · 2023-02-23T16:36:48Z

Surfacing my explorations of parsing HTML into the correct tree structure, e.g. <p><p>Lorem -> <p></p><p>Lorem</p>

adamziel/wordpress-develop#1

gziolo · 2023-09-05T08:39:02Z

We can close this issue as done. The work continues in WordPress Trac:

HTML API: Introduce HTML Processor, a higher-level partner to the Tag Processor

The list of open tickets is grouped under HTML API component.

There still are two open PRs:

Add/html character reference decoder #47040 – should we create a ticket in Trac for this one?
WIP: Introduce class for sourcing block attributes from HTML #46345 – I believe this one requires the full HTML API, does it deserve a Trac ticket, too?

Explore adding a has_next_tag method

This item is pretty vague, and it seems to be nice to have at this stage.

New HTML Tag Processor - An in-depth tutorial developer-blog-content#75 is tracked elsewhere.

dmsnell · 2023-09-05T16:43:13Z

Thanks @gziolo.

The two open PRs are good to stay open here. I don't know if we need a Trac ticket or not because I expect them to remain open for a while. #46345 will likely never merge but only serve as a guide to what we end up merging (which is one of the reasons I keep these open).

has_next_tag doesn't make sense to me at the moment. I'm not even sure where that came from. With bookmarks we don't need it, and has_next_tag implies peeking ahead of the parser, which we don't currently do.

strarsis · 2023-09-16T14:06:56Z

Can the existing WP_HTML_Tag_Processor API be used to replace an HTML element with another one?
Can an additional WP_HTML_Tag_Processor instance be created from additional HTML and then be used to replace a tag from another WP_HTML_Tag_Processor instance?

dmsnell · 2023-09-17T02:51:47Z

@strarsis the Tag Processor doesn't support replacing any tags, but the HTML Processor (which is being built on top of it) will support that. because of the complexities of the semantic rules involved in replacing an element in HTML though, this is coming at a slower pace.

adamziel mentioned this issue Sep 23, 2022

WP_HTML_Tag_Processor: Inject dynamic data to block HTML markup in PHP #42485

Merged

gziolo added [Feature] Block API API that allows to express the block paradigm. [Type] Enhancement A suggestion for improvement. labels Sep 23, 2022

This was referenced Sep 23, 2022

Tag processor: update @since version tags to 6.2.0 #44432

Merged

Tag Processor: throw when supplied unacceptible attribute names. #44431

Merged

This was referenced Sep 26, 2022

Tag Processor: Document and test XSS prevention in set_attribute #44447

Merged

[Tag Processor] Merge the test files into a single file #44593

Closed

This was referenced Sep 30, 2022

Tag Processor: Remove the shorthand next_tag( $tag_name ) syntax #44595

Closed

Tag Processor: Add get_updated_html as a non-toString method of stringifying the markup #44597

Merged

adamziel mentioned this issue Oct 18, 2022

Tag Processor: Remove the shorthand next_tag( $tag_name ) syntax #45082

Closed

cbirdsong mentioned this issue Oct 25, 2022

Bugs with autogenerated heading anchors #36365

Open

This was referenced Nov 30, 2022

Add a CSS class to all static blocks with className supports #42269

Open

Add a has_class method to the public WP_HTML_Tag_Processor API #46232

Closed

hellofromtonya mentioned this issue Jan 25, 2023

Plugin: Backport PHP changes for WordPress 6.2 release #47187

Closed

85 tasks

bph mentioned this issue Feb 6, 2023

6.2 Dev Note Tracking Issue #47771

Closed

47 tasks

adamziel added the has dev note when dev note is done (for upcoming WordPress release) label Feb 7, 2023

gziolo mentioned this issue Feb 10, 2023

Block API #41236

Open

69 tasks

gziolo mentioned this issue Feb 13, 2023

WIP: Introduce class for sourcing block attributes from HTML #46345

Draft

ockham mentioned this issue Feb 14, 2023

HTML Tag Processor: Add WP 6.3 compat layer #47933

Merged

This was referenced Feb 21, 2023

Tag Processor: Add bookmark invalidation logic #47559

Merged

WP_HTML_Processor: Add set_content_inside_balanced_tags #47036

Closed

WP_HTML_Tag_Processor: Allow non-attribute lexical updates #47068

Closed

Add WP_HTML_Processor #47573

Closed

gziolo added the [Type] Tracking Issue Tactical breakdown of efforts across the codebase and/or tied to Overview issues. label Feb 27, 2023

bph mentioned this issue Mar 2, 2023

New HTML Tag Processor - An in-depth tutorial WordPress/developer-blog-content#83

Closed

gziolo closed this as completed Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WP_HTML_Tag_Processor overview issue #44410

WP_HTML_Tag_Processor overview issue #44410

adamziel commented Sep 23, 2022 •

edited by gziolo

Loading

azaozz commented Sep 23, 2022 •

edited

Loading

adamziel commented Sep 30, 2022

adamziel commented Sep 30, 2022 •

edited

Loading

adamziel commented Dec 1, 2022 •

edited

Loading

adamziel commented Feb 7, 2023 •

edited by gziolo

Loading

gziolo commented Feb 9, 2023

adamziel commented Feb 9, 2023 •

edited

Loading

gziolo commented Feb 10, 2023

adamziel commented Feb 10, 2023

dmsnell commented Feb 10, 2023

ockham commented Feb 13, 2023

adamziel commented Feb 23, 2023

gziolo commented Sep 5, 2023 •

edited

Loading

dmsnell commented Sep 5, 2023

strarsis commented Sep 16, 2023

dmsnell commented Sep 17, 2023

WP_HTML_Tag_Processor overview issue #44410

WP_HTML_Tag_Processor overview issue #44410

Comments

adamziel commented Sep 23, 2022 • edited by gziolo Loading

azaozz commented Sep 23, 2022 • edited Loading

adamziel commented Sep 30, 2022

adamziel commented Sep 30, 2022 • edited Loading

adamziel commented Dec 1, 2022 • edited Loading

adamziel commented Feb 7, 2023 • edited by gziolo Loading

gziolo commented Feb 9, 2023

adamziel commented Feb 9, 2023 • edited Loading

gziolo commented Feb 10, 2023

adamziel commented Feb 10, 2023

dmsnell commented Feb 10, 2023

ockham commented Feb 13, 2023

adamziel commented Feb 23, 2023

gziolo commented Sep 5, 2023 • edited Loading

dmsnell commented Sep 5, 2023

strarsis commented Sep 16, 2023

dmsnell commented Sep 17, 2023

adamziel commented Sep 23, 2022 •

edited by gziolo

Loading

azaozz commented Sep 23, 2022 •

edited

Loading

adamziel commented Sep 30, 2022 •

edited

Loading

adamziel commented Dec 1, 2022 •

edited

Loading

adamziel commented Feb 7, 2023 •

edited by gziolo

Loading

adamziel commented Feb 9, 2023 •

edited

Loading

gziolo commented Sep 5, 2023 •

edited

Loading