Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WP_HTML_Tag_Processor overview issue #44410

Closed
19 of 26 tasks
adamziel opened this issue Sep 23, 2022 · 16 comments
Closed
19 of 26 tasks

WP_HTML_Tag_Processor overview issue #44410

adamziel opened this issue Sep 23, 2022 · 16 comments
Labels
[Feature] Block API API that allows to express the block paradigm. has dev note when dev note is done (for upcoming WordPress release) [Type] Enhancement A suggestion for improvement. [Type] Tracking Issue Tactical breakdown of efforts across the codebase and/or tied to Overview issues.

Comments

@adamziel
Copy link
Contributor

adamziel commented Sep 23, 2022

The follow-up work once the #42485 is merged:

Code is located in two folders:

cc @gziolo @dmsnell @georgeh

adamziel added a commit that referenced this issue Sep 23, 2022
#42485)

Introduce WP_HTML_Tag_Processor for reliably modifying HTML attributes.

Dynamic blocks often need to inject a CSS class name or set <img src /> in the rendered block HTML markup but lack the means to do so. WP_HTML_Tag_Processor solves this problem.

It scans through an HTML document to find specific tags, then transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener).

Importantly, it does not fully parse HTML or _recurse_ into the HTML structure. Instead WP_HTML_Tag_Processor scans linearly through a document and only parses the HTML tag openers.

Example:

```
$p = new WP_HTML_Tag_Processor('<div id="first"><img /></div>');
$p->next_tag('img')->set_attribute('src', '/wp-content/logo.png');
echo $p;
// <div id="first"><img src="/wp-content/logo.png" /></div>
```

For more details and context, see the original GitHub Pull Request at #42485 and the overview issue at #44410.

Co-authored-by: Adam Zieliński <adam@adamziel.com>
Co-authored-by: Dennis Snell <dennis.snell@automattic.com>
Co-authored-by: Grzegorz Ziółkowski <grzegorz.ziolkowski@automattic.com>
Co-authored-by: Sören Wrede <soerenwrede@gmail.com>
Co-authored-by: Colin Stewart <79332690+costdev@users.noreply.github.com>
@gziolo gziolo added [Feature] Block API API that allows to express the block paradigm. [Type] Enhancement A suggestion for improvement. labels Sep 23, 2022
@azaozz
Copy link
Contributor

azaozz commented Sep 23, 2022

Frankly I think #42485 was not ready for merging. Seems it lacks any security features, references, docs, how-to's, etc. Even doesn't mention what security and escaping requirements it has. It seems open to exploits in its current state.

This issue mentions some of that, but misses few things. For example validity of new attribute names. Also, it doesn't mention the needed security related inline docs and examples which are crucial for new APIs.

As I commented on #42485 (comment) the security side of this API still awaits a decision. Imho it needs to implement some restrictions that will make is much more robust and problems-free in the future, and now is the time for these restrictions. If this is not done (in time) there is a risk that WP will end up with another exploits-prone API just like the shortcodes. If you don't believe me please see the huge, slow, cumbersome regex-based shortcodes sanitization functions that had to be added in order to try to secure badly written, exploitable code in plugins. Please consider restricting the illegal chars in attribute names and html-entity encoding the htmlspecialchars in attribute values!

Also see #42485 (comment).

@adamziel
Copy link
Contributor Author

@azaozz#42485 was merged to continue iterating in smaller PRs, not because every kink was worked out. I agree with your points about html-entity encoding and restricting illegal chars. The former is now merged, the latter is in motion. The list of todos also reflects updating the docs.

About this point:

Decide whether on* attributes should be banned in set_attribute( ) to make adding inline JavaScript harder (see @peterwilsoncc's comment).

I don't think there's an effective way to do that. Say we restrict onclick. The developer can just work around it like this:

$p->set_attribute( '__onclick', 'alert()' );
$markup = str_replace( '__onclick', 'onclick', $p );

Notably, the processor does not allow injecting new tags, so there's no risk of injecting a rogue <script> tag. I'll mark that todo as completed for now – if anyone disagrees please speak out.

@adamziel
Copy link
Contributor Author

adamziel commented Sep 30, 2022

Find a canonical way of stringifying the processor. One that's different from (string) $w.

A getter like $p->getUpdatedHTML() or a magic property like $p->updatedHTML could do the trick here. I mean that in addition to having __toString(), not as a replacement.

@adamziel
Copy link
Contributor Author

adamziel commented Dec 1, 2022

Wild idea: Tag Processor could use seekable PHP streams instead of strings and internal pointers:

php > $fp = fopen('php://memory', 'w+');
php > fwrite($fp, '<a href="');
php > fwrite($fp, 'http://wordpress.org/');
php > fwrite($fp, '">Link</a>');
php > fseek($fp, 0);
php > echo fread($fp, 1024);
<a href="http://wordpress.org/">Link</a>

Potential benefits: Simpler, faster code that uses less memory (if we can get rid of $updated_html). This needs more investigation.

@adamziel
Copy link
Contributor Author

adamziel commented Feb 7, 2023

6.2 dev note draft below. @dmsnell @gziolo @ockham feel free to edit this comment and adjust as you see fit:


WordPress 6.2 ships with WP_HTML_Tag_Processor – a tool to adjust HTML tag attributes in the block markup:

$p = new WP_HTML_Tag_Processor( '<a href="#"></a>' );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' );
echo $p->get_updated_html();
// <a></a>

Before WordPress 6.2, the block markup was typically updated using regular expressions or the DOMDocument class. Both have downsides. The former is tedious and prone to security issues, while the latter uses libxml2 which does not support HTML5.

WP_HTML_Tag_Processor is safe and efficient. Unlike full-fledged HTML parsers, the processor avoids handling malformed markup, semantic problems, and building a document tree. Any problems that are present on the input are passed on to the browser. The processor doesn’t fix HTML just as it won’t break HTML.

The tradeoff is that it only offers a simplified API to modify HTML attributes. If you want to replace an img tag with a full-fledged figure layout, this API won’t offer that functionality. Similarly, the processor won’t help you replace all the child nodes of a particular div with a completely new markup. This system is focused on finding specific HTML tags and adding, removing, or updating the attributes on those tags.

Here's how to use WP_HTML_Tag_Processor:

Remove the href attribute from an anchor tag:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( 'a' );
$p->remove_attribute( 'href' );

Add a style attribute to the first tag in the document:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag();
$p->set_attribute( 'style', 'display: none' );

Add a CSS class to the first tag having the wp-block-media-text__content class:

$p = new WP_HTML_Tag_Processor( $html );
$p->next_tag( array(
    'class_name' => 'wp-block-media-text__content'
) );
$p->add_class( 'wp-foo-bar' );

Add the srcset attribute to all image tags:

$p = new WP_HTML_Tag_Processor( $html );
while ( $p->next_tag() ) {
    if (
        isset( $p->get_attribute( 'src' ) ) &&
        ! isset( $p->get_attribute( 'srcset' )
    ) {
        $srcset = build_srcset( $p->get_attribute( 'src' ) );
        $p->set_attribute( 'srcset', $srcset );
    }
}

For additional documentation, refer to GitHub Pull Request. If you want to learn more about the motivation for this new API, check the post that proposed A new system for simply and reliably updating HTML attributes.

@adamziel adamziel added the has dev note when dev note is done (for upcoming WordPress release) label Feb 7, 2023
@gziolo
Copy link
Member

gziolo commented Feb 9, 2023

I reviewed the listed todo items, and I have an impression that we can close this issue after landing HTML Tag Process in WordPress core. If there are any remaining issues, we can open issues that target individual use cases.

Find a canonical way of stringifying the processor. One that's different from (string) $w (#44410 (comment)).

There is get_updated_html method that fulfills the requirement.

Review the documentation blocks.

This happened as part of the WP core merge.

Improve description for keys in test data providers that use numeric values (example).

Fixed in https://github.com/WordPress/wordpress-develop/blob/39bfc2580d9b0ea7e6ff45b8d904408d925cbc4b/tests/phpunit/tests/html/wpHtmlTagProcessor.php

Explore adding a has_next_tag method (#44600 (comment))

Is that still necessary? Could we use the bookmarking system instead?

@adamziel
Copy link
Contributor Author

adamziel commented Feb 9, 2023

@gziolo I refreshed the list of related issues – there's still some work related to the character decoding, the documentation, and the bookmarks API.

@gziolo
Copy link
Member

gziolo commented Feb 10, 2023

@gziolo I refreshed the list of related issues – there's still some work related to the character decoding, the documentation, the bookmarks API.

Thank you for updating the description with the recently started tasks.

A handbook page

Documentation is going to be automatically generated from PHPDoc comments for all classes and methods. It will be exposed in the Code Reference section of Official WordPress Developer Resources, similar to for example WP_Block_Type class. What type of handbook page you would like to include in addition to that? Is that in the Block Editor Handbook?

In the codebase, there is the following comment included:

### Possible future direction for this module
 *
 *  - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c".
 *    This would increase the size of the changes for some operations but leave more
 *    natural-looking output HTML.
 *  - Decode HTML character references within class names when matching. E.g. match having
 *    class `1<"2` needs to recognize `class="1&lt;&quot;2"`. Currently the Tag Processor
 *    will fail to find the right tag if the class name is encoded as such.
 *  - Properly decode HTML character references in `get_attribute()`. PHP's
 *    `html_entity_decode()` is wrong in a couple ways: it doesn't account for the
 *    no-ambiguous-ampersand rule, and it improperly handles the way semicolons may
 *    or may not terminate a character reference.

Does #47040 from the list cover points 2 and 3? It looks like we still need an issue for point 1 about removing whitespaces.


There is also an unlisted PR that introduces a new class for sourcing block attributes that gets referenced in the PR from the section with new APIs:

#46345

I see that some listed tasks reference or extend that work, they explore opening modifications to chunks of HTML:

What do you all see as a top priority and how people can get involved to help with efforts?

@adamziel
Copy link
Contributor Author

Documentation is going to be automatically generated from PHPDoc comments for all classes and methods. It will be exposed in the Code Reference section of Official WordPress Developer Resources, similar to for example WP_Block_Type class. What type of handbook page you would like to include in addition to that? Is that in the Block Editor Handbook?

I'm thinking of a proper guide on how to use WP_HTML_Tag_Processor. API docs don't quite cut it.

Does #47040 from the list cover points 2 and 3?

Point 3 – yes.
Point 2 – potentially? I'm not sure – it requires a deeper look on how CSS classes are read from the attributes. @dmsnell may know off the top of his head.

There is also an unlisted PR that introduces a new class for sourcing block attributes that gets referenced in the PR from the section with new APIs:

Good spot 👍 Added to the list

What do you all see as a top priority and how people can get involved to help with efforts?

I think the two open PRs may be blockers for the draft PR. Can you confirm @ockham ? If yes, then those would be the priority to me. Then the set_content one and the CSS selector one.

@gziolo gziolo mentioned this issue Feb 10, 2023
69 tasks
@dmsnell
Copy link
Member

dmsnell commented Feb 10, 2023

Does #47040 from the list cover points 2 and 3?

Yes, completely. There's an enhancement possible noted in the decoder class which is adding a method like "decoded_strpos" or "decoded_has_substring" but when I was testing the performance wasn't impacted by calling decode on every call, since this only does something when character references are present, and in those cases (which are uncommon) we have to start decoding anyway.

@ockham
Copy link
Contributor

ockham commented Feb 13, 2023

I see that some listed tasks reference or extend that work, they explore opening modifications to chunks of HTML:

What do you all see as a top priority and how people can get involved to help with efforts?

I think the two open PRs may be blockers for the draft PR. Can you confirm @ockham ? If yes, then those would be the priority to me. Then the set_content one and the CSS selector one.

Yes, that's correct. Plus the non-attribute updates one depends on the bookmark invalidation one. So currently, it's a cascade of dependencies, including one more PR that I filed only recently. They are, in order:

  1. HTML Tag Processor: Add WP 6.3 compat layer #47933
  2. Tag Processor: Add bookmark invalidation logic #47559
  3. WP_HTML_Tag_Processor: Allow non-attribute lexical updates #47068
  4. WP_HTML_Processor: Add set_content_inside_balanced_tags #47036

@adamziel
Copy link
Contributor Author

Surfacing my explorations of parsing HTML into the correct tree structure, e.g. <p><p>Lorem -> <p></p><p>Lorem</p>

adamziel/wordpress-develop#1

@gziolo gziolo added the [Type] Tracking Issue Tactical breakdown of efforts across the codebase and/or tied to Overview issues. label Feb 27, 2023
@gziolo
Copy link
Member

gziolo commented Sep 5, 2023

We can close this issue as done. The work continues in WordPress Trac:

HTML API: Introduce HTML Processor, a higher-level partner to the Tag Processor

The list of open tickets is grouped under HTML API component.

There still are two open PRs:

Explore adding a has_next_tag method

This item is pretty vague, and it seems to be nice to have at this stage.

@gziolo gziolo closed this as completed Sep 5, 2023
@dmsnell
Copy link
Member

dmsnell commented Sep 5, 2023

Thanks @gziolo.

The two open PRs are good to stay open here. I don't know if we need a Trac ticket or not because I expect them to remain open for a while. #46345 will likely never merge but only serve as a guide to what we end up merging (which is one of the reasons I keep these open).

has_next_tag doesn't make sense to me at the moment. I'm not even sure where that came from. With bookmarks we don't need it, and has_next_tag implies peeking ahead of the parser, which we don't currently do.

@strarsis
Copy link
Contributor

Can the existing WP_HTML_Tag_Processor API be used to replace an HTML element with another one?
Can an additional WP_HTML_Tag_Processor instance be created from additional HTML and then be used to replace a tag from another WP_HTML_Tag_Processor instance?

@dmsnell
Copy link
Member

dmsnell commented Sep 17, 2023

@strarsis the Tag Processor doesn't support replacing any tags, but the HTML Processor (which is being built on top of it) will support that. because of the complexities of the semantic rules involved in replacing an element in HTML though, this is coming at a slower pace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Block API API that allows to express the block paradigm. has dev note when dev note is done (for upcoming WordPress release) [Type] Enhancement A suggestion for improvement. [Type] Tracking Issue Tactical breakdown of efforts across the codebase and/or tied to Overview issues.
Projects
None yet
Development

No branches or pull requests

6 participants