Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WP_XML_Tag_Processor and WP_XML_Processor #6713

Open
wants to merge 62 commits into
base: trunk
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
9e69192
XML Processor: First stab
adamziel Jun 3, 2024
00a63ae
Don't update state in parse_name();
adamziel Jun 3, 2024
c9f3450
Use NameCharacters to parse tag names
adamziel Jun 3, 2024
3f645db
Support CData and processing instructions
adamziel Jun 3, 2024
323d065
Support stack of open elements
adamziel Jun 3, 2024
63ae575
Support breadcrumbs
adamziel Jun 3, 2024
9558db1
implement 2.11 End-of-Line Handling
adamziel Jun 3, 2024
40df7a6
Clean up comments
adamziel Jun 3, 2024
6bbe94e
Only support UTF-8
adamziel Jun 3, 2024
087dc96
Uncomment more tests
adamziel Jun 3, 2024
91310c5
Uncomment more tests
adamziel Jun 3, 2024
ecb0347
Restrict what's accepted after the root element is closed
adamziel Jun 3, 2024
7ae34bf
formalize parsing prolog, element, and misc
adamziel Jun 3, 2024
28e91d8
Remove the concept of a comment type
adamziel Jun 3, 2024
a3b5e93
Remove COMMENT_AS_XML_COMMENT
adamziel Jun 3, 2024
7285cd0
Validate whether the root closer was found
adamziel Jun 3, 2024
a6958c9
Pause when root isn't closed
adamziel Jun 3, 2024
43149aa
Remove references to html spec
adamziel Jun 3, 2024
3df4a19
Class-level comment
adamziel Jun 3, 2024
6c15968
Document parsing_stage
adamziel Jun 3, 2024
13ba001
Support PCData
adamziel Jun 3, 2024
ae87c69
Decode attributes values
adamziel Jun 3, 2024
c96bc74
Document the class a bit more
adamziel Jun 3, 2024
00448b0
Support entity decoding
adamziel Jun 3, 2024
aef3baa
Document more and replace HTML tags with XML-lookalikes
adamziel Jun 3, 2024
e5f3d35
Add a step() function to decouple the high-level structure from low-l…
adamziel Jun 3, 2024
be70441
Refactor to separate the higher-order state even more
adamziel Jun 3, 2024
b6261f3
Port matches_breadcrumbs
adamziel Jun 3, 2024
4d43efc
Separate WP_XML_Processor from WP_XML_Tag_Processor
adamziel Jun 3, 2024
6274923
Unit test breadcrumbs query
adamziel Jun 3, 2024
c702221
Add get_inner_text() method
adamziel Jun 4, 2024
9baff6e
Test get_inner_text for cdata
adamziel Jun 4, 2024
5b723e6
Further document get_inner_text
adamziel Jun 4, 2024
e0219cf
Remove test.php
adamziel Jun 4, 2024
dc717e8
Update class-wp-html-decoder.php
adamziel Jun 4, 2024
88261c8
Update class-wp-html-decoder.php
adamziel Jun 4, 2024
e2716bf
Make inner text an integral part of PCData elements to enable streaming
adamziel Jun 4, 2024
2f42035
PHPCS
adamziel Jun 4, 2024
b420b8b
Clerikal updates
adamziel Jun 4, 2024
1d4026b
Only decode the five mandatory entities plus numeric references
adamziel Jun 4, 2024
d614bfd
Streaming improvements
adamziel Jun 4, 2024
dfe8489
Include WP_XML_Processor
adamziel Jun 4, 2024
04d167e
Remove ?> from HEREDOC as it derailed PHP7
adamziel Jun 4, 2024
c2d46a0
Stop using <<<XML syntax entirely
adamziel Jun 4, 2024
5a65ecc
Pause on incomplete attribute
adamziel Jun 4, 2024
8a96804
PHPCBF
adamziel Jun 4, 2024
dd9f580
Expect error notices
adamziel Jun 4, 2024
436c3cd
Declare setExpectedIncorrectUsage at the beginning of eadh test
adamziel Jun 4, 2024
5401798
Use @expectedDeprecated annotation
adamziel Jun 4, 2024
5b2f879
PHP CBF
adamziel Jun 4, 2024
91b2f57
Use `@expectedIncorrectUsage`
adamziel Jun 4, 2024
e5b7cac
Try to calm down phpunit _doing_it_wrong detection
adamziel Jun 4, 2024
faeac14
Move expectedIncorrectUsage up
adamziel Jun 4, 2024
d792a6a
Tweak PHPUnit setup
adamziel Jun 4, 2024
1274876
Don't short-circuit matching breadcrumbs in text nodes
adamziel Jun 4, 2024
9ad4b9a
PHPCS
adamziel Jun 4, 2024
8c7888c
Remove get_inner_text
adamziel Jun 4, 2024
f06b8f8
Make text nodes incomplete until the next tag is available, except in…
adamziel Jun 6, 2024
9922153
Move into separate XML directories.
dmsnell Jun 8, 2024
9917ef3
Much around with the XML decoder.
dmsnell Jun 8, 2024
a905009
Add link to XML spec in class package docs.
dmsnell Jun 8, 2024
3e7b817
More stewing on the decoder and name/attribute value parsers
dmsnell Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions src/wp-includes/xml-api/class-wp-xml-decoder.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
<?php

/**
* XML API: WP_XML_Decoder class
*
* Decodes spans of raw text found inside XML content,
* whether found in an attribute or in a text node.
*
* Do not use this function on the contents of a CDATA section,
* as those sections are not encoded with the XML rules unless
* they are embedded XML content.
*
* @package WordPress
* @subpackage HTML-API
* @since WP_VERSION
*/
class WP_XML_Decoder {
/**
* Decodes a span of XML text.
*
* Example:
*
* '&' = WP_XML_Decoder::decode( '&amp;' );
* '…' = WP_XML_Decoder::decode( '&#x2026;' );
*
* @todo Add examples of parse failures, and decide if it should fail or not.
*
* @since WP_VERSION
*
* @access private
*
* @param string $text Text document containing span of text to decode.
* @return string Decoded UTF-8 string.
*/
public static function decode( $text ) {
$decoded = '';
$end = strlen( $text );
$at = 0;
$was_at = 0;

while ( $at < $end ) {
$next_character_reference_at = strpos( $text, '&', $at );
if ( false === $next_character_reference_at || $next_character_reference_at >= $end ) {
break;
}

$start_of_potential_reference_at = $next_character_reference_at + 1;
if ( $start_of_potential_reference_at >= $end ) {
// @todo This is an error. The document ended too early; consume the rest as plaintext, which is wrong.
break;
}

/**
* First character after the opening `&`.
*/
$start_of_potential_reference = $text[ $start_of_potential_reference_at ];

/*
* If it's a named character reference, it will be one of the five mandated references.
*
* - `&amp;`
* - `&apos;`
* - `&gt;`
* - `&lt;`
* - `&quot;`
*
* These all must be found within the five successive characters from the `&`.
*
* Example:
*
* ╭ ampersand at 9 = $end - 6
* &apos;XML&apos; ($end = 15)
* ╰───┴─ this length must be at least 5 long,
* which is $end - 5.
*/
if (
$next_character_reference_at < $end - 5 &&
(
'a' === $start_of_potential_reference ||
'g' === $start_of_potential_reference ||
'l' === $start_of_potential_reference ||
'q' === $start_of_potential_reference
)
) {
foreach ( array(
'amp;' => '&',
'apos;' => "'",
'lt;' => '<',
'gt;' => '>',
'quot;' => '"',
) as $name => $substitution ) {
if ( 0 === substr_compare( $text, $name, $next_character_reference_at, strlen( $name ) ) ) {
$decoded .= substr( $text, $was_at, $next_character_reference_at - $was_at ) . $substitution;
$at = $start_of_potential_reference_at + strlen( $name );
$was_at = $at;
continue 2;
}
}

// @todo This is an invalid document. It should be communicated. Treat as plaintext and continue.
++$at;
continue;
}

/*
* The shortest numerical character reference is four characters.
*
* Example:
*
* &#9;
*/
if ( '#' !== $start_of_potential_reference || $next_character_reference_at + 4 >= $end ) {
// @todo This is an error. This ampersand _must_ be encoded. Treat as plaintext and move on.
++$at;
continue;
}

$is_hex = 'x' === $text[ $start_of_potential_reference_at + 1 ];
if ( $is_hex ) {
$zeros_at = $start_of_potential_reference_at + 2;
$base = 16;
$digit_chars = '0123456789abcdefABCDEF';
$max_digits = 6; // `&#x10FFFF;`
} else {
$zeros_at = $start_of_potential_reference_at + 1;
$base = 10;
$digit_chars = '0123456789';
$max_digits = 7; // `&#1114111;`
}

$zero_count = strspn( $text, '0', $zeros_at );
$digits_at = $zeros_at + $zero_count;
$digit_count = strspn( $text, $digit_chars, $digits_at, $max_digits );
$semi_at = $digits_at + $digit_count;

if ( $digit_count === 0 || $semi_at >= $end || ';' !== $text[ $semi_at ] ) {

Check failure on line 136 in src/wp-includes/xml-api/class-wp-xml-decoder.php

View workflow job for this annotation

GitHub Actions / PHP coding standards / Run coding standards checks

Use Yoda Condition checks, you must.
// @todo This is an error. Treat as plaintext and move on.
++$at;
continue;
}

$code_point = intval( substr( $text, $digits_at, $digit_count ), $base );
$character_reference = WP_HTML_Decoder::code_point_to_utf8_bytes( $code_point );
if ( '�' === $character_reference && 0xFFFD !== $code_point ) {
/*
* Stop processing if we got an invalid character AND the reference does not
* specifically refer code point FFFD (�).
*
* > It is a fatal error when an XML processor encounters an entity with an
* > encoding that it is unable to process. It is a fatal error if an XML entity
* > is determined (via default, encoding declaration, or higher-level protocol)
* > to be in a certain encoding but contains byte sequences that are not legal
* > in that encoding. Specifically, it is a fatal error if an entity encoded in
* > UTF-8 contains any ill-formed code unit sequences, as defined in section
* > 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level
* > protocol, it is also a fatal error if an XML entity contains no encoding
* > declaration and its content is not legal UTF-8 or UTF-16.
*
* See https://www.w3.org/TR/xml/#charencoding
*/
// @todo This is an error. Treat as plaintext and continue, which is wrong.
++$at;
continue;
}

$decoded .= substr( $text, $was_at, $at - $was_at );
$decoded .= $character_reference;
$at = $semi_at + 1;
$was_at = $at;
}

if ( 0 === $was_at ) {
return $text;
}

if ( $was_at < $end ) {
$decoded .= substr( $text, $was_at, $end - $was_at );
}

return $decoded;
}

/**
* Finds and parses the next entity in a given text starting after the
* given byte offset, and being entirely found within the given max length.
*
* @since {WP_VERSION}
*
* // @todo Implement this function.
*
* @param string $text Text in which to search for an XML entity.
* @param int $starting_byte_offset Start looking after this byte offset.
* @param int $ending_byte_offset Stop looking if entity is not fully contained before this byte offset.
* @param int|null $entity_at Optional. If provided, will be set to byte offset where entity was
* found, if found. Otherwise, will not be set.
*
* @return string|null Parsed entity, if parsed, otherwise `null`.
*/
public static function next_entity( string $text, int $starting_byte_offset, int $ending_byte_offset, int &$entity_at = null ): ?string {
$at = $starting_byte_offset;
$end = $ending_byte_offset;

while ( $at < $end ) {
$remaining = $end - $at;
$amp_after = strcspn( $text, '&', $at, $remaining );

// There are no more possible entities.
if ( $amp_after === $remaining ) {
return null;
}

/*
* @todo Move the decoding logic from `decode()` above into here,
* then call this function in a loop from `decode()`.
*/

++$at;
}

return null;
}
}
Loading
Loading