Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default encoding in the DOMDocument object #168

Merged
merged 2 commits into from
Jun 15, 2019
Merged

Set default encoding in the DOMDocument object #168

merged 2 commits into from
Jun 15, 2019

Conversation

idimopoulos
Copy link
Contributor

Solves #167 (If I am correct).

@goetas
Copy link
Member

goetas commented Jun 14, 2019

Hi, thanks for the PR. Is there a way to test this?

@idimopoulos
Copy link
Contributor Author

idimopoulos commented Jun 14, 2019

@goetas I am working on it right now. There is a very easy to test this in \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testDocument by simply checking if the resulting $doc object has the correct (or default) encoding.

What I am worrying about though, is whether the encoding should derive by the document itself i.e. the charset attribute in the head HTML element tag. I will be back shortly with something.

@idimopoulos
Copy link
Contributor Author

I forced pushed to add a failing test first and then I will push the fix.
The thing is that I am not entirely sure about the nature of the parser.
In \Masterminds\HTML5::parse the scanner is instantiated with the options !empty($options['encoding']) ? $options['encoding'] : 'UTF-8'

However, this means only that the scanner is meant to scan using a 'UTF-8' parser (I mean that it will deal with the document considering that it is a UTF-8 document even if it is not.
However, the DOMTreeBuilder, that actually has the $doc property, should have the encoding of the document/input itself. Since it is an HTML5 parser, I can assume that an HTML is expected to be passed. However, the input is not required to build the DOMTreeBuilder.

I see all over around that the UTF-8 is used as the default option but I am not sure whether that should apply for the DOMTreeBuilder (maybe you people can shed some light here).
In any way, I am providing the fix as in our case, as I mentioned in #167, Symfony uses the new \DOMDocument('1.0', $charset); which is passed in the method to build the Document so even for the legacy cases, the charset is not detected through the head metatag. Thus, at least to be consistent to begin with, we could have this set to UTF-8 by default.

Do you think I need to open an issue to automatically detect the charset from the HTML if available?

@idimopoulos idimopoulos changed the title Set default encoding in the DOMDocument object. Test that the encoding is there by default Jun 14, 2019
@idimopoulos idimopoulos changed the title Test that the encoding is there by default Set default encoding in the DOMDocument object Jun 14, 2019
@idimopoulos
Copy link
Contributor Author

Sorry I messed up a bit with the commit messages.

@goetas
Copy link
Member

goetas commented Jun 15, 2019

Looks good, thanks!

@goetas goetas merged commit ad8802c into Masterminds:master Jun 15, 2019
@idimopoulos idimopoulos deleted the html5_encoding branch July 25, 2019 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants