Use correct encoding when fetching non-UTF-8 site metadata #2015
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the site metadata is fetched, the default assumption is that it will be encoded in UTF-8, but this is not always the case. The result is that the metadata will be displayed in the frontend as garbled characters. This PR adds an additional check on the
charset
property of the fetched page if present, and will re-decode the fetched bytes with the specified encoding if possible. Should an unknown encoding be specified, it will fall back to the original UTF-8 data.Fixes #1858
NOTE: An unrelated fix is also included, as the website in the original issue started its response with a blank line before the DOCTYPE declaration. For this purpose,
trim_start()
was added to the HTML parsing.