Use correct encoding when fetching non-UTF-8 site metadata #2015

rgroothuijsen · 2021-12-29T22:59:38Z

When the site metadata is fetched, the default assumption is that it will be encoded in UTF-8, but this is not always the case. The result is that the metadata will be displayed in the frontend as garbled characters. This PR adds an additional check on the charset property of the fetched page if present, and will re-decode the fetched bytes with the specified encoding if possible. Should an unknown encoding be specified, it will fall back to the original UTF-8 data.

Fixes #1858

NOTE: An unrelated fix is also included, as the website in the original issue started its response with a blank line before the DOCTYPE declaration. For this purpose, trim_start() was added to the HTML parsing.

dessalines

Thanks so much for this!

rgroothuijsen added 2 commits December 29, 2021 21:12

Use correct encoding when fetching non-UTF-8 site metadata

af7d971

Style fixes

40912d4

dessalines requested review from dessalines and Nutomic January 3, 2022 16:26

dessalines approved these changes Jan 3, 2022

View reviewed changes

dessalines mentioned this pull request Jan 3, 2022

Intermittent server crashes on html5ever tokenizer #1964

Closed

Nutomic merged commit 661f97a into LemmyNet:main Jan 6, 2022

hanubeki mentioned this pull request Jan 9, 2023

Encoding aliases not supported when fetching webpages. #2648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct encoding when fetching non-UTF-8 site metadata #2015

Use correct encoding when fetching non-UTF-8 site metadata #2015

rgroothuijsen commented Dec 29, 2021

dessalines left a comment

Use correct encoding when fetching non-UTF-8 site metadata #2015

Use correct encoding when fetching non-UTF-8 site metadata #2015

Conversation

rgroothuijsen commented Dec 29, 2021

dessalines left a comment

Choose a reason for hiding this comment