Add option to specify how characters to consider when searching charset in content header #320

benoit74 · 2024-06-17T12:59:06Z

At https://www.marxists.org/espanol/justo/suvida.htm, the charset specified in HTML header is unfortunately far away (we need 1028 bytes to find it in full, instead of the default 1024 bytes).

Currently, we arbitrarily decided to consider only the first 1024 bytes of the content to lookup for charset. While this default value makes sense as a compromise between capacity to find all charsets and performance / memory footprint, it would help a lot if we could customize this option for the rare cases like here where the content-type is specified, a bit custom (windows-1252 here), and we don't mind to explore more bytes on all contents.

I suggest we should add an option to customize this "magic number".

The text was updated successfully, but these errors were encountered:

benoit74 added the enhancement New feature or request label Jun 17, 2024

benoit74 added this to the 2.0.2 milestone Jun 17, 2024

benoit74 self-assigned this Jun 17, 2024

benoit74 mentioned this issue Jun 17, 2024

Add option to specify content header length #321

Merged

benoit74 closed this as completed in #321 Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to specify how characters to consider when searching charset in content header #320

Add option to specify how characters to consider when searching charset in content header #320

benoit74 commented Jun 17, 2024

Add option to specify how characters to consider when searching charset in content header #320

Add option to specify how characters to consider when searching charset in content header #320

Comments

benoit74 commented Jun 17, 2024