Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to specify how characters to consider when searching charset in content header #320

Closed
benoit74 opened this issue Jun 17, 2024 · 0 comments · Fixed by #321
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@benoit74
Copy link
Collaborator

At https://www.marxists.org/espanol/justo/suvida.htm, the charset specified in HTML header is unfortunately far away (we need 1028 bytes to find it in full, instead of the default 1024 bytes).

Currently, we arbitrarily decided to consider only the first 1024 bytes of the content to lookup for charset. While this default value makes sense as a compromise between capacity to find all charsets and performance / memory footprint, it would help a lot if we could customize this option for the rare cases like here where the content-type is specified, a bit custom (windows-1252 here), and we don't mind to explore more bytes on all contents.

I suggest we should add an option to customize this "magic number".

@benoit74 benoit74 added the enhancement New feature or request label Jun 17, 2024
@benoit74 benoit74 added this to the 2.0.2 milestone Jun 17, 2024
@benoit74 benoit74 self-assigned this Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant