Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html file is not reported as UTF8 after conversion #381

Closed
hrvoj3e opened this issue Nov 8, 2023 · 1 comment · Fixed by #533
Closed

html file is not reported as UTF8 after conversion #381

hrvoj3e opened this issue Nov 8, 2023 · 1 comment · Fixed by #533
Labels
bug Something isn't working CLI Anything related to the CLI script (normalizer)
Milestone

Comments

@hrvoj3e
Copy link

hrvoj3e commented Nov 8, 2023

Provide the file
110-original.zip

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

❯ # rm+unzip

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250


❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
    "path": "/home/adax/code/other/encoding/110-original.htm",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.783,
    "coherence": 66.66,
    "unicode_path": "/home/adax/code/other/encoding/110-original.htm",
    "is_preferred": true
}

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250

enca will however detect UTF-8 as it should

❯ # rm+unzip

❯ enca -L hr 110-original.htm
Unrecognized encoding

❯ normalizer -rfnvvv 110-original.htm

❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
  CRLF line terminators

Expected encoding
Expected normalizer to show UTF-8 encoding after conversion to UTF-8.
Am I wrong here?

Desktop (please complete the following information):

  • OS: Linux
  • Python version 3.11.5
  • Package version charset-normalizer==3.3.2

Additional context
I know. Html is not the same as text.
But I will document this here.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

@hrvoj3e hrvoj3e added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Nov 8, 2023
@hrvoj3e hrvoj3e changed the title html file is not UTF8 after conversion html file is not reported as UTF8 after conversion Nov 8, 2023
@Ousret
Copy link
Member

Ousret commented Nov 11, 2023

Yes, you are correct.
What you have is somewhat edge, but problematic nonetheless.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

Not entirely true, it's more complicated than that.

Fortunately, I know how to fix this. I don't know exactly when, but soon.
The idea is to do a preg replace within the normalizer CLI if there is a declarative mark.

@Ousret Ousret added this to the Next release milestone Sep 21, 2024
@Ousret Ousret added bug Something isn't working CLI Anything related to the CLI script (normalizer) and removed help wanted Extra attention is needed detection Related to the charset detection mechanism, chaos/mess/coherence labels Sep 21, 2024
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CLI Anything related to the CLI script (normalizer)
Development

Successfully merging a pull request may close this issue.

2 participants