Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

Closed
jkseppan opened this issue Oct 2, 2024 · 1 comment · Fixed by #538
Closed
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed

Comments

@jkseppan
Copy link

jkseppan commented Oct 2, 2024

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file
A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html

(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)

Verbose output

2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
    "path": "/tmp/finnish-utf-8-latin-1-confusion.html",
    "encoding": "latin_1",
    "encoding_aliases": [
        "8859",
        "cp819",
        "csisolatin1",
        "ibm819",
        "iso8859",
        "iso8859_1",
        "iso_8859_1",
        "iso_8859_1_1987",
        "iso_ir_100",
        "l1",
        "latin",
        "latin1"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.533,
    "coherence": 65.6,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

This should be UTF-8. One clue is that the output includes the word Päätösehdotus which is a mangled version of Päätösehdotus.

Most nontrivial Finnish text will include several instances of the character ä and possibly ö. Upper-case versions Ä and Ö are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become

  • ä → \xc3\xa4 → ä
  • ö → \xc3\xb6 → ö
  • Ä → \xc3\x84 → à and a control character, or Ä
  • Ä → \xc3\x96 → à and a control character, or Ö

The characters 䶄 do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)

Desktop (please complete the following information):

  • OS: MacOS 14.7
  • Python version 3.12.6
  • Package version 3.3.2

Additional context

My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.

@jkseppan jkseppan added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Oct 2, 2024
@Ousret
Copy link
Member

Ousret commented Oct 2, 2024

This case has been fixed in #538
Will be available in the next release.

@Ousret Ousret closed this as completed Oct 2, 2024
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 9, 2024
##### v3.4.0 (`https://github.com/Ousret/charset_normalizer/blob/HEAD/CHANGELOG.md#340-2024-10-08`)

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 13, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
zemnmez-renovate-bot added a commit to zemn-me/monorepo that referenced this issue Oct 14, 2024
##### v3.4.0 

##### Added

-   Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
-   Support for Python 3.13 ([#512](jawah/charset_normalizer#512))

##### Fixed

-   Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
-   Improved the general reliability of the detector based on user feedbacks. ([#520](jawah/charset_normalizer#520)) ([#509](jawah/charset_normalizer#509)) ([#498](jawah/charset_normalizer#498)) ([#407](jawah/charset_normalizer#407)) ([#537](jawah/charset_normalizer#537))
-   Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. ([#381](jawah/charset_normalizer#381))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed
Development

Successfully merging a pull request may close this issue.

2 participants