-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability to specify a list of candidate encodings when guessing encoding (#36951) #208550
Conversation
f3b6ec4
to
ff546d5
Compare
@yutotnh thanks for this, I have made some changes on top of yours. One thing I would like to discuss outside of this PR is your change of Can you elaborate why you want to change the threshold and should that maybe be another new setting? |
just installed insiders and have two bugs, one unrelated is trying to open files by rightclick now freezes file explorer window, menu never pops up, maybe some vscode insiders or something related preventing it get menu items list... second, the guess DOES NOT WORK AT ALL for windows1250, tried add to list alone, add pair with also utf8 in candidate json list, reverse the order.. it always opens as utf8 - adding file for test, zipped to avoid modification on saving here the file always opens as utf, the detection of the windows1250 never passes. maybe lower the required score in jschardet if using this candidate list ? |
@peminator I can see a change in
When in VS Code insiders I configure:
The file opens as As for the change in |
bpasero seems as specific problem for windows1250 maybe...
this opens the file as 1252... but its not proper in it see here: some characters ok, but some get the unknown symbol, and some get misinterpreted, see the last marked letter
i have entered the correct and expected windows1250 as first and it is skipped, wonder why, Windows-1250 is a Central European encoding, used in Slovak or Czech texts, also as default for older Windows before it went Utf8, so wonder how it gets skipped |
@peminator I do not see |
@aadsm any idea why the windows1250 not detected properly in this file ? |
@peminator
The detection of utf8 seems more reliable then the other encodings, so when using utf8 and another encoding its best to have files.encoding set to the non utf8 encoding as that’s the fallback encoding that is used. |
yes nfrance709 that may seem right, then it opens up as expected in windows1250, i need exactly the oposite way, have all UTF except old files opened in windows1250 bc of two old projects I still have to manage (may store workkspace setting for it locally, but i also often open files directly on ftp server)
|
If you need to ability to have new files created as utf8 for example with the
|
seems overcomplicated, al i now need would be to fix in jschardet to properly recognize also windows1250, i dont see any viable reason why it gets no probability listed in the result of guess, here above mentioned in #208550 (comment) ther is no mention of 1250... something fishy there
|
I have found a particular situation in which Encoding Detection is not working properly. Having VSCode closed, when I open a file with
I'm attaching the file that I'm using to perform the tests. Since I can't upload a file with |
♥ Perfect! ♥ Great work everyone adding to this, both vscode part, and also improving jschardet lib to be better in recognizing win1250. |
Corresponds to #36951 in Feature request.
This request adds a new setting
files.candidateGuessEncodings
.Setting this will narrow down the encodings guessed by the charset auto-guesser (jschardet).
The threshold is also lowered from 0.2 to 0 so that the encoding can be guessed as much as possible.
This is to ensure that the encoding is returned even when the confidence level is small, as in the following added text
In case it cannot be determined,
files.encoding
is respected