UTF8 Encoding isn't consistent with .Net Framework #1679

replaysMike · 2020-01-11T06:11:32Z

I found a subtle difference that was revealed in a bunch of hashing code I had written a while back for .Net Framework. I wrote a multi-platform test that shows UTF8Encoding is treated slightly different in .Net Standard and I don't really have a good way to solve it yet.

Consider the following - I encoded a string in hex to guarantee the bytes are the same for the test:
(hopefully github doesn't mangle the expected string, it looks correct after previewing)

[Test]
public void ShouldEncodeUTF8()
{
    var netFrameworkExpected = "�\u0004[\u0004�\u0001�\v,�\u001cn]�$«�� )�:�YH̗I5�V���Nl7α��i�g_�ZQW%\u001d�Dy\u001eЕ\u0013w�v+\u0012*��\u000f*��\u0019r��}���8��w��&�\r���\f����?���&�t�M��[�`kzhz9\u0015�\u0012I�\u001ey_`�\u0011\tF��A�Af~��q��%P�����\u0003�x�(g���e\u001fM�32\u0014��";
    var hex = "BC045B0488019F0B2CE61C6E5DFC24C2ABE09BDA2029CC3AE9AD5948CC9749359756B1A2D94E6C37CEB189D269AA675FF75A5157251D8544791ED09513779B762B122A89E10F2A98E91972D7CA7DF9F98038DFDB779FED269A0DE3F8FA0C828993B23F85B5A826B474E84DFECD5B87606B7A687A3915C31249CE1E795F609A11094686DF41E99041667E9DD271A0E22550FDD0C3CEF0039678F328679B8590651F4DBE3332148DBA";
    var bytes = hex.HexToBytes();
    var utf8Encoded = Encoding.UTF8.GetString(bytes);
    Assert.AreEqual(netFrameworkExpected, utf8Encoded);
}

public static byte[] HexToBytes(this string hexString)
{
    return Enumerable.Range(0, hexString.Length)
        .Where(x => x % 2 == 0)
        .Select(x => Convert.ToByte(hexString.Substring(x, 2), 16))
        .ToArray();
}

This test will pass on .Net Framework 4.8, but will fail on .Net Standard 2.0

svick · 2020-01-11T11:37:09Z

The difference is that on .Net Core 3.0+, some invalid byte sequences produce two �, while they produced only one � on older frameworks. This was an intentional breaking change to follow Unicode best practices, see dotnet/docs#13547 for more details.

(Also, this has nothing to do with .Net Standard. If you write a .Net Standard library and then run it on .Net Framework or .Net Core 2.x, you should see the old behavior.)

replaysMike · 2020-01-11T22:24:43Z

Thanks, you would think they would provide an overload to get the old behaviour as there is no way to easily fix this. I can’t just utilize the new encoding, the mechanisms are used for existing hashed password authentication and this makes it difficult to upgrade to .Net Core.

It’s good that they fixed this finally, however breaking backwards compatibility is a problem. The only way for me to fix this is to roll my own UTF8 encoding

svick · 2020-01-11T22:34:58Z

Supporting invalid UTF-8 sequences for passwords sounds like a bad idea to me and something nobody would intentionally use. Can't you change your policy to disallow such passwords (and suggest password reset if anyone actually tries to log in with such a password)?

replaysMike · 2020-01-12T01:03:41Z

we weren’t supporting this in passwords but rather the auto generated salts that are stored as binary data. We can't validate the existing password hashes correctly if we can't decode the existing UTF8 salts. I discovered the system was creating unicode strings and validating using UTF8 which is a bug in our system and now fixed, but there's not much I can do about validating existing salts without the same UTF8 handling as in .Net Framework.

replaysMike · 2020-01-13T02:36:58Z

I'll just leave this here in case anyone else runs into this and its a blocker - I created a Nuget package that provides UTF8 encoding as .Net Framework does it: Text.UTF8.Legacy

GrabYourPitchforks · 2020-04-23T17:11:21Z

I'm glad you found a workaround for your scenario. However, I want to provide a general warning that this is not the right way to go about doing things. The Encoding classes are intended for processing structured data, not random data (as you would get from cryptography). The end result is that much of the entropy of the "salt" data will be lost during the conversion process. If the target application is bound by legal or regulatory requirements, a security audit will find such application out of compliance.

The safest course of action for the scenario mentioned above would be to perform a one-time conversion of the existing data in the database, performing the UTF-8 decoding step, then re-storing this information back in the database (as binary) and marking it as tainted. The next time a user attempts to log in, the salt is read from the database and treated in pure binary form rather than having the application attempt to stringify it. Additionally, if the salt is marked tained, the application should re-generate the salt using the full amount of required entropy, then store this new salt back in the database and remove the tainted flag.

This process should have the effect of automatically upgrading all users' login data the next time they log in. The process should be fully transparent to users and shouldn't result in any "you must reset your password" like interruptions to their workflows.

Hope this helps!

replaysMike · 2020-04-23T18:43:21Z

This is only possible when you have a way to UTF8 decode the original data in order to validate at least once so that an upgrade and re-encoding of the data can be performed. In our case, this means having all users for a particular database tenant being upgraded before the old UTF8 data can be upgraded permanently. This means having access to the old UTF8 encoding is still required and the Microsoft team made no option available for backwards compatibility requiring custom solutions. This is the 6th non-backwards compatible change in .net core 3+ I’ve found and seemingly all introduced by a single developer at Microsoft.

…

On Thu, Apr 23, 2020 at 10:11 AM Levi Broderick ***@***.***> wrote: <security hat on> I'm glad you found a workaround for your scenario. However, I want to provide a general warning that this is not the right way to go about doing things. The Encoding classes are intended for processing structured data, not random data (as you would get from cryptography). The end result is that much of the entropy of the "salt" data will be lost during the conversion process. *If the target application is bound by legal or regulatory requirements, a security audit will find such application out of compliance.* The safest course of action for the scenario mentioned above would be to perform a one-time conversion of the existing data in the database, performing the UTF-8 decoding step, then re-storing this information back in the database (as binary) and marking it as *tainted*. The next time a user attempts to log in, the salt is read from the database and treated in pure binary form rather than having the application attempt to stringify it. Additionally, if the salt is marked *tained*, the application should re-generate the salt using the full amount of required entropy, then store this new salt back in the database and remove the *tainted* flag. This process should have the effect of automatically upgrading all users' login data the next time they log in. The process should be fully transparent to users and shouldn't result in any "you must reset your password" like interruptions to their workflows. Hope this helps! — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#1679 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATJ54XRNJMXSNFX6F4BK5LROBZETANCNFSM4KFQLFQQ> .

GrabYourPitchforks · 2020-04-23T19:21:29Z

.NET Core does not hold itself to the same back-compat bar as .NET Full Framework did. The philosophy behind this is that for .NET Full Framework, updates are in-place and are pushed out over Windows Update. Developers and consumers often have no control over when updates are deployed to a machine. Imagine an application works on Monday, then you go to bed and wake up on Tuesday, and the application stops working. That kind of change is generally unacceptable. So we do try to hold ourselves to a very high bar there. One of the consequences is that we often can't even introduce seemingly innocuous bug fixes. Somebody, somewhere, may have taken a dependency on the buggy behavior.

For .NET Core, we're more free to make such changes. The reason for this is that framework updates are a deliberate action by the developer. By definition .NET Core applications can't run into the "works one day, fails the next day" scenario I mentioned above.

We're not ignorant to the fact that breaking changes are sometimes painful, and we're not in the habit of making such changes blithely. When we do make such changes we weigh the benefit vs. harm to the ecosystem as a whole. For example, we generally see improved interoperability with accepted industry standards as a large overall ecosystem benefit. We also document such changes at https://docs.microsoft.com/en-us/dotnet/core/compatibility/breaking-changes.

You said this is the 6th breaking change you've encountered in .NET Core 3.x. Check the link above (or, specifically, https://docs.microsoft.com/en-us/dotnet/core/compatibility/corefx). If we missed something, then please open issues in the docs repo. Or comment here and I can open issues on your behalf.

John0King · 2020-04-24T01:39:54Z

the mechanisms are used for existing hashed password authentication and this makes it difficult to upgrade to .Net Core

if you store as binary in database, why would you use Utf8Encoding ? all hash algorithm (sha1,sha256,md5 etc.) use byte[] instead of string , if you want to show this data, base64 is a much better solution .

the upgrade process would have 2 steps:

convert current utf8 base to base64 . （if the data in database is utf8 based, you can use .net framework to decode and update to original byte[]）
upgrade to .NET Core 3.x with base64

replaysMike closed this as completed Jan 13, 2020

GrabYourPitchforks mentioned this issue Apr 14, 2021

Strange behavior, Encoding.UTF8.GetString has different results in dotnet core and dotnet framewrok. dotnet/runtime#51204

Closed

javierhimura mentioned this issue Jun 25, 2023

Support for GameBoy NSO Save files kwsch/PKHeX#3921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 Encoding isn't consistent with .Net Framework #1679

UTF8 Encoding isn't consistent with .Net Framework #1679

replaysMike commented Jan 11, 2020 •

edited

Loading

svick commented Jan 11, 2020 •

edited

Loading

replaysMike commented Jan 11, 2020

svick commented Jan 11, 2020

replaysMike commented Jan 12, 2020 •

edited

Loading

replaysMike commented Jan 13, 2020

GrabYourPitchforks commented Apr 23, 2020

replaysMike commented Apr 23, 2020 via email

GrabYourPitchforks commented Apr 23, 2020

John0King commented Apr 24, 2020

UTF8 Encoding isn't consistent with .Net Framework #1679

UTF8 Encoding isn't consistent with .Net Framework #1679

Comments

replaysMike commented Jan 11, 2020 • edited Loading

svick commented Jan 11, 2020 • edited Loading

replaysMike commented Jan 11, 2020

svick commented Jan 11, 2020

replaysMike commented Jan 12, 2020 • edited Loading

replaysMike commented Jan 13, 2020

GrabYourPitchforks commented Apr 23, 2020

replaysMike commented Apr 23, 2020 via email

GrabYourPitchforks commented Apr 23, 2020

John0King commented Apr 24, 2020

replaysMike commented Jan 11, 2020 •

edited

Loading

svick commented Jan 11, 2020 •

edited

Loading

replaysMike commented Jan 12, 2020 •

edited

Loading