Skip to content
This repository has been archived by the owner on Sep 13, 2022. It is now read-only.

UTF8 Encoding isn't consistent with .Net Framework #1679

Closed
replaysMike opened this issue Jan 11, 2020 · 9 comments
Closed

UTF8 Encoding isn't consistent with .Net Framework #1679

replaysMike opened this issue Jan 11, 2020 · 9 comments

Comments

@replaysMike
Copy link

replaysMike commented Jan 11, 2020

I found a subtle difference that was revealed in a bunch of hashing code I had written a while back for .Net Framework. I wrote a multi-platform test that shows UTF8Encoding is treated slightly different in .Net Standard and I don't really have a good way to solve it yet.

Consider the following - I encoded a string in hex to guarantee the bytes are the same for the test:
(hopefully github doesn't mangle the expected string, it looks correct after previewing)

[Test]
public void ShouldEncodeUTF8()
{
    var netFrameworkExpected = "\u0004[\u0004\u0001\v,�\u001cn]�$«�� )�:�YH̗I5�V���Nl7α��i�g_�ZQW%\u001d�Dy\u001eЕ\u0013w�v+\u0012*��\u000f*��\u0019r��}���8��w��&�\r���\f����?���&�t�M��[�`kzhz9\u0015\u0012I�\u001ey_`�\u0011\tF��A�Af~��q��%P�����\u0003�x�(g���e\u001fM�32\u0014��";
    var hex = "BC045B0488019F0B2CE61C6E5DFC24C2ABE09BDA2029CC3AE9AD5948CC9749359756B1A2D94E6C37CEB189D269AA675FF75A5157251D8544791ED09513779B762B122A89E10F2A98E91972D7CA7DF9F98038DFDB779FED269A0DE3F8FA0C828993B23F85B5A826B474E84DFECD5B87606B7A687A3915C31249CE1E795F609A11094686DF41E99041667E9DD271A0E22550FDD0C3CEF0039678F328679B8590651F4DBE3332148DBA";
    var bytes = hex.HexToBytes();
    var utf8Encoded = Encoding.UTF8.GetString(bytes);
    Assert.AreEqual(netFrameworkExpected, utf8Encoded);
}

public static byte[] HexToBytes(this string hexString)
{
    return Enumerable.Range(0, hexString.Length)
        .Where(x => x % 2 == 0)
        .Select(x => Convert.ToByte(hexString.Substring(x, 2), 16))
        .ToArray();
}

This test will pass on .Net Framework 4.8, but will fail on .Net Standard 2.0

@svick
Copy link

svick commented Jan 11, 2020

The difference is that on .Net Core 3.0+, some invalid byte sequences produce two �, while they produced only one � on older frameworks. This was an intentional breaking change to follow Unicode best practices, see dotnet/docs#13547 for more details.

(Also, this has nothing to do with .Net Standard. If you write a .Net Standard library and then run it on .Net Framework or .Net Core 2.x, you should see the old behavior.)

@replaysMike
Copy link
Author

Thanks, you would think they would provide an overload to get the old behaviour as there is no way to easily fix this. I can’t just utilize the new encoding, the mechanisms are used for existing hashed password authentication and this makes it difficult to upgrade to .Net Core.

It’s good that they fixed this finally, however breaking backwards compatibility is a problem. The only way for me to fix this is to roll my own UTF8 encoding

@svick
Copy link

svick commented Jan 11, 2020

Supporting invalid UTF-8 sequences for passwords sounds like a bad idea to me and something nobody would intentionally use. Can't you change your policy to disallow such passwords (and suggest password reset if anyone actually tries to log in with such a password)?

@replaysMike
Copy link
Author

replaysMike commented Jan 12, 2020

we weren’t supporting this in passwords but rather the auto generated salts that are stored as binary data. We can't validate the existing password hashes correctly if we can't decode the existing UTF8 salts. I discovered the system was creating unicode strings and validating using UTF8 which is a bug in our system and now fixed, but there's not much I can do about validating existing salts without the same UTF8 handling as in .Net Framework.

@replaysMike
Copy link
Author

I'll just leave this here in case anyone else runs into this and its a blocker - I created a Nuget package that provides UTF8 encoding as .Net Framework does it: Text.UTF8.Legacy

@GrabYourPitchforks
Copy link
Member

<security hat on>

I'm glad you found a workaround for your scenario. However, I want to provide a general warning that this is not the right way to go about doing things. The Encoding classes are intended for processing structured data, not random data (as you would get from cryptography). The end result is that much of the entropy of the "salt" data will be lost during the conversion process. If the target application is bound by legal or regulatory requirements, a security audit will find such application out of compliance.

The safest course of action for the scenario mentioned above would be to perform a one-time conversion of the existing data in the database, performing the UTF-8 decoding step, then re-storing this information back in the database (as binary) and marking it as tainted. The next time a user attempts to log in, the salt is read from the database and treated in pure binary form rather than having the application attempt to stringify it. Additionally, if the salt is marked tained, the application should re-generate the salt using the full amount of required entropy, then store this new salt back in the database and remove the tainted flag.

This process should have the effect of automatically upgrading all users' login data the next time they log in. The process should be fully transparent to users and shouldn't result in any "you must reset your password" like interruptions to their workflows.

Hope this helps!

@replaysMike
Copy link
Author

replaysMike commented Apr 23, 2020 via email

@GrabYourPitchforks
Copy link
Member

.NET Core does not hold itself to the same back-compat bar as .NET Full Framework did. The philosophy behind this is that for .NET Full Framework, updates are in-place and are pushed out over Windows Update. Developers and consumers often have no control over when updates are deployed to a machine. Imagine an application works on Monday, then you go to bed and wake up on Tuesday, and the application stops working. That kind of change is generally unacceptable. So we do try to hold ourselves to a very high bar there. One of the consequences is that we often can't even introduce seemingly innocuous bug fixes. Somebody, somewhere, may have taken a dependency on the buggy behavior.

For .NET Core, we're more free to make such changes. The reason for this is that framework updates are a deliberate action by the developer. By definition .NET Core applications can't run into the "works one day, fails the next day" scenario I mentioned above.

We're not ignorant to the fact that breaking changes are sometimes painful, and we're not in the habit of making such changes blithely. When we do make such changes we weigh the benefit vs. harm to the ecosystem as a whole. For example, we generally see improved interoperability with accepted industry standards as a large overall ecosystem benefit. We also document such changes at https://docs.microsoft.com/en-us/dotnet/core/compatibility/breaking-changes.

You said this is the 6th breaking change you've encountered in .NET Core 3.x. Check the link above (or, specifically, https://docs.microsoft.com/en-us/dotnet/core/compatibility/corefx). If we missed something, then please open issues in the docs repo. Or comment here and I can open issues on your behalf.

@John0King
Copy link

the mechanisms are used for existing hashed password authentication and this makes it difficult to upgrade to .Net Core

if you store as binary in database, why would you use Utf8Encoding ? all hash algorithm (sha1,sha256,md5 etc.) use byte[] instead of string , if you want to show this data, base64 is a much better solution .

the upgrade process would have 2 steps:

  1. convert current utf8 base to base64 . (if the data in database is utf8 based, you can use .net framework to decode and update to original byte[]
  2. upgrade to .NET Core 3.x with base64

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants