Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Re-implemented the ecvt function. #12894

Merged
merged 21 commits into from
Sep 13, 2017
Merged

Re-implemented the ecvt function. #12894

merged 21 commits into from
Sep 13, 2017

Conversation

mazong1123
Copy link

@mazong1123 mazong1123 commented Jul 19, 2017

Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf

Note:

  1. This commit won't fix any existing bug.
  2. This is a raw implementation of the paper. The performance on Linux only gain 10%. We could tune the performance further.

UPDATE:

Now the new implementation is at least 2x faster than the old one.

Fix #10651

Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf

Note:
1. This commit won't fix any existing bug.
2. This is a raw implementation of the paper. The performance on Linux only gain 10%. We could tune the performance further.

Fix #10651
@mazong1123 mazong1123 changed the title Re-implemented the ecvt function. [WIP] Re-implemented the ecvt function. Jul 19, 2017
@mazong1123
Copy link
Author

mazong1123 commented Jul 19, 2017

@tarekgh @jkotas This commit is a raw implementation of the paper https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf . The performance not gained so much (only 10%). I'll tune the performance further.

Currently, I want to go through the CI checks so that if there's any issue I can fix them earlier so I created the PR.

Note I did not change the "15 digit first, then 17 digit precision" round trip conversion rule. I think that should be done in another PR.

@mazong1123
Copy link
Author

mazong1123 commented Jul 19, 2017

@tarekgh @jkotas PTAL.
I ran some tests on this PR. Following are the details:

Accuracy

I ran following code against this PR:

using System;

namespace dtost
{
    class Program
    {
        static void TestDoubleConversion(double number)
        {
            string str = number.ToString("R");
            double back = double.Parse(str);
            if (number != back)
            {
                Console.WriteLine("Test failed: converted str: {0}. expected double {1}", back, number);
            }
            else
            {
                Console.WriteLine("Pass");
            }
        }

        static void Main(string[] args)
        {
            TestDoubleConversion(104234.343);
            TestDoubleConversion(-1.7976931348623157E+308);
            TestDoubleConversion(1.7976931348623157e+308);
            TestDoubleConversion(1.7976931348623157e+308);
            TestDoubleConversion(70.9228162514264339123);
            TestDoubleConversion(3.1415926535897937784612345);
            TestDoubleConversion(29999999999999792458.0);
            TestDoubleConversion(1000.4999999999999999999);
            TestDoubleConversion(1000.9999999999999999999);
            TestDoubleConversion(999.99999999999999999999);
            TestDoubleConversion(0.84551240822557006);
            TestDoubleConversion(1.0);
            TestDoubleConversion(10.0);
            TestDoubleConversion(0);
            TestDoubleConversion(0.0);
            TestDoubleConversion(0.0000000000000199);
            TestDoubleConversion(-0.0000123);
        }
    }
}

The output is:

Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Pass
Test failed: converted str: 0.84551240822557. expected double 0.84551240822557
Pass
Pass
Pass
Pass
Pass
Pass

Ran the code under current .NET Core release got exactly the same result. The only failure is caused by a known issue https://stackoverflow.com/questions/24299692/why-is-a-round-trip-conversion-via-a-string-not-safe-for-a-double

So the accuracy of the new implementation should be the same as current .NET Core release.

Performance

I ran following micro-benchmark:

using System;
using System.Diagnostics;
using System.Globalization;

namespace hlold
{
    class Program
    {
        static void Main(string[] args)
        {
            double number = 104234.343;
            CultureInfo cultureInfo = new CultureInfo("fr");
            for (int i = 0; i < 20; i++)
            {
                Stopwatch watch = new Stopwatch();
                watch.Start();
                for (int j = 0; j < 10000; j++)
                {
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                    number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
                }

                Console.Write(watch.ElapsedMilliseconds);
                Console.WriteLine();
            }
        }
    }
}

The new implementation is 3x faster than the old one:

Output of new implementation:

52
52
50
51
50
51
51
51
51
50
50
50
51
50
50
59
50
50
51
51

Output of old implementation:

176
195
182
193
186
190
186
189
184
189
179
193
187
193
181
191
193
185
198
180

If we change the number to double.MinValue / 2, the new implementation is still 2x faster than the old one:

Output of new implementation:

191
189
188
190
192
190
188
189
187
190
192
187
188
187
188
189
192
188
187
187

Output of old implementation:

392
382
381
383
386
400
390
381
381
387
390
382
382
393
386
382
386
386
390
382

So the new implementation is at least 2x faster than the old one. I think we have proved that the algorithm is much more efficient than the old one.

@mazong1123 mazong1123 changed the title [WIP] Re-implemented the ecvt function. Re-implemented the ecvt function. Jul 19, 2017
@tarekgh tarekgh added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 19, 2017
@dotnet dotnet deleted a comment from dotnet-bot Jul 19, 2017
@dotnet dotnet deleted a comment from dotnet-bot Jul 19, 2017
@tarekgh
Copy link
Member

tarekgh commented Jul 19, 2017

@mazong1123 this is fantastic. the results look promising. I have enabled the whole corefx repo to run against your changes to know if we get any regression case. Also I am not seeing we used a big static data which is a good news too. if corefx goes fine we can move to the next step to check with the legal about using the algorithm and do official code review then we can finalize the work.

Thanks a lot for getting this done. good work!

@tarekgh
Copy link
Member

tarekgh commented Jul 19, 2017

@safern could you please take a look at the CI falure in ComponentModel?

Ubuntu x64 Checked Build and Test (Jit - CoreFx)

https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_ubuntu_corefx_baseline_prtest/62/consoleFull#-21281798079fe3b83-f408-404c-b9e7-9207d232e5fc

this is not urgent but it'll be good if you have a look when you have a chance.

@mazong1123 you may ignore the failure in "Ubuntu x64 Checked Build and Test (Jit - CoreFx)" for now as it looks unrelated to your changes.

@safern
Copy link
Member

safern commented Jul 19, 2017

@safern could you please take a look at the CI falure in ComponentModel?

Will do later :)

@tannergooding
Copy link
Member

I'm going to give the implementation a review pass sometime later this evening.

const UINT32* pRhsEnd = pRhsCurrent + length;

while (pRhsCurrent != pRhsEnd)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just memcpy this part instead of looping?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be best to measure to see if there's any performance difference.

Copy link
Member

@tannergooding tannergooding Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memcpy should do the right thing, regardless of size. That being said, the current max size is only 35 bytes so the perf difference is barely measurable. But, this size is way too small and should be well over 138 bytes (with exact size varying based on the maximum number of digits we consider). The current max size is 35 dwords (140 bytes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we don't just memcpy considering m_len even if you think the perf is barely measurable?


In reply to: 128531411 [](ancestors = 128531411)


while (true)
{
if (*pCurrentLeftBlock > *pCurrentRightBlock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we can do it as

INT64 diff = (INT64) (*pCurrentLeftBlock) - (INT64) (*pCurrentRightBlock);
if (diff != 0)
return diff;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, but we should validate that this doesn't perform any more poorly on 32-bit platforms

const UINT32* pCurrentLeftBlock = lhs.m_blocks + lastIndex;
const UINT32* pCurrentRightBlock = rhs.m_blocks + lastIndex;

while (true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true) [](start = 11, length = 5)

nit: while (pCurrentLeftBlock <= pLeftStartBlock) is better here and remove if (pCurrentLeftBlock == pLeftStartBlock) from the loop

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you meant while (pCurrentLeftBlock >= pLeftStartBlock) :). I'll take this optimization.


void BigNum::ShiftLeft(UINT64 input, int shift, BigNum& output)
{
int shiftBlocks = shift / 32;
Copy link
Member

@tarekgh tarekgh Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int shiftBlocks = shift / 32; [](start = 4, length = 29)

shift >> 5

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, although the compiler should also be doing this optimization for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we should probably just leave this. The compiler will optimize the next statement as well (into either a single idiv or into a couple of shifts and a subtract)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using / and % is fine, no native compiler will generate actual div instructions for this code. However, I would suggest making shift unsigned because unsigned division by constant is cheaper than signed division. It may be the native compiler will figure this one too but it requires interprocedural optimizations so it's somewhat less certain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn Good catch. I think make shift unsigned should be more reasonable here.

realExponent = -1074;
mantissaHighBitIdx = BigNum::LogBase2(realMantissa);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can shortcut Zero, Infinity, Indeterminate, and NaN by this point

Copy link
Author

@mazong1123 mazong1123 Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These special cases has already been handled before passing to the method: https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/number.cpp#L146

I think adding asserts is worth though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like zero and negative zero are not handled by the outer code (just NaN/Inf).

Copy link
Author

@mazong1123 mazong1123 Jul 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should shortcut zeros here. Thanks. I'd like to borrow the code from https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L185-L193

However, I do not quite understand the comment at https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L184

// we have issue #10290 tracking fixing the sign of 0.0 across the platforms

Did this code introduce issue #10290 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mazong1123, no. #10290 is tracking a future improvement (and possible breaking change) that -0.0 and 0.0 should be treated differently when converting to a string.

The code you referenced is fine and is matching the current expected behavior.

UINT64 realMantissa = 0;
int realExponent = 0;
UINT32 mantissaHighBitIdx = 0;
if (((FPDOUBLE*)&value)->exp > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect you to be special casing denormal numbers, since the implicit bit is zero instead of one.

Copy link
Author

@mazong1123 mazong1123 Jul 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have included the denormal numbers in the else condition. According to https://en.wikipedia.org/wiki/Double-precision_floating-point_format

In the case of subnormals (e=0) the double-precision number is described by:
(-1)^sign * 2^(1 - 1023) * 0.fraction = (-1)^sign * 2^(-1022) * 0.fraction

2^(-1022) * 0.fraction = 2^(-1022) * mantissa / 2 ^52 = mantissa * 2^(-1074) . So you can see in the else block, realExponent = -1074 , realMantissa = mantissa.

I will add comments here.

}
} m_initializer;

UINT8 m_len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Either make this UINT32 or place it after m_blocks (larger fields should come first). The resulting struct is going to be 8-bytes in either case...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll place m_len after m_blocks. Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should make it UINT32 rather than placing it after m_blocks. In general, the only advantage of using small integral types is storage size. In this case you can't actually save any storage due to alignment requirements. Instead your risk making the code larger because the compiler may need to insert widening conversion in some cases.

Copy link
Author

@mazong1123 mazong1123 Jul 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikedn Sounds reasonable. But that means I need to extend the static data as well. Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.

Anyway, I'll change m_len and related static data to UINT32 to see if there's any big impact. I don't want to rely on the compiler for the alignment but the size of static data should not be increased so much as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.

Yes, but only for consistency. I would expect the compiler to treat those as constants so the impact on the generated code should be exactly 0. In fact, they should be constexpr, provided that the version(s) of VC++ used to build this code doesn't barf on it.

static const UINT8 BIGSIZE = 35;
static const UINT8 UINT32POWER10NUM = 8;
static const UINT8 BIGPOWER10NUM = 6;
static UINT32 m_power10UInt32Table[UINT32POWER10NUM];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static UINT32 const m_power10UInt32Table[] =
{
// Declare data inline
};

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to use constexpr to accomplish this. The code can be compiled on my current Ubuntu 16.04 box but I'm not sure if it is able to be compiled across platform. I'll do the change and to see if it can pass the CI later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using constexpr would likely work, but it shouldn't be required.

This should just be a UINT32 constant for the length and then a sequence of bytes representing the data. You likely need a second lookup table that contains the starting index for each entry, but it should just be a cast from UINT32* to BigNum* at that point.


BigNum::BigNum()
:m_len(0)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a comment about explicitly not zeroing the memory due to perf reasons might be nice

break;
}

--pCurrentLeftBlock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Lengths of both are the same, you can use a single index to do the access

return *this;
}

int BigNum::Compare(const BigNum& lhs, UINT32 value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is having a compare really better than individual ==, !=, <, <=, >, and >= operators?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can overload these operators but I think we should do this enhancement in another PR because a simple compare method should be enough to implement the algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly asking from a perf perspective. It seems like we are only actually reusing the Compare value in one place, every other place we are just doing a single comparison, which is likely cheaper than the full Compare check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove BigNum::Compare(const BigNum& lhs, UINT32 value) since it is not used now.

BigNum::Compare(const BigNum& lhs, const BigNum& rhs) is still useful when comparing numerator and denominator so I'll keep it.

int shiftBlocks = shift / 32;
int remaningToShiftBits = shift % 32;

for (int i = 0; i < shiftBlocks; ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memset(0) would be much faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(memset over the entire range that is)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be, it really depends on how large these bignums usually get. I suspect not very large.

Copy link
Member

@tannergooding tannergooding Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the memcpy comment. memset should do the right thing, regardless of size. The current max size is only 35 bytes, but I also believe this is considerably too small. The current max size is 35 dwords (140 bytes).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the loop to memset.

return true;
}

for (UINT8 i = 0; i < m_len; ++i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we could trivially detect all the cases where the value would become zero and only rely on m_len == 0

Copy link
Author

@mazong1123 mazong1123 Jul 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean we should ensure that whenever m_len > 0, we at least have 1 block != 0 so that we could only check m_len == 0 inside IsZero()? I think IsZero() should not depend on this assumption which is like to depend on an implementation outside of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsZero is part of the BigNum class, so it would only be dependent on a convention set for its own class.

This probably isn't too important so we can investigate later.


void BigNum::SetUInt64(UINT64 value)
{
m_len = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_blocks[0] = value & 0xFFFFFFFF;
m_blocks[1] = value >> 32;
m_len = (m_blocks[1] == 0) ? 1 : 2;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks!


void BigNum::ExtendBlock(UINT32 newBlock)
{
m_blocks[m_len] = newBlock;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few cases where an ExtendBlock(UINT32 blockValue, UINT32 numBlocks) would be useful. It can memset and increment m_len the appropriate count all at once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this change but it slightly slows down the performance. So I'd like to keep this method and use memset when there're bulk blocks need to be extended.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you create a second ExtendBlock method or just replace this one entirely? I was suggesting that we create a second method and call it if the number of blocks being set (we loop the current method in a couple places) is larger than some threshold.

UINT32 lastIndex = lhs.m_len - 1;
INT32 currentIndex = lastIndex;

while (currentIndex >= 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (UINT32 currentIndex = lhs.m_len - 1; currentIndex >= 0; --currentIndex) might be more readable?

// ========================================================================================================================================
// This implementation is based on the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf
// Besides the paper, some of the code and ideas are modified from http://www.ryanjuckett.com/programming/printing-floating-point-numbers/
// You must read these two marterials to fully understand the code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

materials

//
// 0.30102999566398119521373889472449 = log10V2
// 0.69 = 1 - log10V2 - epsilon (a small number account for drift of floating point multiplication)
int k = (int)(ceil(double((int)mantissaHighBitIdx + e) * 0.30102999566398119521373889472449 - 0.69));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, it would be more readable to declare log10v2 and 0.69` as constants.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits, but overall 👍

@mazong1123
Copy link
Author

@tannergooding I've updated code according to the feedback. Please take a look. Thanks!

@mazong1123
Copy link
Author

@dotnet-bot test Ubuntu x64 corefx_baseline
@dotnet-bot test Windows_NT x86 corefx_baseline
@dotnet-bot test Tizen armel Cross Release Build

@mazong1123
Copy link
Author

Do I need to put the license mentioned in #12894 (comment) anywhere?
At the beginning of number.cpp or DoubleToNumberWorker?
I am asking and will get back to you.

@tarekgh Do we have any update?

@tarekgh
Copy link
Member

tarekgh commented Aug 15, 2017

Do we have any update?

We are still following up here. looks there is some process we need to follow. we'll let you know what needs to be done when having more details. Thanks for your patience.

@tarekgh
Copy link
Member

tarekgh commented Aug 25, 2017

We are still following up here. we'll update as soon as we have news.

@tarekgh
Copy link
Member

tarekgh commented Sep 12, 2017

#13920

@tarekgh
Copy link
Member

tarekgh commented Sep 13, 2017

@mazong1123 thanks a lot for your effort and getting this ready. finally, I am going to merge it :-)

@tarekgh tarekgh merged commit 602ebe9 into dotnet:master Sep 13, 2017
@mazong1123 mazong1123 deleted the fix-10651 branch September 13, 2017 01:41
@jkotas
Copy link
Member

jkotas commented Sep 13, 2017

This introduced build break on Windows ARM (this config does not build by default in CI):

src\classlibnative\bcltype\bignum.cpp(595): error C3861: 'BitScanReverse64': identifier not found [D:\j\workspace\arm_cross_che---9fe37e18\bin\obj\Windows_NT.arm.Checked\src\classlibnative\bcltype\bcltype.vcxproj]

@jkotas
Copy link
Member

jkotas commented Sep 13, 2017

Fix: #13932

@mazong1123
Copy link
Author

@jkotas how did you find the build break? Is there any way to check arm build in CI?

@jkotas
Copy link
Member

jkotas commented Sep 13, 2017

No worries. I happened to notice it on a different job. There are custom jobs that can be triggered manually based on the nature of the change.

@dotnet-bot help

prints the list. You will find the Windows ARM build somewhere in there. Obviously, it is not cost effective to run all of them for each job.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Perf: double.ToString() 3x slower in some cases on Linux