-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf Note: 1. This commit won't fix any existing bug. 2. This is a raw implementation of the paper. The performance on Linux only gain 10%. We could tune the performance further. Fix #10651
@tarekgh @jkotas This commit is a raw implementation of the paper https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf . The performance not gained so much (only 10%). I'll tune the performance further. Currently, I want to go through the CI checks so that if there's any issue I can fix them earlier so I created the PR. Note I did not change the "15 digit first, then 17 digit precision" round trip conversion rule. I think that should be done in another PR. |
@tarekgh @jkotas PTAL. AccuracyI ran following code against this PR: using System;
namespace dtost
{
class Program
{
static void TestDoubleConversion(double number)
{
string str = number.ToString("R");
double back = double.Parse(str);
if (number != back)
{
Console.WriteLine("Test failed: converted str: {0}. expected double {1}", back, number);
}
else
{
Console.WriteLine("Pass");
}
}
static void Main(string[] args)
{
TestDoubleConversion(104234.343);
TestDoubleConversion(-1.7976931348623157E+308);
TestDoubleConversion(1.7976931348623157e+308);
TestDoubleConversion(1.7976931348623157e+308);
TestDoubleConversion(70.9228162514264339123);
TestDoubleConversion(3.1415926535897937784612345);
TestDoubleConversion(29999999999999792458.0);
TestDoubleConversion(1000.4999999999999999999);
TestDoubleConversion(1000.9999999999999999999);
TestDoubleConversion(999.99999999999999999999);
TestDoubleConversion(0.84551240822557006);
TestDoubleConversion(1.0);
TestDoubleConversion(10.0);
TestDoubleConversion(0);
TestDoubleConversion(0.0);
TestDoubleConversion(0.0000000000000199);
TestDoubleConversion(-0.0000123);
}
}
} The output is:
Ran the code under current .NET Core release got exactly the same result. The only failure is caused by a known issue https://stackoverflow.com/questions/24299692/why-is-a-round-trip-conversion-via-a-string-not-safe-for-a-double So the accuracy of the new implementation should be the same as current .NET Core release. PerformanceI ran following micro-benchmark: using System;
using System.Diagnostics;
using System.Globalization;
namespace hlold
{
class Program
{
static void Main(string[] args)
{
double number = 104234.343;
CultureInfo cultureInfo = new CultureInfo("fr");
for (int i = 0; i < 20; i++)
{
Stopwatch watch = new Stopwatch();
watch.Start();
for (int j = 0; j < 10000; j++)
{
number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
number.ToString(cultureInfo); number.ToString(cultureInfo); number.ToString(cultureInfo);
}
Console.Write(watch.ElapsedMilliseconds);
Console.WriteLine();
}
}
}
} The new implementation is 3x faster than the old one: Output of new implementation:
Output of old implementation:
If we change the number to Output of new implementation:
Output of old implementation:
So the new implementation is at least 2x faster than the old one. I think we have proved that the algorithm is much more efficient than the old one. |
@mazong1123 this is fantastic. the results look promising. I have enabled the whole corefx repo to run against your changes to know if we get any regression case. Also I am not seeing we used a big static data which is a good news too. if corefx goes fine we can move to the next step to check with the legal about using the algorithm and do official code review then we can finalize the work. Thanks a lot for getting this done. good work! |
@safern could you please take a look at the CI falure in ComponentModel? Ubuntu x64 Checked Build and Test (Jit - CoreFx) this is not urgent but it'll be good if you have a look when you have a chance. @mazong1123 you may ignore the failure in "Ubuntu x64 Checked Build and Test (Jit - CoreFx)" for now as it looks unrelated to your changes. |
Will do later :) |
I'm going to give the implementation a review pass sometime later this evening. |
const UINT32* pRhsEnd = pRhsCurrent + length; | ||
|
||
while (pRhsCurrent != pRhsEnd) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just memcpy this part instead of looping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be best to measure to see if there's any performance difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memcpy
should do the right thing, regardless of size. That being said, the current max size is only The current max size is 35 dwords (140 bytes).35
bytes so the perf difference is barely measurable. But, this size is way too small and should be well over 138 bytes (with exact size varying based on the maximum number of digits we consider).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we don't just memcpy considering m_len even if you think the perf is barely measurable?
In reply to: 128531411 [](ancestors = 128531411)
|
||
while (true) | ||
{ | ||
if (*pCurrentLeftBlock > *pCurrentRightBlock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we can do it as
INT64 diff = (INT64) (*pCurrentLeftBlock) - (INT64) (*pCurrentRightBlock);
if (diff != 0)
return diff;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, but we should validate that this doesn't perform any more poorly on 32-bit platforms
const UINT32* pCurrentLeftBlock = lhs.m_blocks + lastIndex; | ||
const UINT32* pCurrentRightBlock = rhs.m_blocks + lastIndex; | ||
|
||
while (true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true) [](start = 11, length = 5)
nit: while (pCurrentLeftBlock <= pLeftStartBlock) is better here and remove if (pCurrentLeftBlock == pLeftStartBlock) from the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant while (pCurrentLeftBlock >= pLeftStartBlock)
:). I'll take this optimization.
|
||
void BigNum::ShiftLeft(UINT64 input, int shift, BigNum& output) | ||
{ | ||
int shiftBlocks = shift / 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int shiftBlocks = shift / 32; [](start = 4, length = 29)
shift >> 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍, although the compiler should also be doing this optimization for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we should probably just leave this. The compiler will optimize the next statement as well (into either a single idiv
or into a couple of shifts and a subtract)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using /
and %
is fine, no native compiler will generate actual div instructions for this code. However, I would suggest making shift
unsigned because unsigned division by constant is cheaper than signed division. It may be the native compiler will figure this one too but it requires interprocedural optimizations so it's somewhat less certain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikedn Good catch. I think make shift
unsigned should be more reasonable here.
realExponent = -1074; | ||
mantissaHighBitIdx = BigNum::LogBase2(realMantissa); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can shortcut Zero
, Infinity
, Indeterminate
, and NaN
by this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These special cases has already been handled before passing to the method: https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/number.cpp#L146
I think adding asserts is worth though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like zero and negative zero are not handled by the outer code (just NaN/Inf).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should shortcut zeros here. Thanks. I'd like to borrow the code from https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L185-L193
However, I do not quite understand the comment at https://github.com/dotnet/coreclr/blob/master/src/pal/src/cruntime/misctls.cpp#L184
// we have issue #10290 tracking fixing the sign of 0.0 across the platforms
Did this code introduce issue #10290 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mazong1123, no. #10290 is tracking a future improvement (and possible breaking change) that -0.0 and 0.0 should be treated differently when converting to a string.
The code you referenced is fine and is matching the current expected behavior.
UINT64 realMantissa = 0; | ||
int realExponent = 0; | ||
UINT32 mantissaHighBitIdx = 0; | ||
if (((FPDOUBLE*)&value)->exp > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect you to be special casing denormal numbers, since the implicit bit is zero instead of one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have included the denormal numbers in the else condition. According to https://en.wikipedia.org/wiki/Double-precision_floating-point_format
In the case of subnormals (e=0) the double-precision number is described by:
(-1)^sign * 2^(1 - 1023) * 0.fraction = (-1)^sign * 2^(-1022) * 0.fraction
2^(-1022) * 0.fraction = 2^(-1022) * mantissa / 2 ^52 = mantissa * 2^(-1074) . So you can see in the else
block, realExponent = -1074
, realMantissa = mantissa
.
I will add comments here.
src/classlibnative/bcltype/bignum.h
Outdated
} | ||
} m_initializer; | ||
|
||
UINT8 m_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Either make this UINT32 or place it after m_blocks
(larger fields should come first). The resulting struct is going to be 8-bytes in either case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll place m_len
after m_blocks
. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You really should make it UINT32 rather than placing it after m_blocks
. In general, the only advantage of using small integral types is storage size. In this case you can't actually save any storage due to alignment requirements. Instead your risk making the code larger because the compiler may need to insert widening conversion in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikedn Sounds reasonable. But that means I need to extend the static data as well. Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.
Anyway, I'll change m_len and related static data to UINT32 to see if there's any big impact. I don't want to rely on the compiler for the alignment but the size of static data should not be increased so much as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like BIGSIZE and BIGPOWER10NUM should be extended to UINT32.
Yes, but only for consistency. I would expect the compiler to treat those as constants so the impact on the generated code should be exactly 0. In fact, they should be constexpr
, provided that the version(s) of VC++ used to build this code doesn't barf on it.
src/classlibnative/bcltype/bignum.h
Outdated
static const UINT8 BIGSIZE = 35; | ||
static const UINT8 UINT32POWER10NUM = 8; | ||
static const UINT8 BIGPOWER10NUM = 6; | ||
static UINT32 m_power10UInt32Table[UINT32POWER10NUM]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static UINT32 const m_power10UInt32Table[] =
{
// Declare data inline
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to use constexpr
to accomplish this. The code can be compiled on my current Ubuntu 16.04 box but I'm not sure if it is able to be compiled across platform. I'll do the change and to see if it can pass the CI later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using constexpr
would likely work, but it shouldn't be required.
This should just be a UINT32
constant for the length and then a sequence of bytes representing the data. You likely need a second lookup table that contains the starting index for each entry, but it should just be a cast from UINT32* to BigNum* at that point.
|
||
BigNum::BigNum() | ||
:m_len(0) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment about explicitly not zeroing the memory due to perf reasons might be nice
break; | ||
} | ||
|
||
--pCurrentLeftBlock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Lengths of both are the same, you can use a single index to do the access
return *this; | ||
} | ||
|
||
int BigNum::Compare(const BigNum& lhs, UINT32 value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is having a compare
really better than individual ==, !=, <, <=, >, and >= operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can overload these operators but I think we should do this enhancement in another PR because a simple compare method should be enough to implement the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly asking from a perf perspective. It seems like we are only actually reusing the Compare value in one place, every other place we are just doing a single comparison, which is likely cheaper than the full Compare check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove BigNum::Compare(const BigNum& lhs, UINT32 value)
since it is not used now.
BigNum::Compare(const BigNum& lhs, const BigNum& rhs)
is still useful when comparing numerator and denominator so I'll keep it.
int shiftBlocks = shift / 32; | ||
int remaningToShiftBits = shift % 32; | ||
|
||
for (int i = 0; i < shiftBlocks; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memset(0) would be much faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(memset over the entire range that is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be, it really depends on how large these bignums usually get. I suspect not very large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the memcpy
comment. memset
should do the right thing, regardless of size. The current max size is only 35 bytes, but I also believe this is considerably too small. The current max size is 35 dwords (140 bytes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the loop to memset.
return true; | ||
} | ||
|
||
for (UINT8 i = 0; i < m_len; ++i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like we could trivially detect all the cases where the value would become zero and only rely on m_len == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean we should ensure that whenever m_len > 0
, we at least have 1 block != 0 so that we could only check m_len == 0
inside IsZero()? I think IsZero() should not depend on this assumption which is like to depend on an implementation outside of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsZero is part of the BigNum
class, so it would only be dependent on a convention set for its own class.
This probably isn't too important so we can investigate later.
|
||
void BigNum::SetUInt64(UINT64 value) | ||
{ | ||
m_len = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_blocks[0] = value & 0xFFFFFFFF;
m_blocks[1] = value >> 32;
m_len = (m_blocks[1] == 0) ? 1 : 2;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks!
|
||
void BigNum::ExtendBlock(UINT32 newBlock) | ||
{ | ||
m_blocks[m_len] = newBlock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few cases where an ExtendBlock(UINT32 blockValue, UINT32 numBlocks)
would be useful. It can memset
and increment m_len the appropriate count all at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this change but it slightly slows down the performance. So I'd like to keep this method and use memset when there're bulk blocks need to be extended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you create a second ExtendBlock
method or just replace this one entirely? I was suggesting that we create a second method and call it if the number of blocks being set (we loop the current method in a couple places) is larger than some threshold.
UINT32 lastIndex = lhs.m_len - 1; | ||
INT32 currentIndex = lastIndex; | ||
|
||
while (currentIndex >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (UINT32 currentIndex = lhs.m_len - 1; currentIndex >= 0; --currentIndex)
might be more readable?
// ======================================================================================================================================== | ||
// This implementation is based on the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf | ||
// Besides the paper, some of the code and ideas are modified from http://www.ryanjuckett.com/programming/printing-floating-point-numbers/ | ||
// You must read these two marterials to fully understand the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
materials
// | ||
// 0.30102999566398119521373889472449 = log10V2 | ||
// 0.69 = 1 - log10V2 - epsilon (a small number account for drift of floating point multiplication) | ||
int k = (int)(ceil(double((int)mantissaHighBitIdx + e) * 0.30102999566398119521373889472449 - 0.69)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, it would be more readable to declare log10v2
and 0.69` as constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits, but overall 👍
@tannergooding I've updated code according to the feedback. Please take a look. Thanks! |
@dotnet-bot test Ubuntu x64 corefx_baseline |
We are still following up here. looks there is some process we need to follow. we'll let you know what needs to be done when having more details. Thanks for your patience. |
We are still following up here. we'll update as soon as we have news. |
@mazong1123 thanks a lot for your effort and getting this ready. finally, I am going to merge it :-) |
This introduced build break on Windows ARM (this config does not build by default in CI):
|
Fix: #13932 |
@jkotas how did you find the build break? Is there any way to check arm build in CI? |
No worries. I happened to notice it on a different job. There are custom jobs that can be triggered manually based on the nature of the change. @dotnet-bot help prints the list. You will find the Windows ARM build somewhere in there. Obviously, it is not cost effective to run all of them for each job. |
This reverts commit 602ebe9.
Instead of leveraging snprintf, re-implement the ecvt function according to the paper: https://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf
Note:
UPDATE:
Now the new implementation is at least 2x faster than the old one.
Fix #10651