-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
long double floating point calculations give inexact results #830
Comments
Are you sure you're using the same version of g++ on WSL versus your Linux machine? |
Confirmed it happens with consistent g++ version (6.1.1 20160510). Same ELF64 object generates byte-for-byte on native and WSL. It doesn't even call any math library functions. Looking at the assembly, everything that calculates |
It's been a while since I looked at this sort of thing (pre-x86_64), but IIRC Windows and Linux have slightly different calling conventions regarding saving things like the FPU control registers during/after syscalls? Maybe relevant? |
Thanks for reporting the issue. Just looked at this, and it is a difference between the default floating point control register (i.e. FPCSR\NPXCW) between Linux and NT. glibc assumes 0x37f but NT uses 0x27f. To confirm, I added the following lines to the sample above to set the control register and it passed:
MxCsr is the same between Linux and NT, but I'll poke around to see if there are other control registers that are mismatched. I'll file a bug on our side to match the default configuration on Linux. |
FTHOI, I compiled the sample program in Linux, then copied the resulting executable to WSL. It produced different results under Linux and WSL, 1.00000000000000000e-05 under Linux but 1.00000000000000008e-05 under WSL. It's not g++ that's the problem. Does WSL not support 80-bit floating point? Does WSL not use the FPU but only provide 64-bit software floating point math? FWLIW, awk under Linux produces the same output as Windows. |
See the comment above yours. There is a difference between the floating point control register on Windows vs Linux. Steve is working on a fix. |
@stehufntdev Thank you for the temporary solution (especially the magic number!!! :-)) and the on going work. With that fix, my (some other) computation executable now gives the same results under BashOnWindows as under Linux. @therealkenc Thanks for the clarification, that's exactly what happend. Just for completeness |
Bumping this because it's a serious ABI issue and completely breaks any musl libc-linked binaries that use floating point scanf/strtod, among other things. It probably also breaks printf rounding of floating point values. We use long double internally for various reasons, including issues related to excess precision (on archs that have FLT_EVAL_METHOD>0), and the cost is irrelevant since the bulk of the code actually works with integers, but of course it badly breaks when run with an ABI-non-conforming control word setting or an emulator that fails to emulate x87 correctly. |
@richfelker - Thanks for pinging this thread, this issue is still on your backlog. If possible could you describe a specific scenario that is not working because of this to help us prioritize this fix? |
OK, I'll work out a specific strtod call that returns wrong result with x87 control word set incorrectly. |
@benhillis, I think several tests in the glibc test suite (e.g., |
With the control word set wrong, strtod("4398046511105",0) returns 4398046511104. This represents the exact boundary I expected to break: values with more than 53-(64-53) = 42 bits in the significand. |
I still think this is going to have a NodeJS repro if someone in the small intersection of coders that do numerical computation stuff and ECMAScript stuff (not me) wants to take a crack at it. The interpreter (JIT compiled or otherwise) is going to make hard assumptions about the floating point mode. Might be motivational. |
+1 We just tried to transfer one of our projects onto WSL (https://github.com/Illumina/strelka), and traced failing unit tests down to differences like this within #include <iostream>
#include <iomanip>
int main()
{
long double n(7.59070505394721872907e+286);
long double d(2.00440157654530257755e+291);
// linux n = 3.78701810194652734355e-05
// WSL n = 3.78701810194652746001e-05
n /= d;
std::cout.precision(21);
std::cout << "n = " << n << "\n";
} |
@benhillis, this is about to be a two years old issue. Can we have some updates? Something we can do to help you guys fixing it? |
I just ran into this (or a similar) issue while performance testing two different ways to do a calculation in Node.js. On WSL the calculation started producing an incorrect result after a few thousand iterations, whereas on a RHEL virtual machine it was always correct (at least through 1,000,000 iterations). The specific iteration where the first error occurred was consistent as long as the script was unchanged, but it would vary somewhat if small changes were introduced into the script. Here is one such script: var sqrtfive = Math.sqrt(5);
var phi = (1 + sqrtfive) / 2;
var fibc = function (n) { return Math.round((Math.pow(phi, n)) / sqrtfive); };
var output = 0;
var correct_output = fibc(75);
var num_errors = 0;
for (var x = 0; x < 100000; x++) {
output = fibc(75);
if (output != correct_output) {
num_errors++;
console.log('error at x = ' + x);
if (num_errors > 25) {
break;
}
}
}
console.log(correct_output);
console.log(output); Running the above through node v8.11.3 prints, on my machine:
(For the record, 2111485077978050 is the correct answer for Fib(75).) |
Ping? Back in October I had indication out-of-band that a fix for this was coming, but I haven't seen anything about it since. |
It looks like a pretty old ticket, is there any plan to resolve it? I image the fix is very straightforward, just initialize the right word in the process control block/context of the first linux process. |
Ok, this problem is more complicated (and more serious) than I thought -- the x87 control word is NOT inherited by a child process across
It is also confirmed that the control word is NOT inherited by a child (POSIX) thread. And the problem is not limited to x87 control word, it equally affects the SSE control and status register. Many manifestations of the same problem were reported earlier in this ticket, let me try to summarize here the full impact, based on what I have gathered so far:
Update: |
Utterly untested, that is off the cuff.
|
@therealkenc Yes, it is possible to initialize the control word in the main process through I didn't make it up about POSIX compliance, whereas there could be some latitude regarding
Here "exact copy" is to be interpreted as all machine specific/dependent states -- I know this because I am an OS developer. This means I also understand the fix of this problem is non-trivial. The symptom indicates the x87 control word is not maintained as part of the linux process context, it's only in the NT process context. What could further complicate this effort is that x87 coprocessor is usually implemented as lazy during context switch, that is, the FPU is initially marked as not present, any FPU instruction would trigger a fpu-unavailable trap at which time FPU context (including the control word) would be restored. It is very likely the current linux subsystem module won't even see this trap, and some changes to the NT microkernel proper may have to be made. |
I'm quite aware of that, but the "no decent OS developer" comment is rather exaggerated. These days many operating systems do not make this optimization for various reasons; it's nowhere near as valuable as it was a couple decades ago. In any case you could make the fix that way for now, and optimize later if it turns out the performance hit is a problem, rather than leaving it broken for the sake of performance. FWIW the issue you cited is exactly why we're not going to workaround the WSL bug in musl; doing so would make all binaries generated with the workaround unconditionally poke the fpu when they shouldn't have to, and the cost would not go away when WSL gets fixed because it would be linked into the binaries rather than in OS-level code that will go away once it's fixed right. |
@richfelker Brother, I feel your pain. I am here not because I'm a kernel developer, but because an application my team developed was hit by this bug. I would very much like to see this problem fixed and am offering any help I can render. At this point I may be desperate enough to insert an |
I suspect, but can't prove, a variation of this argument is being applied by WSL devs too. It could be Which is good. Everyone is in agreement that without a perfect fix all hope is lost, and hopefully this thread can go quiet again. |
@therealkenc, I don't see them as equivalent arguments at all. One is about a suboptimal fix at the level of a component that ships with OS updates and never gets linked into applications, so that, if it's costly, it can go away in a future OS update when a better fix is made. The other is about hard-coding a workaround into application binaries in a way that affects running them on systems not affected by the bug, including real Linux systems and the expected post-fix WSL systems at some point in the future. |
Although you want same output between native Linux and WSL, I think the double / long double do not have explicit size in C++ standard so it is just like a Linux distro run a difference hardware and therefore has different result. |
@Po-wei It's true that the C/C++ language standards don't specify the exact precision of Other platforms are free to implement C/C++ with different ABIs, but WSL is trying to implement the Linux x86/x86-64 ABI, so that's the relevant standard. This is definitely a bug in WSL. Note that several of the people posting here are well-known experts who are involved in making C/C++ work on Linux in the first place. They know what they're talking about :-) |
@njsmith Thanks for your explanation! |
@Po-wei In some way, it did breaks C++ standard library. (Beside the break of C library demonstrated previously) Specifically, it breaks For example, in WSL, // g++ -std=c++11 test_ld.cpp && ./a.out
#include <iostream>
#include <limits>
int main()
{
typedef std::numeric_limits<long double> TT;
std::cout << "mantissa bits : " << TT::digits << "\n"; // =64
std::cout << " epsilon : " << TT::epsilon() << "\n"; // =1.0842e-19
long double t = 1.0;
while (1 + t/2 != 1) {
t = t/2;
}
std::cout << "measured epsilon : " << t << "\n"; // WSL: =2.22045e-16
} So, even a standard conforming code/programmer (that didn't assumes accuracy of |
Why has this problem never been fixed? Why is it not mentioned in the WSL FAQ? |
I assume this is one of the advantages of running a real Linux kernel under virtualization (which is what WSL2 does) rather than having this aspect of syscalls be handled by the NT kernel directly. |
Unless WSL1 is entirely abandoned in favor of WSL2, I think this still needs to be fixed. Maybe now that WSL2 is out there and performance-sensitive users are going to be switching to it, the simple fix for WSL1 will be less controversial...? |
Ping. People are still hitting this. See the new reference above and the post on the musl list: https://www.openwall.com/lists/musl/2019/09/25/16 |
Ping. Is WSL1 still a thing, and if so, can this please be fixed? |
Ping. A user just hit this again today. So apparently people still are using WSL1. Being that this would literally be a one-line fix, is there any reason it can't be done?? |
Yup, it is a shame it hasn't been fixed yet. |
This issue cause ns-3 int64x64 TEST Failed. |
WSL2 fails on one of my laptops, so I use WSL1. On another (newer) laptop, WSL2 "works" but I need to use VirtualBox, and I have had no end of difficulty trying to get the two to work together, so WSL2 is not an option on that computer either. So, some of us need WSL1. Please fix it! |
@jlearman they didn't fix this in 8 years because the NT folks refuse to do it... Unless a big company or government asks for it, I doubt they will ever bother. If I may suggest, vagrant and multipass work with VirtualBox. |
This program
outputs (Linux, expected result):
1.00000000000000000e-05
But in BashOnWindows, it outputs:
1.00000000000000008e-05
The long double data type in C/C++ under Linux using GCC for x86/x86-64 platform has an 80-bit format with 64 bits mantissa.
When a program using long double runs in BashOnWindows it behaves like having 53 bits mantissa number just like the usual "double" (but with almost the same exponential range as long double)
This harms functions like std::pow(long double, int) and hence
boost::lexical_cast<double>(char*)
(boost version 1.55) since internally it uses long double as intermediate result (but boost version 1.60 fixes this problem), thereforeboost::program_options
can reads double type options inaccurately also (even for exact input e.g. 0.03125 = 1/32).Likewise this program:
Should output:
Good.
While in BashOnWindows, it output:
What? t0 = 7.46536864129530799e-4948 != 3.64519953188247460e-4951
The text was updated successfully, but these errors were encountered: