[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

sxjscience · 2020-07-30T04:02:56Z

The CPU version of mx.npx.leaky_relu(x, act_type='gelu') has different precision from PyTorch.

The minimal reproducible example:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

The GPU version has no issue:

import mxnet as mx
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,), ctx=mx.gpu()) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy()).cuda() 
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

@pengzhao-intel @ciyongch

Error:

<ipython-input-48-6f3377797f65> in <module>
      9 b_torch = torch.nn.functional.gelu(a_torch)
     10 assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)
---> 11 assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
   1526     header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
   1527     assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1528                          verbose=verbose, header=header, equal_nan=equal_nan)
   1529 
   1530 

~/.local/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Not equal to tolerance rtol=0.0001, atol=0.0001

Mismatched elements: 2258 / 10000 (22.6%)
Max absolute difference: 0.0004735
Max relative difference: 0.8255573
 x: array([ 0.684651,  0.508604, -0.165598, ...,  1.706593,  0.288036,
        1.006167], dtype=float32)
 y: array([ 0.68455 ,  0.508554, -0.165716, ...,  1.706508,  0.288026,
        1.005966], dtype=float32)

The text was updated successfully, but these errors were encountered:

TaoLv · 2020-07-30T04:31:35Z

@sxjscience Can you confirm the operator runs into its mkldnn version?

sxjscience · 2020-07-30T04:37:52Z

Sorry I do not have the bandwidth to confirm that. I think mkldnn should be turned on by default. Are you able to reproduce this? Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Tao Lv <notifications@github.com> Sent: Wednesday, July 29, 2020 9:31:49 PM To: apache/incubator-mxnet <incubator-mxnet@noreply.github.com> Cc: Xingjian SHI <xshiab@connect.ust.hk>; Mention <mention@noreply.github.com> Subject: Re: [apache/incubator-mxnet] [Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version (#18826) @sxjscience<https://github.com/sxjscience> Can you confirm the operator runs into its mkldnn version? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#18826 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABHQH3XYWAH6FQB5D4YO4DDR6DZTLANCNFSM4PMXSPLQ>.

TaoLv · 2020-07-30T05:15:05Z

In fact, I cannot correctly run the reproducer. I try to fix the precision problem with #18827. Please let me know if it works for you. Thanks.

sxjscience · 2020-07-30T05:26:45Z

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

pengzhao-intel · 2020-08-07T05:35:06Z

@TaoLv Sorry, missed some imports.

import mxnet as mx
import math
from numpy.testing import assert_allclose
mx.npx.set_np()
a = mx.np.random.normal(0, 1, (10000,)) 
b = mx.npx.leaky_relu(a, act_type='gelu')
c = a * 0.5 * (1.0 + mx.npx.erf(a / math.sqrt(2.0)))

import torch
a_torch = torch.from_numpy(a.asnumpy())
b_torch = torch.nn.functional.gelu(a_torch)
assert_allclose(b_torch.cpu().numpy(), c.asnumpy(), 1E-4, 1E-4)  
assert_allclose(b_torch.cpu().numpy(), b.asnumpy(), 1E-4, 1E-4)

(Compiling MXNet takes some time for me so it will be helpful if you can check that...)

Does the issue still exist after Tao's PR?

sxjscience · 2020-08-07T05:41:49Z

Yes, it's solved.

sxjscience added Bug needs triage v2.0 labels Jul 30, 2020

zheyuye mentioned this issue Jul 30, 2020

[Numpy Refactor] BART dmlc/gluon-nlp#1282

Merged

3 tasks

pengzhao-intel added the MKLDNN label Aug 7, 2020

sxjscience closed this as completed Aug 7, 2020

sxjscience mentioned this issue Aug 19, 2020

[Development] MXNet 2.0 Update #18931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

sxjscience commented Jul 30, 2020

TaoLv commented Jul 30, 2020

sxjscience commented Jul 30, 2020 via email

TaoLv commented Jul 30, 2020

sxjscience commented Jul 30, 2020

pengzhao-intel commented Aug 7, 2020

sxjscience commented Aug 7, 2020

[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

[Activation] GELU precision mismatch between MXNet and PyTorch in the CPU version #18826

Comments

sxjscience commented Jul 30, 2020

TaoLv commented Jul 30, 2020

sxjscience commented Jul 30, 2020 via email

TaoLv commented Jul 30, 2020

sxjscience commented Jul 30, 2020

pengzhao-intel commented Aug 7, 2020

sxjscience commented Aug 7, 2020