Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is nvFuser changing the global locale? #62

Closed
kevinstephano opened this issue Mar 23, 2023 · 9 comments
Closed

Is nvFuser changing the global locale? #62

kevinstephano opened this issue Mar 23, 2023 · 9 comments
Assignees

Comments

@kevinstephano
Copy link
Collaborator

I thought it was TorchScript specific issue and not nvFuser but then I tried torch.compile with Inductor and nvprims_nvfuser. Only nvFuser failed.

This is the script I have been playing with:

import locale
import torch
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
torch._C._jit_override_can_fuse_on_cpu(False)
torch._C._jit_override_can_fuse_on_gpu(False)
torch._C._jit_set_texpr_fuser_enabled(False)
torch._C._jit_set_nvfuser_enabled(False)
@torch.jit.script
def bias_gelu_fused(inp: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
    """Bias-GeLU fused"""
    x = inp + bias
    return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
class Fusion(torch.nn.Module):
    def __init__(self):
        super(Fusion, self).__init__()
    def forward(self, inp, bias):
        """Bias-GeLU fused"""
        x = inp + bias
        return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
model = torch.compile(Fusion(), backend='nvprims_nvfuser')
H = 768
inp = torch.randn(H, H, device="cuda")
bias = torch.randn(H, device="cuda")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {​​​​​​​locale.getpreferredencoding()}​​​​​​​"
out = model(inp, bias)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {​​​​​​​locale.getpreferredencoding()}​​​​​​​"
out = model(inp, bias)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {​​​​​​​locale.getpreferredencoding()}​​​​​​​"
@naoyam
Copy link
Collaborator

naoyam commented Mar 23, 2023

Do you know how to get this encoding in C++? locale.getpreferredencoding()

@kevinstephano
Copy link
Collaborator Author

From the documentation, it looks like if you use a default constructor on std::locale, it copies the global locale.

std::locale loc;

@kevinstephano
Copy link
Collaborator Author

This definitely appears to be coming from codegen as I tried asserting around the FusionDefinition and things were fine after initially creating the unscheduled Fusion IR. I don't see us setting std::locale::global anywhere in our code. I do see a few places in Pytorch's third_party components but none of them look like they should impact nvFuser.

import locale
import torch
from nvfuser import FusionDefinition, DataType

H = 768
inputs = [
    torch.randn(H, H, device="cuda"),
    torch.randn(H, device="cuda"),
]

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float, is_cpu=False)
    T1 = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T1, output_shape=[768, 768], broadcast_dims=[1])
    T3 = fd.ops.add(T0, T2)
    S4 = fd.define_constant(0.500000, dtype=DataType.Double)
    T5 = fd.ops.mul(T3, S4)
    S6 = fd.define_constant(0.797885, dtype=DataType.Double)
    T7 = fd.ops.mul(T3, S6)
    S8 = fd.define_constant(0.0447150, dtype=DataType.Double)
    T9 = fd.ops.mul(T3, S8)
    T10 = fd.ops.mul(T9, T3)
    S11 = fd.define_constant(1.00000, dtype=DataType.Double)
    T12 = fd.ops.add(T10, S11)
    T13 = fd.ops.mul(T7, T12)
    T14 = fd.ops.tanh(T13)
    S15 = fd.define_constant(1.00000, dtype=DataType.Double)
    T16 = fd.ops.add(T14, S15)
    T17 = fd.ops.mul(T5, T16)
    T18 = fd.ops.cast(T17, dtype=DataType.Float)
    fd.add_output(T18)

print("ASSERT 1")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

@kevinstephano
Copy link
Collaborator Author

This problem seems very specific to Python 3.10 and how the internal function _get_locale_encoding() is implemented. If I force the code down the emulated path, I don't see this problem. If it executes the C function that I can't find the implementation for, it fails. Note the code for python 3.9 and 3.11 is different!

https://github.com/python/cpython/blob/3.10/Lib/locale.py#L636-L647

    def _get_locale_encoding():
        if hasattr(sys, 'getandroidapilevel'):
            # On Android langinfo.h and CODESET are missing, and UTF-8 is
            # always used in mbstowcs() and wcstombs().
            return 'UTF-8'
        if sys.flags.utf8_mode:
            return 'UTF-8'
        encoding = getdefaultlocale()[1]
        if encoding is None:
            # LANG not set, default conservatively to ASCII
            encoding = 'ascii'
        return encoding

I think the C code is hitting this case but I am not sure why.

      if encoding is None:
            # LANG not set, default conservatively to ASCII
            encoding = 'ascii'

@kevinstephano
Copy link
Collaborator Author

This is the test I used where I print the possible places to get the locale settings. Only getpreferredencoding() is in python 3.10 is showing something different.

import locale
import torch
from nvfuser import FusionDefinition, DataType
import os

H = 768
inputs = [
    torch.randn(H, H, device="cuda"),
    torch.randn(H, device="cuda"),
]

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float, is_cpu=False)
    T1 = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T1, output_shape=[768, 768], broadcast_dims=[1])
    T3 = fd.ops.add(T0, T2)
    S4 = fd.define_constant(0.500000, dtype=DataType.Double)
    T5 = fd.ops.mul(T3, S4)
    S6 = fd.define_constant(0.797885, dtype=DataType.Double)
    T7 = fd.ops.mul(T3, S6)
    S8 = fd.define_constant(0.0447150, dtype=DataType.Double)
    T9 = fd.ops.mul(T3, S8)
    T10 = fd.ops.mul(T9, T3)
    S11 = fd.define_constant(1.00000, dtype=DataType.Double)
    T12 = fd.ops.add(T10, S11)
    T13 = fd.ops.mul(T7, T12)
    T14 = fd.ops.tanh(T13)
    S15 = fd.define_constant(1.00000, dtype=DataType.Double)
    T16 = fd.ops.add(T14, S15)
    T17 = fd.ops.mul(T5, T16)
    T18 = fd.ops.cast(T17, dtype=DataType.Float)
    fd.add_output(T18)

print("ASSERT 1", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3", locale.getpreferredencoding(do_setlocale=True), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

Output:

print("ASSERT 1", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3", locale.getpreferredencoding(do_setlocale=True), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

@kevinstephano
Copy link
Collaborator Author

I think this is Python 3.10 problem and not an nvFuser problem, so I am closing!

@kevinstephano kevinstephano self-assigned this Mar 24, 2023
@ksivaman
Copy link
Member

I am observing this issue with Python 3.8 as well @kevinstephano

@kevinstephano
Copy link
Collaborator Author

kevinstephano commented Apr 3, 2023

The problem is actually being caused by NVRTC. I am not sure what NVRTC is setting as the result is that:
nl_langinfo(CODESET returns ANSI_X3.4-1968. This would explain why both NNC and nvFuser have the issue but TorchInductor does not as OAI-Triton compiles via LLVM-IR and not NVRTC.

I wrapped our specific call to nvrtcCompiileProgram in executor_utils.cpp and you can see the change with the following code:

   char* locstr = setlocale(LC_CTYPE, NULL);
   char* encoding = nl_langinfo(CODESET);
   printf("1. Locale is %s\n", locstr);
   printf("1. Encoding is %s\n", encoding);

I thought the simple fix would be to do setlocale(LC_CTYPE, "C.UTF-8"), but that does not work. I am not sure what nl_langinfo is seeing to determine that the locale is ascii.

I discovered this as I noticed from stepping pdb through Python 3.8 that the library code in Python 3.8 was calling locale.nl_langinfo(CODESET) which the emulated code in Python 3.10 does not but I am guessing the Cpython does in Python 3.10.

This issue with NVRTC and nvrtcCompileProgram was also noticed on Stack Overlow.

There is also an associated NvBug 3833924

@kevinstephano kevinstephano reopened this Apr 3, 2023
@kevinstephano
Copy link
Collaborator Author

I am going to close this issue, again, as there isn't anything we can do besides monitor the NVRTC bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants