Is nvFuser changing the global `locale`? #62

kevinstephano · 2023-03-23T20:02:29Z

I thought it was TorchScript specific issue and not nvFuser but then I tried torch.compile with Inductor and nvprims_nvfuser. Only nvFuser failed.

This is the script I have been playing with:

import locale
import torch
torch._C._jit_set_profiling_executor(True)
torch._C._jit_set_profiling_mode(True)
torch._C._jit_override_can_fuse_on_cpu(False)
torch._C._jit_override_can_fuse_on_gpu(False)
torch._C._jit_set_texpr_fuser_enabled(False)
torch._C._jit_set_nvfuser_enabled(False)
@torch.jit.script
def bias_gelu_fused(inp: torch.Tensor, bias: torch.Tensor) -> torch.Tensor:
    """Bias-GeLU fused"""
    x = inp + bias
    return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
class Fusion(torch.nn.Module):
    def __init__(self):
        super(Fusion, self).__init__()
    def forward(self, inp, bias):
        """Bias-GeLU fused"""
        x = inp + bias
        return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
model = torch.compile(Fusion(), backend='nvprims_nvfuser')
H = 768
inp = torch.randn(H, H, device="cuda")
bias = torch.randn(H, device="cuda")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = model(inp, bias)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = model(inp, bias)
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

The text was updated successfully, but these errors were encountered:

naoyam · 2023-03-23T20:22:07Z

Do you know how to get this encoding in C++? locale.getpreferredencoding()

kevinstephano · 2023-03-23T20:33:41Z

From the documentation, it looks like if you use a default constructor on std::locale, it copies the global locale.

std::locale loc;

kevinstephano · 2023-03-23T21:14:34Z

This definitely appears to be coming from codegen as I tried asserting around the FusionDefinition and things were fine after initially creating the unscheduled Fusion IR. I don't see us setting std::locale::global anywhere in our code. I do see a few places in Pytorch's third_party components but none of them look like they should impact nvFuser.

import locale
import torch
from nvfuser import FusionDefinition, DataType

H = 768
inputs = [
    torch.randn(H, H, device="cuda"),
    torch.randn(H, device="cuda"),
]

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float, is_cpu=False)
    T1 = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T1, output_shape=[768, 768], broadcast_dims=[1])
    T3 = fd.ops.add(T0, T2)
    S4 = fd.define_constant(0.500000, dtype=DataType.Double)
    T5 = fd.ops.mul(T3, S4)
    S6 = fd.define_constant(0.797885, dtype=DataType.Double)
    T7 = fd.ops.mul(T3, S6)
    S8 = fd.define_constant(0.0447150, dtype=DataType.Double)
    T9 = fd.ops.mul(T3, S8)
    T10 = fd.ops.mul(T9, T3)
    S11 = fd.define_constant(1.00000, dtype=DataType.Double)
    T12 = fd.ops.add(T10, S11)
    T13 = fd.ops.mul(T7, T12)
    T14 = fd.ops.tanh(T13)
    S15 = fd.define_constant(1.00000, dtype=DataType.Double)
    T16 = fd.ops.add(T14, S15)
    T17 = fd.ops.mul(T5, T16)
    T18 = fd.ops.cast(T17, dtype=DataType.Float)
    fd.add_output(T18)

print("ASSERT 1")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3")
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

kevinstephano · 2023-03-24T01:45:50Z

This problem seems very specific to Python 3.10 and how the internal function _get_locale_encoding() is implemented. If I force the code down the emulated path, I don't see this problem. If it executes the C function that I can't find the implementation for, it fails. Note the code for python 3.9 and 3.11 is different!

https://github.com/python/cpython/blob/3.10/Lib/locale.py#L636-L647

    def _get_locale_encoding():
        if hasattr(sys, 'getandroidapilevel'):
            # On Android langinfo.h and CODESET are missing, and UTF-8 is
            # always used in mbstowcs() and wcstombs().
            return 'UTF-8'
        if sys.flags.utf8_mode:
            return 'UTF-8'
        encoding = getdefaultlocale()[1]
        if encoding is None:
            # LANG not set, default conservatively to ASCII
            encoding = 'ascii'
        return encoding

I think the C code is hitting this case but I am not sure why.

      if encoding is None:
            # LANG not set, default conservatively to ASCII
            encoding = 'ascii'

kevinstephano · 2023-03-24T03:01:05Z

This is the test I used where I print the possible places to get the locale settings. Only getpreferredencoding() is in python 3.10 is showing something different.

import locale
import torch
from nvfuser import FusionDefinition, DataType
import os

H = 768
inputs = [
    torch.randn(H, H, device="cuda"),
    torch.randn(H, device="cuda"),
]

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float, is_cpu=False)
    T1 = fd.define_tensor(symbolic_sizes=[-1], contiguous=[True], dtype=DataType.Float, is_cpu=False)
    T2 = fd.ops.broadcast_in_dim(T1, output_shape=[768, 768], broadcast_dims=[1])
    T3 = fd.ops.add(T0, T2)
    S4 = fd.define_constant(0.500000, dtype=DataType.Double)
    T5 = fd.ops.mul(T3, S4)
    S6 = fd.define_constant(0.797885, dtype=DataType.Double)
    T7 = fd.ops.mul(T3, S6)
    S8 = fd.define_constant(0.0447150, dtype=DataType.Double)
    T9 = fd.ops.mul(T3, S8)
    T10 = fd.ops.mul(T9, T3)
    S11 = fd.define_constant(1.00000, dtype=DataType.Double)
    T12 = fd.ops.add(T10, S11)
    T13 = fd.ops.mul(T7, T12)
    T14 = fd.ops.tanh(T13)
    S15 = fd.define_constant(1.00000, dtype=DataType.Double)
    T16 = fd.ops.add(T14, S15)
    T17 = fd.ops.mul(T5, T16)
    T18 = fd.ops.cast(T17, dtype=DataType.Float)
    fd.add_output(T18)

print("ASSERT 1", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3", locale.getpreferredencoding(do_setlocale=True), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

Output:

print("ASSERT 1", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
with FusionDefinition() as fd:
    nvfuser_fusion_id0(fd)
print("ASSERT 2", locale.getpreferredencoding(), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"
out = fd.execute(inputs)
print("ASSERT 3", locale.getpreferredencoding(do_setlocale=True), locale.setlocale(locale.LC_CTYPE), locale.getlocale(), os.getenv('LANG'), os.getenv('PYTHONIOENCODING'), locale.getdefaultlocale())
assert locale.getpreferredencoding() == "UTF-8", f"Preferred encoding: {locale.getpreferredencoding()}"

kevinstephano · 2023-03-24T03:21:05Z

I think this is Python 3.10 problem and not an nvFuser problem, so I am closing!

ksivaman · 2023-03-28T19:03:42Z

I am observing this issue with Python 3.8 as well @kevinstephano

kevinstephano · 2023-04-03T01:02:59Z

The problem is actually being caused by NVRTC. I am not sure what NVRTC is setting as the result is that:
nl_langinfo(CODESET returns ANSI_X3.4-1968. This would explain why both NNC and nvFuser have the issue but TorchInductor does not as OAI-Triton compiles via LLVM-IR and not NVRTC.

I wrapped our specific call to nvrtcCompiileProgram in executor_utils.cpp and you can see the change with the following code:

   char* locstr = setlocale(LC_CTYPE, NULL);
   char* encoding = nl_langinfo(CODESET);
   printf("1. Locale is %s\n", locstr);
   printf("1. Encoding is %s\n", encoding);

I thought the simple fix would be to do setlocale(LC_CTYPE, "C.UTF-8"), but that does not work. I am not sure what nl_langinfo is seeing to determine that the locale is ascii.

I discovered this as I noticed from stepping pdb through Python 3.8 that the library code in Python 3.8 was calling locale.nl_langinfo(CODESET) which the emulated code in Python 3.10 does not but I am guessing the Cpython does in Python 3.10.

This issue with NVRTC and nvrtcCompileProgram was also noticed on Stack Overlow.

There is also an associated NvBug 3833924

kevinstephano · 2023-04-04T20:44:23Z

I am going to close this issue, again, as there isn't anything we can do besides monitor the NVRTC bug.

see NVIDIA/Fuser#62

kevinstephano closed this as completed Mar 24, 2023

kevinstephano self-assigned this Mar 24, 2023

kevinstephano reopened this Apr 3, 2023

kevinstephano closed this as completed Apr 4, 2023

ksivaman mentioned this issue Apr 5, 2023

Preferred global encoding format getting changed in TE NVIDIA/TransformerEngine#133

Closed

wujingyue mentioned this issue May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

zifeitong mentioned this issue Jun 17, 2024

[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py vllm-project/vllm#5606

Merged

wujingyue mentioned this issue Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

LoicGrobol added a commit to hopsparser/hopsparser that referenced this issue Feb 19, 2025

more patching up for nvrtc encoding bug

0cf0bec

see NVIDIA/Fuser#62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is nvFuser changing the global `locale`? #62

Is nvFuser changing the global `locale`? #62

kevinstephano commented Mar 23, 2023

naoyam commented Mar 23, 2023

kevinstephano commented Mar 23, 2023

kevinstephano commented Mar 23, 2023

kevinstephano commented Mar 24, 2023

kevinstephano commented Mar 24, 2023

kevinstephano commented Mar 24, 2023

ksivaman commented Mar 28, 2023

kevinstephano commented Apr 3, 2023 •

edited

Loading

kevinstephano commented Apr 4, 2023

Is nvFuser changing the global locale? #62

Is nvFuser changing the global locale? #62

Comments

kevinstephano commented Mar 23, 2023

naoyam commented Mar 23, 2023

kevinstephano commented Mar 23, 2023

kevinstephano commented Mar 23, 2023

kevinstephano commented Mar 24, 2023

kevinstephano commented Mar 24, 2023

kevinstephano commented Mar 24, 2023

ksivaman commented Mar 28, 2023

kevinstephano commented Apr 3, 2023 • edited Loading

kevinstephano commented Apr 4, 2023

Is nvFuser changing the global `locale`? #62

Is nvFuser changing the global `locale`? #62

kevinstephano commented Apr 3, 2023 •

edited

Loading