Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specialize undefinedness #1

Open
wants to merge 64 commits into
base: krovatkin/switch4_4
Choose a base branch
from

Conversation

Krovatkin
Copy link
Owner

No description provided.

suo and others added 30 commits October 28, 2019 13:47
Summary:
Pull Request resolved: pytorch#28788

Okay, my last fix was wrong because it turns out that the base SHA is
computed at PR time using the actual repo's view of the base ref, not
the user's. So if the user doesn't rebase on top of the latest master
before putting up the PR, the diff thing is wrong anyway.

This PR fixes the issue by not relying on any of these API details and
just getting the merge-base of the base and head refs, which should
guarantee we are diffing against the right thing.

This solution taken from github/VisualStudio#1008

Test Plan: Imported from OSS

Differential Revision: D18172391

Pulled By: suo

fbshipit-source-id: 491a50119194508b2eefa5bd39fe813ca85f27b1
…lure (pytorch#28807)

Summary:
Pull Request resolved: pytorch#28807

`FAIL: test_numerical_consistency_per_channel (_main_.TestFakeQuantizePerChannel)`

This test is failing consistently on master, we can't find a clean blame.
ghstack-source-id: 92763176

Test Plan: CI

Differential Revision: D18181496

fbshipit-source-id: 5948af06c4cb7dea9a8db1366deb7c12f6ec1c72
Summary:
Pull Request resolved: pytorch#28766

Add the warning message to explicitly ask the users to upgrade the deprecated `torch.jit.quantized` API to the new `torch.quantization.quantize_dynamic` API.
ghstack-source-id: 92711620

Test Plan: CI

Differential Revision: D18164903

fbshipit-source-id: e6aff2527f335c2d9f362e6856ce8597edb52aaa
Summary:
Fixes pytorch#26333

Fixes the operators missed in pytorch#26507 and includes a test for all operators.
Pull Request resolved: pytorch#27423

Differential Revision: D17835390

Pulled By: ezyang

fbshipit-source-id: 7a1351c7ccc8ad11454dbaa00d3701dcee4f06a8
Summary:
Pull Request resolved: pytorch#28426

Type casting is used in copy, and will be used also in tensor iterator
in the next stacked diff. I move it to c10 to make it serve as an common
util for different things.

I also add two dynamic casting functions
- fetch_and_cast
- cast_and_store

fetch_and_cast fetch a value with dynamic type specified by a ScalarType
from a void pointer and cast it to a static type.

cast_and_store casts a static typed value into dynamic type specified
by a ScalarType, and store it into a void pointer.

Test Plan: Imported from OSS

Differential Revision: D18170996

Pulled By: ezyang

fbshipit-source-id: 41658afd5c0ab58c6b6c510424893d9a2a0c059e
Summary:
Pull Request resolved: pytorch#28427

Fixes: pytorch#26401

This PR fixes the issue by using the newly added dynamic cast inside
`TensorIterator` so that instead of converting the type at the beginning
(which generates extra kernel launches), the `TensorIterator` do a
load-cast-compute-store for each element while looping. So there is only
one read and one write of memory.

**nvprof:**
```python
import torch

_100M = 100 * 1024 ** 2
r = torch.randn(_100M, dtype=torch.float32, device='cuda')
d = torch.randn(_100M, dtype=torch.float64, device='cuda')
torch.cuda.synchronize()
torch.cuda.profiler.start()
r.add_(d)
torch.cuda.profiler.stop()
torch.cuda.synchronize()
```

```
==11407== NVPROF is profiling process 11407, command:
/home/xgao/anaconda3/bin/python simple.py
==11407== Profiling application: /home/xgao/anaconda3/bin/python
simple.py
==11407== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min
Max  Name
 GPU activities:  100.00%  2.0611ms         1  2.0611ms  2.0611ms
2.0611ms
_ZN2at6native18elementwise_kernelILi512ELi1EZNS0_15gpu_kernel_implIZZZNS0_15add_kernel_cudaERNS_14TensorIteratorEN3c106ScalarEENKUlvE_clEvENKUlvE1_clEvEUlddE_EEvS4_RKT_EUliE_EEviT1_
      API calls:  100.00%  1.05006s         1  1.05006s  1.05006s
1.05006s  cudaLaunchKernel
                    0.00%  2.7740us         2  1.3870us     673ns
2.1010us  cudaGetDevice
                    0.00%  2.3730us         1  2.3730us  2.3730us
2.3730us  cudaSetDevice
                    0.00%     830ns         1     830ns     830ns
830ns  cudaGetLastError
```

**benchmark**
```python
import torch
print(torch.__version__)
print(torch.version.git_version)

_100M = 100 * 1024 ** 2
r = torch.randn(_100M, dtype=torch.float32, device='cuda')
d = torch.randn(_100M, dtype=torch.float64, device='cuda')
torch.cuda.synchronize()
%timeit r.add_(d); torch.cuda.synchronize()
```

original
```
1.4.0a0+7d277b0
7d277b0
6.83 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

after
```
1.4.0a0+f0f2f65
f0f2f65
2.08 ms ± 139 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

For more benchmark, see: pytorch#28344

Test Plan: Imported from OSS

Differential Revision: D18170997

Pulled By: ezyang

fbshipit-source-id: 9c82c1c89583f3e6202c5d790b9b73ad9f960fad
Summary:
Pull Request resolved: pytorch#28428

Using the new type promotion and dynamic casting added to
`TensorIterator`, the copy kernels could be greatly simplified.

Benchmark on CUDA:
```python
import torch
import timeit
import pandas
import itertools
from tqdm.notebook import tqdm
import math
print(torch.__version__)
print()

_10M = 10 * 1024 ** 2

d = {}

for from_, to in tqdm(itertools.product(torch.testing.get_all_dtypes(), repeat=2)):
    if from_ not in d:
        d[from_] = {}
    a = torch.empty(_10M, dtype=from_, device='cuda')
    min_ = math.inf
    for i in range(100):
        torch.cuda.synchronize()
        start = timeit.default_timer()
        a.to(to)
        torch.cuda.synchronize()
        end = timeit.default_timer()
        elapsed = end - start
        if elapsed < min_:
            min_ = elapsed
    d[from_][to] = int(min_ * 1000 * 1000)

pandas.DataFrame(d)
```

original:
![image](https://user-images.githubusercontent.com/1032377/67623519-e3e6dd80-f7da-11e9-86ea-9cc9f237123b.png)

new:
![image](https://user-images.githubusercontent.com/1032377/67623527-fc56f800-f7da-11e9-82bd-dc1ff9821b68.png)

Test Plan: Imported from OSS

Differential Revision: D18170995

Pulled By: ezyang

fbshipit-source-id: 461b53641813dc6cfa872a094ae917e750c60759
Summary:
Since we have merged pytorch#27382 (thanks pbelevich!)
Pull Request resolved: pytorch#28804

Differential Revision: D18185714

Pulled By: yf225

fbshipit-source-id: 1148f5837fbf578843b989fc53fd334519943cdd
Summary:
Pull Request resolved: pytorch#28800

Fix up namespaces and make friendly error message when registered class doesn't inherit from the right base

Test Plan: Imported from OSS

Differential Revision: D18175067

Pulled By: jamesr66a

fbshipit-source-id: 5c7cf3a49fb45db502d84eb3f9a69be126ee59fb
Summary:
Pull Request resolved: pytorch#28815

Add the unittest import
ghstack-source-id: 92789329

Test Plan: CI

Differential Revision: D18191989

fbshipit-source-id: c54e0309e21156c33e4fec01bfba17a1c30463c9
Summary:
GitHub commits:

facebook/fbthrift@724e939
facebook/folly@f4fb426
facebook/proxygen@95d4b19
facebook/mvfst@8b81314
facebookarchive/profilo@ac8faa6
pytorch/FBGEMM@5487e2b

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: 9b9f4cccd869638215c17111361a6f6c480c73af
Summary:
build error in internal pt mobile build

```
xplat/caffe2/torch/csrc/autograd/VariableTypeManual.cpp:118:49: error: address of function 'requires_grad' will always evaluate to 'true' [-Werror,-Wpointer-bool-conversion]
      autograd::utils::requires_grad_leaf_error(requires_grad)
      ~~~~~~~~                                  ^~~~~~~~~~~~~
xplat/caffe2/torch/csrc/autograd/VariableTypeManual.cpp:118:49: note: prefix with the address-of operator to silence this warning
```

I think the variable name in requires_grad_leaf_error is wrong.

Test Plan: mobile build works

Reviewed By: pbelevich

Differential Revision: D18192663

fbshipit-source-id: a3d3ebb9039022eb228c1d183a1076f65f9e84e0
Summary:
Adds `interpolate` functional and `Upsample` module support for the C++ API.

**Issue**: pytorch#25883

**Reviewer**: yf225
Pull Request resolved: pytorch#28413

Differential Revision: D18165014

Pulled By: yf225

fbshipit-source-id: ecae2f432a301b1f4afa7c038b2d104cbad139f2
Summary:
Skip the functions which were reverted.
Pull Request resolved: pytorch#28816

Reviewed By: hl475

Differential Revision: D18196628

Pulled By: houseroad

fbshipit-source-id: 30d43fcd57efb21b870c6a630b7ee305604dc603
Summary:
Pull Request resolved: pytorch#28809

### Summary

This PR adds the interactive mode to `bootstrap.sh`. Instead of passing the credential information from command parameters(`-t`,`-p`), we're going to ask the user enter that information and save it to a config file, such that next time you don't have to enter again. So all you need now, is one line command

```shell
./bootstrap
```

### Test Plan

- TestApp.ipa can be installed on any devices
- Don't break CI jobs

Test Plan: Imported from OSS

Differential Revision: D18194032

Pulled By: xta0

fbshipit-source-id: a416ef7f13fa565e2c10bb55f94a8ce994b4e869
Summary:
GitHub commits:

pytorch/FBGEMM@edee492

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: b69770ac1a801b372fba0e112124b25ad1572821
Summary:
Pull Request resolved: pytorch#28768

Add Conv3dInt8

Test Plan: buck test mode/dev-nosan caffe2/test:quantized -- "Conv"

Reviewed By: jianyuh

Differential Revision: D18023661

fbshipit-source-id: 8fc7a4350baf29271dfd6fa3c1c4b10e60e2fdbf
Test Plan: revert-hammer

Differential Revision:
D18170995

Original commit changeset: 461b53641813

fbshipit-source-id: 1ebb119325d746a153982ac3209d3570a7e18d88
Test Plan: revert-hammer

Differential Revision:
D18170997

Original commit changeset: 9c82c1c89583

fbshipit-source-id: 8862d9628864d23a087f2895870386772a634e45
Test Plan: revert-hammer

Differential Revision:
D18170996

Original commit changeset: 41658afd5c0a

fbshipit-source-id: 394e84bbc52bdd708609304261ffa1513a771d57
Summary:
GitHub commits:

pytorch/FBGEMM@214b370

Test Plan: n/a

Reviewed By: zpao

fbshipit-source-id: aa03f9a37d316c232fdf2e4289c32ec68a22b469
Summary: Pull Request resolved: pytorch#28535

Differential Revision: D18197932

Pulled By: Krovatkin

fbshipit-source-id: 2639b205e899f800787ee57c157447d54e4669c3
Summary:
Pull Request resolved: pytorch#28827

When we print the `DynamicLinear` module, we don't want to print the scale and zero points as they are not needed for the dynamic quantization.

Let's take the output of RoBERTa model as an example:

Before this PR:
```
      (19): TransformerEncoderLayer(
        (dropout): Dropout(p=0.1, inplace=False)
        (attention): MultiheadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (input_projection): DynamicQuantizedLinear(in_features=1024, out_features=3072, scale=1.0, zero_point=0)
          (output_projection): DynamicQuantizedLinear(in_features=1024, out_features=1024, scale=1.0, zero_point=0)
        )
        (residual_mlp): ResidualMLP(
          (mlp): Sequential(
            (0): DynamicQuantizedLinear(in_features=1024, out_features=4096, scale=1.0, zero_point=0)
            (1): GeLU()
            (2): Dropout(p=0.1, inplace=False)
            (3): DynamicQuantizedLinear(in_features=4096, out_features=1024, scale=1.0, zero_point=0)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (attention_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
      (20): TransformerEncoderLayer(
        (dropout): Dropout(p=0.1, inplace=False)
        (attention): MultiheadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (input_projection): DynamicQuantizedLinear(in_features=1024, out_features=3072, scale=1.0, zero_point=0)
          (output_projection): DynamicQuantizedLinear(in_features=1024, out_features=1024, scale=1.0, zero_point=0)
        )
        (residual_mlp): ResidualMLP(
          (mlp): Sequential(
            (0): DynamicQuantizedLinear(in_features=1024, out_features=4096, scale=1.0, zero_point=0)
            (1): GeLU()
            (2): Dropout(p=0.1, inplace=False)
            (3): DynamicQuantizedLinear(in_features=4096, out_features=1024, scale=1.0, zero_point=0)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (attention_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
```

After this PR:
```
      (19): TransformerEncoderLayer(
        (dropout): Dropout(p=0.1, inplace=False)
        (attention): MultiheadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (input_projection): DynamicQuantizedLinear(in_features=1024, out_features=3072)
          (output_projection): DynamicQuantizedLinear(in_features=1024, out_features=1024)
        )
        (residual_mlp): ResidualMLP(
          (mlp): Sequential(
            (0): DynamicQuantizedLinear(in_features=1024, out_features=4096)
            (1): GeLU()
            (2): Dropout(p=0.1, inplace=False)
            (3): DynamicQuantizedLinear(in_features=4096, out_features=1024)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (attention_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
      (20): TransformerEncoderLayer(
        (dropout): Dropout(p=0.1, inplace=False)
        (attention): MultiheadAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (input_projection): DynamicQuantizedLinear(in_features=1024, out_features=3072)
          (output_projection): DynamicQuantizedLinear(in_features=1024, out_features=1024)
        )
        (residual_mlp): ResidualMLP(
          (mlp): Sequential(
            (0): DynamicQuantizedLinear(in_features=1024, out_features=4096)
            (1): GeLU()
            (2): Dropout(p=0.1, inplace=False)
            (3): DynamicQuantizedLinear(in_features=4096, out_features=1024)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (attention_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
```
ghstack-source-id: 92807317

Test Plan: CI

Differential Revision: D18197022

fbshipit-source-id: e41635330cfdfb008a0468d6a8ff67a06f7e1c59
Summary:
Pull Request resolved: pytorch#28836

as title

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:softmax_test
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
  Changed value project.buck_out='buck-out/opt' (was 'buck-out/dev')
  ... and 56 more. See logs for all changes
Parsing buck files: finished in 6.2 sec
Creating action graph: finished in 8.8 sec
Building: finished in 05:42.6 min (100%) 28336/28336 jobs, 23707 updated
  Total time: 05:57.7 min
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: Softmax
/proc/self/fd/4/softmax_test.py:57: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  """
# Mode: Eager
# Name: Softmax_N4_C3_H256_W256
# Input: N: 4, C: 3, H: 256, W: 256
Forward Execution Time (us) : 18422.487

Reviewed By: hl475

Differential Revision: D18202335

fbshipit-source-id: 0bb376cb465d998a49196e148d48d436126ae334
Summary:
This PR adds scripts that could be used for pytorch#26052

Example output:

```
Success: TestTorchDeviceTypeCPU.test_advancedindex_big_cpu
Success: TestTorchDeviceTypeCPU.test_addcmul_cpu
Success: TestTorchDeviceTypeCPU.test_addbmm_cpu_float32
Success: TestTorchDeviceTypeCPU.test_advancedindex_cpu_float16
Success: TestTorchDeviceTypeCPU.test_addmv_cpu
Success: TestTorchDeviceTypeCPU.test_addcdiv_cpu
Success: TestTorchDeviceTypeCPU.test_all_any_empty_cpu
Success: TestTorchDeviceTypeCPU.test_atan2_cpu
Success: TestTorchDeviceTypeCPU.test_advancedindex_cpu_float64
Success: TestTorchDeviceTypeCPU.test_baddbmm_cpu_float32
Success: TestTorchDeviceTypeCPU.test_atan2_edgecases_cpu
Success: TestTorchDeviceTypeCPU.test_add_cpu
Success: TestTorchDeviceTypeCPU.test_addr_cpu_bfloat16
Success: TestTorchDeviceTypeCPU.test_addr_cpu_float32
```
Pull Request resolved: pytorch#28127

Differential Revision: D18184255

Pulled By: mruberry

fbshipit-source-id: 7fd4bd9faf9f8b37b369f631c63f26eb965b16e7
Summary: as title

Test Plan: test in stacked diff

Reviewed By: csummersea

Differential Revision: D18123726

fbshipit-source-id: ce75db1e6f314a822a94ebdfc11988fab50ee836
Summary:
Pull Request resolved: pytorch#28837

The JIT code used in op bench is not compatibility with latest JIT code path. This diff aims to resolve that issue.

Test Plan:
```buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:add_test -- --use_jit
Building: finished in 02:29.8 min (100%) 7055/7055 jobs, 1 updated
  Total time: 02:30.3 min
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: add
# Mode: JIT
# Name: add_M64_N64_K64_cpu
# Input: M: 64, N: 64, K: 64, device: cpu
Forward Execution Time (us) : 118.052

Reviewed By: hl475

Differential Revision: D18197057

fbshipit-source-id: 92edae8a48abc4115a558a91ba46cc9c3edb2eb8
Summary:
Pull Request resolved: pytorch#28838

as title

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/operator_benchmark/pt:softmax_test -- --ai_pep_format true
  Total time: 02:36.7 min
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short

# Benchmarking PyTorch: Softmax
/proc/self/fd/4/softmax_test.py:57: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
  """
PyTorchObserver {"type": "PyTorch_Softmax_N4_C3_H128_W128", "metric": "latency", "unit": "ms", "value": "4.83197245048359"}
PyTorchObserver {"type": "PyTorch_Softmax_N4_C3_H128_W128", "metric": "latency", "unit": "ms", "value": "4.839232977246866"}
PyTorchObserver {"type": "PyTorch_Softmax_N4_C3_H128_W128", "metric": "latency", "unit": "ms", "value": "4.7970924858236685"}
PyTorchObserver {"type": "PyTorch_Softmax_N4_C3_H128_W128", "metric": "latency", "unit": "ms", "value": "4.708389271399938"}
# Benchmarking PyTorch: Softmax
...

Reviewed By: hl475

Differential Revision: D18202504

fbshipit-source-id: 4a332763432b3b5886f241bb2ce49d4df481a6f3
Summary: Pull Request resolved: pytorch#28784

Differential Revision: D18178889

Pulled By: anjali411

fbshipit-source-id: 976810bf3f9def3a8f5ca6885b1e049b831f06f3
Summary:
Pull Request resolved: pytorch#28112

att

Test Plan:
reading

Imported from OSS

Differential Revision: D18173102

fbshipit-source-id: d8574758288bfce08eaf0f4f6163284defb56d6e
mrshenli and others added 10 commits October 29, 2019 19:39
…#28630)

Summary:
Pull Request resolved: pytorch#28630

This includes:
1. Respect autograd context in rpc.remote for builtin ops
2. Force setting autograd context in RRef.to_here() even if the
message for to_here() does not contain any tensor.

Test Plan: Imported from OSS

Differential Revision: D18138562

Pulled By: mrshenli

fbshipit-source-id: a39ec83e556d19130f22eb317927241a017000ba
Summary: Pull Request resolved: pytorch#28656

Test Plan: Imported from OSS

Differential Revision: D18138561

Pulled By: mrshenli

fbshipit-source-id: 798e7c00465b5a299f7b4642683bc407895bc7da
Summary:
Pull Request resolved: pytorch#28855

Resubmit:
OfflineTensor will be a shell to just carry the shape and dtype. No data will be stored. This should help us plumb through the onnxifi process.

Test Plan:
```
buck test caffe2/caffe2/fb/opt:onnxifi_with_offline_tensor_test
```

Reviewed By: ipiszy, ChunliF

Differential Revision: D18212824

fbshipit-source-id: 5c8aaed2ef11d719dfa2a2901875efd66806ea56
Summary:
Pull Request resolved: pytorch#27346

att

Test Plan:
test_jit.py

Imported from OSS

Differential Revision: D18182915

fbshipit-source-id: d646ae76ce44f5d12e974c776a3e92e5e163493c
Summary:
Pull Request resolved: pytorch#28866

When we are working on the fix for int32 instead of int64, we also need to take care of the ClipRangesGatherSigridHash since this is the operator that actually gets used during inference.

Test Plan: Added unittest to cover for the new case

Reviewed By: ipiszy

Differential Revision: D17147237

fbshipit-source-id: 2b562b72a6ae8f7282e54d822467b8204fb1055e
Summary:
Pull Request resolved: pytorch#28799

When the verbosity is quiet, hypothesis no longer prints the real
error when it finds multiple falsifying examples: it just says
that there are two failures.  This is supremely unuseful. Make
it print more.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D18206936

Pulled By: ezyang

fbshipit-source-id: 03bb60ba24cee28706bb3d1f0858c32b6743a109
…rch#28499)

Summary:
Before, we would only give the key we are looking for (i.e. typically
just "No such serialized tensor 'weight'", no matter for which submodule
we were looking for a weight.
Now we error with "No such serialized tensor '0.conv1.weight'" or
similar.
The analogous information is added to missing module error messages.

I threw in a test, and it saved me already...
Pull Request resolved: pytorch#28499

Differential Revision: D18122442

Pulled By: yf225

fbshipit-source-id: a134b6d06ca33de984a11d6fea923244bcd9fb95
@Krovatkin Krovatkin force-pushed the krovatkin/specialize_undefinedness branch from cf36f83 to 91c3cdc Compare October 30, 2019 17:04
Krovatkin pushed a commit that referenced this pull request May 4, 2020
Smaller things to fix before enabling TE by default
Krovatkin pushed a commit that referenced this pull request May 5, 2020
Summary:
Pull Request resolved: pytorch#37101

Fixes pytorch#36954.

The basic concept is to streamline the process of rethrowing
c10::Error with extra error information.  This is in a few
steps:

- I completely remodeled the Error data type and the internal
  invariants.  Instead of manually adding in newlines, the
  message stack formatting process is responsible for inserting
  newlines and spacing as necessary.  Call sites are then
  modified to respect the new API model.
- TORCH_RETHROW macro is added, which adds context to an error
  message and then rethrows it.

New internal assert failure looks like:

```
0 INTERNAL ASSERT FAILED at ../c10/test/util/exception_test.cpp:64, please report a bug to PyTorch.
Exception raised from TestBody at ../c10/test/util/exception_test.cpp:64 (most recent call first):
frame #0: <unknown function> + 0x6aab9 (0x7ff611d3aab9 in /data/users/ezyang/pytorch-tmp/build/lib/libc10.so)
frame #1: ...
```

Error message with context looks like:

```
This is an error
  This is context 1
  This is context 2
```

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D21202891

Pulled By: ezyang

fbshipit-source-id: 361cadd16bc52e5886dba08e79277771ada76169
Krovatkin pushed a commit that referenced this pull request Nov 3, 2020
Summary:
Pull Request resolved: pytorch#46966

These tests had false positives in TSAN for modifying thread local
variables:

```
WARNING: ThreadSanitizer: data race (pid=5364)
  Write of size 8 at 0x7b2c0004ff70 by thread T2:
    #0 free <null> (libtools_build_sanitizers_tsan-py.so+0xde6ad)
    #1 __GI__dl_deallocate_tls

  Previous write of size 1 at 0x7b2c0004ff71 by thread T3:
    #0 at::GradMode::set_enabled(bool) caffe2/aten/src/ATen/core/grad_mode.cpp:20 (libcaffe2_ATen-core.so+0x40e013)
    #1 torch::autograd::set_grad_enabled(_object*, _object*) caffe2/torch/csrc/autograd/init.cpp:143 (libcaffe2__C_impl_cuda.so+0x115ef0e)
    #2 _PyMethodDef_RawFastCallKeywords

  Thread T3 (tid=5385, finished) created by main thread at:
    #0 pthread_create <null> (libtools_build_sanitizers_tsan-py.so+0xc5a86)
    #1 PyThread_start_new_thread
```
ghstack-source-id: 115330433

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24584411

fbshipit-source-id: e35f704dfcb7b161a13a4902beaf8b1e41ccd596
Krovatkin pushed a commit that referenced this pull request May 25, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node pytorch#5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node pytorch#6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node pytorch#7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node pytorch#8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node pytorch#9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node pytorch#10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
Krovatkin pushed a commit that referenced this pull request Jul 14, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
pytorch#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
pytorch#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
pytorch#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
pytorch#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
pytorch#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
pytorch#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
pytorch#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
pytorch#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
pytorch#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
pytorch#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
pytorch#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
pytorch#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
pytorch#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
pytorch#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
pytorch#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
pytorch#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
pytorch#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
pytorch#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
pytorch#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
Krovatkin pushed a commit that referenced this pull request Jul 22, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
pytorch#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
pytorch#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
pytorch#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
pytorch#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
pytorch#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
pytorch#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
pytorch#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
pytorch#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)pytorch#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)pytorch#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
Krovatkin pushed a commit that referenced this pull request Aug 25, 2021
…ytorch#63339)

Summary:
Pull Request resolved: pytorch#63339

# Context
https://fb.workplace.com/groups/pytorch.dev/permalink/900474523864362/?comment_id=901125403799274&reply_comment_id=905023386742809

##### WHAT IS A STACK TRACE?
A stack trace (also called stack backtrace or stack traceback) is a report of the active stack frames at a certain point in time during the execution of a program.

Typically when an exception is thrown, one would expect to see the code (file:line) that threw the exception, and every intermediate frame up to and including the main function.

We are enabling android stack trace to help debugging on android devices.

Test Plan:
## Steps to test
```
buck build fbsource//xplat/caffe2/mode/aibench_pytorch_android -c pt.enable_qpl=0 -c pt.has_backtraces=1 fbsource//xplat/caffe2/fb/lite_predictor:lite_predictorAndroid#android-x86_64

one_world android emulator android-28

adb push ~/fbsource/buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictorAndroid#android-x86_64 /data/local/tmp

cd /data/local/tmp
./lite_predictorAndroid#android-x86_64

./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
```

## See how model file is not found stack traces is:

### before
```
./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true

Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
(no backtrace available)
Aborted
```

### after
```
134|generic_x86_64:/data/local/tmp $ ./lite_predictorAndroid#android-x86_64 --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 2 threads
Run with 2 threads
Loading model...
terminating with uncaught exception of type c10::Error: open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
 frame #0       c10::get_backtrace(unsigned long, unsigned long, bool)[0x59494274f10e]
 frame #1       [0x5949427b1eee]
 frame #2       [0x5949427b1eb2]
 frame #3       [0x5949427b1cdc]
 frame #4       std::__ndk1::function<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > ()>::operator()() const[0x5949427afc34]
 frame pytorch#5       c10::Error::Error(c10::SourceLocation, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >)[0x5949427b05b1]
 frame pytorch#6       c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949427aca5f]
 frame pytorch#7       caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b37b2]
 frame pytorch#8       caffe2::serialize::FileAdapter::FileAdapter(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x5949426b3903]
 frame pytorch#9       torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>, std::__ndk1::unordered_map<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> >, std::__ndk1::hash<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::equal_to<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > >, std::__ndk1::allocator<std::__ndk1::pair<std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > > > >&)[0x5949422737bd]
 frame pytorch#10      torch::jit::_load_for_mobile(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, c10::optional<c10::Device>)[0x594942273769]
 frame pytorch#11      benchmark(std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, int, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__ndk1::basic_string<char, std::__ndk1::char_traits<char>, std::__ndk1::allocator<char> > const&)[0x59494189b21d]
 frame pytorch#12      main[0x594941882aff]
 frame pytorch#13      __libc_init[0x7b699d08578d]
```

### what we get for os:linux
```
(base) [pavithran@devvm1803.vll0 /data/users/pavithran/fbsource] ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor --model ./detect.bc --input_dims "1,3,192,192" --input_type float --warmup 20 --iter 5 --report_pep true
Run with 24 threads
Run with 24 threads
Loading model...
terminate called after throwing an instance of 'c10::Error'
  what():  open file failed, file path: ./detect.bc
Exception raised from RAIIFile at xplat/caffe2/caffe2/serialize/file_adapter.cc:13 (most recent call first):
frame #0: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb7fe]
frame #1: ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor() [0x20cb6c6]
frame #2: std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>::operator()() const + 0x54 (0x20ca4e4 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #3: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x57 (0x20ca9a7 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame #4: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7a (0x20c823a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#5: caffe2::serialize::FileAdapter::RAIIFile::RAIIFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x96 (0x206f3d6 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#6: caffe2::serialize::FileAdapter::FileAdapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x42 (0x206f502 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#7: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x30 (0x1be826c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#8: torch::jit::_load_for_mobile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x35 (0x1be8214 in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#9: benchmark(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, int, int, int, bool, int, bool, int, double, bool, bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x16d (0x12093ad in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#10: main + 0x25c (0x11f933c in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)
frame pytorch#11: __libc_start_main + 0x105 (0x7fc7b9f2ed95 in /usr/local/fbcode/platform009/lib/libc.so.6)
frame pytorch#12: _start + 0x2a (0x11f902a in ./buck-out/gen/xplat/caffe2/fb/lite_predictor/lite_predictor)

Aborted (core dumped)
````

Reviewed By: dhruvbird

Differential Revision: D30135947

fbshipit-source-id: f50c634ef4545843305cad4b4a14a8776b1aec76
Krovatkin pushed a commit that referenced this pull request Sep 20, 2021
…4332)

Summary:
Pull Request resolved: pytorch#64332

With this diff, if a compiler bug occurs (unlikely, I know!) we'll be able to get a c++ stacktrace leading to the exception, rather than just a terse message.  E.g.,
```
RuntimeError: UNSUPPORTED DTYPE
Exception raised from compilation_error at ../torch/csrc/jit/tensorexpr/exceptions.h:32 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f966659b2eb in /fsx/users/bertrand/c\
onda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x376f099 (0x7f966a195099 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x3763bf5 (0x7f966a189bf5 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: torch::jit::tensorexpr::CudaCodeGen::Initialize() + 0xdd8 (0x7f966a193368 in /fsx/users/bertrand/conda/envs/pytorch/lib/python3.8/site-packages/torch/lib/libtorch_cuda\
.so)
```

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D30745610

Pulled By: bertmaher

fbshipit-source-id: a1cfaa7364ef4120de834e9cbe57ced1d082ab4e
Krovatkin pushed a commit that referenced this pull request Oct 6, 2021
Summary:
Pull Request resolved: pytorch#66009

Fixes
```
test_trace_c10_ops (jit.test_tracer.TestTracer) ... third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374:24: runtime error: applying non-zero offset 4 to null pointer
    #0 0x7f5228f72227 in Eigen::internal::BlockImpl_dense<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, true>::BlockImpl_dense(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:374
    #1 0x7f5228f7212c in Eigen::BlockImpl<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false, Eigen::Dense>::BlockImpl(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:166
    #2 0x7f5228f720dc in Eigen::Block<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >, -1, -1, false>::Block(Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> >&, long, long, long, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/Block.h:142
    #3 0x7f5229b0e059 in Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::FixedBlockXpr<internal::get_fixed_value<int>::value, internal::get_fixed_value<long>::value>::Type Eigen::DenseBase<Eigen::Map<Eigen::Array<float, -1, -1, 1, -1, -1>, 0, Eigen::Stride<0, 0> > >::block<int, long>(long, long, int, long) third-party-buck/platform009/build/eigen/include/Eigen/src/Core/../plugins/BlockMethods.h:98
    #4 0x7f5229b0c5ca in caffe2::GenerateProposalsOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/generate_proposals_op.cc:348
```
Also cleans up some data type and const issues around the area.

Test Plan: Sandcastle

Reviewed By: xush6528

Differential Revision: D31343046

fbshipit-source-id: fd9096c8e47a0aad529c72fd313f64ca98dcb80b
Krovatkin pushed a commit that referenced this pull request Oct 6, 2021
Summary:
Pull Request resolved: pytorch#66060

Fixes
```
testTumHistoryAdditionalLaser (caffe2.caffe2.fb.layers.tests.tum_history_test.TestTumHistory) ... caffe2/caffe2/operators/concat_split_op.h:363:74: runtime error: applying non-zero offset 8 to null pointer
    #0 0x7f8f39d29795 in caffe2::ConcatOp<caffe2::CPUContext>::RunOnDevice() caffe2/caffe2/operators/concat_split_op.h:363
    #1 0x7f8f39c4978d in caffe2::Operator<caffe2::CPUContext>::Run(int) caffe2/caffe2/core/operator.h:987
    #2 0x7f8f381fe9c9 in caffe2::SimpleNet::Run() caffe2/caffe2/core/net_simple.cc:67
    #3 0x7f8f38ee488e in caffe2::Workspace::RunNet(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) caffe2/caffe2/core/workspace.cc:289
```

Test Plan: Sandcastle

Reviewed By: dzhulgakov, xush6528

Differential Revision: D31366205

fbshipit-source-id: 566aa519677c9d371189e4b1f81d595732861efc
Krovatkin pushed a commit that referenced this pull request Jun 1, 2022
…78136)

This prevents `import torch` accidentally crash on machines with no metal devices

Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:
```
(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend
```

which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in
https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171

Pull Request resolved: pytorch#78136
Approved by: https://github.com/seemethere
Krovatkin pushed a commit that referenced this pull request Jun 1, 2022
… of libtorch_python (pytorch#78028)

Summary:
This moves torch::class_<WorkerInfo> into `rpc_agent.cpp` so it gets registered in libtorch instead of libtorch_python. This is intermediate work to getting torch::deploy to load an unmodified copy of libtorch. Current RPC is incompatible due to duplicate registrations.

```
unknown file: Failure
C++ exception with description "Exception Caught inside torch::deploy embedded library:
Custom class with name __torch__.torch.classes.dist_rpc.WorkerInfo is already registered. Ensure that registration with torch::class_ is only called once.
Exception raised from registerCustomClass at ../aten/src/ATen/core/custom_class.cpp:61 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3bd9adb92e in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7f3bd9ab7068 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: torch::registerCustomClass(std::shared_ptr<c10::ClassType>) + 0x110 (0x7f3bc2258980 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&) + 0x3b9 (0x7f3bc225a419 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: [0x7f3ba45cfea1]
frame pytorch#5: <unknown function> + 0x1b5334 (0x5652bdab9334 in ./test_deploy)
frame pytorch#6: <unknown function> + 0x1b4f3e (0x5652bdab8f3e in ./test_deploy)
frame pytorch#7: <unknown function> + 0x1b519b (0x5652bdab919b in ./test_deploy)
frame pytorch#8: loadSearchFile(char const*) + 0x23e (0x7f3ba62f37f8 in /tmp/torch_deploy9ATEFg)
frame pytorch#9: deploy_set_self + 0x51 (0x7f3ba62f38f9 in /tmp/torch_deploy9ATEFg)
frame pytorch#10: torch::deploy::Interpreter::Interpreter(torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>) + 0x274 (0x5652bdaaa790 in ./test_deploy)
frame pytorch#11: void __gnu_cxx::new_allocator<torch::deploy::Interpreter>::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x81 (0x5652bdaaf58b in ./test_deploy)
frame pytorch#12: void std::allocator_traits<std::allocator<torch::deploy::Interpreter> >::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(std::allocator<torch::deploy::Interpreter>&, torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x4a (0x5652bdaae320 in ./test_deploy)
frame pytorch#13: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::_M_realloc_insert<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(__gnu_cxx::__normal_iterator<torch::deploy::Interpreter*, std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> > >, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xee (0x5652bdaae4a0 in ./test_deploy)
frame pytorch#14: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::emplace_back<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xb6 (0x5652bdaad258 in ./test_deploy)
frame pytorch#15: torch::deploy::InterpreterManager::InterpreterManager(unsigned long, std::shared_ptr<torch::deploy::Environment>) + 0x123 (0x5652bdaa83b1 in ./test_deploy)
frame pytorch#16: TorchpyTest_InitTwice_Test::TestBody() + 0x65 (0x5652bda075a9 in ./test_deploy)
frame pytorch#17: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x65 (0x5652bda944b7 in ./test_deploy)
frame pytorch#18: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x5a (0x5652bda8cfe7 in ./test_deploy)
frame pytorch#19: testing::Test::Run() + 0x100 (0x5652bda68622 in ./test_deploy)
frame pytorch#20: testing::TestInfo::Run() + 0x10f (0x5652bda68fb3 in ./test_deploy)
frame pytorch#21: testing::TestSuite::Run() + 0x121 (0x5652bda6980d in ./test_deploy)
frame pytorch#22: testing::internal::UnitTestImpl::RunAllTests() + 0x38e (0x5652bda756e6 in ./test_deploy)
frame pytorch#23: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x65 (0x5652bda9586b in ./test_deploy)
frame pytorch#24: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x5a (0x5652bda8e0f7 in ./test_deploy)
frame pytorch#25: testing::UnitTest::Run() + 0xc9 (0x5652bda73fd1 in ./test_deploy)
frame pytorch#26: RUN_ALL_TESTS() + 0x11 (0x5652bda169fa in ./test_deploy)
frame pytorch#27: main + 0x27 (0x5652bda10ce2 in ./test_deploy)
frame pytorch#28: <unknown function> + 0x2d310 (0x7f3bc0431310 in /usr/lib/libc.so.6)
frame pytorch#29: __libc_start_main + 0x81 (0x7f3bc04313c1 in /usr/lib/libc.so.6)
frame pytorch#30: _start + 0x25 (0x5652bda063b5 in ./test_deploy)
```

Test Plan: CI

Differential Revision: D36564258

Pull Request resolved: pytorch#78028
Approved by: https://github.com/rohan-varma
Krovatkin pushed a commit that referenced this pull request Jun 1, 2022
…ytorch#78276)

Fixes pytorch#325
**Summary**: Currently, the pytorchbot only allows for rebasing to the master branch. These modifications add functionality for rebasing to the 'viable/strict' branch of pytorch/pytorch by adding a flag to the comment.
**Test Plan:** tested manually on personal fork ([#1](swang392#1)), and included a test case in test_tryrebase.py that checks if rebasing to viable/strict branch was successful.
Pull Request resolved: pytorch#78276
Approved by: https://github.com/clee2000, https://github.com/janeyx99
Krovatkin pushed a commit that referenced this pull request Jun 7, 2022
… to conform with non-quantized countertpart filenames

Summary:
Names of analogous files in quantized directory (previously snake case) were inconsistent with
their non-quantized filename counterparts (pascal case). This is the first of a series of PRs that changes
all files in quantized (and sub-directories) dir to have pascal case.

`aten/src/ATen/native/quantized/qconv_unpack.cpp` has not been renamed yet
because (for reasons currently unknown) after making the name change, `import torch` produces the below error (`qlinear_unpack.cpp` renaming also seems to fail some phabricator CI tests for similar reasons). We suspect that these may be undefined errors and will revisit naming these files in a future PR.

```
terminate called after throwing an instance of 'c10::Error'
  what():  Type c10::intrusive_ptr<ConvPackedParamsBase<2> > could not be converted to any of the known types.
Exception raised from operator() at ../aten/src/ATen/core/jit_type.h:1735 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x55 (0x7f26745c0c65 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xb1 (0x7f26745bdcd1 in /data/users/dzdang/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1494e24 (0x7f2663b14e24 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0xfed0bc (0x7f266366d0bc in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame #4: c10::detail::infer_schema::make_function_schema(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&&, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x5a (0x7f266366d71a in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame pytorch#5: c10::detail::infer_schema::make_function_schema(c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>, c10::ArrayRef<c10::detail::infer_schema::ArgumentDef>) + 0x7b (0x7f266366e06b in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame pytorch#6: <unknown function> + 0x1493f32 (0x7f2663b13f32 in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame pytorch#7: <unknown function> + 0xe227dd (0x7f26634a27dd in /data/users/dzdang/pytorch/torch/lib/libtorch_cpu.so)
frame pytorch#8: <unknown function> + 0x14e0a (0x7f268c934e0a in /lib64/ld-linux-x86-64.so.2)
..........................truncated.............
```

Test Plan:
```
python test/test_quantization.py
```

Pull Request resolved: pytorch#77037

Approved by: https://github.com/jerryzh168
Krovatkin pushed a commit that referenced this pull request Jul 28, 2022
### Summary:
This PR implements PTQ for APoT FakeQuant. It runs models (Resnet-18 pre-trained model, ImageNet dataset) to compare accuracy metrics for different qconfig settings of uniform vs. APoT quantized activation and weight.

According to the collected accuracy stats, model #2 (uniform activation and APoT weight) appears to have a slight improvement in accuracy compared to model #1 (uniform activation and uniform weight) for 8-bit and significant improvement for 4-bit (see "Accuracy Stats" section below).

### Test Plan:
Run models with: `python test/quantization/core/experimental/fx_graph_mode_apot.py`

### Accuracy Stats:
8-bit (Uniform int8, APoT b = 8 k = 2)

**Model #1:** Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.43% (Top-1), 85.62% (Top-5)

**Model #2:** Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.51% (Top-1), 85.78% (Top-5)

**Model #3:** APoT activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.32% (Top-1), 85.78% (Top-5)

4-bit (Uniform int4, APoT b = 4 k = 2)

**Model #1:** Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 45.63% (Top-1), 71.96% (Top-5)

**Model #2:** Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 64.24% (Top-1), 85.56% (Top-5)

**Model #3:** APoT activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 45.40% (Top-1), 76.21% (Top-5)

**Full Precision model (FX Graph Mode quantized)**
Evaluation accuracy on test dataset: 69.76% (Top-1), 89.08% (Top-5)

**Eager mode quantized model**
Evaluation accuracy on test dataset: 69.49% (Top-1), 88.90% (Top-5)
Pull Request resolved: pytorch#81040
Approved by: https://github.com/jerryzh168
Krovatkin pushed a commit that referenced this pull request Aug 11, 2022
Hi!

I was playing with libfuzzer and found bug when loading a model from file via `torch::jit::load` function.
There is an unhandled exception in caffe2/serialize when calling a `stoull` function on unsanitized version string.

The bug can be reproduced with `aot_model_compiler` binary:
```
aot_model_compiler --model=crash-stoull --model_name=name --model_version=1 --input_dims='1,3,224,224;2,2' --input_types='float;float'
```

Crash file is provided in [crash.zip](https://github.com/pytorch/pytorch/files/8701504/crash.zip).

gdb output:
```
Temporary breakpoint 1, main (argc=6, argv=0x7ffcd160f9f8) at /pytorch_master/binaries/aot_model_compiler.cc:87
87	      "Run NNC AOT compiler for pytorch model. Example usage:\n"
(gdb) c
Continuing.
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoull

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fa637f16859 in __GI_abort () at abort.c:79
#2  0x00007fa6381c1911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fa6381cd38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fa6381cd3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
pytorch#5  0x00007fa6381cd6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
pytorch#6  0x00007fa6381c42ce in std::__throw_invalid_argument(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6
pytorch#7  0x000000000247d567 in __gnu_cxx::__stoa<unsigned long long, unsigned long long, char, int> (__str=0x7ffcd160f228 "ZZ", __idx=0x0, __base=10, __convf=<optimized out>, __name=<optimized out>)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/string_conversions.h:83
pytorch#8  std::__cxx11::stoull (__str="ZZ", __idx=0x0, __base=10) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/basic_string.h:6577
pytorch#9  caffe2::serialize::PyTorchStreamReader::init (this=this@entry=0x8c11ce0) at /pytorch_master/caffe2/serialize/inline_container.cc:145
pytorch#10 0x000000000247d9c7 in caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader (this=0x8c11ce0, in=std::shared_ptr<class caffe2::serialize::ReadAdapterInterface> (empty) = {...})
    at /pytorch_master/caffe2/serialize/inline_container.cc:88
pytorch#11 0x00000000035b7ba4 in __gnu_cxx::new_allocator<caffe2::serialize::PyTorchStreamReader>::construct<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (
    __p=0x2, __args=..., this=<optimized out>) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/ext/new_allocator.h:150
pytorch#12 std::allocator_traits<std::allocator<caffe2::serialize::PyTorchStreamReader> >::construct<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__a=...,
    __p=0x2, __p@entry=0x8c11ce0, __args=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/alloc_traits.h:512
pytorch#13 0x00000000035b1988 in std::_Sp_counted_ptr_inplace<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x8c11cd0, __a=..., __args=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:551
pytorch#14 std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a8, __p=@0x7ffcd160f3a0: 0x10, __args=..., __a=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:683
pytorch#15 std::__shared_ptr<caffe2::serialize::PyTorchStreamReader, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a0, __args=..., __tag=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1371
pytorch#16 std::shared_ptr<caffe2::serialize::PyTorchStreamReader>::shared_ptr<std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (this=0x7ffcd160f3a0,
    __args=..., __tag=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:408
pytorch#17 std::allocate_shared<caffe2::serialize::PyTorchStreamReader, std::allocator<caffe2::serialize::PyTorchStreamReader>, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__args=..., __a=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:859
pytorch#18 std::make_shared<caffe2::serialize::PyTorchStreamReader, std::shared_ptr<caffe2::serialize::ReadAdapterInterface> > (__args=...)
    at /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr.h:875
pytorch#19 torch::jit::load (rai=std::shared_ptr<class caffe2::serialize::ReadAdapterInterface> (empty) = {...}, device=device@entry=..., Python Exception <class 'gdb.error'> No type named std::__detail::_Hash_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, true>.:
extra_files=std::unordered_map with 0 elements)
    at /pytorch_master/torch/csrc/jit/serialization/import.cpp:474
pytorch#20 0x00000000035b1ef6 in torch::jit::load (filename="crash-stoull", device=device@entry=..., Python Exception <class 'gdb.error'> No type named std::__detail::_Hash_node<struct std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, true>.:
extra_files=std::unordered_map with 0 elements) at /pytorch_master/torch/csrc/jit/serialization/import.cpp:444
pytorch#21 0x00000000035b1d22 in torch::jit::load (filename="", device=device@entry=...) at /pytorch_master/torch/csrc/jit/serialization/import.cpp:424
pytorch#22 0x00000000008f9be3 in main (argc=1, argv=0x7ffcd160f9f8) at /pytorch_master/binaries/aot_model_compiler.cc:128
```

Pull Request resolved: pytorch#77557
Approved by: https://github.com/Gamrix
Krovatkin pushed a commit that referenced this pull request Aug 12, 2022
### Summary:
This PR implements QAT for APoT FakeQuant. It runs QAT with FX graph mode quantized models (Resnet-18 pre-trained model, full ImageNet dataset) to compare accuracy metrics for different qconfig settings of uniform vs. APoT quantized activation and weight. It also refactors the APoT PTQ module `apot_fx_graph_mode_ptq.py` (previously `fx_graph_mode_apot.py`) such that shared helper functions between PTQ and QAT are in a separate file `quantization_util.py`.

Model #2 (uniformly quantized activation, APoT quantized weight) shows comparable accuracy compared to model #1 (uniformly quantized activation, APoT quantized weight) for 8-bit and significant accuracy improvement for 4-bit (see "Accuracy Stats" section below).

### Test Plan:
Run QAT models with: `python test/quantization/core/experimental/apot_qat.py`
Run PTQ models with: `python test/quantization/core/experimental/apot_ptq.py`

### Accuracy Stats
8-bit (Uniform int8, APoT b = 8 k = 2)

Model #1: Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 69.67% (Top-1), 89.04% (Top-5)

Model #2: Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 69.72% (Top-1), 89.06% (Top-5)

4-bit (Uniform int4, APoT b = 4 k = 2)

Model #1: Uniform activation, uniform weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 46.85% (Top-1), 72.85% (Top-5)

Model #2: Uniform activation, APoT weight (FX Graph Mode quantized)
Evaluation accuracy on test dataset: 66.45% (Top-1), 86.23% (Top-5)
Pull Request resolved: pytorch#83282
Approved by: https://github.com/jerryzh168
pytorchmergebot pushed a commit that referenced this pull request Dec 21, 2022
Summary:
The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created

This fixes the problem reported in
https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/

Test Plan:
# Testing memory leak fix

**With test code added in D41487340:**
```
cd ~/fbsource/xplat
buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test
```

Before this diff:

```
==2060866==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 608 byte(s) in 1 object(s) allocated from:
    #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27)
    #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77

Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
    #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f)
    #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85

SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s).
```

After this diff:
- No errors
___

# Testing op correctness

```
cd ~/fbsource/fbcode
buck test caffe2/test/quantization:quantization -- test_qsoftmax
```
Passes
- https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/

Differential Revision: D41487341

Pull Request resolved: pytorch#89544
Approved by: https://github.com/mcr229
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.