-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU implementation for torch._int_mm
(s8*s8->s32)
#121792
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121792
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit b6c3fb8 with merge base ae983d2 (): BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UT failing..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add
#include <ATen/ops/_int_mm_native.h>
#include <ATen/ops/_int_mm_out_native.h>
in LinearAlgebra.cpp
to get rid of the clang build errors.
@@ -5866,6 +5866,34 @@ def _gen_pair(m, k, n): | |||
r"Expected result.size\(0\) to be 17 but got 16", | |||
lambda: torch._int_mm(genf_int(17, 8), genf_int(8, 32), out=genf_int(16, 31).int())) | |||
|
|||
@onlyCPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should expand the existing test case test__int_mm
instead of creating a new one for cpu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case for CUDA has many restrictions and checks because the CUDA implementation has many limitations of shape and CUDA version, etc. However, the CPU implementation does not have those limitations. So, it will be much easier to separate the tests for CUDA and CPU. Do you think it's OK? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you still extend the CUDA case, and add some CPU-only shapes to the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lezcano Sorry I did not notice this comment. Do I still need to combine the CPU and CUDA test cases?
ideep::tensor::data_type::s32, | ||
result.strides().vec()}, | ||
result.data_ptr()); | ||
// Create primitive desc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought you would go directly with mkldnn_gemm_s8s8s32
: https://oneapi-src.github.io/oneDNN/v0/group__c__api__blas.html#gac1869eab851b572350fb450c50c61626
which one has better performance, or are they the same ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I have run benchmarks locally to compare the implementations with BLAS API and primitive API. In most cases, BLAS API showed better performance. However, the BLAS API requires input buffers to be contiguous. So, the current dispatching rule is that if input buffers are contiguous, the BLAS API is used; otherwise, the primitive API is used. Do you think it's OK? Thanks.
UT failures are fixed. Thanks. |
Thanks. It's added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor points only. Feel free to merge after addressing them
@@ -3506,5 +3508,63 @@ Tensor _weight_int8pack_mm_cpu( | |||
return C; | |||
} | |||
|
|||
Tensor& _int_mm_out_cpu(const Tensor& self, const Tensor& mat2, Tensor& result) { | |||
TORCH_CHECK(self.dim() == 2, __func__, ": Expected self to be of dimension 2 but got ", self.dim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__func__
is not standard. Better define a constexpr at the top. Also, these are user facing names. They should not use internal names of functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have defined a string "int_mm_out_cpu" without the leading underscore.
Hi @lezcano I encountered this CI failure: |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: .github/workflows/trunk.yml / macos-12-py3-arm64-mps / test (mps, 1, 1, macos-m1-stable) Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@Xia-Weiwen, which CPU flags are needed to use this function on Intel CPUs? Also, what will happen when using AMD CPUs? |
Hi @maktukmak This function call oneDNN BLAS API essentially, so it should support all X86 platforms. |
Someone in Huggingface reported overflow on AMD Epyc 7R32. What could be the reason? |
Example code to reproduce the issue on AMD Epyc 7R32 (typically available on AWS cloud g5 instances). import pytest
import torch
@pytest.mark.parametrize("device", ['cpu', 'cuda'])
@pytest.mark.parametrize("m", [32, 64])
@pytest.mark.parametrize("k", [32, 64])
@pytest.mark.parametrize("n", [32, 64])
@pytest.mark.parametrize("use_transpose_a", [True, False])
@pytest.mark.parametrize("use_transpose_b", [True, False])
@pytest.mark.parametrize("non_contig_type", [0, 1, 2])
def test__int_mm_cpu(device, m, k, n, use_transpose_a, use_transpose_b, non_contig_type):
# non_contig_type:
# 0: the whole data buffer is contiguous (can be transposed)
# 1: stride of one dimension is 1, but the whole buffer is not contiguous
# 2: Neither stride is 1
def genf_int_float(x, y, use_transpose, non_contig_type):
if use_transpose:
x, y = y, x
if non_contig_type != 0:
y = y * 2
x_int8 = torch.randint(-128, 128, (x, y), dtype=torch.int8, device=device)
x_float = x_int8.to(torch.float32)
if non_contig_type == 1:
x_int8 = x_int8[:, : y // 2]
x_float = x_float[:, : y // 2]
elif non_contig_type == 2:
x_int8 = x_int8[:, ::2]
x_float = x_float[:, ::2]
if use_transpose:
return x_int8.t(), x_float.t()
return x_int8, x_float
if non_contig_type != 0 and (m == 0 or k == 0):
return
a_int8, a_float = genf_int_float(m, k, use_transpose_a, non_contig_type)
b_int8, b_float = genf_int_float(k, n, use_transpose_b, non_contig_type)
c_int32 = torch._int_mm(a_int8, b_int8)
assert torch.equal(c_int32.float(), torch.mm(a_float, b_float))
c_int32_result = c_int32.new_empty(c_int32.size())
torch._int_mm(a_int8, b_int8, out=c_int32_result)
assert torch.equal(c_int32_result.float(), torch.mm(a_float, b_float)) |
If so, it seems an issue in oneDNN? @vpirogov |
Hi @maktukmak @dacorvo Could you give a pointer to the issue you mentioned on HuggingFace? |
Here is the link to the issue: huggingface/optimum-quanto#319 |
@dacorvo Thanks. Could you or @maktukmak open an issue to track this? |
// x:s8 * w:s8 -> y:s32 | ||
// both inputs should be 2d | ||
// In most cases, using DNNL blas API is faster but it requires a/b contiguous along one dimentsion | ||
bool a_is_contigous = (mat1.stride(0) == 1 || mat1.stride(1) == 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: contiguous
Fixes #121647
Description
Currently, the op
torch._int_mm
only supports CUDA device. This PR adds CPU implementation for it.Besides the request from the issue, this op may also be useful for planned CPU implementations of LLM.int8() in Bitsandbytes.
The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.
Test plan
python test/test_linalg.py -k test__int_mm_cpu
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10