Add CPU implementation for `torch._int_mm` (s8*s8->s32) #121792

Xia-Weiwen · 2024-03-13T06:55:00Z

Description
Currently, the op torch._int_mm only supports CUDA device. This PR adds CPU implementation for it.
Besides the request from the issue, this op may also be useful for planned CPU implementations of LLM.int8() in Bitsandbytes.

The implementation prefers mkldnn (oneDNN) kernels. If mkldnn is not available, a reference implementation with nested for loops is used.

Test plan
python test/test_linalg.py -k test__int_mm_cpu

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-03-13T06:55:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121792

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit b6c3fb8 with merge base ae983d2 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / macos-12-py3-arm64-mps / test (mps, 1, 1, macos-m1-stable) (gh)
test_mps.py::TestConsistencyCPU::test_output_grad_match_clamp_cpu_float16

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jgong5

UT failing..

aten/src/ATen/native/LinearAlgebra.cpp

mingfeima

you need to add

#include <ATen/ops/_int_mm_native.h>
#include <ATen/ops/_int_mm_out_native.h>

in LinearAlgebra.cpp to get rid of the clang build errors.

aten/src/ATen/native/mkldnn/Matmul.cpp

mingfeima · 2024-03-14T01:27:35Z

test/test_linalg.py

@@ -5866,6 +5866,34 @@ def _gen_pair(m, k, n):
                               r"Expected result.size\(0\) to be 17 but got 16",
                               lambda: torch._int_mm(genf_int(17, 8), genf_int(8, 32), out=genf_int(16, 31).int()))

+    @onlyCPU


we should expand the existing test case test__int_mm instead of creating a new one for cpu.

The test case for CUDA has many restrictions and checks because the CUDA implementation has many limitations of shape and CUDA version, etc. However, the CPU implementation does not have those limitations. So, it will be much easier to separate the tests for CUDA and CPU. Do you think it's OK? Thanks!

Can you still extend the CUDA case, and add some CPU-only shapes to the test?

Hi @lezcano Sorry I did not notice this comment. Do I still need to combine the CPU and CUDA test cases?

mingfeima · 2024-03-14T01:29:19Z

aten/src/ATen/native/mkldnn/Matmul.cpp

+       ideep::tensor::data_type::s32,
+       result.strides().vec()},
+      result.data_ptr());
+  // Create primitive desc


I thought you would go directly with mkldnn_gemm_s8s8s32: https://oneapi-src.github.io/oneDNN/v0/group__c__api__blas.html#gac1869eab851b572350fb450c50c61626

which one has better performance, or are they the same ?

Thanks for the suggestion. I have run benchmarks locally to compare the implementations with BLAS API and primitive API. In most cases, BLAS API showed better performance. However, the BLAS API requires input buffers to be contiguous. So, the current dispatching rule is that if input buffers are contiguous, the BLAS API is used; otherwise, the primitive API is used. Do you think it's OK? Thanks.

…guous

Xia-Weiwen · 2024-03-15T03:26:53Z

UT failing..

UT failures are fixed. Thanks.

Xia-Weiwen · 2024-03-15T03:27:12Z

you need to add
#include <ATen/ops/_int_mm_native.h>
#include <ATen/ops/_int_mm_out_native.h>
in LinearAlgebra.cpp to get rid of the clang build errors.

Thanks. It's added.

aten/src/ATen/native/mkldnn/Matmul.cpp

lezcano

Minor points only. Feel free to merge after addressing them

lezcano · 2024-03-15T16:22:16Z

aten/src/ATen/native/LinearAlgebra.cpp

@@ -3506,5 +3508,63 @@ Tensor _weight_int8pack_mm_cpu(
  return C;
 }

+Tensor& _int_mm_out_cpu(const Tensor& self, const Tensor& mat2, Tensor& result) {
+  TORCH_CHECK(self.dim() == 2, __func__, ": Expected self to be of dimension 2 but got ", self.dim());


__func__ is not standard. Better define a constexpr at the top. Also, these are user facing names. They should not use internal names of functions.

Thanks. I have defined a string "int_mm_out_cpu" without the leading underscore.

aten/src/ATen/native/LinearAlgebra.cpp

aten/src/ATen/native/mkldnn/Matmul.cpp

Xia-Weiwen · 2024-03-18T05:14:06Z

Hi @lezcano I encountered this CI failure:

Do you have any idea what is about? Thanks!

Xia-Weiwen · 2024-03-19T05:22:00Z

@pytorchbot merge

pytorchmergebot · 2024-03-19T05:23:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-03-19T06:16:13Z

Merge failed

Reason: 1 jobs have failed, first few of them are: .github/workflows/trunk.yml / macos-12-py3-arm64-mps / test (mps, 1, 1, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

lezcano · 2024-03-19T08:41:26Z

@pytorchbot merge

pytorchmergebot · 2024-03-19T08:44:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

maktukmak · 2024-09-18T17:58:25Z

@Xia-Weiwen, which CPU flags are needed to use this function on Intel CPUs? Also, what will happen when using AMD CPUs?

Xia-Weiwen · 2024-09-19T02:15:47Z

Hi @maktukmak This function call oneDNN BLAS API essentially, so it should support all X86 platforms.

maktukmak · 2024-09-19T02:22:29Z

Someone in Huggingface reported overflow on AMD Epyc 7R32. What could be the reason?

dacorvo · 2024-09-19T07:24:48Z

Example code to reproduce the issue on AMD Epyc 7R32 (typically available on AWS cloud g5 instances).

import pytest  
import torch  

@pytest.mark.parametrize("device", ['cpu', 'cuda'])
@pytest.mark.parametrize("m", [32, 64])
@pytest.mark.parametrize("k", [32, 64])
@pytest.mark.parametrize("n", [32, 64])
@pytest.mark.parametrize("use_transpose_a", [True, False])
@pytest.mark.parametrize("use_transpose_b", [True, False])
@pytest.mark.parametrize("non_contig_type", [0, 1, 2])
def test__int_mm_cpu(device, m, k, n, use_transpose_a, use_transpose_b, non_contig_type):
    
    # non_contig_type:
    # 0: the whole data buffer is contiguous (can be transposed)
    # 1: stride of one dimension is 1, but the whole buffer is not contiguous
    # 2: Neither stride is 1

    def genf_int_float(x, y, use_transpose, non_contig_type):
        if use_transpose:
            x, y = y, x
        if non_contig_type != 0:
            y = y * 2
        x_int8 = torch.randint(-128, 128, (x, y), dtype=torch.int8, device=device)
        x_float = x_int8.to(torch.float32)
        if non_contig_type == 1:
            x_int8 = x_int8[:, : y // 2]
            x_float = x_float[:, : y // 2]
        elif non_contig_type == 2:
            x_int8 = x_int8[:, ::2]
            x_float = x_float[:, ::2]
        if use_transpose:
            return x_int8.t(), x_float.t()
        return x_int8, x_float

    if non_contig_type != 0 and (m == 0 or k == 0):
        return
    a_int8, a_float = genf_int_float(m, k, use_transpose_a, non_contig_type)
    b_int8, b_float = genf_int_float(k, n, use_transpose_b, non_contig_type)
    c_int32 = torch._int_mm(a_int8, b_int8)
    assert torch.equal(c_int32.float(), torch.mm(a_float, b_float))
    c_int32_result = c_int32.new_empty(c_int32.size())
    torch._int_mm(a_int8, b_int8, out=c_int32_result)
    assert torch.equal(c_int32_result.float(), torch.mm(a_float, b_float))

jgong5 · 2024-09-23T11:07:52Z

Someone in Huggingface reported overflow on AMD Epyc 7R32. What could be the reason?

If so, it seems an issue in oneDNN? @vpirogov

Xia-Weiwen · 2024-09-24T01:00:39Z

Hi @maktukmak @dacorvo Could you give a pointer to the issue you mentioned on HuggingFace?

dacorvo · 2024-09-26T07:23:47Z

Here is the link to the issue: huggingface/optimum-quanto#319

Xia-Weiwen · 2024-09-26T09:32:52Z

@dacorvo Thanks. Could you or @maktukmak open an issue to track this?

dacorvo · 2024-09-26T09:55:05Z

#136746

vadimkantorov · 2024-09-26T20:19:40Z

aten/src/ATen/native/mkldnn/Matmul.cpp

+  // x:s8 * w:s8 -> y:s32
+  // both inputs should be 2d
+  // In most cases, using DNNL blas API is faster but it requires a/b contiguous along one dimentsion
+  bool a_is_contigous = (mat1.stride(0) == 1 || mat1.stride(1) == 1);


typo: contiguous

Add CPU implementation for torch._int_mm (s8*s8->s32)

dbead77

pytorch-bot bot added the release notes: linalg_frontend release notes category label Mar 13, 2024

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 13, 2024

Xia-Weiwen requested review from jgong5 and mingfeima March 13, 2024 06:55

pytorchbot added the open source label Mar 13, 2024

Xia-Weiwen added the intel This tag is for PR from Intel label Mar 13, 2024

jgong5 requested changes Mar 13, 2024

View reviewed changes

aten/src/ATen/native/LinearAlgebra.cpp Outdated Show resolved Hide resolved

mingfeima requested changes Mar 14, 2024

View reviewed changes

Parallelize ref impl; use DNNL BLASS gemm_s8s8s32 if inputs are conti…

d5e33ee

…guous

Xia-Weiwen requested review from jgong5 and mingfeima March 15, 2024 03:32

Add ut for non-contiguous inputs

06022af

jgong5 reviewed Mar 15, 2024

View reviewed changes

aten/src/ATen/native/mkldnn/Matmul.cpp Outdated Show resolved Hide resolved

Refine code for non-contiguous cases

bf2db9c

Xia-Weiwen requested review from jgong5 and lezcano March 15, 2024 12:16

jgong5 approved these changes Mar 15, 2024

View reviewed changes

lezcano approved these changes Mar 15, 2024

View reviewed changes

Xia-Weiwen added 3 commits March 16, 2024 14:37

Refine error messages

f61925d

Merge branch 'main' into int_mm_cpu

6001388

Use c10::string_view for function name in error message

26af489

Xia-Weiwen marked this pull request as ready for review March 18, 2024 01:23

Xia-Weiwen requested review from nikitaved and IvanYashchuk as code owners March 18, 2024 01:23

bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 18, 2024

Merge branch 'main' into int_mm_cpu

b6c3fb8

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 19, 2024

pytorchmergebot added the merging label Mar 19, 2024

pytorchmergebot removed the merging label Mar 19, 2024

pytorchmergebot added the merging label Mar 19, 2024

pytorchmergebot added the Merged label Mar 19, 2024

pytorchmergebot closed this in 8168338 Mar 19, 2024

pytorchmergebot removed the merging label Mar 19, 2024

maktukmak mentioned this pull request Mar 20, 2024

enable cpu integer matmul huggingface/optimum-quanto#126

Closed

dacorvo mentioned this pull request Mar 24, 2024

enable cpu integer matmul huggingface/optimum-quanto#130

Merged

Xia-Weiwen mentioned this pull request Apr 4, 2024

Support int_scaled_mm on CPU pytorch/ao#121

Merged

Xia-Weiwen deleted the int_mm_cpu branch September 26, 2024 09:33

vadimkantorov reviewed Sep 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU implementation for `torch._int_mm` (s8*s8->s32) #121792

Add CPU implementation for `torch._int_mm` (s8*s8->s32) #121792

Xia-Weiwen commented Mar 13, 2024 •

edited

Loading

pytorch-bot bot commented Mar 13, 2024 •

edited

Loading

jgong5 left a comment

mingfeima left a comment

mingfeima Mar 14, 2024

Xia-Weiwen Mar 15, 2024

lezcano Mar 15, 2024

Xia-Weiwen Mar 16, 2024

mingfeima Mar 14, 2024

Xia-Weiwen Mar 15, 2024

Xia-Weiwen commented Mar 15, 2024

Xia-Weiwen commented Mar 15, 2024

lezcano left a comment •

edited

Loading

lezcano Mar 15, 2024

Xia-Weiwen Mar 16, 2024

Xia-Weiwen commented Mar 18, 2024

Xia-Weiwen commented Mar 19, 2024

pytorchmergebot commented Mar 19, 2024

pytorchmergebot commented Mar 19, 2024

lezcano commented Mar 19, 2024

pytorchmergebot commented Mar 19, 2024

maktukmak commented Sep 18, 2024

Xia-Weiwen commented Sep 19, 2024

maktukmak commented Sep 19, 2024

dacorvo commented Sep 19, 2024

jgong5 commented Sep 23, 2024

Xia-Weiwen commented Sep 24, 2024

dacorvo commented Sep 26, 2024

Xia-Weiwen commented Sep 26, 2024

dacorvo commented Sep 26, 2024

vadimkantorov Sep 26, 2024

Add CPU implementation for torch._int_mm (s8*s8->s32) #121792

Add CPU implementation for torch._int_mm (s8*s8->s32) #121792

Conversation

Xia-Weiwen commented Mar 13, 2024 • edited Loading

pytorch-bot bot commented Mar 13, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/121792

✅ You can merge normally! (1 Unrelated Failure)

jgong5 left a comment

Choose a reason for hiding this comment

mingfeima left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xia-Weiwen commented Mar 15, 2024

Xia-Weiwen commented Mar 15, 2024

lezcano left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xia-Weiwen commented Mar 18, 2024

Xia-Weiwen commented Mar 19, 2024

pytorchmergebot commented Mar 19, 2024

Merge started

pytorchmergebot commented Mar 19, 2024

Merge failed

lezcano commented Mar 19, 2024

pytorchmergebot commented Mar 19, 2024

Merge started

maktukmak commented Sep 18, 2024

Xia-Weiwen commented Sep 19, 2024

maktukmak commented Sep 19, 2024

dacorvo commented Sep 19, 2024

jgong5 commented Sep 23, 2024

Xia-Weiwen commented Sep 24, 2024

dacorvo commented Sep 26, 2024

Xia-Weiwen commented Sep 26, 2024

dacorvo commented Sep 26, 2024

Choose a reason for hiding this comment

Add CPU implementation for `torch._int_mm` (s8*s8->s32) #121792

Add CPU implementation for `torch._int_mm` (s8*s8->s32) #121792

Xia-Weiwen commented Mar 13, 2024 •

edited

Loading

pytorch-bot bot commented Mar 13, 2024 •

edited

Loading

lezcano left a comment •

edited

Loading