Fix TorchAO related bugs; revert device_map changes #10371

a-r-r-o-w · 2024-12-24T10:07:49Z

Reverts part of #10256

Currently, we support:

Serialization/Deserialization from the commonly tested TorchAO dtypes. All quantization types are not exhaustively tested
Loading sharded/unsharded state dicts WITHOUT device_map. Using device_map calls into accelerate code but this is problematic for reasons stated here [Core] refactor model loading #10013, and so we error out. Will be tackled in future release

This PR:

Errors out if device_map is used
Skips the device_map tests because they didn't work correctly before, nor do we support it now. I didn't realize they didn't work because we were not checking the weights to be AffineQuantizedTensor's
Improves the test to ensure both sharded and non-sharded checkpoints are tests in fast tests. Slow tests only run the sharded checkpoint of Flux
Adds a serialization slow test to make sure we have atleast one large checkpoint being serialized and compared for output

Context: https://huggingface.slack.com/archives/C065E480NN9/p1735010991364189

Running slow tests now

…nabled (#10256)" This reverts commit 41ba8c0.

HuggingFaceDocBuilderDev · 2024-12-24T10:14:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

DN6

Need to also add torchao to nightly test quantization matrix

diffusers/.github/workflows/nightly_tests.yml

Line 358 in 6dfaec3

- backend: "bitsandbytes"

src/diffusers/models/modeling_utils.py

tests/quantization/torchao/test_torchao.py

src/diffusers/models/modeling_utils.py

a-r-r-o-w · 2024-12-24T13:08:47Z

Need to also add torchao to nightly test quantization matrix

Just curious how we make sure that the quantization-related python packages are installed? I couldn't find relevant LoC that handles this, nor does the workflow file have relevant install commands to install gguf/bitsandbytes/torchao

a-r-r-o-w · 2024-12-24T13:11:55Z

I did some more testing related to serialization with int4wo, int8wo, int8dq, uintx. It seems like we can serialize all, but for uintx, there seems to be an error happening during torch.load with UINT-related pickle error. I think the fix needs to come from TorchAO, but otherwise I can verify that the weights are being saved in quantized precision and the directories reflect the expected file sizes.

Ran all the nightly tests and added a few more changes. Everything is passing 🤞 Going to run the bitsandbytes fast/slow tests now

docs/source/en/quantization/torchao.md

tests/quantization/torchao/test_torchao.py

a-r-r-o-w · 2024-12-24T13:32:19Z

Just for reference in future, the error when loading a serialized uint4wo quant model is:

Traceback (most recent call last):
  File "/home/aryan/work/diffusers/dump5.py", line 340, in <module>
    transformer = FluxTransformer2DModel.from_pretrained("/raid/aryan/flux_uint4wo", torch_dtype=dtype, use_safetensors=False)
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/home/aryan/work/diffusers/src/diffusers/models/modeling_utils.py", line 823, in from_pretrained
    model_file = _merge_sharded_checkpoints(sharded_ckpt_cached_folder, sharded_metadata)
  File "/home/aryan/work/diffusers/src/diffusers/models/model_loading_utils.py", line 339, in _merge_sharded_checkpoints
    merged_state_dict.update(torch.load(part_file_path, weights_only=True, map_location="cpu"))
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/torch/serialization.py", line 1359, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torchao.dtypes.uintx.uintx_layout.UintxTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([UintxTensor])` to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

cc @jerryzh168

a-r-r-o-w · 2024-12-24T13:53:01Z

Something like below works for loading uintx serialized weights and running inference:

import torch
from accelerate import init_empty_weights
from diffusers import FluxPipeline, FluxTransformer2DModel, TorchAoConfig

# Serialize the model
transformer = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/Flux.1-Dev",
    subfolder="transformer",
    quantization_config=TorchAoConfig("uint4wo"),
    torch_dtype=torch.bfloat16,
)
transformer.save_pretrained("/path/to/flux_uint4wo", safe_serialization=False, max_shard_size="50GB")
# ...

# Load the model
state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu")
with init_empty_weights():
    transformer = FluxTransformer2DModel.from_config("/path/to/flux_uint4wo/config.json")
transformer.load_state_dict(state_dict, strict=True, assign=True)

We can't load it directly in diffusers because we use a hardcoded weights_only=True here:

diffusers/src/diffusers/models/model_loading_utils.py

Line 339 in 6dfaec3

    
           merged_state_dict.update(torch.load(part_file_path, weights_only=True, map_location="cpu"))

a-r-r-o-w · 2024-12-24T14:16:59Z

@sayakpaul I'm seeing two test failures for BnB. I think they are unrelated but could you confirm when free?

_______________________________________________________________________________________________________________________ SlowBnb8bitTests.test_generate_quality_dequantize _______________________________________________________________________________________________________________________

self = <bnb.test_mixed_int8.SlowBnb8bitTests testMethod=test_generate_quality_dequantize>

    def test_generate_quality_dequantize(self):
        r"""
        Test that loading the model and unquantize it produce correct results.
        """
>       self.pipeline_8bit.transformer.dequantize()

tests/quantization/bnb/test_mixed_int8.py:415: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/diffusers/models/modeling_utils.py:482: in dequantize
    return hf_quantizer.dequantize(self)
src/diffusers/quantizers/base.py:205: in dequantize
    model = self._dequantize(model)
src/diffusers/quantizers/bitsandbytes/bnb_quantizer.py:558: in _dequantize
    model = dequantize_and_replace(
src/diffusers/quantizers/bitsandbytes/utils.py:281: in dequantize_and_replace
    model, has_been_replaced = _dequantize_and_replace(
src/diffusers/quantizers/bitsandbytes/utils.py:264: in _dequantize_and_replace
    _, has_been_replaced = _dequantize_and_replace(
src/diffusers/quantizers/bitsandbytes/utils.py:264: in _dequantize_and_replace
    _, has_been_replaced = _dequantize_and_replace(
src/diffusers/quantizers/bitsandbytes/utils.py:247: in _dequantize_and_replace
    new_module.weight = torch.nn.Parameter(dequantize_bnb_weight(module.weight, state))
src/diffusers/quantizers/bitsandbytes/utils.py:185: in dequantize_bnb_weight
    out32, Sout32 = bnb.functional.igemmlt(im, state.CxB, Sim, state.SB)
/opt/venv/lib/python3.10/site-packages/typing_extensions.py:2853: in wrapper
    return arg(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[127,   0,   0,  ...,   0,   0,   0],
        [  0,   0,   0,  ...,   0,   0,   0],
        [  0,   0,   0,  .... 0,   0,  ...,   0,   0,   0],
        [  0,   0,   0,  ...,   0,   0, 127]], device='cuda:0',
       dtype=torch.int8)
B = tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
   ... ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0', dtype=torch.int8), SA = (torch.Size([256, 256]), 'col32')
SB = (torch.Size([1536, 256]), 'row'), out = None, Sout = None, dtype = torch.int32

    @deprecated(
        "igemmlt is deprecated and will be removed in a future release. Please use int8_linear_matmul instead.",
        category=FutureWarning,
    )
    def igemmlt(
        A: torch.Tensor,
        B: torch.Tensor,
        SA: Tuple[torch.Size, str],
        SB: Tuple[torch.Size, str],
        out: Optional[torch.Tensor] = None,
        Sout: Optional[Tuple[torch.Size, str]] = None,
        dtype=torch.int32,
    ):
        if SA is not None and SA[1] != "row":
>           raise NotImplementedError(f"Only row-major format inputs are supported, but got format `{SA[1]}`")
E           NotImplementedError: Only row-major format inputs are supported, but got format `col32`

/opt/venv/lib/python3.10/site-packages/bitsandbytes/functional.py:2268: NotImplementedError
_________________________________________________________________________________________________________________________________ SlowBnb8bitTests.test_quality _________________________________________________________________________________________________________________________________

self = <bnb.test_mixed_int8.SlowBnb8bitTests testMethod=test_quality>

    def test_quality(self):
        output = self.pipeline_8bit(
            prompt=self.prompt,
            num_inference_steps=self.num_inference_steps,
            generator=torch.manual_seed(self.seed),
            output_type="np",
        ).images
        out_slice = output[0, -3:, -3:, -1].flatten()
        expected_slice = np.array([0.0376, 0.0359, 0.0015, 0.0449, 0.0479, 0.0098, 0.0083, 0.0295, 0.0295])
    
        max_diff = numpy_cosine_similarity_distance(expected_slice, out_slice)
>       self.assertTrue(max_diff < 1e-2)
E       AssertionError: False is not true

tests/quantization/bnb/test_mixed_int8.py:378: AssertionError
======================================================================================================================================= warnings summary ========================================================================================================================================
tests/quantization/bnb/test_4bit.py: 12 warnings
tests/quantization/bnb/test_mixed_int8.py: 5 warnings
  /__w/diffusers/diffusers/src/diffusers/utils/testing_utils.py:547: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
    arry = torch.load(BytesIO(response.content))

tests/quantization/bnb/test_mixed_int8.py::BnB8bitBasicTests::test_keep_modules_in_fp32
tests/quantization/bnb/test_mixed_int8.py::BnB8bitTrainingTests::test_training
  /opt/venv/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:315: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
    warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /__w/diffusers/diffusers/src/diffusers/quantizers/bitsandbytes/utils.py:181: FutureWarning: This function is deprecated. Please use `int8_double_quant` instead.
    im, imt, SCim, SCimt, coo_tensorim = bnb.functional.double_quant(im)

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /__w/diffusers/diffusers/src/diffusers/quantizers/bitsandbytes/utils.py:182: FutureWarning: The layout transformation operations will be removed in a future release. Please use row-major layout only.
    im, Sim = bnb.functional.transform(im, "col32")

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /opt/venv/lib/python3.10/site-packages/bitsandbytes/functional.py:2812: FutureWarning: This function is deprecated and will be removed in a future release.
    prev_device = pre_call(A.device)

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /opt/venv/lib/python3.10/site-packages/bitsandbytes/functional.py:2818: FutureWarning: The layout transformation operations will be removed in a future release. Please use row-major layout only.
    out, new_state = get_transform_buffer(state[0], A.dtype, A.device, to_order, state[1], transpose)

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /opt/venv/lib/python3.10/site-packages/bitsandbytes/functional.py:2854: FutureWarning: This function is deprecated and will be removed in a future release.
    post_call(prev_device)

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /__w/diffusers/diffusers/src/diffusers/quantizers/bitsandbytes/utils.py:184: FutureWarning: The layout transformation operations will be removed in a future release. Please use row-major layout only.
    state.CxB, state.SB = bnb.functional.transform(weight.data, to_order=state.formatB)

tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize
  /__w/diffusers/diffusers/src/diffusers/quantizers/bitsandbytes/utils.py:185: FutureWarning: igemmlt is deprecated and will be removed in a future release. Please use int8_linear_matmul instead.
    out32, Sout32 = bnb.functional.igemmlt(im, state.CxB, Sim, state.SB)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================================== short test summary info ====================================================================================================================================
FAILED tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_generate_quality_dequantize - NotImplementedError: Only row-major format inputs are supported, but got format `col32`
FAILED tests/quantization/bnb/test_mixed_int8.py::SlowBnb8bitTests::test_quality - AssertionError: False is not true
==================================================================================================================== 2 failed, 42 passed, 26 warnings in 1020.55s (0:17:00) =====================================================================================================================

sayakpaul

Thanks for the fixes!

Apart from the comments I left, I think it might make sense to also test (integration tests) loading from quantized checkpoints and making sure they are working as expected.

Basically what's done in:

diffusers/tests/quantization/bnb/test_4bit.py

Line 531 in 825979d

class SlowBnb4BitFluxTests(Base4bitTests):

docs/source/en/quantization/torchao.md

sayakpaul · 2024-12-25T03:59:57Z

docs/source/en/quantization/torchao.md

+image.save("output.png")
+```
+
+Some quantization methods, such as `uint4wo`, cannot be loaded directly and may result in an `UnpicklingError` when trying to load the models, but work as expected when saving them. In order to work around this, one can load the state dict manually into the model. Note, however, that this requires using `weights_only=False` in `torch.load`, so it should be run only if the weights were obtained from a trustable source.


Cc: @jerryzh168. Is this known?

src/diffusers/models/modeling_utils.py

tests/quantization/torchao/test_torchao.py

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

sayakpaul · 2024-12-25T04:48:22Z

@a-r-r-o-w regarding the failures,

The first failure doesn't happen with bitsandbytes 0.44.1. So, I have reported this internally (see link).
The other failure will go away if we match the nightly test environment (with bistandbytes 0.44.1).

tests/quantization/torchao/test_torchao.py

sayakpaul

Thank you!

a-r-r-o-w · 2024-12-25T09:54:49Z

Fast tests all pass ✅

Fast test logs

root@e64a4756d90e:/__w/diffusers/diffusers# RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py 
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collecting ... The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
collected 20 items                                                                                                                                                                                                                                                                              

transformer/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 360/360 [00:00<00:00, 4.31MB/s]
transformer/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 538/538 [00:00<00:00, 8.61MB/s]
diffusion_pytorch_model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 280k/280k [00:00<00:00, 30.0MB/s]
text_encoder/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 529/529 [00:00<00:00, 7.79MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 283k/283k [00:00<00:00, 46.5MB/s]
text_encoder_2/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 780/780 [00:00<00:00, 12.5MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 277k/277k [00:00<00:00, 30.8MB/s]
tokenizer/tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 697/697 [00:00<00:00, 12.0MB/s]
tokenizer/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.6k/15.6k [00:00<00:00, 109MB/s]
tokenizer/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78k/4.78k [00:00<00:00, 43.2MB/s]
tokenizer/special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 11.7MB/s]
tokenizer_2/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.4k/20.4k [00:00<00:00, 139MB/s]
tokenizer_2/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402k/402k [00:00<00:00, 39.0MB/s]
tokenizer_2/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 37.2MB/s]
vae/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 591/591 [00:00<00:00, 8.79MB/s]
diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19.5k/19.5k [00:00<00:00, 105MB/s]
(…)ion_pytorch_model.safetensors.index.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.34k/6.34k [00:00<00:00, 40.6MB/s]
(…)pytorch_model-00001-of-00002.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 190k/190k [00:00<00:00, 28.6MB/s]
(…)pytorch_model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 90.5k/90.5k [00:00<00:00, 14.4MB/s]
Fetching 2 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  9.68it/s]
text_encoder/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 689/689 [00:00<00:00, 3.93MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146k/146k [00:00<00:00, 59.4MB/s]
text_encoder_2/config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 885/885 [00:00<00:00, 6.49MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 141k/141k [00:00<00:00, 35.5MB/s]
tokenizer/tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 727/727 [00:00<00:00, 12.3MB/s]
tokenizer/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.6k/15.6k [00:00<00:00, 107MB/s]
tokenizer/merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.78k/4.78k [00:00<00:00, 40.1MB/s]
tokenizer/special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 9.17MB/s]
tokenizer_2/tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.5k/20.5k [00:00<00:00, 122MB/s]
tokenizer_2/tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 402k/402k [00:00<00:00, 36.1MB/s]
tokenizer_2/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.54k/2.54k [00:00<00:00, 35.3MB/s]
vae/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 737/737 [00:00<00:00, 10.7MB/s]
diffusion_pytorch_model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19.5k/19.5k [00:00<00:00, 89.2MB/s]
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8346.87it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 19645.45it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 18558.87it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 35.81it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 29.09it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 77.53it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.99it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 49.27it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 23.19it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 75.65it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 75.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 106.73it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 98.85it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 47934.90it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 102.85it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 20262.34it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 30.44it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 21509.25it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 81.94it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 46863.73it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 18.42it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 20311.40it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 49.61it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 21509.25it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 22.54it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 20116.57it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 73.00it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 18196.55it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 75.41it/s]
Fetching 2 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 7200.52it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 107.65it/s]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 18436.50it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 106.28it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 80.71it/s]
  0%|                                                                                                                               AUTOTUNE mixed_mm(512x32, 32x128)
  triton_mm_176 0.0275 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_165 0.0285 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_164 0.0286 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_167 0.0286 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_163 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_171 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_172 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_173 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=8
  triton_mm_174 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_177 0.0287 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 2.0950 seconds and 0.0017 seconds precompiling
AUTOTUNE mixed_mm(256x32, 32x128)
  triton_mm_203 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_202 0.0277 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_204 0.0278 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_199 0.0285 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_200 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_201 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_196 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
  triton_mm_197 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_205 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_206 0.0288 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 2.0873 seconds and 0.0015 seconds precompiling
AUTOTUNE mixed_mm(768x160, 160x32)
  triton_mm_301 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_305 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_314 0.0307 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=256, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=2
  triton_mm_307 0.0308 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_300 0.0316 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_312 0.0316 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_302 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_304 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_306 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_308 0.0317 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 2.1011 seconds and 0.0019 seconds precompiling
AUTOTUNE mixed_mm(768x32, 32x32)
  triton_mm_271 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_276 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=8
  triton_mm_265 0.0297 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_273 0.0299 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_264 0.0307 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
  triton_mm_270 0.0307 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_266 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_267 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_268 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_269 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 1.8829 seconds and 0.0012 seconds precompiling
AUTOTUNE mixed_mm(256x32, 32x4)
  triton_mm_316 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
  triton_mm_319 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_320 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_323 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_328 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=1
  triton_mm_321 0.0305 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_317 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=2
  triton_mm_322 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_325 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=8
  triton_mm_327 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=1
SingleProcess AUTOTUNE benchmarking takes 1.5638 seconds and 0.0012 seconds precompiling
AUTOTUNE mixed_mm(256x4, 4x32)
  triton_mm_15 0.0298 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=2
  triton_mm_11 0.0307 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_9 0.0308 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_2 0.0317 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_14 0.0317 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=2
  triton_mm_12 0.0318 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=8
  triton_mm_10 0.0318 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_8 0.0318 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_13 0.0318 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_4 0.0327 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=False, GROUP_M=8, num_stages=5, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 1.9423 seconds and 0.0013 seconds precompiling
AUTOTUNE mixed_mm(512x32, 32x32)
  triton_mm_23 0.0297 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_18 0.0297 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_17 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_21 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_24 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_31 0.0307 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
  triton_mm_19 0.0317 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_25 0.0317 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_26 0.0317 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_16 0.0327 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
SingleProcess AUTOTUNE benchmarking takes 1.9161 seconds and 0.0013 seconds precompiling
AUTOTUNE mixed_mm(256x32, 32x32)
  triton_mm_50 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_52 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_53 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_56 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_57 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_59 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
  triton_mm_61 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_62 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=2
  triton_mm_63 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
  triton_mm_54 0.0287 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 1.8770 seconds and 0.0012 seconds precompiling
AUTOTUNE mixed_mm(256x32, 32x32)
  triton_mm_128 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
  triton_mm_129 0.0288 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_133 0.0288 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_135 0.0296 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_143 0.0296 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
  triton_mm_130 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_131 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_132 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_134 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_137 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 1.8759 seconds and 0.0013 seconds precompiling
AUTOTUNE mixed_mm(512x128, 128x32)
  triton_mm_194 0.0287 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=2
  triton_mm_188 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_192 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_193 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_195 0.0297 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=2
  triton_mm_187 0.0298 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_179 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_181 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_182 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_184 0.0307 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 2.0841 seconds and 0.0020 seconds precompiling
AUTOTUNE mixed_mm(256x128, 128x32)
  triton_mm_215 0.0288 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_216 0.0289 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_230 0.0296 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=16, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=2
  triton_mm_218 0.0296 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_217 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_220 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_223 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_224 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_228 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_229 0.0297 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 2.0865 seconds and 0.0015 seconds precompiling
AUTOTUNE mixed_mm(768x32, 32x128)
  triton_mm_286 0.0276 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=8
  triton_mm_285 0.0285 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_280 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=1, num_warps=2
  triton_mm_281 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4
  triton_mm_284 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_287 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4
  triton_mm_290 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4
  triton_mm_291 0.0287 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4
  triton_mm_282 0.0287 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=32, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=8
  triton_mm_292 0.0289 ms 95.8% ACC_TYPE='tl.float32', ALLOW_TF32=True, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, B_PROLOGUE_CAST_TYPE='tl.bfloat16', EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 2.0763 seconds and 0.0012 seconds precompiling
 50%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                              | 1/2 [0100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:49<00:00, 54.59s/it]
Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 14364.05it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.25it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.89s/it]
.......sssss

============================================= 15 passed, 5 skipped in 147.22s (0:02:27) =============================================

Slow test for pre-serialized model pass ✅

Slow Preserialized test logs

root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py -k SlowTorchAoPreserializedModelTests
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 20 items / 19 deselected / 1 selected                                                                                                                                                                                                                                                 

Fetching 2 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 45100.04it/s]
Loading pipeline components...:  43%|███████████████████████████████████████████████████████████████████████████████████████████████▏                                                                                                                              | 3/7 [00:00<00:00,  5.69it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.43s/it]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:05<00:00,  1.18it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.28it/s]
[0.0566, 0.0781, 0.1426, 0.0488, 0.0684, 0.1504, 0.0625, 0.0781, 0.1445, 0.0625, 0.0781, 0.1562, 0.0547, 0.0723, 0.1484, 0.0566, 0.5703, 0.8867, 0.7266, 0.5742, 0.875, 0.7148, 0.5586, 0.875, 0.7148, 0.5547, 0.8633, 0.7109, 0.5469, 0.8398, 0.6992, 0.5703]
.

========================================================================================================================= 1 passed, 19 deselected in 214.15s (0:03:34) ==========================================================================================================================

Slow test for memory footprint passes ✅

Slow memory footprint test logs

root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py -k test_memory_footprint_int4wo
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 20 items / 19 deselected / 1 selected                                                                                                                                                                                                                                                 

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 21472.55it/s]
.

========================================================================================================================= 1 passed, 19 deselected in 203.11s (0:03:23) ==========================================================================================================================
root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py -k test_memory_footprint_int8wo
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 20 items / 19 deselected / 1 selected                                                                                                                                                                                                                                                 

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 44462.59it/s]
.

=============================================================================================================================== 1 passed, 19 deselected in 24.25s ===============================================================================================================================

Slow test for quantization precision and layer check pass ✅

Slow quantization logs

root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py::SlowTorchAoTests::test_quantization
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                                                                                

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 54003.91it/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10330.80it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.35s/it]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:15<00:00,  1.29it/s]
.

================================================================================================================================= 1 passed in 202.72s (0:03:22) =================================================================================================================================
root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py::SlowTorchAoTests::test_quantization
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                                                                                

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 52211.25it/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10565.00it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.20s/it]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [02:18<00:00,  6.94s/it]
.

================================================================================================================================= 1 passed in 433.82s (0:07:13) =================================================================================================================================
root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py::SlowTorchAoTests::test_quantization
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                                                                                

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 53773.13it/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10330.80it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.21s/it]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:17<00:00,  1.11it/s]
.

================================================================================================================================= 1 passed in 315.50s (0:05:15) =================================================================================================================================
root@e64a4756d90e:/__w/diffusers/diffusers# RUN_SLOW=1 RUN_NIGHTLY=1 pytest -s tests/quantization/torchao/test_torchao.py::SlowTorchAoTests::test_quantization
====================================================================================================================================== test session starts ======================================================================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0
rootdir: /__w/diffusers/diffusers
configfile: pyproject.toml
plugins: requests-mock-1.10.0, xdist-3.6.1, timeout-2.3.1
collected 1 item                                                                                                                                                                                                                                                                                

Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 49932.19it/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10485.76it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.20s/it]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.49it/s]
.

================================================================================================================================= 1 passed in 304.99s (0:05:04) =================================================================================================================================

Looks good to merge I think! Thanks for the reviews everyone, and apologies for bothering you during the vacation period! Going to start the patch release in a bit

* Revert "Add support for sharded models when TorchAO quantization is enabled (#10256)" This reverts commit 41ba8c0. * update tests * udpate * update * update * update device map tests * apply review suggestions * update * make style * fix * update docs * update tests * update workflow * update * improve tests * allclose tolerance * Update src/diffusers/models/modeling_utils.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * Update tests/quantization/torchao/test_torchao.py Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> * improve tests * fix * update correct slices --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

a-r-r-o-w added 5 commits December 24, 2024 09:30

Revert "Add support for sharded models when TorchAO quantization is e…

03049aa

…nabled (#10256)" This reverts commit 41ba8c0.

update tests

1a29a99

udpate

c6651f9

update

87bb2fe

update

ba1269d

a-r-r-o-w added the quantization label Dec 24, 2024

a-r-r-o-w requested review from DN6 and sayakpaul December 24, 2024 10:07

Merge branch 'main' into fix-torchao-related-bugs

bc47057

update device map tests

1873bb7

DN6 reviewed Dec 24, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

tests/quantization/torchao/test_torchao.py Outdated Show resolved Hide resolved

apply review suggestions

d0b718a

DN6 reviewed Dec 24, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

a-r-r-o-w added 6 commits December 24, 2024 12:30

update

a10f19c

make style

fb8b44e

fix

2bd9302

update docs

651666d

update tests

d1b6405

update workflow

1dcd24e

a-r-r-o-w commented Dec 24, 2024

View reviewed changes

a-r-r-o-w requested a review from DN6 December 24, 2024 13:16

update

e5dcdec

improve tests

1ae9fec

allclose tolerance

2663e94

sayakpaul reviewed Dec 25, 2024

View reviewed changes

a-r-r-o-w and others added 2 commits December 25, 2024 09:54

Update src/diffusers/models/modeling_utils.py

9f98833

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

Update tests/quantization/torchao/test_torchao.py

b7de749

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

a-r-r-o-w added 2 commits December 25, 2024 10:43

Merge branch 'main' into fix-torchao-related-bugs

77a3456

improve tests

3e72979

a-r-r-o-w commented Dec 25, 2024

View reviewed changes

tests/quantization/torchao/test_torchao.py Outdated Show resolved Hide resolved

a-r-r-o-w commented Dec 25, 2024

View reviewed changes

tests/quantization/torchao/test_torchao.py Show resolved Hide resolved

fix

a98a184

sayakpaul reviewed Dec 25, 2024

View reviewed changes

tests/quantization/torchao/test_torchao.py Show resolved Hide resolved

sayakpaul approved these changes Dec 25, 2024

View reviewed changes

DN6 approved these changes Dec 25, 2024

View reviewed changes

update correct slices

fa33949

a-r-r-o-w merged commit cd991d1 into main Dec 25, 2024
15 checks passed

a-r-r-o-w deleted the fix-torchao-related-bugs branch December 25, 2024 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TorchAO related bugs; revert device_map changes #10371

Fix TorchAO related bugs; revert device_map changes #10371

a-r-r-o-w commented Dec 24, 2024

HuggingFaceDocBuilderDev commented Dec 24, 2024

DN6 left a comment

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024

sayakpaul left a comment

sayakpaul Dec 25, 2024

sayakpaul commented Dec 25, 2024

sayakpaul left a comment

a-r-r-o-w commented Dec 25, 2024

Fix TorchAO related bugs; revert device_map changes #10371

Fix TorchAO related bugs; revert device_map changes #10371

Conversation

a-r-r-o-w commented Dec 24, 2024

HuggingFaceDocBuilderDev commented Dec 24, 2024

DN6 left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 • edited Loading

a-r-r-o-w commented Dec 24, 2024 • edited Loading

a-r-r-o-w commented Dec 24, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

sayakpaul Dec 25, 2024

Choose a reason for hiding this comment

sayakpaul commented Dec 25, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Dec 25, 2024

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading