Support batching for UsefulSensors Moonshine #35922

njeffrie · 2025-01-28T00:19:03Z

Add attention masking to the moonshine model.

Tested on Open ASR Leaderboard with batch_size=256.

WER (tiny): 12.53
WER (base): 9.89
RTFx (tiny): 2062
RTFx (base): 1634

Unblocks this OpenASR Leaderboard PR
@eustlb

Tested against Open ASR Leaderboard with batch size 256.

Perform attention mask downsampling inside of moonshine forward call.

- Correctly pipe encoder attention mask into decoder - Add correct scaling factor if one is not already provided. - Fix formatting with ruff

eustlb

Great initiative, thanks! 🤗 I had it in mind when integrating but batch inference was working pretty well with 0-padding when I benchmarked on Fleurs (yet non optimal).
Few changes to call proper functions of the codebase + little docstring fixies but otherwise LGTM!
There's just the padding to multiple of 8 that I have not seen in Transformers yet (to the best of my knowledge) and that I would not merge without proper benchmark signal.

eustlb · 2025-01-28T11:32:08Z

src/transformers/models/moonshine/modular_moonshine.py

+
+        # Pad head size dimension to next multiple of 8. Q K and V always have equal head sizes.
+        pad_amount = 8 * ((query_states.shape[-1] + 7) // 8) - query_states.shape[-1]
+        if pad_amount > 0:
+            # Ensure scaling is correct even with padding.
+            if self.scaling is None:
+                self.scaling = 1.0 / math.sqrt(query_states.shape[-1])
+
+            query_states = torch.nn.functional.pad(query_states, (0, pad_amount))
+            key_states = torch.nn.functional.pad(key_states, (0, pad_amount))
+            value_states = torch.nn.functional.pad(value_states, (0, pad_amount))
+


Can you justify the expected speedups a bit further here? Have you run benchmarks?

When I added an attention mask, I found I was only able to use batch sizes up to 32, otherwise I'd run out of memory. After doing some memory profiling, I found the culprit was the torch sdpa backend implementation - the memory efficient implementation with attention masking only supports multiple of eight head sizes, so we were falling back to the torch c++ implementation.

Overall, this change allows batch size 32 -> 256 and a corresponding ~4x increase in RTFx on Open ASR Leaderboard.

Nice take (here is the doc for the curious)!

I would rather have gone with another HF repo with an updated architecture (in config.json) and updated weights with 0.0s where necessary to avoid impacting dependencies, yet this would require modifying the modeling code anyway to handle correct scaling.

Let's add a config parameter (with explanation mentioned in the docstring) pad_head_dim_to_multiple_of that defaults to None (no effect) and that you would set to 8 in the model config.json (in their respective HF repos)

cc @ArthurZucker

Added the config parameter and I've got PRs ready for our two usefulsensors moonshine huggingface repos (1, 2) to update this parameter once we land this change.

src/transformers/models/moonshine/modular_moonshine.py

eustlb

Tiny nit but LGTM, thanks! Will very likely require a subsequent PR to update expected logits for the CI runners, I'll take care of it

src/transformers/models/moonshine/modular_moonshine.py

src/transformers/models/moonshine/configuration_moonshine.py

src/transformers/models/moonshine/modeling_moonshine.py

* Add support for attention masking in moonshine. Tested against Open ASR Leaderboard with batch size 256. * Update comments and ensure attention masks are passed everywhere. Perform attention mask downsampling inside of moonshine forward call. * Hide padding behind conditional. Fix encoder/decoder masking. - Correctly pipe encoder attention mask into decoder - Add correct scaling factor if one is not already provided. - Fix formatting with ruff * Add auto generated modeling_moonshine file. * Update formatting in generated model file. * Address review comments. * Fix typo. * Add `pad_head_dim_to_multiple_of` to moonshine config. * Correct args order for MooonshineConfig. * Update configuration moonshine too. * Update src/transformers/models/moonshine/modular_moonshine.py * Update src/transformers/models/moonshine/configuration_moonshine.py --------- Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>

njeffrie added 3 commits January 28, 2025 00:30

Add support for attention masking in moonshine.

d939b9d

Tested against Open ASR Leaderboard with batch size 256.

Update comments and ensure attention masks are passed everywhere.

1602b66

Perform attention mask downsampling inside of moonshine forward call.

Hide padding behind conditional. Fix encoder/decoder masking.

4f9e435

- Correctly pipe encoder attention mask into decoder - Add correct scaling factor if one is not already provided. - Fix formatting with ruff

njeffrie force-pushed the main branch from e38baa8 to 4f9e435 Compare January 28, 2025 00:30

Add auto generated modeling_moonshine file.

fd0ea7b

njeffrie mentioned this pull request Jan 28, 2025

Implement batching for Useful Sensors Moonshine huggingface/open_asr_leaderboard#48

Open

Update formatting in generated model file.

c345d73

eustlb reviewed Jan 28, 2025

View reviewed changes

njeffrie added 6 commits January 28, 2025 19:29

Address review comments.

21eb70a

Fix typo.

4b56344

Merge branch 'main' into main

516b54b

Add pad_head_dim_to_multiple_of to moonshine config.

2346089

Correct args order for MooonshineConfig.

826a123

Update configuration moonshine too.

f3556bb

eustlb approved these changes Jan 30, 2025

View reviewed changes

src/transformers/models/moonshine/modular_moonshine.py Outdated Show resolved Hide resolved

eustlb added 2 commits January 30, 2025 16:00

Merge branch 'main' into main

4f18ce3

Update src/transformers/models/moonshine/modular_moonshine.py

d75ce7f

eustlb reviewed Jan 30, 2025

View reviewed changes

src/transformers/models/moonshine/configuration_moonshine.py Outdated Show resolved Hide resolved

eustlb added 2 commits January 30, 2025 16:58

Update src/transformers/models/moonshine/configuration_moonshine.py

ec16a8d

Merge branch 'main' into main

714e094

eustlb merged commit 693328f into huggingface:main Jan 30, 2025
8 checks passed

ArthurZucker reviewed Jan 30, 2025

View reviewed changes

src/transformers/models/moonshine/modeling_moonshine.py Show resolved Hide resolved

eustlb mentioned this pull request Jan 31, 2025

[Moonshine] compute head_dim_padding at init #35984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support batching for UsefulSensors Moonshine #35922

Support batching for UsefulSensors Moonshine #35922

njeffrie commented Jan 28, 2025 •

edited

Loading

eustlb left a comment •

edited

Loading

eustlb Jan 28, 2025

njeffrie Jan 28, 2025

eustlb Jan 29, 2025

njeffrie Jan 29, 2025

eustlb left a comment

Support batching for UsefulSensors Moonshine #35922

Support batching for UsefulSensors Moonshine #35922

Conversation

njeffrie commented Jan 28, 2025 • edited Loading

eustlb left a comment • edited Loading

Choose a reason for hiding this comment

eustlb Jan 28, 2025

Choose a reason for hiding this comment

njeffrie Jan 28, 2025

Choose a reason for hiding this comment

eustlb Jan 29, 2025

Choose a reason for hiding this comment

njeffrie Jan 29, 2025

Choose a reason for hiding this comment

eustlb left a comment

Choose a reason for hiding this comment

njeffrie commented Jan 28, 2025 •

edited

Loading

eustlb left a comment •

edited

Loading