Skip to content

🤘 TT-NN operator library, and TT-Metalium low level kernel programming model.

License

Notifications You must be signed in to change notification settings

tenstorrent/tt-metal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ttnn logo

TT-NN is a Python & C++ Neural Network OP library.


Grayskull (GS) Models

Model Batch End-to-end throughput [1] Device throughput [2] Target throughput
ResNet-50 (fps) 20 5,100 6,600 10,000
BERT-Large (sen/s) 12 370 406 410
Falcon7B-decode (t/s) 32 135 135 140
ViT (fps) 8 860 1570 2000
T5 small (sen/s) 140
Bloom (sen/s) 70
U-Net coming soon

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

Wormhole (WH) Models

Note

All model demos in this table function on both N150 and N300 Wormhole cards, unless otherwise stated.

Furthermore, all performance numbers here are run or based off an N300 Wormhole card.

Model Last verified release Gen. Token [3] Batch Time to first token [4] End-to-end throughput [1] Device throughput [2] Target throughput
Falcon7B v0.51.0-rc24 129th 32 0.08 s 16.7 t/s/u - 534 t/s 19.6 t/s/u - 627 t/s 26
Mistral-7B v0.51.0-rc28 129th 32 coming soon 9.9 t/s/u - 317 t/s 11.0 t/s/u - 352 t/s 25
Mamba-2.8B v0.51.0-rc26 any 32 0.04 s 12.3 t/s/u - 394 t/s 17.1 t/s/u - 547 t/s 41
LLaMA-3.1-8B v0.51.0-rc28 129th 1 coming soon 8.3 t/s/u - 8.3 t/s 9.7 t/s/u - 9.7 t/s 23
BERT-Large (sen/s) [5] - 8 - 270 340 400
Stable Diffusion 1.4 512x512 (sec/img) [6] - 1 - 6 5 3
ResNet-50 (fps) - 16 - 4,100 5,010 7,000

[1] - Observed from the host. Includes dispatch overhead and kernel execution time. For LLMs, token-to-token decode throughput is reported.

[2] - Ignoring host overhead. Kernel execution time only. For LLMs, token-to-token decode throughput is reported.

[3] - Generating the i'th token in a sequence while the kv_cache is filled with i-1 rows.

[4] - Time to fill the kv_cache and generate the first output token (1st user).

[5] - This model demo does not work on N150. It does work on N300.

[6] - This model demo does not work on N300. It does work on N150.

TT-QuietBox & TT-LoudBox (2x4 mesh of WHs) Models

Model Last verified release Technique Gen. Token [3] Batch Time to first token [4] End-to-end throughput [1] Device throughput [2] Target throughput
Falcon7B v0.51.0-rc36 Data Parallel 129th 256 0.11 s 13.4 t/s/u - 3430 t/s 19.6 t/s/u - 5018 t/s 26 t/s/u
LLaMA-2-70B v0.51.0-rc36 Tensor Parallel 129th 32 coming soon 10.4 t/s/u - 333 t/s 16.6 t/s/u - 531 t/s 20 t/s/u
LLaMA-3.1-70B v0.51.0-rc36 Tensor Parallel 129th 32 coming soon 10.4 t/s/u - 333 t/s 15.8 t/s/u - 506 t/s 20 t/s/u
Falcon40B v0.51.0-rc35 Tensor Parallel 129th 32 coming soon 5.3 t/s/u - 168 t/s 12.2 t/s/u - 390 t/s 36 t/s/u
Mixtral7Bx8 v0.51.0-rc33 Tensor Parallel 129th 32 0.19 s 15.7 t/s/u - 502 t/s 21.4 t/s/u - 685 t/s 33 t/s/u
ResNet-50 (fps) Data Parallel - 128 - 31,250 40,080 56,000

Single Galaxy (8x4 mesh of WHs) Models

Model Last verified release Technique Gen. Token [3] Batch Time to first token [4] End-to-end throughput [1] Device throughput [2] Target throughput
Falcon7B v0.51.0-rc30 Data Parallel 129th 1024 0.30 s 4.0 t/s/u - 4096 t/s 17.7 t/s/u - 18125 t/s 26 t/s/u

Model Updates

For the latest model updates and features, please see MODEL_UPDATES.md

Using TT-NN ops and tensors

import ttnn
import torch

with ttnn.manage_device(device_id=0) as device:
   a = torch.ones((5, 7))
   b = torch.ones((1, 7))

   a = ttnn.from_torch(a, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)
   b = ttnn.from_torch(b, device=device, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT)

   output = a + b
   output = ttnn.to_torch(output)

print(output)

TT-Metalium logo

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Getting started

Get started with simple kernels.

Tech Reports