RAM <-> VRAM paging? #16466

Hexorg · 2025-10-07T15:32:39Z

Hexorg
Oct 7, 2025

I have 12GB GPU and 128GB CPU. I can do ~64 Tok/s on GPU, but as soon as one layer is on CPU it drops down to ~12 Tok/s.

I couldn't find any discussion/approaches for dynamic paging of model layers - e.g. load first 12 layers, compute 12th layer output, load next 12 layers, compute 24th layer output - all on GPU. Is doing such paging really slower than letting the CPU crunch through numbers?

0cc4m · 2025-10-08T05:16:20Z

0cc4m
Oct 8, 2025
Collaborator

Yes, that has been attempted in the past, it is very slow.

0 replies

JohannesGaessler · 2025-10-08T09:08:02Z

JohannesGaessler
Oct 8, 2025
Collaborator

For generating tokens you're I/O bound. Loading the data from RAM to VRAM and then from VRAM into the GPU is going to be slower than just loading the weights from RAM into the CPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RAM <-> VRAM paging? #16466

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RAM <-> VRAM paging? #16466

Uh oh!

Hexorg Oct 7, 2025

Replies: 2 comments

Uh oh!

0cc4m Oct 8, 2025 Collaborator

Uh oh!

JohannesGaessler Oct 8, 2025 Collaborator

Hexorg
Oct 7, 2025

0cc4m
Oct 8, 2025
Collaborator

JohannesGaessler
Oct 8, 2025
Collaborator