Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre allocate KV tensors and use inference mode #13

Closed
wants to merge 29 commits into from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 30, 2023

Fixes: #12 (Use bigcode-project/transformers#5)

  • Add pre_allocate_cache cache option. Big speedup on GPU but doesn't help with the CPU bottleneck.
  • Use torch inference mode. Speeds up CPU calls, marginal GPU speedup.
python3 src/main.py --hidden_size=2048 --n_head=16 --n_layer=24 --pipeline_class=HF_Pipeline --model_class=GPT2 --dtype=float16 --device=cuda --cycles=5 --batch_size=256 --max_new_tokens=100 --n_positions=512 --attention_type=[1/2] --max_log_outputs=1 --activation_function=gelu_new_python [--pre_allocate_cache] [--profile]

MQA, before: e2e = 1242 ms, cuda = 762 ms
MQA, pre-allocate: e2e = 1248 ms, cuda = 693 ms (no e2e speedup for this case but there will be for cases that aren't cpu-bottlenecked)
MQA, inference mode: e2e = 1130 ms, cuda = 690 ms
MQA, inference mode, no pre-allocate: e2e = 1121 ms, cuda = 760 ms

MHA, before: e2e = 2241 ms, cuda = 2201 ms
MHA, pre-allocate: e2e = 1442 ms, cuda = 1048 ms
MHA, inference mode: e2e = 1249 ms, cuda = 1046 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try pytorch inference mode
2 participants