-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Infer] Revise and Adapt Triton Kernels for Spec-Dec #5401
Merged
FrankLeeeee
merged 7 commits into
hpcaitech:feat/speculative-decoding
from
yuanheng-zhao:feat/spec-dec/kernels
Feb 28, 2024
Merged
[Infer] Revise and Adapt Triton Kernels for Spec-Dec #5401
FrankLeeeee
merged 7 commits into
hpcaitech:feat/speculative-decoding
from
yuanheng-zhao:feat/spec-dec/kernels
Feb 28, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fix dependency in pytest
FrankLeeeee
approved these changes
Feb 28, 2024
yuanheng-zhao
added a commit
that referenced
this pull request
Apr 5, 2024
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out
yuanheng-zhao
added a commit
that referenced
this pull request
Apr 10, 2024
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out
botbw
added a commit
that referenced
this pull request
May 23, 2024
* [Inference] First PR for rebuild colossal-infer (#5143) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add readme (roadmap) and fulfill request handler (#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels * [Inference]Add BatchInferState, Sequence and InferConfig (#5149) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * [Inference] Add CacheBlock and KV-Cache Manager (#5156) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config) * [Inference]Update inference config and fix test (#5178) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt * [Inference] add logit processor and request handler (#5166) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * Add padding llama model * Fixed a bug in the inference frame * fix bugs in request_handler * precision alignment * Fixed a writing error * [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> * add context_attention_unpadded * fix bugs in sampler * Fixed a typo * fix beam_width * [Inference] Pytorch Attention func, pad&nopad input support (#5219) * add attn * add attention test * fix attn forward * fix decoding * fix bugs in attention.py and request_handler.py * adapted to pad_context_forward * [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs * fix bugs related to processing padding mask * fix CI bugs * rm torch.cuda.synchronize * fix bugs in request_handler.py and engine.py * [Inference] Kernel: no pad rotary embedding (#5252) * fix bugs * comment * use more accurate atol * fix * [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy * [git] fixed rebased files * [kernel] Add KV cache copy kernel during decoding (#5261) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy * [doc] updated inference readme (#5269) * [Inference] Fix request handler and add recycle logic (#5260) * fix request handler * fix comment * [kernel] Revise KVCache copy triton kernel API (#5273) * [kernel/fix] revise kvcache copy kernel api * fix benchmark * [Inference]Adapted to the triton attn kernels (#5264) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print * [kernel] Add RMSLayerNorm triton kernel (#5262) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm * [Hotfix] Fix bugs in testing continuous batching (#5270) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func * [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions * [inference] Adapted to Rotary Embedding and RMS Norm (#5283) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py * add utils.py * [Inference] Benchmarking rotary embedding and add a fetch function (#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix * [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark * [Inference]Add fused rotary kernel and get cos cache kernel (#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs * [hotfix] fix boundary check in batch (#5306) * [inference]Optimize the usage of the mid tensors space in flash attn (#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py * fix (#5311) * [Inference] Update rms norm kernel, benchmark with vLLM (#5315) * add * xi * del * del * fix * [DOC] Update inference readme (#5280) * add readme * add readme * 1 * update engine * finish readme * add readme * [Inference]Add Nopadding Llama Modeling (#5327) * add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style * [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval * [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) * revise rotary embedding * remove useless print * adapt * [inference] simplified config verification (#5346) * [inference] simplified config verification * polish * polish * [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py * [inference] removed redundancy init_batch (#5353) * [inference] moved ops tests to test_infer (#5354) * [doc] updated inference readme (#5343) * [Inference/opt]Optimize the mid tensor of RMS Norm (#5350) * opt rms_norm * fix bugs in rms_layernorm * [Inference]Optimize generation process of inference engine (#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py * [Fix/Infer] Remove unused deps and revise requirements (#5341) * remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module * [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py * [Inference] Adapt to Fused rotary (#5348) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) This reverts commit 9f4ab2e. * [inference] added inference template (#5375) * [Inference/opt] Fused KVCahce Memcopy (#5374) * fused kv memcopy * add TODO in test_kvcache_copy.py * [Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config * [inference] refactored config (#5376) * [Inference]Support vllm testing in benchmark scripts (#5379) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style * [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding * [Inference]Fused kv copy into rotary calculation (#5383) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del * Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs * [Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * [Inference]Add CUDA KVCache Kernel (#5406) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review * [Inference]Move benchmark-related code to the example directory. (#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py * add silu_and_mul for infer * [feat] cuda graph support and refactor non-functional api * add reusable utils for cuda * refactor code * feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) * [fix] multi graphs capture error * [fix] multi graphs capture error * [doc] add doc * refactor code * optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) * fix include path * fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) * [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline * [fix] pytest and fix dyn grid bug * diverse tests * add implementatino for GetGPULaunchConfig1D * [fix] tmp for test * add some comments * refactor vector utils * [feat] add use_cuda_kernel option * add vec_type_trait implementation (#5473) * [fix] unused option * [fix] * [fix] * [fix] remove unused comment * [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32 * [fix] PR #5354 (#5501) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align * [Inference] Optimize request handler of llama (#5512) * optimize request_handler * fix ways of writing * The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) * [Inference/Kernel]Add get_cos_and_sin Kernel (#5528) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'. * [Inference] Add Reduce Utils (#5537) * add reduce utils * add using to delele namespace prefix * [Fix/Inference] Remove unused and non-functional functions (#5543) * [fix] remove unused func * rm non-functional partial * add cast and op_functor for cuda build-in types (#5546) * remove unused triton kernels * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove outdated triton test * [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out * [Inference/SpecDec] Add Basic Drafter Model Container (#5405) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver) * [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec * [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) * fix drafter pastkv and usage of batch bucket * [Inference/SpecDec] Support GLIDE Drafter Model (#5455) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine * [doc] Add inference/speculative-decoding README (#5552) * add README for spec-dec * update roadmap * [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) - resolve conflicts of rebasing feat/speculative-decoding * [Fix] Llama Modeling Control with Spec-Dec (#5580) - fix ref before asgmt - fall back to use triton kernels when using spec-dec * refactor csrc (#5582) * [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file * [inference/model]Adapted to the baichuan2-7B model (#5591) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE' * [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feat]Tensor Model Parallel Support For Inference (#5563) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample * feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) * [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH * [example] Update Llama Inference example (#5629) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup * [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc * [Fix/Inference]Fix vllm benchmark (#5630) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name * [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Remove obsolete files - inference (#5650) * [Inference]Adapt to baichuan2 13B (#5614) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py * [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment * [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) * [Inference/Feat] Feat quant kvcache step2 (#5674) * [Inference] Adapt Baichuan2-13B TP (#5659) * adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos * [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention * [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) * [inference]Add alibi to flash attn function (#5678) * add alibi to flash attn function * rm redundant modifications * [Inference] Fix quant bits order (#5681) * [kernel] Support New KCache Layout - Triton Kernel (#5677) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Fix & Update Inference Tests (compatibility w/ main) * [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) * [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) * [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing * [Fix] Fix Inference Example, Tests, and Requirements (#5688) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe * [hotfix] fix OpenMOE example import path (#5697) * [Inference]Adapt temperature processing logic (#5689) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * [Inference] Support the logic related to ignoring EOS token (#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation * [Inference] ADD async and sync Api server using FastAPI (#5396) * add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template * [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision * [Online Server] Chat Api for streaming and not streaming response (#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n * [Inference] resolve rebase conflicts fix * [Inference] Fix bugs and docs for feat/online-server (#5598) * fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb. * resolve rebase conflicts on Branch feat/online-serving * [Inference] Add example test_ci script * [Inference/Feat] Add quant kvcache interface (#5700) * add quant kvcache interface * delete unused output * complete args comments * [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) * add convert_fp8 op for fp8 test in the future * rerun ci * [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids * [Feat]Inference RPC Server Support (#5705) * rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add paged-attetionv2: support seq length split across thread block (#5707) * [Inference] Delete duplicated copy_vector (#5716) * [ci] Fix example tests (#5714) * [fix] revise timeout value on example CI * trivial * [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) * Fix Llama3 Load error * Omit Checkpoint IO Temporarily * [Inference] Fix API server, test and example (#5712) * fix api server * fix generation config * fix api server * fix comments * fix infer hanging bug * resolve comments, change backend to free port * 【Inference] Delete duplicated package (#5723) * [example] Update Inference Example (#5725) * [example] update inference example * [lazy] fix lazy cls init (#5720) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix * [Inference] Fix Inference Generation Config and Sampling (#5710) * refactor and add * config default values * fix gen config passing * fix rpc generation config * [Fix/Inference] Add unsupported auto-policy error message (#5730) * [fix] auto policy error message * trivial * [doc] Update Inference Readme (#5736) * [doc] update inference readme * add contents * trivial * [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702) * [pre-commit.ci] auto fixes from pre-commit.com hooks * add parallel cross entropy output for falcon model & fix some typos in bloom.py * fix module name error, self.model -> self.transformers in bloom, falcon model * Fix the overflow bug of distributed cross entropy loss function when training with fp16 * add dtype to parallel cross entropy loss function * fix dtype related typos adn prettify the loss.py * fix grad dtype and update dtype mismatch error * fix typo bugs * [bug] fix silly bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] add test for prefetch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [ci] Temporary fix for build on pr (#5741) * temporary fix for CI * timeout to 90 * [NFC] Fix code factors on inference triton kernels (#5743) * [NFC] fix requirements (#5744) * [inference] release (#5747) * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> Co-authored-by: FrankLeeeee <somerlee.9@gmail.com> Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com> Co-authored-by: xs_courtesy <xs1580802568@gmail.com> Co-authored-by: Runyu Lu <runyulu@umich.edu> Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: Haze188 <haze188@qq.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com>
botbw
added a commit
that referenced
this pull request
May 23, 2024
commit 4647ec28c8450ee96f4709626617763712efd77e Author: binmakeswell <binmakeswell@gmail.com> Date: Thu May 23 17:44:06 2024 +0800 [inference] release (#5747) * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release commit df6747603f11e2a1929db193ceb014799e02e2c1 Merge: 22ce873c 498f42c4 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 22 14:31:09 2024 +0800 [Colossal-Inference] (v0.1.0) Merge pull request #5739 from hpcaitech/feature/colossal-infer [Inference] Merge feature/colossal-infer commit 498f42c45b256b5cfc32d74b552e1e306f317a42 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 22 12:08:49 2024 +0800 [NFC] fix requirements (#5744) commit bd38fe6b912379080673a43d77fd3bdf0e5c852e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 22:12:15 2024 +0800 [NFC] Fix code factors on inference triton kernels (#5743) commit c2c8c9cf17d67000df8a5b75ae9dbecee0e1c00a Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 18:20:57 2024 +0800 [ci] Temporary fix for build on pr (#5741) * temporary fix for CI * timeout to 90 commit c06208e72c35d74e150b6a83e72375f5021d10b1 Merge: d8b1ea4a 8633c15d Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 21 11:26:37 2024 +0800 Merge pull request #5737 from yuanheng-zhao/inference/sync/main [sync] Sync feature/colossal-infer with main commit 22ce873c3f26fd7f4217cdf19071c173683c2b47 Author: Haze188 <haze188@qq.com> Date: Tue May 21 11:07:13 2024 +0800 [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702) * [pre-commit.ci] auto fixes from pre-commit.com hooks * add parallel cross entropy output for falcon model & fix some typos in bloom.py * fix module name error, self.model -> self.transformers in bloom, falcon model * Fix the overflow bug of distributed cross entropy loss function when training with fp16 * add dtype to parallel cross entropy loss function * fix dtype related typos adn prettify the loss.py * fix grad dtype and update dtype mismatch error * fix typo bugs commit 8633c15da9b82c675c59ad292e7f0d77f092653c Merge: d8b1ea4a 9d83c6d7 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Mon May 20 15:50:53 2024 +0000 [sync] Sync feature/colossal-infer with main commit d8b1ea4ac90317ad6126acbd854e66583a8f9c8f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 20 22:50:04 2024 +0800 [doc] Update Inference Readme (#5736) * [doc] update inference readme * add contents * trivial commit bdf9a001d61cfad4bb68752c4a808295165307a0 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 20 22:49:18 2024 +0800 [Fix/Inference] Add unsupported auto-policy error message (#5730) * [fix] auto policy error message * trivial commit 283c407a19002118bda7edd1b8a3acf099843205 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Sun May 19 15:08:42 2024 +0800 [Inference] Fix Inference Generation Config and Sampling (#5710) * refactor and add * config default values * fix gen config passing * fix rpc generation config commit 9d83c6d715e8cdb802f82335e651923baab5cfc6 Author: flybird11111 <1829166702@qq.com> Date: Fri May 17 18:18:59 2024 +0800 [lazy] fix lazy cls init (#5720) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix commit 8bcfe360fdae7ccec7051aaced48497519afc2f2 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri May 17 11:28:53 2024 +0800 [example] Update Inference Example (#5725) * [example] update inference example commit a8d459f99a1d415fc843327e4dafce19ecee1f3e Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu May 16 10:49:03 2024 +0800 【Inference] Delete duplicated package (#5723) commit f47f2fbb2467df15548d2c663b119f4ae0103890 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed May 15 15:47:31 2024 +0800 [Inference] Fix API server, test and example (#5712) * fix api server * fix generation config * fix api server * fix comments * fix infer hanging bug * resolve comments, change backend to free port commit 74c47921facd26dbd93172bf887abcad4eab2d5c Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Tue May 14 20:17:43 2024 +0800 [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) * Fix Llama3 Load error * Omit Checkpoint IO Temporarily commit 5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 14 16:08:51 2024 +0800 [ci] Fix example tests (#5714) * [fix] revise timeout value on example CI * trivial commit 121d7ad629c746e52a96ec53d6e26c0194016a03 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue May 14 14:35:33 2024 +0800 [Inference] Delete duplicated copy_vector (#5716) commit 7806842f2dbb4b6d6e74014efc7db5be8ccf0bbd Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue May 14 12:46:54 2024 +0800 add paged-attetionv2: support seq length split across thread block (#5707) commit 18d67d0e8e79c22bded0745c7d3daf8ca40d445c Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Tue May 14 10:00:55 2024 +0800 [Feat]Inference RPC Server Support (#5705) * rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit de4bf3dedf2c7cb7ba6c3044745bab3c3ef6352d Author: yuehuayingxueluo <867460659@qq.com> Date: Sat May 11 15:13:25 2024 +0800 [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids commit 50104ab340e6c7067fbaaf9b47c608eb828aa95b Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri May 10 18:39:54 2024 +0800 [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) * add convert_fp8 op for fp8 test in the future * rerun ci commit bfad39357b0fe31ecf6f7639e2c4056165078a3f Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu May 9 18:03:24 2024 +0800 [Inference/Feat] Add quant kvcache interface (#5700) * add quant kvcache interface * delete unused output * complete args comments commit 492520dbdb962d207ac40d216e0414807f73eb19 Merge: d4829220 5d9a4948 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu May 9 17:19:45 2024 +0800 Merge pull request #5588 from hpcaitech/feat/online-serving [Feature]Online Serving commit 5d9a49483d98ccd4bebebbfd039162caceefe6bd Author: CjhHa1 <cjh18671720497@outlook.com> Date: Thu May 9 05:44:05 2024 +0000 [Inference] Add example test_ci script commit bc9063adf1598c3be32fc2d12577d76b9daa79bf Author: CjhHa1 <cjh18671720497@outlook.com> Date: Wed May 8 10:36:42 2024 +0000 resolve rebase conflicts on Branch feat/online-serving commit 61a1b2e798edcbf91ac35966a4047407ad6aa62d Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed May 8 15:14:06 2024 +0800 [Inference] Fix bugs and docs for feat/online-server (#5598) * fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb02440d5cd566d32b508bee9f9c13dda15. commit 7bbb28e48bdb5849d9dfb118d7bf2959d79bbe02 Author: CjhHa1 <cjh18671720497@outlook.com> Date: Thu Apr 11 10:12:31 2024 +0800 [Inference] resolve rebase conflicts fix commit c06403286567f62cb0a6dfc5e075cf60e291cea9 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Sun Apr 7 14:45:43 2024 +0800 [Online Server] Chat Api for streaming and not streaming response (#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n commit de378cd2abd77b464786dc5f8298c9edbf023fbc Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Mar 18 17:06:05 2024 +0800 [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision commit 69cd7e069d5705c7e431b301ac14924711c74e41 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Mar 1 14:47:36 2024 +0800 [Inference] ADD async and sync Api server using FastAPI (#5396) * add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template commit d482922035ff7b6fe7ced8e6c4028faa2d68197f tAuthor: yuehuayingxueluo <867460659@qq.com> Date: Wed May 8 19:59:10 2024 +0800 [Inference] Support the logic related to ignoring EOS token (#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation commit 9c2fe7935ff5aaec4f174cfba6f324df623c7447 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed May 8 17:58:29 2024 +0800 [Inference]Adapt temperature processing logic (#5689) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg commit 12e7c28d5e8f219480d1dbc682fd225dc76fcc2b Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Daqte: Wed May 8 15:48:47 2024 +0800 [hotfix] fix OpenMOE example import path (#5697) commit 55cc7f3df7c600deae2f344ee162abae5a5c63e1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed May 8 11:30:15 2024 +0800 [Fix] Fix Inference Example, Tests, and Requirements (#5688) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe commit f9afe0addd89303de4819debd93efe97d5618238 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue May 7 23:13:14 2024 +0800 [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing commit 1ace1065e6bff175a0af88cae86d272acef29c9f Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon May 6 15:35:13 2024 +0800 [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) commit db7b3051f4379862f88790bf1653ddb6443c002e Merge: 725fbd2e 8754abae Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon May 6 14:43:38 2024 +0800 [Sync] Update from main to feature/colossal-infer (Merge pull request #5685) [Sync] Update from main to feature/colossal-infer - Merge pull request #5685 from yuanheng-zhao/inference/merge/main commit 725fbd2ed067f9c58ac04670377d3e6f2a96fe00 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Mon May 6 10:55:34 2024 +0800 [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) commit 8754abae24dbcc492d2992d1091428592b615285 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Sun May 5 16:28:56 2024 +0000 [Fix] Fix & Update Inference Tests (compatibility w/ main) commit 56ed09aba5e017fc0c211dac70215c2f83815919 Merge: 537a3cbc d3f34ee8 Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com> Date: Sun May 5 05:14:00 2024 +0000 [sync] resolve conflicts of merging main commit 537a3cbc4df445786c8ecf2af0a2998e2fd881b6 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri May 3 17:20:45 2024 +0800 [kernel] Support New KCache Layout - Triton Kernel (#5677) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 9df016fc4520a5a5c95a11ed04a8ac62bde039c4 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 19:38:00 2024 +0800 [Inference] Fix quant bits order (#5681) commit f79963199cd30c5e917d430aedd79113d06d608c Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 30 19:35:05 2024 +0800 [inference]Add alibi to flash attn function (#5678) * add alibi to flash attn function * rm redundant modifications commit ef8e4ffe310bfe21f83feb965d962d816d75bc88 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 18:33:53 2024 +0800 [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) commit 5cd75ce4c7edc95bacd8ec5fc04b8add339e8331 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue Apr 30 15:52:23 2024 +0800 [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention commit 5f00002e43bd738a99fea250306e54c8c908f05a Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 30 15:47:07 2024 +0800 [Inference] Adapt Baichuan2-13B TP (#5659) * adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos commit 808ee6e4addccb51990398434547fa5df3c255b0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Apr 30 11:26:36 2024 +0800 [Inference/Feat] Feat quant kvcache step2 (#5674) commit 8ccb6714e79137c8e6e50d9a585eadbf70ae6fc0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Apr 26 19:40:37 2024 +0800 [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) commit 5be590b99eb6c58c3aa809d453680139fdd2b9f7 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri Apr 26 17:51:49 2024 +0800 [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment commit 3c91e3f1763d2a30a85187a3a606dbe4d1b9454d Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Apr 25 23:11:30 2024 +0800 [Inference]Adapt to baichuan2 13B (#5614) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py commit f342a9387168cedc2e5cc33155939c6d0c4e99a0 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Apr 25 22:04:59 2024 +0800 [Fix] Remove obsolete files - inference (#5650) commit a8fd3b034235e1fa987a1ae85a9a2b465ee6128f Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Thu Apr 25 14:24:02 2024 +0800 [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 90cd5227a348dfe506e95b2e49f2a8dcd34fdbca Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Apr 24 14:51:36 2024 +0800 [Fix/Inference]Fix vllm benchmark (#5630) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name commit 279300dc5f34db219c90a297c0996d00221eae96 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Wed Apr 24 14:17:54 2024 +0800 [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc commit 04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 23 22:23:07 2024 +0800 [example] Update Llama Inference example (#5629) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup commit 12f10d5b0b49a180bc162e166337942e0bbfb96b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Apr 23 13:44:49 2024 +0800 [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH commit 5d4c1fe8f5f7019284f6cbc0ed29506748f63bf1 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 23 13:09:55 2024 +0800 [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit ccf72797e3bfafcbfc42870ce24ee484858d4852 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Fri Apr 19 15:34:53 2024 +0800 feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) commit e37ee2fb65fc77c275b816968d91776322fd7695 Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Thu Apr 18 16:56:46 2024 +0800 [Feat]Tensor Model Parallel Support For Inference (#5563) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample commit be396ad6cc102fa610731291bf28e531a5641c7a Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Thu Apr 18 16:45:07 2024 +0800 [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> commit 56b222eff8c996a4677a158d4b5d4834a1bc0cfc Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Apr 15 16:53:02 2024 +0800 [inference/model]Adapted to the baichuan2-7B model (#5591) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE' commit d4cb023b62ea8e092783be437cb16d74a1afc6a7 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 15 10:57:51 2024 +0800 [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file commit a21912339a2c41627b43fd00e6adba38308a2ea0 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Thu Apr 11 15:41:36 2024 +0800 refactor csrc (#5582) commit 25928d84961b60264a6dabbddeae32af04a43fa2 Merge: d56c9633 f8598e3e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Apr 10 18:39:27 2024 +0800 [Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding Add Speculative Decoding and GLIDE Spec-Dec commit f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Wed Apr 10 11:14:04 2024 +0800 [Fix] Llama Modeling Control with Spec-Dec (#5580) - fix ref before asgmt - fall back to use triton kernels when using spec-dec commit e60d430cf53c9009af4682908d01742147654429 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Sun Apr 7 14:53:30 2024 +0800 [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) - resolve conflicts of rebasing feat/speculative-decoding commit e1acb58423c53ece50b72db3bf9b91475d5d3d64 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Apr 3 18:06:23 2024 +0800 [doc] Add inference/speculative-decoding README (#5552) * add README for spec-dec * update roadmap commit d85d91435ae25d875bfeb012b1e66cbfce6f6525 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Apr 1 21:54:24 2024 +0800 [Inference/SpecDec] Support GLIDE Drafter Model (#5455) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine commit 912e24b2aaf4acda0e2b9a45a7d4327fbfc8bd39 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Mar 12 17:57:01 2024 +0800 [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) * fix drafter pastkv and usage of batch bucket commit a37f82629d7b9e3c3a0f430b8dd3ff6f38ddf1d4 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Mar 11 09:51:42 2024 +0800 [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec commit 5a9b05f7b297bc9ce3479990aeee94891c7f5edf Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Feb 28 13:48:17 2024 +0800 [Inference/SpecDec] Add Basic Drafter Model Container (#5405) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver) commit d63c469f45bc20115aaf5ba01e62dc67ab47953f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Feb 28 13:47:00 2024 +0800 [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out commit d56c96334e8a0626696609c3803ba5c73798f073 Merge: 7ebdf48a 7ca1d1c5 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 9 10:09:34 2024 +0800 Sync main to feature/colossal-infer [Sync] Merge feature/colossal-infer with main commit 7ca1d1c5453de3e726bca6334c360045050f94c4 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 17:00:55 2024 +0800 remove outdated triton test commit d78817539ea03b7b4bc79e0ef50db33d3e347f24 Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Date: Mon Apr 8 08:41:07 2024 +0000 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci commit ce9401ad52b870012846abcde120f1e87d5da7fe Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 16:25:12 2024 +0800 remove unused triton kernels commit ed5ebd1735db4541709eebdd37839ad161f542e8 Merge: 7ebdf48a 641b1ee7 Author: Yuanheng <jonathan.zhaoyh@gmail.com> Date: Mon Apr 8 16:21:47 2024 +0800 [Fix] resolve conflicts of merging main commit 7ebdf48ac50ca7bab827ef611551c6c48113b684 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 8 11:38:05 2024 +0800 add cast and op_functor for cuda build-in types (#5546) commit 4bb5d8923a6e85a0f89a483f15933698635a9f9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Apr 2 14:16:59 2024 +0800 [Fix/Inference] Remove unused and non-functional functions (#5543) * [fix] remove unused func * rm non-functional partial commit a2878e39f42f509f237f3d3fd0741f53e3feff0e Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Apr 1 15:34:25 2024 +0800 [Inference] Add Reduce Utils (#5537) * add reduce utils * add using to delele namespace prefix commit 04aca9e55bd91ea4dd8d1231aa66df7848b08f03 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Apr 1 13:47:14 2024 +0800 [Inference/Kernel]Add get_cos_and_sin Kernel (#5528) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'. commit 934e31afb22d2a281464aebde074eb2f238fb812 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Mar 28 10:42:51 2024 +0800 The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) commit e6496dd37144202c8602dfdd66bb83f297eb5805 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 26 16:37:14 2024 +0800 [Inference] Optimize request handler of llama (#5512) * optimize request_handler * fix ways of writing commit 6251d68dc9f92c333a8f07ddf94e80ff7462726e Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Mon Mar 25 15:24:17 2024 +0800 [fix] PR #5354 (#5501) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align commit 1d626233ce8dbf35405cb7d92a5638ee1d830e8f Merge: 87079cff 68e9396b Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Mon Mar 25 14:55:59 2024 +0800 Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph [feat] cuda graph support and refactor non-functional api commit 68e9396bc084f03fe9315e9fed93292c0efc7a48 Merge: ff4998c6 87079cff Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 14:48:28 2024 +0800 [fix] merge conflicts commit 87079cffe8e006d4949aa7ca7cb60e6b813ff701 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Mar 25 13:40:34 2024 +0800 [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32 commit ff4998c6f39cbfd6d3d11f038c55cca3c9d3abd0 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 12:00:57 2024 +0800 [fix] remove unused comment commit 9fe61b44753083c89a50540daa1e9a3daedeb335 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 25 11:37:58 2024 +0800 [fix] commit 5b017d6324c9881e02a5440e0b1a3156612a8044 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 15:55:25 2024 +0800 [fix] commit 606603bb8805c39f6ee01029337ddc614c8d46ef Merge: 4eafe0c8 7ff42cc0 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 14:25:22 2024 +0800 Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into colossal-infer-cuda-graph commit 4eafe0c8141c120229be3ddce9c5591c1535348a Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 21 11:28:42 2024 +0800 [fix] unused option commit 7ff42cc06d007ae78fe091da65cb89c4bb62bc38 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 19 18:36:40 2024 +0800 add vec_type_trait implementation (#5473) commit b96557b5e15dbb521bf0f77b6b1f24dcbd9464d6 Merge: b6e97858 48c4f29b Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 19 13:53:26 2024 +0800 Merge pull request #5469 from Courtesy-Xs/add_vec_traits Refactor vector utils commit aabc9fb6aada9e7feb2ff8cf1f34e6ac37ade2e7 Author: Runyu Lu <runyulu@umich.edu> Date: Tue Mar 19 13:24:25 2024 +0800 [feat] add use_cuda_kernel option commit 48c4f29b275e2d8105842913cd84f5d66c378b36 Author: xs_courtesy <xs1580802568@gmail.com> Date: Tue Mar 19 11:32:01 2024 +0800 refactor vector utils commit b6e97858856ee8637216c51f14ac544b1bc0f872 Merge: f366a5ea 5724b9e3 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Mar 15 11:23:44 2024 +0800 Merge pull request #5457 from Courtesy-Xs/ly_add_implementation_for_launch_config add implementatino for GetGPULaunchConfig1D commit 5724b9e31e13e07d8ade0444c3e2f3e6894d13b1 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 15 11:18:57 2024 +0800 add some comments commit 6e30248683c0e4ccc63d15f39f8149875cba1263 Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 14 16:13:00 2024 +0800 [fix] tmp for test commit 388e0439301834a1ad0d11da26b23f4cdc6c82d7 Author: xs_courtesy <xs1580802568@gmail.com> Date: Thu Mar 14 11:13:40 2024 +0800 add implementatino for GetGPULaunchConfig1D commit d02e257abd778812d64491dde893c0d691ed4328 Merge: ae24b4f0 f366a5ea Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Date: Thu Mar 14 10:37:05 2024 +0800 Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph commit ae24b4f025285949253a21c41bee4b80679a0bfe Author: Runyu Lu <runyulu@umich.edu> Date: Thu Mar 14 10:35:08 2024 +0800 diverse tests commit 1821a6dab0ad6ad24ae25216e56268c4b0c0d365 Author: Runyu Lu <runyulu@umich.edu> Date: Wed Mar 13 17:28:32 2024 +0800 [fix] pytest and fix dyn grid bug commit f366a5ea1f2626a7870acaf8866f21d5fb49c388 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Mar 13 17:20:03 2024 +0800 [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline commit ed431de4e4f73584e6b9c11ab041ef54a8e83de6 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Wed Mar 13 16:00:55 2024 +0800 fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) commit 6fd355a5a6bb46bfee41d2bc75578e8fba001144 Merge: b699f540 c1c45e9d Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Wed Mar 13 11:26:41 2024 +0800 Merge pull request #5452 from Courtesy-Xs/fix_include_path fix include path commit c1c45e9d8ecb6743e88e63dd151c617c0014e7c1 Author: xs_courtesy <xs1580802568@gmail.com> Date: Wed Mar 13 11:21:06 2024 +0800 fix include path commit b699f54007c52b2f4ec56326a495b06858cf8856 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Tue Mar 12 17:48:02 2024 +0800 optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) commit 368a2aa5433d127adaa3674c6d00bb9dc3e0729c Merge: 21e1e364 095c070a Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Tue Mar 12 14:14:37 2024 +0800 Merge pull request #5445 from Courtesy-Xs/refactor_infer_compilation Refactor colossal-infer code arch commit 095c070a6eefe1a76fe3483b21986826114d6d17 Author: xs_courtesy <xs1580802568@gmail.com> Date: Mon Mar 11 17:06:57 2024 +0800 refactor code commit 21e1e3645c8f2e0d4e556f3e13d0d2aa5053911b Merge: f7aecc0c 5eb5ff14 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Mon Mar 11 11:15:29 2024 +0800 Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config Add query and other components commit 633e95b301336c4c237537f584882b3d8e5f4145 Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:56:51 2024 +0800 [doc] add doc commit 9dec66fad6c2f85166903aa80d0c077e37512fce Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:51:16 2024 +0800 [fix] multi graphs capture error commit b2c0d9ff2b4e4015660f2967837688cf7293b21e Author: Runyu Lu <runyulu@umich.edu> Date: Mon Mar 11 10:49:31 2024 +0800 [fix] multi graphs capture error commit f7aecc0c6bac001d10c1dd00274e0152e4c86df6 Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Date: Fri Mar 8 16:21:12 2024 +0800 feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) commit 5eb5ff1464311ac16c29307d03a3c076aced7e03 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 15:41:14 2024 +0800 refactor code commit 01d289d8e51384131d536b1c223c473aeea463e9 Merge: a46598ac 2b28b54a Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 15:04:55 2024 +0800 Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config commit a46598ac5984c7dc5804d0cf8621698f1a6a8720 Author: xs_courtesy <xs1580802568@gmail.com> Date: Fri Mar 8 14:53:29 2024 +0800 add reusable utils for cuda commit 2b28b54ac6d19d33079d9117b9717fd2779f2b08 Merge: 593a72e4 95c21498 Author: 傅剑寒 <Xs1580802568@gmail.com> Date: Fri Mar 8 14:44:37 2024 +0800 Merge pull request #5433 from Courtesy-Xs/add_silu_and_mul 【Inference】Add silu_and_mul for infer commit cefaeb5fdd551c8b95837a475cb810f4991cf674 Author: Runyu Lu <runyulu@umich.edu> Date: Fri Mar 8 14:19:35 2024 +0800 [feat] cuda graph support and refactor non-functional api commit 95c21498d4f6e640e218f4b00349020f4ae7c69a Author: xs_courtesy <xs1580802568@gmail.com> Date: Thu Mar 7 16:57:49 2024 +0800 add silu_and_mul for infer commit 593a72e4d58b8c3feebde2d19c78d44f702f7b06 Merge: 0aa27f19 0310b76e Author: Frank Lee <somerlee.9@gmail.com> Date: Mon Mar 4 10:13:59 2024 +0800 Merge pull request #5424 from FrankLeeeee/sync/main Sync/main commit 0310b76e9d485703d5afc128b8d97d01b00f3317 Merge: 0aa27f19 4b8312c0 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Mon Mar 4 10:09:36 2024 +0800 Merge branch 'main' into sync/main commit 0aa27f196109bfb4ce6171d7ce921052b9eee969 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 28 16:46:03 2024 +0800 [Inference]Move benchmark-related code to the example directory. (#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py commit 600881a8ea9b17c436ded922a9d4e3d5969acd87 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 28 14:36:50 2024 +0800 [Inference]Add CUDA KVCache Kernel (#5406) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review commit 19061188c396d851ef17bc34b526e2f2b4fc1479 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Feb 26 16:17:47 2024 +0800 [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest commit bc1da87366d81e144f1f133801d5f20520433c52 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 23 10:51:35 2024 +0800 [Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket commit 2a718c8be89918ec70b88f1f059148a7294dbccb Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 21 13:23:57 2024 +0800 Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs commit 730103819dc0636c85af1af80cc17914dcf196c1 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 21 11:31:48 2024 +0800 [Inference]Fused kv copy into rotary calculation (#5383) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del commit b21aac5baeddf7ea19615fae454e6f78f7469cd2 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Feb 19 17:18:20 2024 +0800 [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding commit 8c69debdc7128e1b8839f12aa3f19ad327569017 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Feb 8 15:27:26 2024 +0800 [Inference]Support vllm testing in benchmark scripts (#5379) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style commit 9afa52061f89dde87a73e36f740f62781d658a01 Author: Frank Lee <somerlee.9@gmail.com> Date: Thu Feb 8 14:04:14 2024 +0800 [inference] refactored config (#5376) commit 1f8c7e70469191610d9536029f624b4f30db8caf Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 7 17:55:48 2024 +0800 [Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config commit 6fb4bcbb2420b9f977ab74de60c6d311b6c9ed9a Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Feb 7 17:15:42 2024 +0800 [Inference/opt] Fused KVCahce Memcopy (#5374) * fused kv memcopy * add TODO in test_kvcache_copy.py commit 58740b5f6872bc5a26dbf7c3112b86a1b66c083a Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Feb 7 17:11:43 2024 +0800 [inference] added inference template (#5375) commit 8106ede07fae7e239203feb815162efdf46975ec Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Feb 7 14:27:04 2024 +0800 Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) This reverts commit 9f4ab2eb924b938348df2c713bb4580972f18eb1. commit 9f4ab2eb924b938348df2c713bb4580972f18eb1 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Feb 7 11:36:04 2024 +0800 [Inference] Adapt to Fused rotary (#5348) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix commit 35382a7fbf96c731ba1ed76cf5529ea3220a5b66 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Feb 6 19:38:25 2024 +0800 [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py commit 1dedb57747270f32be5d0e67abc1ad2fff658f8f Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Feb 6 17:27:45 2024 +0800 [Fix/Infer] Remove unused deps and revise requirements (#5341) * remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module commit 631862f3390f874db118a25c0137f86630e9b167 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 2 15:38:21 2024 +0800 [Inference]Optimize generation process of inference engine (#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py commit 21ad4a27f91659220bec6c4d4f2d0f62f7093a45 Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Feb 2 15:06:01 2024 +0800 [Inference/opt]Optimize the mid tensor of RMS Norm (#5350) * opt rms_norm * fix bugs in rms_layernorm commit 027aa1043f1c7b3668d5ca9b91d35c846736e9c4 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 14:31:10 2024 +0800 [doc] updated inference readme (#5343) commit e76acbb076582e0aade1ee8a5fa7696d95c1bef5 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 13:51:22 2024 +0800 [inference] moved ops tests to test_infer (#5354) commit db1a763307a54ca262751ebebd5f1c503d9bca74 Author: Frank Lee <somerlee.9@gmail.com> Date: Fri Feb 2 11:44:15 2024 +0800 [inference] removed redundancy init_batch (#5353) commit 249644c23b0402ccf9d0908f13ed15b41b95145f Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Feb 1 15:49:39 2024 +0800 [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py commit f8e456d20295af52665ca06a21f9fd8b468204d7 Author: Frank Lee <somerlee.9@gmail.com> Date: Thu Feb 1 15:31:01 2024 +0800 [inference] simplified config verification (#5346) * [inference] simplified config verification * polish * polish commit df0aa49585d2dd19d7397dfbd3b5f136abac609b Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Jan 31 16:31:29 2024 +0800 [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) * revise rotary embedding * remove useless print * adapt commit 1336838a9149fb210a956b0ad338197c4ae77821 Merge: 5f98a9d6 c5655199 Author: Frank Lee <somerlee.9@gmail.com> Date: Wed Jan 31 16:29:26 2024 +0800 Merge pull request #5339 from FrankLeeeee/sync/merge-main Sync/merge main commit c56551991379a457fc34df699710ab94132779fc Merge: 5f98a9d6 71321a07 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Wed Jan 31 10:41:47 2024 +0800 merge commit commit 5f98a9d68a0a35031e1c740c19e33b32f4fa8d9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 30 16:06:09 2024 +0800 [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval commit e8f0642f2841f6aeb6ed0e6695ff9d9ef14f198b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 30 10:31:46 2024 +0800 [Inference]Add Nopadding Llama Modeling (#5327) * add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style commit c7c104cb7ccc353faa10667853ed210e042f1be8 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 29 16:21:06 2024 +0800 [DOC] Update inference readme (#5280) * add readme * add readme * 1 * update engine * finish readme * add readme commit 1f8a75d470d548bfd4db877e73102b8fad5cdfa9 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 29 10:22:33 2024 +0800 [Inference] Update rms norm kernel, benchmark with vLLM (#5315) * add * xi * del * del * fix commit 7ddd8b37f0f1160e28a2919a2e37f8e8ad199773 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Jan 26 15:02:12 2024 +0800 fix (#5311) commit 4f28cb43c0c2afbc970b9f0f300e7aa28e39bd2e Author: yuehuayingxueluo <867460659@qq.com> Date: Fri Jan 26 14:00:10 2024 +0800 [inference]Optimize the usage of the mid tensors space in flash attn (#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py commit af8359c430ce3fabb22748870b67b0c6c33f610c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Jan 25 10:23:12 2024 +0800 [hotfix] fix boundary check in batch (#5306) commit c647e00e3c092d3d6219f7686f260f2932a0c27d Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Wed Jan 24 16:20:42 2024 +0800 [Inference]Add fused rotary kernel and get cos cache kernel (#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs commit 3da9993b0d03923755c1fcd6279cc4c7b8d00d1e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 23 17:16:02 2024 +0800 [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark commit 8e606ecc7e89ffed80537e89a27bb1eb6759f4bc Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Tue Jan 23 12:11:53 2024 +0800 [Inference] Benchmarking rotary embedding and add a fetch function (#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix commit b7853196a0a46558d7c0cac7deac9a36c7a5ba38 Merge: bfff9254 cea9c86e Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 17:07:14 2024 +0800 Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding [Inference/fix]Add utils.py for Rotary Embedding commit cea9c86e453e36b4848064312c9a4f0d2de6ea98 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 16:06:27 2024 +0800 add utils.py commit bfff9254ac8ca866673746ec47cfd2f87aab2b66 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 22 10:55:34 2024 +0800 [inference] Adapted to Rotary Embedding and RMS Norm (#5283) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py commit 6e487e7d3cf5295ca908fa69c8e03af8980391bf Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Fri Jan 19 15:47:16 2024 +0800 [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions commit 9e2342bde2c0ffe1a8cdd2fe8917254ef0a06e7f Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 18 16:31:14 2024 +0800 [Hotfix] Fix bugs in testing continuous batching (#5270) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func commit 5ae9099f9203a4f8350f383b838e8f2ad15d6fdd Author: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com> Date: Thu Jan 18 10:21:03 2024 +0800 [kernel] Add RMSLayerNorm triton kernel (#5262) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm commit 86b63f720cf60deefe40874517b3d8e1dccb7af3 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 17 16:03:10 2024 +0800 [Inference]Adapted to the triton attn kernels (#5264) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print commit 0f2b46a41c2c308cc6fbeaf0e86d0e0b93435b77 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Jan 16 14:41:02 2024 +0800 [kernel] Revise KVCache copy triton kernel API (#5273) * [kernel/fix] revise kvcache copy kernel api * fix benchmark commit d8db500efc0e67dea995c2124d20aadd07afb6f0 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 15 17:50:46 2024 +0800 [Inference] Fix request handler and add recycle logic (#5260) * fix request handler * fix comment commit c597678da475abd4ecc075c0b80996989f1bcdc0 Author: Frank Lee <somerlee.9@gmail.com> Date: Mon Jan 15 17:37:41 2024 +0800 [doc] updated inference readme (#5269) commit fa85e02b3b1b316009c4557482f998b903730ec3 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Jan 15 17:37:20 2024 +0800 [kernel] Add KV cache copy kernel during decoding (#5261) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy commit 1ded7e81ef08d574798dd98d1f4d33da07b7f4c9 Author: FrankLeeeee <somerlee.9@gmail.com> Date: Thu Jan 11 13:50:45 2024 +0000 [git] fixed rebased files commit 1513f20f4d80f782fab381996368ff2c2f3c95c3 Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Thu Jan 11 18:06:39 2024 +0800 [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy commit fded91d049997ed87dee965fc42c35a239e3ec03 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 11 16:24:54 2024 +0800 [Inference] Kernel: no pad rotary embedding (#5252) * fix bugs * comment * use more accurate atol * fix commit d40eb26029e8c61fc2b8ef3a1b8126a229e48047 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 10 10:38:53 2024 +0800 fix bugs in request_handler.py and engine.py commit 10e3c9f923caf4fb68ab61e96c244bd5cca9b9da Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 15:53:04 2024 +0800 rm torch.cuda.synchronize commit fab294c7f4a5db0a4e19109ac5656492ff3ca08b Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 15:18:28 2024 +0800 fix CI bugs commit 2a73e828eba565017d19eaf70a304e1b1eddba1f Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 14:29:45 2024 +0800 fix bugs related to processing padding mask commit e545a871b8a89093f5d01e3fea1fe873ef52d51a Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Jan 8 15:56:00 2024 +0800 [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs commit fa4fbdbffb6996e8aa1f65bddce5844f2bbbfdf1 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 9 13:52:53 2024 +0800 adapted to pad_context_forward commit 47e53eaa1ca08fd55b657b53b75d13cc72f9cd05 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Jan 8 12:35:06 2024 +0800 fix bugs in attention.py and request_handler.py commit bfd9b1b494b4414835b22cbba52005921127e4f6 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Thu Jan 4 16:39:00 2024 +0800 [Inference] Pytorch Attention func, pad&nopad input support (#5219) * add attn * add attention test * fix attn forward * fix decoding commit 3ad1f3b78b830c90079ed9f1e0b5cd26601194fa Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 16:48:53 2024 +0800 fix beam_width commit b2eb9cd18665317ec7900364ef21a38c3edb9e3f Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 15:09:06 2024 +0800 Fixed a typo commit bbfebfb9fc5250c1e4d3a6f008af652f7a0a9ca0 Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Jan 4 15:03:18 2024 +0800 fix bugs in sampler commit 02c1bf8b2abef137a653b86b733d66b6dfbcc022 Author: yuehuayingxueluo <867460659@qq.com> Date: Wed Jan 3 18:50:26 2024 +0800 add context_attention_unpadded commit 07b5283b6a3899ebe84cbe8c7902d142ffbc4b9c Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Wed Jan 3 14:41:35 2024 +0800 [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> commit 4df8876fcad799ace567b2458df5feb3109ee917 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 18:34:19 2024 +0800 Fixed a writing error commit 9489dc64d8e01b04c9033c3dcaee83e25afebe42 Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 18:30:11 2024 +0800 precision alignment commit 62968588d195126adc9b1bdb3adc02f199303ddf Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Jan 2 13:02:20 2024 +0800 fix bugs in request_handler commit 62fd08ee4425e031f8f1c43b25bf1ba5e7e33e8d Author: yuehuayingxueluo <867460659@qq.com> Date: Tue Dec 26 21:34:27 2023 +0800 Fixed a bug in the inference frame commit 86853a37d5243b40d4b229d163494624b8027cd0 Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Dec 25 14:07:43 2023 +0800 Add padding llama model commit 0e616462a7f9e8faaa33d1700a2020ceb03ccd34 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Mon Dec 25 12:15:15 2023 +0800 [Inference] add logit processor and request handler (#5166) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 8daee26989adad5ae5b152b24d3344db727986fe Author: yuehuayingxueluo <867460659@qq.com> Date: Mon Dec 18 10:40:47 2023 +0800 [Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt commit 93aeacca342ab03732362dbb9096ab1265f4a8b3 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Tue Dec 12 17:22:41 2023 +0800 [Inference]Update inference config and fix test (#5178) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 3de2e622995321b042d4a8cffcd61686cda4a58e Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Mon Dec 11 10:56:18 2023 +0800 [Inference] Add CacheBlock and KV-Cache Manager (#5156) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config) commit fab9b931d9e24c6e8ada8025cf8cf12719c3d2af Author: yuehuayingxueluo <867460659@qq.com> Date: Thu Dec 7 14:34:01 2023 +0800 [Inference]Add BatchInferState, Sequence and InferConfig (#5149) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct commit 2bb92243d4151873d75a9d6d9c2275b390e1716a Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Date: Tue Dec 5 15:12:57 2023 +0800 [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels commit 56e75eeb063279fbc0fc84e25f267f1ca208e784 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Dec 1 17:31:31 2023 +0800 [Inference] Add readme (roadmap) and fulfill request handler (#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> commit 4cf4682e70f70dea8e0510705d3383de0bf1a4a8 Author: Jianghai <72591262+CjhHa1@users.noreply.github.com> Date: Fri Dec 1 17:02:44 2023 +0800 [Inference] First PR for rebuild colossal-infer (#5143) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Checklist before creating the PR
[doc/gemini/tensor/...]: A concise description
🚨 Issue number
Part of #5245
📝 What does this PR do?
n
tokens in parallel. It enables 1) the kv-cache-copy kernel to copy multiple tokens for each sequence, and enables 2) decoding attention to receive inputs withq_len > 1
.💥 Checklist before requesting a review
⭐️ Do you enjoy contributing to Colossal-AI?
Tell us more if you don't enjoy contributing to Colossal-AI.