Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infer] Revise and Adapt Triton Kernels for Spec-Dec #5401

Conversation

yuanheng-zhao
Copy link
Contributor

@yuanheng-zhao yuanheng-zhao commented Feb 26, 2024

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

Part of #5245

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

  • This PR revises triton kernels so that the main/large model is able to verify n tokens in parallel. It enables 1) the kv-cache-copy kernel to copy multiple tokens for each sequence, and enables 2) decoding attention to receive inputs with q_len > 1.
  • Add new cases in tests
image

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@yuanheng-zhao yuanheng-zhao marked this pull request as ready for review February 27, 2024 07:33
@yuanheng-zhao yuanheng-zhao requested a review from a team as a code owner February 27, 2024 07:33
@FrankLeeeee FrankLeeeee merged commit 2d62aca into hpcaitech:feat/speculative-decoding Feb 28, 2024
@yuanheng-zhao yuanheng-zhao deleted the feat/spec-dec/kernels branch February 28, 2024 06:26
yuanheng-zhao added a commit that referenced this pull request Apr 5, 2024
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* resolve conflicts for revising flash-attn

* adapt kv cache copy kernel for spec-dec

* fix seqlen-n kvcache copy kernel/tests

* test kvcache copy - use torch.equal

* add assertions

* (trivial) comment out
yuanheng-zhao added a commit that referenced this pull request Apr 10, 2024
* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* resolve conflicts for revising flash-attn

* adapt kv cache copy kernel for spec-dec

* fix seqlen-n kvcache copy kernel/tests

* test kvcache copy - use torch.equal

* add assertions

* (trivial) comment out
botbw added a commit that referenced this pull request May 23, 2024
* [Inference] First PR for rebuild colossal-infer (#5143)

* add engine and scheduler

* add dirs

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add readme (roadmap) and fulfill request handler (#5147)

* request handler

* add readme

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

* [inference/nfc] remove outdated inference tests

* remove outdated kernel tests

* remove deprecated triton kernels

* remove imports from deprecated kernels

* [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* [Inference] Add CacheBlock and KV-Cache Manager (#5156)

* [Inference] Add KVCache Manager

* function refactored

* add test for KVCache Manager

* add attr beam width

* Revise alloc func in CacheManager

* Fix docs and pytests

* add tp slicing for head number

* optimize shapes of tensors used as physical cache

* Apply using InferenceConfig on KVCacheManager

* rm duplicate config file

* Optimize cache allocation: use contiguous cache

* Fix config in pytest (and config)

* [Inference]Update inference config and fix test (#5178)

* unify the config setting

* fix test

* fix import

* fix test

* fix

* fix

* add logger

* revise log info

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add the logic of the inference engine (#5173)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* Add the logic of the inference engine

* update engine and test

* Recover cache_manager.py

* add logger

* fix conflict

* update codes

* update codes

* update model and tokenizer

* fix add the logic about shardformer

* change kvcache_manager docstring

* add policy

* fix ci bug in test_kvcache_manager.py

* remove codes related o tokenizer and move model_policy

* fix  code style

* add ordered_set to requirements-infer.txt

* Delete extra empty lines

* add ordered_set to requirements-test.txt

* [Inference] add logit processor and request handler (#5166)

* add logit processor and request handler

* add

* add

* add

* fix

* add search tokens and update func

* finish request handler

* add running list test

* fix test

* fix some bug

* add

* add

* fix bugs

* fix some bugs

* fix bug

* fix

* fix

* add copy fun

* del useless attn

* fix request status

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* Add padding llama model

* Fixed a bug in the inference frame

* fix bugs in request_handler

* precision alignment

* Fixed a writing error

* [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

* add context attn unpadded triton kernel

* test compatibility

* kv cache copy (testing)

* fix k/v cache copy

* fix kv cache copy and test

* fix boundary of block ptrs

* add support for GQA/MQA and testing

* fix import statement

---------

Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

* add context_attention_unpadded

* fix bugs in sampler

* Fixed a typo

* fix beam_width

* [Inference] Pytorch Attention func, pad&nopad input support (#5219)

* add attn

* add attention test

* fix attn forward

* fix decoding

* fix bugs in attention.py and request_handler.py

* adapted to pad_context_forward

* [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

* fix accuracy

* alignment in attention

* fix attention

* fix

* fix bugs

* fix bugs

* fix bugs

* fix bugs related to processing padding mask

* fix CI bugs

* rm torch.cuda.synchronize

* fix bugs in request_handler.py and engine.py

* [Inference] Kernel: no pad rotary embedding (#5252)

* fix bugs

* comment

* use more accurate atol

* fix

* [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

* add flash decoding unpad triton kernel

* rename flash decoding kernel

* add kernel testing (draft)

* revise pytest

* support kv group (GQA)

* (trivial) fix api and pytest

* (trivial) func renaming

* (trivial) func/file renaming

* refactor pytest for attention

* (trivial) format and consistent vars of context/decode attn

* (trivial) remove test redundancy

* [git] fixed rebased files

* [kernel] Add KV cache copy kernel during decoding  (#5261)

* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy

* [doc] updated inference readme (#5269)

* [Inference] Fix request handler and add recycle logic (#5260)

* fix request handler

* fix comment

* [kernel] Revise KVCache copy triton kernel API (#5273)

* [kernel/fix] revise kvcache copy kernel api

* fix benchmark

* [Inference]Adapted to the triton attn kernels (#5264)

* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print

* [kernel] Add RMSLayerNorm triton kernel (#5262)

* add layerrmsnorm triton kernel

* add layerrmsnorm kernel

* modify the atol and rtol in test file

* Remove the logics of mean computations, and update the name of ther kernel functions and files

* add benchmark of rms norm

* [Hotfix] Fix bugs in testing continuous batching (#5270)

* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func

* [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions

* [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

* adapted to rotary_embedding

* adapted to nopad rms norm

* fix bugs in benchmark

* fix flash_decoding.py

* add utils.py

* [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

* fix bugs and add a cos/sin cache fetch func

* add docstring

* fix bug

* fix

* [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

* fix decoding kernel pytest

* revise and add triton context attn benchmark

* [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

* add fused rotary and get cos cache func

* staged

* fix bugs

* fix bugs

* [hotfix] fix boundary check in batch (#5306)

* [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py

* fix (#5311)

* [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

* add

* xi

* del

* del

* fix

* [DOC] Update inference readme  (#5280)

* add readme

* add readme

* 1

* update engine

* finish readme

* add readme

* [Inference]Add Nopadding Llama Modeling (#5327)

* add nopadding llama modeling

* add nopadding_llama.py

* rm unused codes

* fix bugs in test_xine_copy.py

* fix code style

* [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

* revise shape of kvcache (context attn kernel)

* revise shape of kvcache (flash decoding kernel)

* revise shape of kvcache (kvcache copy) and attn func

* init of kvcache in kvcache manager

* revise llama modeling

* revise block size retrieval

* use torch for rms_norm benchmarking

* revise block size retrieval

* [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

* revise rotary embedding

* remove useless print

* adapt

* [inference] simplified config verification (#5346)

* [inference] simplified config verification

* polish

* polish

* [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)

* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py

* [inference] removed redundancy init_batch (#5353)

* [inference] moved ops tests to test_infer (#5354)

* [doc] updated inference readme (#5343)

* [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

* opt rms_norm

* fix bugs in rms_layernorm

* [Inference]Optimize generation process of inference engine (#5356)

* opt inference engine

* fix run_benchmark.sh

* fix generate in engine.py

* rollback tesh_inference_engine.py

* [Fix/Infer] Remove unused deps and revise requirements (#5341)

* remove flash-attn dep

* rm padding llama

* revise infer requirements

* move requirements out of module

* [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365)

* fused the gate and up proj in mlp

* fix code styles

* opt auto_grad

* rollback test_inference_engine.py

* modifications based on the review feedback.

* fix bugs in flash attn

* Change reshape to view

* fix test_rmsnorm_triton.py

* [Inference] Adapt to Fused rotary (#5348)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

This reverts commit 9f4ab2e.

* [inference] added inference template (#5375)

* [Inference/opt] Fused KVCahce Memcopy (#5374)

* fused kv memcopy

* add TODO in test_kvcache_copy.py

* [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

* add

* fix

* fix

* pause

* fix

* fix pytest

* align

* fix

* license

* fix

* fix

* fix readme

* fix some bugs

* remove tokenizer config

* [inference] refactored config (#5376)

* [Inference]Support vllm testing in benchmark scripts (#5379)

* add vllm benchmark scripts

* fix code style

* update run_benchmark.sh

* fix code style

* [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

* add kvcache manager funcs for batching

* add batch bucket for batching

* revise RunningList struct in handler

* add kvcache/batch funcs for compatibility

* use new batching methods

* fix indexing bugs

* revise abort logic

* use cpu seq lengths/block tables

* rm unused attr in Sequence

* fix type conversion/default arg

* add and revise pytests

* revise pytests, rm unused tests

* rm unused statements

* fix pop finished indexing issue

* fix: use index in batch when retrieving inputs/update seqs

* use dict instead of odict in batch struct

* arg type hinting

* fix make compress

* refine comments

* fix: pop_n_seqs to pop the first n seqs

* add check in request handler

* remove redundant conversion

* fix test for request handler

* fix pop method in batch bucket

* fix prefill adding

* [Inference]Fused kv copy into rotary calculation (#5383)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* fused kv copy

* fused copy

* colossalai/kernel/triton/no_pad_rotary_embedding.py

* del padding llama

* del

* Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

* opt_view_and_memcopy

* fix bugs in ci

* fix ci bugs

* update benchmark scripts

* fix ci bugs

* [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

* Fix bugs in inference_engine

* fix bugs in engine.py

* rm  CUDA_VISIBLE_DEVICES

* add request_ids in generate

* fix bug in engine.py

* add logger.debug for BatchBucket

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* [Inference]Add CUDA KVCache Kernel (#5406)

* add cuda KVCache kernel

* annotation benchmark_kvcache_copy

* add use cuda

* fix import path

* move benchmark scripts to example/

* rm benchmark codes in test_kv_cache_memcpy.py

* rm redundancy codes

* rm redundancy codes

* pr was modified according to the review

* [Inference]Move benchmark-related code to the example directory. (#5408)

* move benchmark-related code to the example directory.

* fix bugs in test_fused_rotary_embedding.py

* add silu_and_mul for infer

* [feat] cuda graph support and refactor non-functional api

* add reusable utils for cuda

* refactor code

* feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

* [fix] multi graphs capture error

* [fix] multi graphs capture error

* [doc] add doc

* refactor code

* optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

* fix include path

* fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

* [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

* add rotary embedding kernel

* add rotary_embedding_kernel

* add fused rotary_emb and kvcache memcopy

* add fused_rotary_emb_and_cache_kernel.cu

* add fused_rotary_emb_and_memcopy

* fix bugs in fused_rotary_emb_and_cache_kernel.cu

* fix ci bugs

* use vec memcopy and opt the  gloabl memory access

* fix code style

* fix test_rotary_embdding_unpad.py

* codes revised based on the review comments

* fix bugs about include path

* rm inline

* [fix] pytest and fix dyn grid bug

* diverse tests

* add implementatino for GetGPULaunchConfig1D

* [fix] tmp for test

* add some comments

* refactor vector utils

* [feat] add use_cuda_kernel option

* add vec_type_trait implementation (#5473)

* [fix] unused option

* [fix]

* [fix]

* [fix] remove unused comment

* [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

* Support FP16/BF16 Flash Attention 2

* fix bugs in test_kv_cache_memcpy.py

* add context_kv_cache_memcpy_kernel.cu

* rm typename MT

* add tail process

* add high_precision

* add high_precision to config.py

* rm unused code

* change the comment for the high_precision parameter

* update test_rotary_embdding_unpad.py

* fix vector_copy_utils.h

* add comment for self.high_precision when using float32

* [fix] PR #5354 (#5501)

* [fix]

* [fix]

* Update config.py docstring

* [fix] docstring align

* [fix] docstring align

* [fix] docstring align

* [Inference] Optimize request handler of llama (#5512)

* optimize request_handler

* fix ways of writing

* The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

* [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

* Add get_cos_and_sin kernel

* fix code comments

* fix code typos

* merge common codes of get_cos_and_sin kernel.

* Fixed a typo

* Changed 'asset allclose' to 'assert equal'.

* [Inference] Add Reduce Utils (#5537)

* add reduce utils

* add using to delele namespace prefix

* [Fix/Inference] Remove unused and non-functional functions (#5543)

* [fix] remove unused func

* rm non-functional partial

* add cast and op_functor for cuda build-in types (#5546)

* remove unused triton kernels

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove outdated triton test

* [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* resolve conflicts for revising flash-attn

* adapt kv cache copy kernel for spec-dec

* fix seqlen-n kvcache copy kernel/tests

* test kvcache copy - use torch.equal

* add assertions

* (trivial) comment out

* [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* add drafter model container (basic ver)

* [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

* fix flash decoding mask during verification

* add spec-dec

* add test for spec-dec

* revise drafter init

* remove drafter sampling

* retire past kv in drafter

* (trivial) rename attrs

* (trivial) rename arg

* revise how we enable/disable spec-dec

* [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

* fix drafter pastkv and usage of batch bucket

* [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

* add glide-llama policy and modeling

* update glide modeling, compitable with transformers 4.36.2

* revise glide llama modeling/usage

* fix issues of glimpsing large kv

* revise the way re-loading params for glide drafter

* fix drafter and engine tests

* enable convert to glide strict=False

* revise glide llama modeling

* revise vicuna prompt template

* revise drafter and tests

* apply usage of glide model in engine

* [doc] Add inference/speculative-decoding README (#5552)

* add README for spec-dec

* update roadmap

* [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

- resolve conflicts of rebasing feat/speculative-decoding

* [Fix] Llama Modeling Control with Spec-Dec (#5580)

- fix ref before asgmt
- fall back to use triton kernels when using spec-dec

* refactor csrc (#5582)

* [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

* delete duplicated code and refactor vec_copy utils and reduce utils

* delete unused header file

* [inference/model]Adapted to the baichuan2-7B model (#5591)

* Adapted to the baichuan2-7B model

* modified according to the review comments.

* Modified the method of obtaining random weights.

* modified according to the review comments.

* change mlp layewr 'NOTE'

* [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

* feat flash decoding for paged attention

* refactor flashdecodingattention

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feat]Tensor Model Parallel Support For Inference (#5563)

* tensor parallel support naive source

* [fix]precision, model load and refactor the framework

* add tp unit test

* docstring

* fix do_sample

* feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

* [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

* [fix] GQA calling of flash decoding triton

* fix kv cache alloc shape

* fix rotary triton - GQA

* fix sequence max length assigning

* Sequence max length logic

* fix scheduling and spec-dec

* skip without import error

* fix pytest - skip without ImportError

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

* fix rotary embedding GQA

* change test_rotary_embdding_unpad.py KH

* [example] Update Llama Inference example (#5629)

* [example] add infernece benchmark llama3

* revise inference config - arg

* remove unused args

* add llama generation demo script

* fix init rope in llama policy

* add benchmark-llama3 - cleanup

* [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

* refactor compilation mechanism and unified multi hw

* fix file path bug

* add init.py to make pybind a module to avoid relative path error caused by softlink

* delete duplicated micros

* fix micros bug in gcc

* [Fix/Inference]Fix vllm benchmark (#5630)

* Fix bugs about OOM when running vllm-0.4.0

* rm used params

* change generation_config

* change benchmark log file name

* [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Remove obsolete files - inference (#5650)

* [Inference]Adapt to baichuan2 13B (#5614)

* adapt to baichuan2 13B

* adapt to baichuan2 13B

* change BAICHUAN_MODEL_NAME_OR_PATH

* fix test_decoding_attn.py

* Modifications based on review comments.

* change BAICHUAN_MODEL_NAME_OR_PATH

* mv attn mask processes to test flash decoding

* mv get_alibi_slopes baichuan modeling

* fix bugs in test_baichuan.py

* [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

* add context attn triton kernel - new kcache layout

* add benchmark triton

* tiny revise

* trivial - code style, comment

* [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

* [Inference/Feat] Feat quant kvcache step2 (#5674)

* [Inference] Adapt Baichuan2-13B TP (#5659)

* adapt to baichuan2 13B

* add baichuan2 13B TP

* update baichuan tp logic

* rm unused code

* Fix TP logic

* fix alibi slopes tp logic

* rm nn.Module

* Polished the code.

* change BAICHUAN_MODEL_NAME_OR_PATH

* Modified the logic for loading Baichuan weights.

* fix typos

* [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

* refactor decode_kv_cache_memcpy

* enable alibi in pagedattention

* [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

* [inference]Add alibi to flash attn function (#5678)

* add alibi to flash attn function

* rm redundant modifications

* [Inference] Fix quant bits order (#5681)

* [kernel] Support New KCache Layout - Triton Kernel (#5677)

* kvmemcpy triton for new kcache layout

* revise tests for new kcache layout

* naive triton flash decoding - new kcache layout

* rotary triton kernel - new kcache layout

* remove redundancy - triton decoding

* remove redundancy - triton kvcache copy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Fix & Update Inference Tests (compatibility w/ main)

* [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

* [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

* [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

- Fix key value number assignment in KVCacheManager, as well as method of accessing

* [Fix] Fix Inference Example, Tests, and Requirements (#5688)

* clean requirements

* modify example inference struct

* add test ci scripts

* mark test_infer as submodule

* rm deprecated cls & deps

* import of HAS_FLASH_ATTN

* prune inference tests to be run

* prune triton kernel tests

* increment pytest timeout mins

* revert import path in openmoe

* [hotfix] fix OpenMOE example import path (#5697)

* [Inference]Adapt temperature processing logic (#5689)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* [Inference] Support the logic related to ignoring EOS token (#5693)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* support ignore EOS token

* change variable's name

* fix annotation

* [Inference] ADD  async and sync Api server using FastAPI (#5396)

* add api server

* fix

* add

* add completion service and fix bug

* add generation config

* revise shardformer

* fix bugs

* add docstrings and fix some bugs

* fix bugs and add choices for prompt template

* [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

* finish online test and add examples

* fix test_contionus_batching

* fix some bugs

* fix bash

* fix

* fix inference

* finish revision

* fix typos

* revision

* [Online Server] Chat Api for streaming and not streaming response (#5470)

* fix bugs

* fix bugs

* fix api server

* fix api server

* add chat api and test

* del request.n

* [Inference] resolve rebase conflicts

fix

* [Inference] Fix bugs and docs for feat/online-server (#5598)

* fix test bugs

* add do sample test

* del useless lines

* fix comments

* fix tests

* delete version tag

* delete version tag

* add

* del test sever

* fix test

* fix

* Revert "add"

This reverts commit b9305fb.

* resolve rebase conflicts on Branch feat/online-serving

* [Inference] Add example test_ci script

* [Inference/Feat] Add quant kvcache interface (#5700)

* add quant kvcache interface

* delete unused output

* complete args comments

* [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

* add convert_fp8 op for fp8 test in the future

* rerun ci

* [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

* Adapt repetition_penalty and no_repeat_ngram_size

* fix no_repeat_ngram_size_logit_process

* remove batch_updated

* fix annotation

* modified codes based on the review feedback.

* rm get_batch_token_ids

* [Feat]Inference RPC Server Support (#5705)

* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add paged-attetionv2: support seq length split across thread block (#5707)

* [Inference] Delete duplicated copy_vector (#5716)

* [ci] Fix example tests (#5714)

* [fix] revise timeout value on example CI

* trivial

* [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

* Fix Llama3 Load error
* Omit Checkpoint IO Temporarily

* [Inference] Fix API server, test and example (#5712)

* fix api server

* fix generation config

* fix api server

* fix comments

* fix infer hanging bug

* resolve comments, change backend to free port

* 【Inference] Delete duplicated package (#5723)

* [example] Update Inference Example (#5725)

* [example] update inference example

* [lazy] fix lazy cls init (#5720)

* fix

* fix

* fix

* fix

* fix

* remove kernel intall

* rebase

revert

fix

* fix

* fix

* [Inference] Fix Inference Generation Config and Sampling (#5710)

* refactor and add

* config default values

* fix gen config passing

* fix rpc generation config

* [Fix/Inference] Add unsupported auto-policy error message (#5730)

* [fix] auto policy error message

* trivial

* [doc] Update Inference Readme (#5736)

* [doc] update inference readme

* add contents

* trivial

* [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

* add parallel cross entropy output for falcon model & fix some typos in bloom.py

* fix module name error, self.model -> self.transformers in bloom, falcon model

* Fix the overflow bug of distributed cross entropy loss function when training with fp16

* add dtype to parallel cross entropy loss function

* fix dtype related typos adn prettify the loss.py

* fix grad dtype and update dtype mismatch error

* fix typo bugs

* [bug] fix silly bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [chore] add test for prefetch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [ci] Temporary fix for build on pr (#5741)

* temporary fix for CI

* timeout to 90

* [NFC] Fix code factors on inference triton kernels (#5743)

* [NFC]  fix requirements (#5744)

* [inference] release (#5747)

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
Co-authored-by: FrankLeeeee <somerlee.9@gmail.com>
Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Co-authored-by: xs_courtesy <xs1580802568@gmail.com>
Co-authored-by: Runyu Lu <runyulu@umich.edu>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: Haze188 <haze188@qq.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>
botbw added a commit that referenced this pull request May 23, 2024
commit 4647ec28c8450ee96f4709626617763712efd77e
Author: binmakeswell <binmakeswell@gmail.com>
Date:   Thu May 23 17:44:06 2024 +0800

    [inference] release (#5747)

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

    * [inference] release

commit df6747603f11e2a1929db193ceb014799e02e2c1
Merge: 22ce873c 498f42c4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 14:31:09 2024 +0800

    [Colossal-Inference] (v0.1.0) Merge pull request #5739 from hpcaitech/feature/colossal-infer

    [Inference] Merge feature/colossal-infer

commit 498f42c45b256b5cfc32d74b552e1e306f317a42
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 22 12:08:49 2024 +0800

    [NFC]  fix requirements (#5744)

commit bd38fe6b912379080673a43d77fd3bdf0e5c852e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 22:12:15 2024 +0800

    [NFC] Fix code factors on inference triton kernels (#5743)

commit c2c8c9cf17d67000df8a5b75ae9dbecee0e1c00a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 18:20:57 2024 +0800

    [ci] Temporary fix for build on pr (#5741)

    * temporary fix for CI

    * timeout to 90

commit c06208e72c35d74e150b6a83e72375f5021d10b1
Merge: d8b1ea4a 8633c15d
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 21 11:26:37 2024 +0800

    Merge pull request #5737 from yuanheng-zhao/inference/sync/main

    [sync] Sync feature/colossal-infer with main

commit 22ce873c3f26fd7f4217cdf19071c173683c2b47
Author: Haze188 <haze188@qq.com>
Date:   Tue May 21 11:07:13 2024 +0800

    [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    * add parallel cross entropy output for falcon model & fix some typos in bloom.py

    * fix module name error, self.model -> self.transformers in bloom, falcon model

    * Fix the overflow bug of distributed cross entropy loss function when training with fp16

    * add dtype to parallel cross entropy loss function

    * fix dtype related typos adn prettify the loss.py

    * fix grad dtype and update dtype mismatch error

    * fix typo bugs

commit 8633c15da9b82c675c59ad292e7f0d77f092653c
Merge: d8b1ea4a 9d83c6d7
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Mon May 20 15:50:53 2024 +0000

    [sync] Sync feature/colossal-infer with main

commit d8b1ea4ac90317ad6126acbd854e66583a8f9c8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:50:04 2024 +0800

    [doc] Update Inference Readme (#5736)

    * [doc] update inference readme

    * add contents

    * trivial

commit bdf9a001d61cfad4bb68752c4a808295165307a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 20 22:49:18 2024 +0800

    [Fix/Inference] Add unsupported auto-policy error message (#5730)

    * [fix] auto policy error message

    * trivial

commit 283c407a19002118bda7edd1b8a3acf099843205
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun May 19 15:08:42 2024 +0800

    [Inference] Fix Inference Generation Config and Sampling (#5710)

    * refactor and add

    * config default values

    * fix gen config passing

    * fix rpc generation config

commit 9d83c6d715e8cdb802f82335e651923baab5cfc6
Author: flybird11111 <1829166702@qq.com>
Date:   Fri May 17 18:18:59 2024 +0800

    [lazy] fix lazy cls init (#5720)

    * fix

    * fix

    * fix

    * fix

    * fix

    * remove kernel intall

    * rebase

    revert

    fix

    * fix

    * fix

commit 8bcfe360fdae7ccec7051aaced48497519afc2f2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 17 11:28:53 2024 +0800

    [example] Update Inference Example (#5725)

    * [example] update inference example

commit a8d459f99a1d415fc843327e4dafce19ecee1f3e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 16 10:49:03 2024 +0800

    【Inference] Delete duplicated package (#5723)

commit f47f2fbb2467df15548d2c663b119f4ae0103890
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 15 15:47:31 2024 +0800

    [Inference] Fix API server, test and example (#5712)

    * fix api server

    * fix generation config

    * fix api server

    * fix comments

    * fix infer hanging bug

    * resolve comments, change backend to free port

commit 74c47921facd26dbd93172bf887abcad4eab2d5c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 20:17:43 2024 +0800

    [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

    * Fix Llama3 Load error
    * Omit Checkpoint IO Temporarily

commit 5bbab1533ae7672ab37e91b7bc9e584b3a4e9cc1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 14 16:08:51 2024 +0800

    [ci] Fix example tests (#5714)

    * [fix] revise timeout value on example CI

    * trivial

commit 121d7ad629c746e52a96ec53d6e26c0194016a03
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue May 14 14:35:33 2024 +0800

    [Inference] Delete duplicated copy_vector (#5716)

commit 7806842f2dbb4b6d6e74014efc7db5be8ccf0bbd
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue May 14 12:46:54 2024 +0800

    add paged-attetionv2: support seq length split across thread block (#5707)

commit 18d67d0e8e79c22bded0745c7d3daf8ca40d445c
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Tue May 14 10:00:55 2024 +0800

    [Feat]Inference RPC Server Support (#5705)

    * rpc support source
    * kv cache logical/physical disaggregation
    * sampler refactor
    * colossalai launch built in
    * Unitest
    * Rpyc support

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit de4bf3dedf2c7cb7ba6c3044745bab3c3ef6352d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Sat May 11 15:13:25 2024 +0800

    [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

    * Adapt repetition_penalty and no_repeat_ngram_size

    * fix no_repeat_ngram_size_logit_process

    * remove batch_updated

    * fix annotation

    * modified codes based on the review feedback.

    * rm get_batch_token_ids

commit 50104ab340e6c7067fbaaf9b47c608eb828aa95b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri May 10 18:39:54 2024 +0800

    [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

    * add convert_fp8 op for fp8 test in the future

    * rerun ci

commit bfad39357b0fe31ecf6f7639e2c4056165078a3f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu May 9 18:03:24 2024 +0800

    [Inference/Feat] Add quant kvcache interface (#5700)

    * add quant kvcache interface

    * delete unused output

    * complete args comments

commit 492520dbdb962d207ac40d216e0414807f73eb19
Merge: d4829220 5d9a4948
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu May 9 17:19:45 2024 +0800

    Merge pull request #5588 from hpcaitech/feat/online-serving

    [Feature]Online Serving

commit 5d9a49483d98ccd4bebebbfd039162caceefe6bd
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu May 9 05:44:05 2024 +0000

    [Inference] Add example test_ci script

commit bc9063adf1598c3be32fc2d12577d76b9daa79bf
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Wed May 8 10:36:42 2024 +0000

    resolve rebase conflicts on Branch feat/online-serving

commit 61a1b2e798edcbf91ac35966a4047407ad6aa62d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed May 8 15:14:06 2024 +0800

    [Inference] Fix bugs and docs for feat/online-server (#5598)

    * fix test bugs

    * add do sample test

    * del useless lines

    * fix comments

    * fix tests

    * delete version tag

    * delete version tag

    * add

    * del test sever

    * fix test

    * fix

    * Revert "add"

    This reverts commit b9305fb02440d5cd566d32b508bee9f9c13dda15.

commit 7bbb28e48bdb5849d9dfb118d7bf2959d79bbe02
Author: CjhHa1 <cjh18671720497@outlook.com>
Date:   Thu Apr 11 10:12:31 2024 +0800

    [Inference] resolve rebase conflicts

    fix

commit c06403286567f62cb0a6dfc5e075cf60e291cea9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Sun Apr 7 14:45:43 2024 +0800

    [Online Server] Chat Api for streaming and not streaming response (#5470)

    * fix bugs

    * fix bugs

    * fix api server

    * fix api server

    * add chat api and test

    * del request.n

commit de378cd2abd77b464786dc5f8298c9edbf023fbc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Mar 18 17:06:05 2024 +0800

    [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

    * finish online test and add examples

    * fix test_contionus_batching

    * fix some bugs

    * fix bash

    * fix

    * fix inference

    * finish revision

    * fix typos

    * revision

commit 69cd7e069d5705c7e431b301ac14924711c74e41
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Mar 1 14:47:36 2024 +0800

    [Inference] ADD  async and sync Api server using FastAPI (#5396)

    * add api server

    * fix

    * add

    * add completion service and fix bug

    * add generation config

    * revise shardformer

    * fix bugs

    * add docstrings and fix some bugs

    * fix bugs and add choices for prompt template

commit d482922035ff7b6fe7ced8e6c4028faa2d68197f
tAuthor: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 19:59:10 2024 +0800

     [Inference] Support the logic related to ignoring EOS token (#5693)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

    * support ignore EOS token

    * change variable's name

    * fix annotation

commit 9c2fe7935ff5aaec4f174cfba6f324df623c7447
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed May 8 17:58:29 2024 +0800

    [Inference]Adapt temperature processing logic (#5689)

    * Adapt temperature processing logic

    * add ValueError for top_p and top_k

    * add GQA Test

    * fix except_msg

commit 12e7c28d5e8f219480d1dbc682fd225dc76fcc2b
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Daqte:   Wed May 8 15:48:47 2024 +0800

    [hotfix] fix OpenMOE example import path (#5697)

commit 55cc7f3df7c600deae2f344ee162abae5a5c63e1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed May 8 11:30:15 2024 +0800

    [Fix] Fix Inference Example, Tests, and Requirements (#5688)

    * clean requirements

    * modify example inference struct

    * add test ci scripts

    * mark test_infer as submodule

    * rm deprecated cls & deps

    * import of HAS_FLASH_ATTN

    * prune inference tests to be run

    * prune triton kernel tests

    * increment pytest timeout mins

    * revert import path in openmoe

commit f9afe0addd89303de4819debd93efe97d5618238
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue May 7 23:13:14 2024 +0800

    [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

    - Fix key value number assignment in KVCacheManager, as well as method of accessing

commit 1ace1065e6bff175a0af88cae86d272acef29c9f
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon May 6 15:35:13 2024 +0800

    [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

commit db7b3051f4379862f88790bf1653ddb6443c002e
Merge: 725fbd2e 8754abae
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon May 6 14:43:38 2024 +0800

    [Sync] Update from main to feature/colossal-infer (Merge pull request #5685)

    [Sync] Update from main to feature/colossal-infer

    - Merge pull request #5685 from yuanheng-zhao/inference/merge/main

commit 725fbd2ed067f9c58ac04670377d3e6f2a96fe00
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Mon May 6 10:55:34 2024 +0800

    [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

commit 8754abae24dbcc492d2992d1091428592b615285
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 16:28:56 2024 +0000

    [Fix] Fix & Update Inference Tests (compatibility w/ main)

commit 56ed09aba5e017fc0c211dac70215c2f83815919
Merge: 537a3cbc d3f34ee8
Author: Yuanheng Zhao <jonathan.zhaoyh@gmail.com>
Date:   Sun May 5 05:14:00 2024 +0000

    [sync] resolve conflicts of merging main

commit 537a3cbc4df445786c8ecf2af0a2998e2fd881b6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri May 3 17:20:45 2024 +0800

    [kernel] Support New KCache Layout - Triton Kernel (#5677)

    * kvmemcpy triton for new kcache layout

    * revise tests for new kcache layout

    * naive triton flash decoding - new kcache layout

    * rotary triton kernel - new kcache layout

    * remove redundancy - triton decoding

    * remove redundancy - triton kvcache copy

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 9df016fc4520a5a5c95a11ed04a8ac62bde039c4
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 19:38:00 2024 +0800

    [Inference] Fix quant bits order (#5681)

commit f79963199cd30c5e917d430aedd79113d06d608c
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 19:35:05 2024 +0800

    [inference]Add alibi to flash attn function (#5678)

    * add alibi to flash attn function

    * rm redundant modifications

commit ef8e4ffe310bfe21f83feb965d962d816d75bc88
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 18:33:53 2024 +0800

    [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

commit 5cd75ce4c7edc95bacd8ec5fc04b8add339e8331
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Apr 30 15:52:23 2024 +0800

    [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

    * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

    * refactor decode_kv_cache_memcpy

    * enable alibi in pagedattention

commit 5f00002e43bd738a99fea250306e54c8c908f05a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 30 15:47:07 2024 +0800

    [Inference] Adapt Baichuan2-13B TP (#5659)

    * adapt to baichuan2 13B

    * add baichuan2 13B TP

    * update baichuan tp logic

    * rm unused code

    * Fix TP logic

    * fix alibi slopes tp logic

    * rm nn.Module

    * Polished the code.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * Modified the logic for loading Baichuan weights.

    * fix typos

commit 808ee6e4addccb51990398434547fa5df3c255b0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Apr 30 11:26:36 2024 +0800

    [Inference/Feat] Feat quant kvcache step2 (#5674)

commit 8ccb6714e79137c8e6e50d9a585eadbf70ae6fc0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Apr 26 19:40:37 2024 +0800

    [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

commit 5be590b99eb6c58c3aa809d453680139fdd2b9f7
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Apr 26 17:51:49 2024 +0800

    [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

    * add context attn triton kernel - new kcache layout

    * add benchmark triton

    * tiny revise

    * trivial - code style, comment

commit 3c91e3f1763d2a30a85187a3a606dbe4d1b9454d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Apr 25 23:11:30 2024 +0800

    [Inference]Adapt to baichuan2 13B (#5614)

    * adapt to baichuan2 13B

    * adapt to baichuan2 13B

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * fix test_decoding_attn.py

    * Modifications based on review comments.

    * change BAICHUAN_MODEL_NAME_OR_PATH

    * mv attn mask processes to test flash decoding

    * mv get_alibi_slopes baichuan modeling

    * fix bugs in test_baichuan.py

commit f342a9387168cedc2e5cc33155939c6d0c4e99a0
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Apr 25 22:04:59 2024 +0800

    [Fix] Remove obsolete files - inference (#5650)

commit a8fd3b034235e1fa987a1ae85a9a2b465ee6128f
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 25 14:24:02 2024 +0800

    [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

    * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 90cd5227a348dfe506e95b2e49f2a8dcd34fdbca
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Apr 24 14:51:36 2024 +0800

    [Fix/Inference]Fix vllm benchmark (#5630)

    * Fix bugs about OOM when running vllm-0.4.0

    * rm used params

    * change generation_config

    * change benchmark log file name

commit 279300dc5f34db219c90a297c0996d00221eae96
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Apr 24 14:17:54 2024 +0800

    [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

    * refactor compilation mechanism and unified multi hw

    * fix file path bug

    * add init.py to make pybind a module to avoid relative path error caused by softlink

    * delete duplicated micros

    * fix micros bug in gcc

commit 04863a9b144fc7dd46a57d2c7b0cf2f4b351ffb6
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 22:23:07 2024 +0800

    [example] Update Llama Inference example (#5629)

    * [example] add infernece benchmark llama3

    * revise inference config - arg

    * remove unused args

    * add llama generation demo script

    * fix init rope in llama policy

    * add benchmark-llama3 - cleanup

commit 12f10d5b0b49a180bc162e166337942e0bbfb96b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Apr 23 13:44:49 2024 +0800

    [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

    * fix rotary embedding GQA

    * change test_rotary_embdding_unpad.py KH

commit 5d4c1fe8f5f7019284f6cbc0ed29506748f63bf1
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 23 13:09:55 2024 +0800

    [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

    * [fix] GQA calling of flash decoding triton

    * fix kv cache alloc shape

    * fix rotary triton - GQA

    * fix sequence max length assigning

    * Sequence max length logic

    * fix scheduling and spec-dec

    * skip without import error

    * fix pytest - skip without ImportError

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit ccf72797e3bfafcbfc42870ce24ee484858d4852
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Apr 19 15:34:53 2024 +0800

    feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

commit e37ee2fb65fc77c275b816968d91776322fd7695
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Apr 18 16:56:46 2024 +0800

    [Feat]Tensor Model Parallel Support For Inference (#5563)

    * tensor parallel support naive source

    * [fix]precision, model load and refactor the framework

    * add tp unit test

    * docstring

    * fix do_sample

commit be396ad6cc102fa610731291bf28e531a5641c7a
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Thu Apr 18 16:45:07 2024 +0800

    [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

    * feat flash decoding for paged attention

    * refactor flashdecodingattention

    * [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

    ---------

    Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

commit 56b222eff8c996a4677a158d4b5d4834a1bc0cfc
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 15 16:53:02 2024 +0800

    [inference/model]Adapted to the baichuan2-7B model (#5591)

    * Adapted to the baichuan2-7B model

    * modified according to the review comments.

    * Modified the method of obtaining random weights.

    * modified according to the review comments.

    * change mlp layewr 'NOTE'

commit d4cb023b62ea8e092783be437cb16d74a1afc6a7
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 15 10:57:51 2024 +0800

    [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

    * delete duplicated code and refactor vec_copy utils and reduce utils

    * delete unused header file

commit a21912339a2c41627b43fd00e6adba38308a2ea0
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Thu Apr 11 15:41:36 2024 +0800

    refactor csrc (#5582)

commit 25928d84961b60264a6dabbddeae32af04a43fa2
Merge: d56c9633 f8598e3e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 10 18:39:27 2024 +0800

    [Inference/Spec-Dec] Merge pull request #5565 from hpcaitech/feat/speculative-decoding

    Add Speculative Decoding and GLIDE Spec-Dec

commit f8598e3ec56bbe6bc6dd9fd84a1e0543adbd3073
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Wed Apr 10 11:14:04 2024 +0800

    [Fix] Llama Modeling Control with Spec-Dec (#5580)

    - fix ref before asgmt
    - fall back to use triton kernels when using spec-dec

commit e60d430cf53c9009af4682908d01742147654429
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Sun Apr 7 14:53:30 2024 +0800

    [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

    - resolve conflicts of rebasing feat/speculative-decoding

commit e1acb58423c53ece50b72db3bf9b91475d5d3d64
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Apr 3 18:06:23 2024 +0800

    [doc] Add inference/speculative-decoding README (#5552)

    * add README for spec-dec

    * update roadmap

commit d85d91435ae25d875bfeb012b1e66cbfce6f6525
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Apr 1 21:54:24 2024 +0800

    [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

    * add glide-llama policy and modeling

    * update glide modeling, compitable with transformers 4.36.2

    * revise glide llama modeling/usage

    * fix issues of glimpsing large kv

    * revise the way re-loading params for glide drafter

    * fix drafter and engine tests

    * enable convert to glide strict=False

    * revise glide llama modeling

    * revise vicuna prompt template

    * revise drafter and tests

    * apply usage of glide model in engine

commit 912e24b2aaf4acda0e2b9a45a7d4327fbfc8bd39
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Mar 12 17:57:01 2024 +0800

    [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

    * fix drafter pastkv and usage of batch bucket

commit a37f82629d7b9e3c3a0f430b8dd3ff6f38ddf1d4
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Mar 11 09:51:42 2024 +0800

    [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

    * fix flash decoding mask during verification

    * add spec-dec

    * add test for spec-dec

    * revise drafter init

    * remove drafter sampling

    * retire past kv in drafter

    * (trivial) rename attrs

    * (trivial) rename arg

    * revise how we enable/disable spec-dec

commit 5a9b05f7b297bc9ce3479990aeee94891c7f5edf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:48:17 2024 +0800

    [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * add drafter model container (basic ver)

commit d63c469f45bc20115aaf5ba01e62dc67ab47953f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Feb 28 13:47:00 2024 +0800

    [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

    * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

    * resolve conflicts for revising flash-attn

    * adapt kv cache copy kernel for spec-dec

    * fix seqlen-n kvcache copy kernel/tests

    * test kvcache copy - use torch.equal

    * add assertions

    * (trivial) comment out

commit d56c96334e8a0626696609c3803ba5c73798f073
Merge: 7ebdf48a 7ca1d1c5
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 9 10:09:34 2024 +0800

    Sync main to feature/colossal-infer

    [Sync] Merge feature/colossal-infer with main

commit 7ca1d1c5453de3e726bca6334c360045050f94c4
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 17:00:55 2024 +0800

    remove outdated triton test

commit d78817539ea03b7b4bc79e0ef50db33d3e347f24
Author: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Date:   Mon Apr 8 08:41:07 2024 +0000

    [pre-commit.ci] auto fixes from pre-commit.com hooks

    for more information, see https://pre-commit.ci

commit ce9401ad52b870012846abcde120f1e87d5da7fe
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:25:12 2024 +0800

    remove unused triton kernels

commit ed5ebd1735db4541709eebdd37839ad161f542e8
Merge: 7ebdf48a 641b1ee7
Author: Yuanheng <jonathan.zhaoyh@gmail.com>
Date:   Mon Apr 8 16:21:47 2024 +0800

    [Fix] resolve conflicts of merging main

commit 7ebdf48ac50ca7bab827ef611551c6c48113b684
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 8 11:38:05 2024 +0800

    add cast and op_functor for cuda build-in types (#5546)

commit 4bb5d8923a6e85a0f89a483f15933698635a9f9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Apr 2 14:16:59 2024 +0800

    [Fix/Inference] Remove unused and non-functional functions (#5543)

    * [fix] remove unused func

    * rm non-functional partial

commit a2878e39f42f509f237f3d3fd0741f53e3feff0e
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Apr 1 15:34:25 2024 +0800

    [Inference] Add Reduce Utils (#5537)

    * add reduce utils

    * add using to delele namespace prefix

commit 04aca9e55bd91ea4dd8d1231aa66df7848b08f03
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Apr 1 13:47:14 2024 +0800

    [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

    * Add get_cos_and_sin kernel

    * fix code comments

    * fix code typos

    * merge common codes of get_cos_and_sin kernel.

    * Fixed a typo

    * Changed 'asset allclose' to 'assert equal'.

commit 934e31afb22d2a281464aebde074eb2f238fb812
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Mar 28 10:42:51 2024 +0800

    The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

commit e6496dd37144202c8602dfdd66bb83f297eb5805
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 26 16:37:14 2024 +0800

    [Inference] Optimize request handler of llama (#5512)

    * optimize request_handler

    * fix ways of writing

commit 6251d68dc9f92c333a8f07ddf94e80ff7462726e
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 15:24:17 2024 +0800

    [fix] PR #5354 (#5501)

    * [fix]

    * [fix]

    * Update config.py docstring

    * [fix] docstring align

    * [fix] docstring align

    * [fix] docstring align

commit 1d626233ce8dbf35405cb7d92a5638ee1d830e8f
Merge: 87079cff 68e9396b
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Mon Mar 25 14:55:59 2024 +0800

    Merge pull request #5434 from LRY89757/colossal-infer-cuda-graph

    [feat] cuda graph support and refactor non-functional api

commit 68e9396bc084f03fe9315e9fed93292c0efc7a48
Merge: ff4998c6 87079cff
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 14:48:28 2024 +0800

    [fix] merge conflicts

commit 87079cffe8e006d4949aa7ca7cb60e6b813ff701
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Mar 25 13:40:34 2024 +0800

    [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

    * Support FP16/BF16 Flash Attention 2

    * fix bugs in test_kv_cache_memcpy.py

    * add context_kv_cache_memcpy_kernel.cu

    * rm typename MT

    * add tail process

    * add high_precision

    * add high_precision to config.py

    * rm unused code

    * change the comment for the high_precision parameter

    * update test_rotary_embdding_unpad.py

    * fix vector_copy_utils.h

    * add comment for self.high_precision when using float32

commit ff4998c6f39cbfd6d3d11f038c55cca3c9d3abd0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 12:00:57 2024 +0800

    [fix] remove unused comment

commit 9fe61b44753083c89a50540daa1e9a3daedeb335
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 25 11:37:58 2024 +0800

    [fix]

commit 5b017d6324c9881e02a5440e0b1a3156612a8044
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 15:55:25 2024 +0800

    [fix]

commit 606603bb8805c39f6ee01029337ddc614c8d46ef
Merge: 4eafe0c8 7ff42cc0
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 14:25:22 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into colossal-infer-cuda-graph

commit 4eafe0c8141c120229be3ddce9c5591c1535348a
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 21 11:28:42 2024 +0800

    [fix] unused option

commit 7ff42cc06d007ae78fe091da65cb89c4bb62bc38
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 18:36:40 2024 +0800

    add vec_type_trait implementation (#5473)

commit b96557b5e15dbb521bf0f77b6b1f24dcbd9464d6
Merge: b6e97858 48c4f29b
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 19 13:53:26 2024 +0800

    Merge pull request #5469 from Courtesy-Xs/add_vec_traits

    Refactor vector utils

commit aabc9fb6aada9e7feb2ff8cf1f34e6ac37ade2e7
Author: Runyu Lu <runyulu@umich.edu>
Date:   Tue Mar 19 13:24:25 2024 +0800

    [feat] add use_cuda_kernel option

commit 48c4f29b275e2d8105842913cd84f5d66c378b36
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Tue Mar 19 11:32:01 2024 +0800

    refactor vector utils

commit b6e97858856ee8637216c51f14ac544b1bc0f872
Merge: f366a5ea 5724b9e3
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 15 11:23:44 2024 +0800

    Merge pull request #5457 from Courtesy-Xs/ly_add_implementation_for_launch_config

    add implementatino for GetGPULaunchConfig1D

commit 5724b9e31e13e07d8ade0444c3e2f3e6894d13b1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 15 11:18:57 2024 +0800

    add some comments

commit 6e30248683c0e4ccc63d15f39f8149875cba1263
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 16:13:00 2024 +0800

    [fix] tmp for test

commit 388e0439301834a1ad0d11da26b23f4cdc6c82d7
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 14 11:13:40 2024 +0800

    add implementatino for GetGPULaunchConfig1D

commit d02e257abd778812d64491dde893c0d691ed4328
Merge: ae24b4f0 f366a5ea
Author: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Date:   Thu Mar 14 10:37:05 2024 +0800

    Merge branch 'feature/colossal-infer' into colossal-infer-cuda-graph

commit ae24b4f025285949253a21c41bee4b80679a0bfe
Author: Runyu Lu <runyulu@umich.edu>
Date:   Thu Mar 14 10:35:08 2024 +0800

    diverse tests

commit 1821a6dab0ad6ad24ae25216e56268c4b0c0d365
Author: Runyu Lu <runyulu@umich.edu>
Date:   Wed Mar 13 17:28:32 2024 +0800

    [fix] pytest and fix dyn grid bug

commit f366a5ea1f2626a7870acaf8866f21d5fb49c388
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Mar 13 17:20:03 2024 +0800

    [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

    * add rotary embedding kernel

    * add rotary_embedding_kernel

    * add fused rotary_emb and kvcache memcopy

    * add fused_rotary_emb_and_cache_kernel.cu

    * add fused_rotary_emb_and_memcopy

    * fix bugs in fused_rotary_emb_and_cache_kernel.cu

    * fix ci bugs

    * use vec memcopy and opt the  gloabl memory access

    * fix code style

    * fix test_rotary_embdding_unpad.py

    * codes revised based on the review comments

    * fix bugs about include path

    * rm inline

commit ed431de4e4f73584e6b9c11ab041ef54a8e83de6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Wed Mar 13 16:00:55 2024 +0800

    fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

commit 6fd355a5a6bb46bfee41d2bc75578e8fba001144
Merge: b699f540 c1c45e9d
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Wed Mar 13 11:26:41 2024 +0800

    Merge pull request #5452 from Courtesy-Xs/fix_include_path

    fix include path

commit c1c45e9d8ecb6743e88e63dd151c617c0014e7c1
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Wed Mar 13 11:21:06 2024 +0800

    fix include path

commit b699f54007c52b2f4ec56326a495b06858cf8856
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Tue Mar 12 17:48:02 2024 +0800

    optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

commit 368a2aa5433d127adaa3674c6d00bb9dc3e0729c
Merge: 21e1e364 095c070a
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Tue Mar 12 14:14:37 2024 +0800

    Merge pull request #5445 from Courtesy-Xs/refactor_infer_compilation

    Refactor colossal-infer code arch

commit 095c070a6eefe1a76fe3483b21986826114d6d17
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Mon Mar 11 17:06:57 2024 +0800

    refactor code

commit 21e1e3645c8f2e0d4e556f3e13d0d2aa5053911b
Merge: f7aecc0c 5eb5ff14
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Mon Mar 11 11:15:29 2024 +0800

    Merge pull request #5435 from Courtesy-Xs/add_gpu_launch_config

    Add query and other components

commit 633e95b301336c4c237537f584882b3d8e5f4145
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:56:51 2024 +0800

    [doc] add doc

commit 9dec66fad6c2f85166903aa80d0c077e37512fce
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:51:16 2024 +0800

    [fix] multi graphs capture error

commit b2c0d9ff2b4e4015660f2967837688cf7293b21e
Author: Runyu Lu <runyulu@umich.edu>
Date:   Mon Mar 11 10:49:31 2024 +0800

    [fix] multi graphs capture error

commit f7aecc0c6bac001d10c1dd00274e0152e4c86df6
Author: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Date:   Fri Mar 8 16:21:12 2024 +0800

    feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

commit 5eb5ff1464311ac16c29307d03a3c076aced7e03
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:41:14 2024 +0800

    refactor code

commit 01d289d8e51384131d536b1c223c473aeea463e9
Merge: a46598ac 2b28b54a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 15:04:55 2024 +0800

    Merge branch 'feature/colossal-infer' of https://github.com/hpcaitech/ColossalAI into add_gpu_launch_config

commit a46598ac5984c7dc5804d0cf8621698f1a6a8720
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Fri Mar 8 14:53:29 2024 +0800

    add reusable utils for cuda

commit 2b28b54ac6d19d33079d9117b9717fd2779f2b08
Merge: 593a72e4 95c21498
Author: 傅剑寒 <Xs1580802568@gmail.com>
Date:   Fri Mar 8 14:44:37 2024 +0800

    Merge pull request #5433 from Courtesy-Xs/add_silu_and_mul

    【Inference】Add silu_and_mul for infer

commit cefaeb5fdd551c8b95837a475cb810f4991cf674
Author: Runyu Lu <runyulu@umich.edu>
Date:   Fri Mar 8 14:19:35 2024 +0800

    [feat] cuda graph support and refactor non-functional api

commit 95c21498d4f6e640e218f4b00349020f4ae7c69a
Author: xs_courtesy <xs1580802568@gmail.com>
Date:   Thu Mar 7 16:57:49 2024 +0800

    add silu_and_mul for infer

commit 593a72e4d58b8c3feebde2d19c78d44f702f7b06
Merge: 0aa27f19 0310b76e
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:13:59 2024 +0800

    Merge pull request #5424 from FrankLeeeee/sync/main

    Sync/main

commit 0310b76e9d485703d5afc128b8d97d01b00f3317
Merge: 0aa27f19 4b8312c0
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Mon Mar 4 10:09:36 2024 +0800

    Merge branch 'main' into sync/main

commit 0aa27f196109bfb4ce6171d7ce921052b9eee969
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 16:46:03 2024 +0800

    [Inference]Move benchmark-related code to the example directory. (#5408)

    * move benchmark-related code to the example directory.

    * fix bugs in test_fused_rotary_embedding.py

commit 600881a8ea9b17c436ded922a9d4e3d5969acd87
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 28 14:36:50 2024 +0800

    [Inference]Add CUDA KVCache Kernel (#5406)

    * add cuda KVCache kernel

    * annotation benchmark_kvcache_copy

    * add use cuda

    * fix import path

    * move benchmark scripts to example/

    * rm benchmark codes in test_kv_cache_memcpy.py

    * rm redundancy codes

    * rm redundancy codes

    * pr was modified according to the review

commit 19061188c396d851ef17bc34b526e2f2b4fc1479
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 26 16:17:47 2024 +0800

    [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

    fix dependency in pytest

commit bc1da87366d81e144f1f133801d5f20520433c52
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 23 10:51:35 2024 +0800

    [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

    * Fix bugs in inference_engine

    * fix bugs in engine.py

    * rm  CUDA_VISIBLE_DEVICES

    * add request_ids in generate

    * fix bug in engine.py

    * add logger.debug for BatchBucket

commit 2a718c8be89918ec70b88f1f059148a7294dbccb
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 21 13:23:57 2024 +0800

    Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

    * opt_view_and_memcopy

    * fix bugs in ci

    * fix ci bugs

    * update benchmark scripts

    * fix ci bugs

commit 730103819dc0636c85af1af80cc17914dcf196c1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 21 11:31:48 2024 +0800

    [Inference]Fused kv copy into rotary calculation (#5383)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

    * fused kv copy

    * fused copy

    * colossalai/kernel/triton/no_pad_rotary_embedding.py

    * del padding llama

    * del

commit b21aac5baeddf7ea19615fae454e6f78f7469cd2
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Feb 19 17:18:20 2024 +0800

    [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

    * add kvcache manager funcs for batching

    * add batch bucket for batching

    * revise RunningList struct in handler

    * add kvcache/batch funcs for compatibility

    * use new batching methods

    * fix indexing bugs

    * revise abort logic

    * use cpu seq lengths/block tables

    * rm unused attr in Sequence

    * fix type conversion/default arg

    * add and revise pytests

    * revise pytests, rm unused tests

    * rm unused statements

    * fix pop finished indexing issue

    * fix: use index in batch when retrieving inputs/update seqs

    * use dict instead of odict in batch struct

    * arg type hinting

    * fix make compress

    * refine comments

    * fix: pop_n_seqs to pop the first n seqs

    * add check in request handler

    * remove redundant conversion

    * fix test for request handler

    * fix pop method in batch bucket

    * fix prefill adding

commit 8c69debdc7128e1b8839f12aa3f19ad327569017
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 8 15:27:26 2024 +0800

     [Inference]Support vllm testing in benchmark scripts (#5379)

    * add vllm benchmark scripts

    * fix code style

    * update run_benchmark.sh

    * fix code style

commit 9afa52061f89dde87a73e36f740f62781d658a01
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 8 14:04:14 2024 +0800

    [inference] refactored config (#5376)

commit 1f8c7e70469191610d9536029f624b4f30db8caf
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 17:55:48 2024 +0800

    [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

    * add

    * fix

    * fix

    * pause

    * fix

    * fix pytest

    * align

    * fix

    * license

    * fix

    * fix

    * fix readme

    * fix some bugs

    * remove tokenizer config

commit 6fb4bcbb2420b9f977ab74de60c6d311b6c9ed9a
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Feb 7 17:15:42 2024 +0800

    [Inference/opt] Fused KVCahce Memcopy (#5374)

    * fused kv memcopy

    * add TODO in test_kvcache_copy.py

commit 58740b5f6872bc5a26dbf7c3112b86a1b66c083a
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 17:11:43 2024 +0800

    [inference] added inference template (#5375)

commit 8106ede07fae7e239203feb815162efdf46975ec
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Feb 7 14:27:04 2024 +0800

    Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

    This reverts commit 9f4ab2eb924b938348df2c713bb4580972f18eb1.

commit 9f4ab2eb924b938348df2c713bb4580972f18eb1
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Feb 7 11:36:04 2024 +0800

    [Inference] Adapt to Fused rotary (#5348)

    * revise rotary embedding

    * remove useless print

    * adapt

    * fix

    * add

    * fix

    * modeling

    * fix

    * fix

    * fix

commit 35382a7fbf96c731ba1ed76cf5529ea3220a5b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Feb 6 19:38:25 2024 +0800

    [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365)

    * fused the gate and up proj in mlp

    * fix code styles

    * opt auto_grad

    * rollback test_inference_engine.py

    * modifications based on the review feedback.

    * fix bugs in flash attn

    * Change reshape to view

    * fix test_rmsnorm_triton.py

commit 1dedb57747270f32be5d0e67abc1ad2fff658f8f
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Feb 6 17:27:45 2024 +0800

    [Fix/Infer] Remove unused deps and revise requirements (#5341)

    * remove flash-attn dep

    * rm padding llama

    * revise infer requirements

    * move requirements out of module

commit 631862f3390f874db118a25c0137f86630e9b167
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:38:21 2024 +0800

    [Inference]Optimize generation process of inference engine (#5356)

    * opt inference engine

    * fix run_benchmark.sh

    * fix generate in engine.py

    * rollback tesh_inference_engine.py

commit 21ad4a27f91659220bec6c4d4f2d0f62f7093a45
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Feb 2 15:06:01 2024 +0800

    [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

    * opt rms_norm

    * fix bugs in rms_layernorm

commit 027aa1043f1c7b3668d5ca9b91d35c846736e9c4
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 14:31:10 2024 +0800

    [doc] updated inference readme (#5343)

commit e76acbb076582e0aade1ee8a5fa7696d95c1bef5
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 13:51:22 2024 +0800

    [inference] moved ops tests to test_infer (#5354)

commit db1a763307a54ca262751ebebd5f1c503d9bca74
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Fri Feb 2 11:44:15 2024 +0800

    [inference] removed redundancy init_batch (#5353)

commit 249644c23b0402ccf9d0908f13ed15b41b95145f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Feb 1 15:49:39 2024 +0800

    [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340)

    * add fused qkv

    * replace attn and mlp by shardformer

    * fix bugs in mlp

    * add docstrings

    * fix test_inference_engine.py

    * add optimize unbind

    * add fused_addmm

    * rm squeeze(1)

    * refactor codes

    * fix ci bugs

    * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

    * Removed the dependency on LlamaFlashAttention2

    * rollback test_inference_engine.py

commit f8e456d20295af52665ca06a21f9fd8b468204d7
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Thu Feb 1 15:31:01 2024 +0800

    [inference] simplified config verification (#5346)

    * [inference] simplified config verification

    * polish

    * polish

commit df0aa49585d2dd19d7397dfbd3b5f136abac609b
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 31 16:31:29 2024 +0800

    [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

    * revise rotary embedding

    * remove useless print

    * adapt

commit 1336838a9149fb210a956b0ad338197c4ae77821
Merge: 5f98a9d6 c5655199
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Wed Jan 31 16:29:26 2024 +0800

    Merge pull request #5339 from FrankLeeeee/sync/merge-main

    Sync/merge main

commit c56551991379a457fc34df699710ab94132779fc
Merge: 5f98a9d6 71321a07
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Wed Jan 31 10:41:47 2024 +0800

    merge commit

commit 5f98a9d68a0a35031e1c740c19e33b32f4fa8d9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 30 16:06:09 2024 +0800

    [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

    * revise shape of kvcache (context attn kernel)

    * revise shape of kvcache (flash decoding kernel)

    * revise shape of kvcache (kvcache copy) and attn func

    * init of kvcache in kvcache manager

    * revise llama modeling

    * revise block size retrieval

    * use torch for rms_norm benchmarking

    * revise block size retrieval

commit e8f0642f2841f6aeb6ed0e6695ff9d9ef14f198b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 30 10:31:46 2024 +0800

    [Inference]Add Nopadding Llama Modeling (#5327)

    * add nopadding llama modeling

    * add nopadding_llama.py

    * rm unused codes

    * fix bugs in test_xine_copy.py

    * fix code style

commit c7c104cb7ccc353faa10667853ed210e042f1be8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 16:21:06 2024 +0800

    [DOC] Update inference readme  (#5280)

    * add readme

    * add readme

    * 1

    * update engine

    * finish readme

    * add readme

commit 1f8a75d470d548bfd4db877e73102b8fad5cdfa9
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 29 10:22:33 2024 +0800

    [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

    * add

    * xi

    * del

    * del

    * fix

commit 7ddd8b37f0f1160e28a2919a2e37f8e8ad199773
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Jan 26 15:02:12 2024 +0800

    fix (#5311)

commit 4f28cb43c0c2afbc970b9f0f300e7aa28e39bd2e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Fri Jan 26 14:00:10 2024 +0800

    [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

    * opt flash attn

    * opt tmp tensor

    * fix benchmark_llama

    * fix code style

    * fix None logic for output tensor

    * fix adapted to get_xine_cache

    * add comment

    * fix ci bugs

    * fix some codes

    * rm duplicated codes

    * rm duplicated codes

    * fix code style

    * add _get_dtype in config.py

commit af8359c430ce3fabb22748870b67b0c6c33f610c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 25 10:23:12 2024 +0800

    [hotfix] fix boundary check in batch (#5306)

commit c647e00e3c092d3d6219f7686f260f2932a0c27d
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Wed Jan 24 16:20:42 2024 +0800

    [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

    * add fused rotary and get cos cache func

    * staged

    * fix bugs

    * fix bugs

commit 3da9993b0d03923755c1fcd6279cc4c7b8d00d1e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 23 17:16:02 2024 +0800

    [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

    * fix decoding kernel pytest

    * revise and add triton context attn benchmark

commit 8e606ecc7e89ffed80537e89a27bb1eb6759f4bc
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Jan 23 12:11:53 2024 +0800

    [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

    * fix bugs and add a cos/sin cache fetch func

    * add docstring

    * fix bug

    * fix

commit b7853196a0a46558d7c0cac7deac9a36c7a5ba38
Merge: bfff9254 cea9c86e
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 17:07:14 2024 +0800

    Merge pull request #5297 from yuehuayingxueluo/fix_rotary_embedding

    [Inference/fix]Add utils.py for Rotary Embedding

commit cea9c86e453e36b4848064312c9a4f0d2de6ea98
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 16:06:27 2024 +0800

    add utils.py

commit bfff9254ac8ca866673746ec47cfd2f87aab2b66
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 22 10:55:34 2024 +0800

     [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

    * adapted to rotary_embedding

    * adapted to nopad rms norm

    * fix bugs in benchmark

    * fix flash_decoding.py

commit 6e487e7d3cf5295ca908fa69c8e03af8980391bf
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Fri Jan 19 15:47:16 2024 +0800

    [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

    * prevent re-creating intermediate tensors

    * add singleton class holding intermediate values

    * fix triton kernel api

    * add benchmark in pytest

    * fix kernel api and add benchmark

    * revise flash decoding triton kernel in/out shapes

    * fix calling of triton kernel in modeling

    * fix pytest: extract to util functions

commit 9e2342bde2c0ffe1a8cdd2fe8917254ef0a06e7f
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 18 16:31:14 2024 +0800

    [Hotfix] Fix bugs in testing continuous batching (#5270)

    * fix bug

    * fix bugs

    * fix bugs

    * fix bugs and add padding

    * add funcs and fix bugs

    * fix typos

    * fix bugs

    * add func

commit 5ae9099f9203a4f8350f383b838e8f2ad15d6fdd
Author: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Date:   Thu Jan 18 10:21:03 2024 +0800

    [kernel] Add RMSLayerNorm triton kernel (#5262)

    * add layerrmsnorm triton kernel

    * add layerrmsnorm kernel

    * modify the atol and rtol in test file

    * Remove the logics of mean computations, and update the name of ther kernel functions and files

    * add benchmark of rms norm

commit 86b63f720cf60deefe40874517b3d8e1dccb7af3
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 17 16:03:10 2024 +0800

    [Inference]Adapted to the triton attn kernels (#5264)

    * adapted to the triton attn kernels

    * fix pad input

    * adapted to copy_kv_to_blocked_cache

    * fix ci test

    * update kv memcpy

    * remove print

commit 0f2b46a41c2c308cc6fbeaf0e86d0e0b93435b77
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Jan 16 14:41:02 2024 +0800

    [kernel] Revise KVCache copy triton kernel API (#5273)

    * [kernel/fix] revise kvcache copy kernel api

    * fix benchmark

commit d8db500efc0e67dea995c2124d20aadd07afb6f0
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 15 17:50:46 2024 +0800

    [Inference] Fix request handler and add recycle logic (#5260)

    * fix request handler

    * fix comment

commit c597678da475abd4ecc075c0b80996989f1bcdc0
Author: Frank Lee <somerlee.9@gmail.com>
Date:   Mon Jan 15 17:37:41 2024 +0800

    [doc] updated inference readme (#5269)

commit fa85e02b3b1b316009c4557482f998b903730ec3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Jan 15 17:37:20 2024 +0800

    [kernel] Add KV cache copy kernel during decoding  (#5261)

    * add kv copy triton kernel during decoding stage

    * add pytest and fix kernel

    * fix test utilities

    * revise kernel config

    * add benchmark for kvcache copy

commit 1ded7e81ef08d574798dd98d1f4d33da07b7f4c9
Author: FrankLeeeee <somerlee.9@gmail.com>
Date:   Thu Jan 11 13:50:45 2024 +0000

    [git] fixed rebased files

commit 1513f20f4d80f782fab381996368ff2c2f3c95c3
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Thu Jan 11 18:06:39 2024 +0800

    [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

    * add flash decoding unpad triton kernel

    * rename flash decoding kernel

    * add kernel testing (draft)

    * revise pytest

    * support kv group (GQA)

    * (trivial) fix api and pytest

    * (trivial) func renaming

    * (trivial) func/file renaming

    * refactor pytest for attention

    * (trivial) format and consistent vars of context/decode attn

    * (trivial) remove test redundancy

commit fded91d049997ed87dee965fc42c35a239e3ec03
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 11 16:24:54 2024 +0800

    [Inference] Kernel: no pad rotary embedding (#5252)

    * fix bugs

    * comment

    * use more accurate atol

    * fix

commit d40eb26029e8c61fc2b8ef3a1b8126a229e48047
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 10 10:38:53 2024 +0800

    fix bugs in request_handler.py and engine.py

commit 10e3c9f923caf4fb68ab61e96c244bd5cca9b9da
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:53:04 2024 +0800

    rm torch.cuda.synchronize

commit fab294c7f4a5db0a4e19109ac5656492ff3ca08b
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 15:18:28 2024 +0800

    fix CI bugs

commit 2a73e828eba565017d19eaf70a304e1b1eddba1f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 14:29:45 2024 +0800

    fix bugs related to processing padding mask

commit e545a871b8a89093f5d01e3fea1fe873ef52d51a
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Jan 8 15:56:00 2024 +0800

    [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

    * fix accuracy

    * alignment in attention

    * fix attention

    * fix

    * fix bugs

    * fix bugs

    * fix bugs

commit fa4fbdbffb6996e8aa1f65bddce5844f2bbbfdf1
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 9 13:52:53 2024 +0800

    adapted to pad_context_forward

commit 47e53eaa1ca08fd55b657b53b75d13cc72f9cd05
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Jan 8 12:35:06 2024 +0800

    fix bugs in attention.py and request_handler.py

commit bfd9b1b494b4414835b22cbba52005921127e4f6
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Thu Jan 4 16:39:00 2024 +0800

    [Inference] Pytorch Attention func, pad&nopad input support (#5219)

    * add attn

    * add attention test

    * fix attn forward

    * fix decoding

commit 3ad1f3b78b830c90079ed9f1e0b5cd26601194fa
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 16:48:53 2024 +0800

    fix beam_width

commit b2eb9cd18665317ec7900364ef21a38c3edb9e3f
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:09:06 2024 +0800

    Fixed a typo

commit bbfebfb9fc5250c1e4d3a6f008af652f7a0a9ca0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Jan 4 15:03:18 2024 +0800

    fix bugs in sampler

commit 02c1bf8b2abef137a653b86b733d66b6dfbcc022
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Wed Jan 3 18:50:26 2024 +0800

    add context_attention_unpadded

commit 07b5283b6a3899ebe84cbe8c7902d142ffbc4b9c
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Wed Jan 3 14:41:35 2024 +0800

    [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

    * add context attn unpadded triton kernel

    * test compatibility

    * kv cache copy (testing)

    * fix k/v cache copy

    * fix kv cache copy and test

    * fix boundary of block ptrs

    * add support for GQA/MQA and testing

    * fix import statement

    ---------

    Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

commit 4df8876fcad799ace567b2458df5feb3109ee917
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:34:19 2024 +0800

    Fixed a writing error

commit 9489dc64d8e01b04c9033c3dcaee83e25afebe42
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 18:30:11 2024 +0800

    precision alignment

commit 62968588d195126adc9b1bdb3adc02f199303ddf
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Jan 2 13:02:20 2024 +0800

    fix bugs in request_handler

commit 62fd08ee4425e031f8f1c43b25bf1ba5e7e33e8d
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Tue Dec 26 21:34:27 2023 +0800

    Fixed a bug in the inference frame

commit 86853a37d5243b40d4b229d163494624b8027cd0
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 25 14:07:43 2023 +0800

    Add padding llama model

commit 0e616462a7f9e8faaa33d1700a2020ceb03ccd34
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Mon Dec 25 12:15:15 2023 +0800

    [Inference] add logit processor and request handler (#5166)

    * add logit processor and request handler

    * add

    * add

    * add

    * fix

    * add search tokens and update func

    * finish request handler

    * add running list test

    * fix test

    * fix some bug

    * add

    * add

    * fix bugs

    * fix some bugs

    * fix bug

    * fix

    * fix

    * add copy fun

    * del useless attn

    * fix request status

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 8daee26989adad5ae5b152b24d3344db727986fe
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Mon Dec 18 10:40:47 2023 +0800

    [Inference] Add the logic of the inference engine (#5173)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

    * Add the logic of the inference engine

    * update engine and test

    * Recover cache_manager.py

    * add logger

    * fix conflict

    * update codes

    * update codes

    * update model and tokenizer

    * fix add the logic about shardformer

    * change kvcache_manager docstring

    * add policy

    * fix ci bug in test_kvcache_manager.py

    * remove codes related o tokenizer and move model_policy

    * fix  code style

    * add ordered_set to requirements-infer.txt

    * Delete extra empty lines

    * add ordered_set to requirements-test.txt

commit 93aeacca342ab03732362dbb9096ab1265f4a8b3
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Tue Dec 12 17:22:41 2023 +0800

    [Inference]Update inference config and fix test (#5178)

    * unify the config setting

    * fix test

    * fix import

    * fix test

    * fix

    * fix

    * add logger

    * revise log info

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 3de2e622995321b042d4a8cffcd61686cda4a58e
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Mon Dec 11 10:56:18 2023 +0800

    [Inference] Add CacheBlock and KV-Cache Manager (#5156)

    * [Inference] Add KVCache Manager

    * function refactored

    * add test for KVCache Manager

    * add attr beam width

    * Revise alloc func in CacheManager

    * Fix docs and pytests

    * add tp slicing for head number

    * optimize shapes of tensors used as physical cache

    * Apply using InferenceConfig on KVCacheManager

    * rm duplicate config file

    * Optimize cache allocation: use contiguous cache

    * Fix config in pytest (and config)

commit fab9b931d9e24c6e8ada8025cf8cf12719c3d2af
Author: yuehuayingxueluo <867460659@qq.com>
Date:   Thu Dec 7 14:34:01 2023 +0800

    [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

    * add infer_struct and infer_config

    * update codes

    * change InferConfig

    * Add hf_model_config to the engine

    * rm _get_hf_model_config

    * update codes

    * made adjustments according to the feedback from the reviewer.

    * update codes

    * add ci test for config and struct

commit 2bb92243d4151873d75a9d6d9c2275b390e1716a
Author: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Date:   Tue Dec 5 15:12:57 2023 +0800

    [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

    * [inference/nfc] remove outdated inference tests

    * remove outdated kernel tests

    * remove deprecated triton kernels

    * remove imports from deprecated kernels

commit 56e75eeb063279fbc0fc84e25f267f1ca208e784
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:31:31 2023 +0800

    [Inference] Add readme (roadmap) and fulfill request handler (#5147)

    * request handler

    * add readme

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

commit 4cf4682e70f70dea8e0510705d3383de0bf1a4a8
Author: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Date:   Fri Dec 1 17:02:44 2023 +0800

    [Inference] First PR for rebuild colossal-infer (#5143)

    * add engine and scheduler

    * add dirs

    ---------

    Co-authored-by: CjhHa1 <cjh18671720497outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants