[gemini] fix ci (#5748) · hpcaitech/ColossalAI@d211820

Commit

[gemini] fix ci (#5748)

* [Inference] First PR for rebuild colossal-infer (#5143)

* add engine and scheduler

* add dirs

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add readme (roadmap) and fulfill request handler (#5147)

* request handler

* add readme

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159)

* [inference/nfc] remove outdated inference tests

* remove outdated kernel tests

* remove deprecated triton kernels

* remove imports from deprecated kernels

* [Inference]Add BatchInferState, Sequence and InferConfig (#5149)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* [Inference] Add CacheBlock and KV-Cache Manager (#5156)

* [Inference] Add KVCache Manager

* function refactored

* add test for KVCache Manager

* add attr beam width

* Revise alloc func in CacheManager

* Fix docs and pytests

* add tp slicing for head number

* optimize shapes of tensors used as physical cache

* Apply using InferenceConfig on KVCacheManager

* rm duplicate config file

* Optimize cache allocation: use contiguous cache

* Fix config in pytest (and config)

* [Inference]Update inference config and fix test (#5178)

* unify the config setting

* fix test

* fix import

* fix test

* fix

* fix

* add logger

* revise log info

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* [Inference] Add the logic of the inference engine (#5173)

* add infer_struct and infer_config

* update codes

* change InferConfig

* Add hf_model_config to the engine

* rm _get_hf_model_config

* update codes

* made adjustments according to the feedback from the reviewer.

* update codes

* add ci test for config and struct

* Add the logic of the inference engine

* update engine and test

* Recover cache_manager.py

* add logger

* fix conflict

* update codes

* update codes

* update model and tokenizer

* fix add the logic about shardformer

* change kvcache_manager docstring

* add policy

* fix ci bug in test_kvcache_manager.py

* remove codes related o tokenizer and move model_policy

* fix  code style

* add ordered_set to requirements-infer.txt

* Delete extra empty lines

* add ordered_set to requirements-test.txt

* [Inference] add logit processor and request handler (#5166)

* add logit processor and request handler

* add

* add

* add

* fix

* add search tokens and update func

* finish request handler

* add running list test

* fix test

* fix some bug

* add

* add

* fix bugs

* fix some bugs

* fix bug

* fix

* fix

* add copy fun

* del useless attn

* fix request status

---------

Co-authored-by: CjhHa1 <cjh18671720497outlook.com>

* Add padding llama model

* Fixed a bug in the inference frame

* fix bugs in request_handler

* precision alignment

* Fixed a writing error

* [kernel] Add triton kernel for context attention (FAv2) without padding (#5192)

* add context attn unpadded triton kernel

* test compatibility

* kv cache copy (testing)

* fix k/v cache copy

* fix kv cache copy and test

* fix boundary of block ptrs

* add support for GQA/MQA and testing

* fix import statement

---------

Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>

* add context_attention_unpadded

* fix bugs in sampler

* Fixed a typo

* fix beam_width

* [Inference] Pytorch Attention func, pad&nopad input support (#5219)

* add attn

* add attention test

* fix attn forward

* fix decoding

* fix bugs in attention.py and request_handler.py

* adapted to pad_context_forward

* [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229)

* fix accuracy

* alignment in attention

* fix attention

* fix

* fix bugs

* fix bugs

* fix bugs

* fix bugs related to processing padding mask

* fix CI bugs

* rm torch.cuda.synchronize

* fix bugs in request_handler.py and engine.py

* [Inference] Kernel: no pad rotary embedding (#5252)

* fix bugs

* comment

* use more accurate atol

* fix

* [kernel] Add flash decoding triton kernel for blocked kv cache (#5249)

* add flash decoding unpad triton kernel

* rename flash decoding kernel

* add kernel testing (draft)

* revise pytest

* support kv group (GQA)

* (trivial) fix api and pytest

* (trivial) func renaming

* (trivial) func/file renaming

* refactor pytest for attention

* (trivial) format and consistent vars of context/decode attn

* (trivial) remove test redundancy

* [git] fixed rebased files

* [kernel] Add KV cache copy kernel during decoding  (#5261)

* add kv copy triton kernel during decoding stage

* add pytest and fix kernel

* fix test utilities

* revise kernel config

* add benchmark for kvcache copy

* [doc] updated inference readme (#5269)

* [Inference] Fix request handler and add recycle logic (#5260)

* fix request handler

* fix comment

* [kernel] Revise KVCache copy triton kernel API (#5273)

* [kernel/fix] revise kvcache copy kernel api

* fix benchmark

* [Inference]Adapted to the triton attn kernels (#5264)

* adapted to the triton attn kernels

* fix pad input

* adapted to copy_kv_to_blocked_cache

* fix ci test

* update kv memcpy

* remove print

* [kernel] Add RMSLayerNorm triton kernel (#5262)

* add layerrmsnorm triton kernel

* add layerrmsnorm kernel

* modify the atol and rtol in test file

* Remove the logics of mean computations, and update the name of ther kernel functions and files

* add benchmark of rms norm

* [Hotfix] Fix bugs in testing continuous batching (#5270)

* fix bug

* fix bugs

* fix bugs

* fix bugs and add padding

* add funcs and fix bugs

* fix typos

* fix bugs

* add func

* [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274)

* prevent re-creating intermediate tensors

* add singleton class holding intermediate values

* fix triton kernel api

* add benchmark in pytest

* fix kernel api and add benchmark

* revise flash decoding triton kernel in/out shapes

* fix calling of triton kernel in modeling

* fix pytest: extract to util functions

* [inference] Adapted to Rotary Embedding and RMS Norm (#5283)

* adapted to rotary_embedding

* adapted to nopad rms norm

* fix bugs in benchmark

* fix flash_decoding.py

* add utils.py

* [Inference] Benchmarking rotary embedding and add a fetch function (#5277)

* fix bugs and add a cos/sin cache fetch func

* add docstring

* fix bug

* fix

* [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301)

* fix decoding kernel pytest

* revise and add triton context attn benchmark

* [Inference]Add fused rotary kernel and get cos cache kernel (#5302)

* add fused rotary and get cos cache func

* staged

* fix bugs

* fix bugs

* [hotfix] fix boundary check in batch (#5306)

* [inference]Optimize the usage of the mid tensors space in flash attn (#5304)

* opt flash attn

* opt tmp tensor

* fix benchmark_llama

* fix code style

* fix None logic for output tensor

* fix adapted to get_xine_cache

* add comment

* fix ci bugs

* fix some codes

* rm duplicated codes

* rm duplicated codes

* fix code style

* add _get_dtype in config.py

* fix (#5311)

* [Inference] Update rms norm kernel, benchmark with vLLM (#5315)

* add

* xi

* del

* del

* fix

* [DOC] Update inference readme  (#5280)

* add readme

* add readme

* 1

* update engine

* finish readme

* add readme

* [Inference]Add Nopadding Llama Modeling (#5327)

* add nopadding llama modeling

* add nopadding_llama.py

* rm unused codes

* fix bugs in test_xine_copy.py

* fix code style

* [Infer] Optimize Blocked KVCache And Kernels Using It (#5325)

* revise shape of kvcache (context attn kernel)

* revise shape of kvcache (flash decoding kernel)

* revise shape of kvcache (kvcache copy) and attn func

* init of kvcache in kvcache manager

* revise llama modeling

* revise block size retrieval

* use torch for rms_norm benchmarking

* revise block size retrieval

* [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336)

* revise rotary embedding

* remove useless print

* adapt

* [inference] simplified config verification (#5346)

* [inference] simplified config verification

* polish

* polish

* [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation，add fused_qkv and fused linear_add (#5340)

* add fused qkv

* replace attn and mlp by shardformer

* fix bugs in mlp

* add docstrings

* fix test_inference_engine.py

* add optimize unbind

* add fused_addmm

* rm squeeze(1)

* refactor codes

* fix ci bugs

* rename ShardFormerLlamaMLP and ShardFormerLlamaAttention

* Removed the dependency on LlamaFlashAttention2

* rollback test_inference_engine.py

* [inference] removed redundancy init_batch (#5353)

* [inference] moved ops tests to test_infer (#5354)

* [doc] updated inference readme (#5343)

* [Inference/opt]Optimize the mid tensor of RMS Norm (#5350)

* opt rms_norm

* fix bugs in rms_layernorm

* [Inference]Optimize generation process of inference engine (#5356)

* opt inference engine

* fix run_benchmark.sh

* fix generate in engine.py

* rollback tesh_inference_engine.py

* [Fix/Infer] Remove unused deps and revise requirements (#5341)

* remove flash-attn dep

* rm padding llama

* revise infer requirements

* move requirements out of module

* [Inference]Fused the gate and up proj in mlp，and optimized the autograd process. (#5365)

* fused the gate and up proj in mlp

* fix code styles

* opt auto_grad

* rollback test_inference_engine.py

* modifications based on the review feedback.

* fix bugs in flash attn

* Change reshape to view

* fix test_rmsnorm_triton.py

* [Inference] Adapt to Fused rotary (#5348)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373)

This reverts commit 9f4ab2e.

* [inference] added inference template (#5375)

* [Inference/opt] Fused KVCahce Memcopy (#5374)

* fused kv memcopy

* add TODO in test_kvcache_copy.py

* [Inference] User Experience: update the logic of default tokenizer and generation config.  (#5337)

* add

* fix

* fix

* pause

* fix

* fix pytest

* align

* fix

* license

* fix

* fix

* fix readme

* fix some bugs

* remove tokenizer config

* [inference] refactored config (#5376)

* [Inference]Support vllm testing in benchmark scripts (#5379)

* add vllm benchmark scripts

* fix code style

* update run_benchmark.sh

* fix code style

* [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367)

* add kvcache manager funcs for batching

* add batch bucket for batching

* revise RunningList struct in handler

* add kvcache/batch funcs for compatibility

* use new batching methods

* fix indexing bugs

* revise abort logic

* use cpu seq lengths/block tables

* rm unused attr in Sequence

* fix type conversion/default arg

* add and revise pytests

* revise pytests, rm unused tests

* rm unused statements

* fix pop finished indexing issue

* fix: use index in batch when retrieving inputs/update seqs

* use dict instead of odict in batch struct

* arg type hinting

* fix make compress

* refine comments

* fix: pop_n_seqs to pop the first n seqs

* add check in request handler

* remove redundant conversion

* fix test for request handler

* fix pop method in batch bucket

* fix prefill adding

* [Inference]Fused kv copy into rotary calculation (#5383)

* revise rotary embedding

* remove useless print

* adapt

* fix

* add

* fix

* modeling

* fix

* fix

* fix

* fused kv copy

* fused copy

* colossalai/kernel/triton/no_pad_rotary_embedding.py

* del padding llama

* del

* Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390)

* opt_view_and_memcopy

* fix bugs in ci

* fix ci bugs

* update benchmark scripts

* fix ci bugs

* [Fix/Inference] Fix format of input prompts and input model  in inference engine (#5395)

* Fix bugs in inference_engine

* fix bugs in engine.py

* rm  CUDA_VISIBLE_DEVICES

* add request_ids in generate

* fix bug in engine.py

* add logger.debug for BatchBucket

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* [Inference]Add CUDA KVCache Kernel (#5406)

* add cuda KVCache kernel

* annotation benchmark_kvcache_copy

* add use cuda

* fix import path

* move benchmark scripts to example/

* rm benchmark codes in test_kv_cache_memcpy.py

* rm redundancy codes

* rm redundancy codes

* pr was modified according to the review

* [Inference]Move benchmark-related code to the example directory. (#5408)

* move benchmark-related code to the example directory.

* fix bugs in test_fused_rotary_embedding.py

* add silu_and_mul for infer

* [feat] cuda graph support and refactor non-functional api

* add reusable utils for cuda

* refactor code

* feat rmsnorm cuda kernel and add unittest, benchmark script (#5417)

* [fix] multi graphs capture error

* [fix] multi graphs capture error

* [doc] add doc

* refactor code

* optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441)

* fix include path

* fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454)

* [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418)

* add rotary embedding kernel

* add rotary_embedding_kernel

* add fused rotary_emb and kvcache memcopy

* add fused_rotary_emb_and_cache_kernel.cu

* add fused_rotary_emb_and_memcopy

* fix bugs in fused_rotary_emb_and_cache_kernel.cu

* fix ci bugs

* use vec memcopy and opt the  gloabl memory access

* fix code style

* fix test_rotary_embdding_unpad.py

* codes revised based on the review comments

* fix bugs about include path

* rm inline

* [fix] pytest and fix dyn grid bug

* diverse tests

* add implementatino for GetGPULaunchConfig1D

* [fix] tmp for test

* add some comments

* refactor vector utils

* [feat] add use_cuda_kernel option

* add vec_type_trait implementation (#5473)

* [fix] unused option

* [fix]

* [fix]

* [fix] remove unused comment

* [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461)

* Support FP16/BF16 Flash Attention 2

* fix bugs in test_kv_cache_memcpy.py

* add context_kv_cache_memcpy_kernel.cu

* rm typename MT

* add tail process

* add high_precision

* add high_precision to config.py

* rm unused code

* change the comment for the high_precision parameter

* update test_rotary_embdding_unpad.py

* fix vector_copy_utils.h

* add comment for self.high_precision when using float32

* [fix] PR #5354 (#5501)

* [fix]

* [fix]

* Update config.py docstring

* [fix] docstring align

* [fix] docstring align

* [fix] docstring align

* [Inference] Optimize request handler of llama (#5512)

* optimize request_handler

* fix ways of writing

* The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519)

* [Inference/Kernel]Add get_cos_and_sin Kernel (#5528)

* Add get_cos_and_sin kernel

* fix code comments

* fix code typos

* merge common codes of get_cos_and_sin kernel.

* Fixed a typo

* Changed 'asset allclose' to 'assert equal'.

* [Inference] Add Reduce Utils (#5537)

* add reduce utils

* add using to delele namespace prefix

* [Fix/Inference] Remove unused and non-functional functions (#5543)

* [fix] remove unused func

* rm non-functional partial

* add cast and op_functor for cuda build-in types (#5546)

* remove unused triton kernels

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove outdated triton test

* [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* resolve conflicts for revising flash-attn

* adapt kv cache copy kernel for spec-dec

* fix seqlen-n kvcache copy kernel/tests

* test kvcache copy - use torch.equal

* add assertions

* (trivial) comment out

* [Inference/SpecDec] Add Basic Drafter Model Container (#5405)

* [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399)

fix dependency in pytest

* add drafter model container (basic ver)

* [Inference/SpecDec] Add Speculative Decoding Implementation (#5423)

* fix flash decoding mask during verification

* add spec-dec

* add test for spec-dec

* revise drafter init

* remove drafter sampling

* retire past kv in drafter

* (trivial) rename attrs

* (trivial) rename arg

* revise how we enable/disable spec-dec

* [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449)

* fix drafter pastkv and usage of batch bucket

* [Inference/SpecDec] Support GLIDE Drafter Model (#5455)

* add glide-llama policy and modeling

* update glide modeling, compitable with transformers 4.36.2

* revise glide llama modeling/usage

* fix issues of glimpsing large kv

* revise the way re-loading params for glide drafter

* fix drafter and engine tests

* enable convert to glide strict=False

* revise glide llama modeling

* revise vicuna prompt template

* revise drafter and tests

* apply usage of glide model in engine

* [doc] Add inference/speculative-decoding README (#5552)

* add README for spec-dec

* update roadmap

* [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557)

- resolve conflicts of rebasing feat/speculative-decoding

* [Fix] Llama Modeling Control with Spec-Dec (#5580)

- fix ref before asgmt
- fall back to use triton kernels when using spec-dec

* refactor csrc (#5582)

* [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593)

* delete duplicated code and refactor vec_copy utils and reduce utils

* delete unused header file

* [inference/model]Adapted to the baichuan2-7B model (#5591)

* Adapted to the baichuan2-7B model

* modified according to the review comments.

* Modified the method of obtaining random weights.

* modified according to the review comments.

* change mlp layewr 'NOTE'

* [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531)

* feat flash decoding for paged attention

* refactor flashdecodingattention

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Feat]Tensor Model Parallel Support For Inference (#5563)

* tensor parallel support naive source

* [fix]precision, model load and refactor the framework

* add tp unit test

* docstring

* fix do_sample

* feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611)

* [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624)

* [fix] GQA calling of flash decoding triton

* fix kv cache alloc shape

* fix rotary triton - GQA

* fix sequence max length assigning

* Sequence max length logic

* fix scheduling and spec-dec

* skip without import error

* fix pytest - skip without ImportError

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623)

* fix rotary embedding GQA

* change test_rotary_embdding_unpad.py KH

* [example] Update Llama Inference example (#5629)

* [example] add infernece benchmark llama3

* revise inference config - arg

* remove unused args

* add llama generation demo script

* fix init rope in llama policy

* add benchmark-llama3 - cleanup

* [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613)

* refactor compilation mechanism and unified multi hw

* fix file path bug

* add init.py to make pybind a module to avoid relative path error caused by softlink

* delete duplicated micros

* fix micros bug in gcc

* [Fix/Inference]Fix vllm benchmark (#5630)

* Fix bugs about OOM when running vllm-0.4.0

* rm used params

* change generation_config

* change benchmark log file name

* [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643)

* optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x])

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Remove obsolete files - inference (#5650)

* [Inference]Adapt to baichuan2 13B (#5614)

* adapt to baichuan2 13B

* adapt to baichuan2 13B

* change BAICHUAN_MODEL_NAME_OR_PATH

* fix test_decoding_attn.py

* Modifications based on review comments.

* change BAICHUAN_MODEL_NAME_OR_PATH

* mv attn mask processes to test flash decoding

* mv get_alibi_slopes baichuan modeling

* fix bugs in test_baichuan.py

* [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658)

* add context attn triton kernel - new kcache layout

* add benchmark triton

* tiny revise

* trivial - code style, comment

* [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656)

* [Inference/Feat] Feat quant kvcache step2 (#5674)

* [Inference] Adapt Baichuan2-13B TP (#5659)

* adapt to baichuan2 13B

* add baichuan2 13B TP

* update baichuan tp logic

* rm unused code

* Fix TP logic

* fix alibi slopes tp logic

* rm nn.Module

* Polished the code.

* change BAICHUAN_MODEL_NAME_OR_PATH

* Modified the logic for loading Baichuan weights.

* fix typos

* [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663)

* refactor kvcache manager and rotary_embedding and kvcache_memcpy operator

* refactor decode_kv_cache_memcpy

* enable alibi in pagedattention

* [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680)

* [inference]Add alibi to flash attn function (#5678)

* add alibi to flash attn function

* rm redundant modifications

* [Inference] Fix quant bits order (#5681)

* [kernel] Support New KCache Layout - Triton Kernel (#5677)

* kvmemcpy triton for new kcache layout

* revise tests for new kcache layout

* naive triton flash decoding - new kcache layout

* rotary triton kernel - new kcache layout

* remove redundancy - triton decoding

* remove redundancy - triton kvcache copy

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [Fix] Fix & Update Inference Tests (compatibility w/ main)

* [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679)

* [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686)

* [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695)

- Fix key value number assignment in KVCacheManager, as well as method of accessing

* [Fix] Fix Inference Example, Tests, and Requirements (#5688)

* clean requirements

* modify example inference struct

* add test ci scripts

* mark test_infer as submodule

* rm deprecated cls & deps

* import of HAS_FLASH_ATTN

* prune inference tests to be run

* prune triton kernel tests

* increment pytest timeout mins

* revert import path in openmoe

* [hotfix] fix OpenMOE example import path (#5697)

* [Inference]Adapt temperature processing logic (#5689)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* [Inference] Support the logic related to ignoring EOS token (#5693)

* Adapt temperature processing logic

* add ValueError for top_p and top_k

* add GQA Test

* fix except_msg

* support ignore EOS token

* change variable's name

* fix annotation

* [Inference] ADD  async and sync Api server using FastAPI (#5396)

* add api server

* fix

* add

* add completion service and fix bug

* add generation config

* revise shardformer

* fix bugs

* add docstrings and fix some bugs

* fix bugs and add choices for prompt template

* [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432)

* finish online test and add examples

* fix test_contionus_batching

* fix some bugs

* fix bash

* fix

* fix inference

* finish revision

* fix typos

* revision

* [Online Server] Chat Api for streaming and not streaming response (#5470)

* fix bugs

* fix bugs

* fix api server

* fix api server

* add chat api and test

* del request.n

* [Inference] resolve rebase conflicts

fix

* [Inference] Fix bugs and docs for feat/online-server (#5598)

* fix test bugs

* add do sample test

* del useless lines

* fix comments

* fix tests

* delete version tag

* delete version tag

* add

* del test sever

* fix test

* fix

* Revert "add"

This reverts commit b9305fb.

* resolve rebase conflicts on Branch feat/online-serving

* [Inference] Add example test_ci script

* [Inference/Feat] Add quant kvcache interface (#5700)

* add quant kvcache interface

* delete unused output

* complete args comments

* [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706)

* add convert_fp8 op for fp8 test in the future

* rerun ci

* [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708)

* Adapt repetition_penalty and no_repeat_ngram_size

* fix no_repeat_ngram_size_logit_process

* remove batch_updated

* fix annotation

* modified codes based on the review feedback.

* rm get_batch_token_ids

* [Feat]Inference RPC Server Support (#5705)

* rpc support source
* kv cache logical/physical disaggregation
* sampler refactor
* colossalai launch built in
* Unitest
* Rpyc support

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add paged-attetionv2: support seq length split across thread block (#5707)

* [Inference] Delete duplicated copy_vector (#5716)

* [ci] Fix example tests (#5714)

* [fix] revise timeout value on example CI

* trivial

* [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717)

* Fix Llama3 Load error
* Omit Checkpoint IO Temporarily

* [Inference] Fix API server, test and example (#5712)

* fix api server

* fix generation config

* fix api server

* fix comments

* fix infer hanging bug

* resolve comments, change backend to free port

* 【Inference] Delete duplicated package (#5723)

* [example] Update Inference Example (#5725)

* [example] update inference example

* [lazy] fix lazy cls init (#5720)

* fix

* fix

* fix

* fix

* fix

* remove kernel intall

* rebase

revert

fix

* fix

* fix

* [Inference] Fix Inference Generation Config and Sampling (#5710)

* refactor and add

* config default values

* fix gen config passing

* fix rpc generation config

* [Fix/Inference] Add unsupported auto-policy error message (#5730)

* [fix] auto policy error message

* trivial

* [doc] Update Inference Readme (#5736)

* [doc] update inference readme

* add contents

* trivial

* [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

* add parallel cross entropy output for falcon model & fix some typos in bloom.py

* fix module name error, self.model -> self.transformers in bloom, falcon model

* Fix the overflow bug of distributed cross entropy loss function when training with fp16

* add dtype to parallel cross entropy loss function

* fix dtype related typos adn prettify the loss.py

* fix grad dtype and update dtype mismatch error

* fix typo bugs

* [bug] fix silly bug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [chore] add test for prefetch

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [ci] Temporary fix for build on pr (#5741)

* temporary fix for CI

* timeout to 90

* [NFC] Fix code factors on inference triton kernels (#5743)

* [NFC]  fix requirements (#5744)

* [inference] release (#5747)

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

* [inference] release

---------

Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com>
Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com>
Co-authored-by: yuehuayingxueluo <867460659@qq.com>
Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local>
Co-authored-by: FrankLeeeee <somerlee.9@gmail.com>
Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com>
Co-authored-by: xs_courtesy <xs1580802568@gmail.com>
Co-authored-by: Runyu Lu <runyulu@umich.edu>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com>
Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: CjhHa1 <cjh18671720497@outlook.com>
Co-authored-by: flybird11111 <1829166702@qq.com>
Co-authored-by: Haze188 <haze188@qq.com>
Co-authored-by: binmakeswell <binmakeswell@gmail.com>

Loading branch information

17 people authored May 23, 2024

1 parent 13c06d3 commit d211820

.github/workflows/build_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -91,7 +91,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /dev/shm -v /data/scratch/llama-tiny:/data/scratch/llama-tiny
-        timeout-minutes: 60
+        timeout-minutes: 90
         defaults:
           run:
             shell: bash
@@ Expand Down @@

.github/workflows/doc_test_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -58,7 +58,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm
-        timeout-minutes: 20
+        timeout-minutes: 30
         defaults:
           run:
             shell: bash
@@ Expand Down @@

.github/workflows/example_check_on_pr.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ on: @@
         # any change in the examples folder will trigger check for the corresponding example.
         paths:
           - "examples/**"
+          - "!examples/**.md"
     jobs:
       # This is for changed example files detect and output a matrix containing all the corresponding directory name.
@@ Expand All / @@ -19,6 +20,7 @@ jobs: @@
         outputs:
           matrix: ${{ steps.setup-matrix.outputs.matrix }}
           anyChanged: ${{ steps.setup-matrix.outputs.anyChanged }}
+          anyExtensionFileChanged: ${{ steps.find-extension-change.outputs.any_changed }}
         name: Detect changed example files
         concurrency:
           group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-detect-change
@@ Expand All / @@ -37,6 +39,16 @@ jobs: @@
               echo $commonCommit
               echo "baseSHA=$commonCommit" >> $GITHUB_OUTPUT
+          - name: Find the changed extension-related files
+            id: find-extension-change
+            uses: tj-actions/changed-files@v35
+            with:
+              base_sha: ${{ steps.locate-base-sha.outputs.baseSHA }}
+              files: |
+                op_builder/**
+                colossalai/kernel/**
+                setup.py
           - name: Get all changed example files
             id: changed-files
             uses: tj-actions/changed-files@v35
@@ Expand Down Expand Up / @@ -79,17 +91,28 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
-        timeout-minutes: 20
+        timeout-minutes: 30
         concurrency:
           group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}-run-example-${{ matrix.directory }}
           cancel-in-progress: true
         steps:
           - uses: actions/checkout@v3
+          - name: Restore Colossal-AI Cache
+            if: needs.detect.outputs.anyExtensionFileChanged != 'true'
+            run: |
+              if [ -d /github/home/cuda_ext_cache ] && [ ! -z "$(ls -A /github/home/cuda_ext_cache/)" ]; then
+                cp -p -r /github/home/cuda_ext_cache/* /__w/ColossalAI/ColossalAI/
+              fi
           - name: Install Colossal-AI
             run: |
               BUILD_EXT=1 pip install -v .
+          - name: Store Colossal-AI Cache
+            run: |
+              cp -p -r /__w/ColossalAI/ColossalAI/build /github/home/cuda_ext_cache/
           - name: Test the example
             run: |
               example_dir=${{ matrix.directory }}
@@ Expand Down @@

.github/workflows/example_check_on_schedule.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -36,7 +36,7 @@ jobs: @@
         container:
           image: hpcaitech/pytorch-cuda:2.1.0-12.1.0
           options: --gpus all --rm -v /data/scratch/examples-data:/data/ -v /dev/shm
-        timeout-minutes: 10
+        timeout-minutes: 30
         steps:
           - name: 📚 Checkout
             uses: actions/checkout@v3
@@ Expand Down @@

README.md

-Original file line number
+Diff line change
@@ Expand Up / @@ -25,6 +25,7 @@ @@
     </div>
     ## Latest News
+    * [2024/05] [Large AI Models Inference Speed Doubled, Colossal-Inference Open Source Release](https://hpc-ai.com/blog/colossal-inference)
     * [2024/04] [Open-Sora Unveils Major Upgrade: Embracing Open Source with Single-Shot 16-Second Video Generation and 720p Resolution](https://hpc-ai.com/blog/open-soras-comprehensive-upgrade-unveiled-embracing-16-second-video-generation-and-720p-resolution-in-open-source)
     * [2024/04] [Most cost-effective solutions for inference, fine-tuning and pretraining, tailored to LLaMA3 series](https://hpc-ai.com/blog/most-cost-effective-solutions-for-inference-fine-tuning-and-pretraining-tailored-to-llama3-series)
     * [2024/03] [314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, Efficient and Easy-to-Use PyTorch+HuggingFace version is Here](https://hpc-ai.com/blog/314-billion-parameter-grok-1-inference-accelerated-by-3.8x-efficient-and-easy-to-use-pytorchhuggingface-version-is-here)
@@ Expand Down Expand Up / @@ -75,11 +76,9 @@ @@
      <li>
        <a href="#Inference">Inference</a>
        <ul>
+         <li><a href="#Colossal-Inference">Colossal-Inference: Large AI  Models Inference Speed Doubled</a></li>
          <li><a href="#Grok-1">Grok-1: 314B model of PyTorch + HuggingFace Inference</a></li>
          <li><a href="#SwiftInfer">SwiftInfer:Breaks the Length Limit of LLM for Multi-Round Conversations with 46% Acceleration</a></li>
-         <li><a href="#GPT-3-Inference">GPT-3</a></li>
-         <li><a href="#OPT-Serving">OPT-175B Online Serving for Text Generation</a></li>
-         <li><a href="#BLOOM-Inference">176B BLOOM</a></li>
        </ul>
      </li>
      <li>
@@ Expand Down Expand Up @@
     ## Inference
+    ### Colossal-Inference
+    <p align="center">
+    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossal-inference-v1-1.png" width=1000/>
+    </p>
+    <p align="center">
+    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference/colossal-inference-v1-2.png" width=1000/>
+    </p>
+     - Large AI models inference speed doubled, compared to the offline inference performance of vLLM in some cases.
+    [[code]](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/inference)
+    [[blog]](https://hpc-ai.com/blog/colossal-inference)
     ### Grok-1
     <p id="Grok-1" align="center">
     <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/examples/images/grok-1-inference.jpg" width=600/>
@@ Expand All @@
     [[HuggingFace Grok-1 PyTorch model weights]](https://huggingface.co/hpcai-tech/grok-1)
     [[ModelScope Grok-1 PyTorch model weights]](https://www.modelscope.cn/models/colossalai/grok-1-pytorch/summary)
+    ### SwiftInfer
     <p id="SwiftInfer" align="center">
     <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/SwiftInfer.jpg" width=800/>
     </p>
     - [SwiftInfer](https://github.com/hpcaitech/SwiftInfer): Inference performance improved by 46%, open source solution breaks the length limit of LLM for multi-round conversations
-    <p id="GPT-3-Inference" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/inference_GPT-3.jpg" width=800/>
-    </p>
-    - [Energon-AI](https://github.com/hpcaitech/EnergonAI): 50% inference acceleration on the same hardware
-    <p id="OPT-Serving" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/BLOOM%20serving.png" width=600/>
-    </p>
-    - [OPT Serving](https://colossalai.org/docs/advanced_tutorials/opt_service): Try 175-billion-parameter OPT online services
-    <p id="BLOOM-Inference" align="center">
-    <img src="https://raw.githubusercontent.com/hpcaitech/public_assets/main/colossalai/img/BLOOM%20Inference.PNG" width=800/>
-    </p>
-    - [BLOOM](https://github.com/hpcaitech/EnergonAI/tree/main/examples/bloom): Reduce hardware deployment costs of 176-billion-parameter BLOOM by more than 10 times.
     <p align="right">(<a href="#top">back to top</a>)</p>
     ## Installation
@@ Expand Down @@

0 comments on commit `d211820`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `d211820`

Commit

There are no files selected for viewing

0 comments on commit d211820

0 comments on commit `d211820`