Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [Inference] First PR for rebuild colossal-infer (#5143) * add engine and scheduler * add dirs --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add readme (roadmap) and fulfill request handler (#5147) * request handler * add readme --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference/NFC] Clean outdated inference tests and deprecated kernels (#5159) * [inference/nfc] remove outdated inference tests * remove outdated kernel tests * remove deprecated triton kernels * remove imports from deprecated kernels * [Inference]Add BatchInferState, Sequence and InferConfig (#5149) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * [Inference] Add CacheBlock and KV-Cache Manager (#5156) * [Inference] Add KVCache Manager * function refactored * add test for KVCache Manager * add attr beam width * Revise alloc func in CacheManager * Fix docs and pytests * add tp slicing for head number * optimize shapes of tensors used as physical cache * Apply using InferenceConfig on KVCacheManager * rm duplicate config file * Optimize cache allocation: use contiguous cache * Fix config in pytest (and config) * [Inference]Update inference config and fix test (#5178) * unify the config setting * fix test * fix import * fix test * fix * fix * add logger * revise log info --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * [Inference] Add the logic of the inference engine (#5173) * add infer_struct and infer_config * update codes * change InferConfig * Add hf_model_config to the engine * rm _get_hf_model_config * update codes * made adjustments according to the feedback from the reviewer. * update codes * add ci test for config and struct * Add the logic of the inference engine * update engine and test * Recover cache_manager.py * add logger * fix conflict * update codes * update codes * update model and tokenizer * fix add the logic about shardformer * change kvcache_manager docstring * add policy * fix ci bug in test_kvcache_manager.py * remove codes related o tokenizer and move model_policy * fix code style * add ordered_set to requirements-infer.txt * Delete extra empty lines * add ordered_set to requirements-test.txt * [Inference] add logit processor and request handler (#5166) * add logit processor and request handler * add * add * add * fix * add search tokens and update func * finish request handler * add running list test * fix test * fix some bug * add * add * fix bugs * fix some bugs * fix bug * fix * fix * add copy fun * del useless attn * fix request status --------- Co-authored-by: CjhHa1 <cjh18671720497outlook.com> * Add padding llama model * Fixed a bug in the inference frame * fix bugs in request_handler * precision alignment * Fixed a writing error * [kernel] Add triton kernel for context attention (FAv2) without padding (#5192) * add context attn unpadded triton kernel * test compatibility * kv cache copy (testing) * fix k/v cache copy * fix kv cache copy and test * fix boundary of block ptrs * add support for GQA/MQA and testing * fix import statement --------- Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> * add context_attention_unpadded * fix bugs in sampler * Fixed a typo * fix beam_width * [Inference] Pytorch Attention func, pad&nopad input support (#5219) * add attn * add attention test * fix attn forward * fix decoding * fix bugs in attention.py and request_handler.py * adapted to pad_context_forward * [Hotfix] Fix accuracy and align attention method api with Triton kernel (#5229) * fix accuracy * alignment in attention * fix attention * fix * fix bugs * fix bugs * fix bugs * fix bugs related to processing padding mask * fix CI bugs * rm torch.cuda.synchronize * fix bugs in request_handler.py and engine.py * [Inference] Kernel: no pad rotary embedding (#5252) * fix bugs * comment * use more accurate atol * fix * [kernel] Add flash decoding triton kernel for blocked kv cache (#5249) * add flash decoding unpad triton kernel * rename flash decoding kernel * add kernel testing (draft) * revise pytest * support kv group (GQA) * (trivial) fix api and pytest * (trivial) func renaming * (trivial) func/file renaming * refactor pytest for attention * (trivial) format and consistent vars of context/decode attn * (trivial) remove test redundancy * [git] fixed rebased files * [kernel] Add KV cache copy kernel during decoding (#5261) * add kv copy triton kernel during decoding stage * add pytest and fix kernel * fix test utilities * revise kernel config * add benchmark for kvcache copy * [doc] updated inference readme (#5269) * [Inference] Fix request handler and add recycle logic (#5260) * fix request handler * fix comment * [kernel] Revise KVCache copy triton kernel API (#5273) * [kernel/fix] revise kvcache copy kernel api * fix benchmark * [Inference]Adapted to the triton attn kernels (#5264) * adapted to the triton attn kernels * fix pad input * adapted to copy_kv_to_blocked_cache * fix ci test * update kv memcpy * remove print * [kernel] Add RMSLayerNorm triton kernel (#5262) * add layerrmsnorm triton kernel * add layerrmsnorm kernel * modify the atol and rtol in test file * Remove the logics of mean computations, and update the name of ther kernel functions and files * add benchmark of rms norm * [Hotfix] Fix bugs in testing continuous batching (#5270) * fix bug * fix bugs * fix bugs * fix bugs and add padding * add funcs and fix bugs * fix typos * fix bugs * add func * [kernel/fix] Performance Optimization for Decoding Kernel and Benchmarking (#5274) * prevent re-creating intermediate tensors * add singleton class holding intermediate values * fix triton kernel api * add benchmark in pytest * fix kernel api and add benchmark * revise flash decoding triton kernel in/out shapes * fix calling of triton kernel in modeling * fix pytest: extract to util functions * [inference] Adapted to Rotary Embedding and RMS Norm (#5283) * adapted to rotary_embedding * adapted to nopad rms norm * fix bugs in benchmark * fix flash_decoding.py * add utils.py * [Inference] Benchmarking rotary embedding and add a fetch function (#5277) * fix bugs and add a cos/sin cache fetch func * add docstring * fix bug * fix * [Kernel/Fix] Revise flash attention triton kernel API and add benchmark (#5301) * fix decoding kernel pytest * revise and add triton context attn benchmark * [Inference]Add fused rotary kernel and get cos cache kernel (#5302) * add fused rotary and get cos cache func * staged * fix bugs * fix bugs * [hotfix] fix boundary check in batch (#5306) * [inference]Optimize the usage of the mid tensors space in flash attn (#5304) * opt flash attn * opt tmp tensor * fix benchmark_llama * fix code style * fix None logic for output tensor * fix adapted to get_xine_cache * add comment * fix ci bugs * fix some codes * rm duplicated codes * rm duplicated codes * fix code style * add _get_dtype in config.py * fix (#5311) * [Inference] Update rms norm kernel, benchmark with vLLM (#5315) * add * xi * del * del * fix * [DOC] Update inference readme (#5280) * add readme * add readme * 1 * update engine * finish readme * add readme * [Inference]Add Nopadding Llama Modeling (#5327) * add nopadding llama modeling * add nopadding_llama.py * rm unused codes * fix bugs in test_xine_copy.py * fix code style * [Infer] Optimize Blocked KVCache And Kernels Using It (#5325) * revise shape of kvcache (context attn kernel) * revise shape of kvcache (flash decoding kernel) * revise shape of kvcache (kvcache copy) and attn func * init of kvcache in kvcache manager * revise llama modeling * revise block size retrieval * use torch for rms_norm benchmarking * revise block size retrieval * [Inference] Kernel Fusion, fused copy kv cache into rotary embedding (#5336) * revise rotary embedding * remove useless print * adapt * [inference] simplified config verification (#5346) * [inference] simplified config verification * polish * polish * [Inference]Repalce Attention layer and MLP layer by shardformer to optimize the weight transpose operation,add fused_qkv and fused linear_add (#5340) * add fused qkv * replace attn and mlp by shardformer * fix bugs in mlp * add docstrings * fix test_inference_engine.py * add optimize unbind * add fused_addmm * rm squeeze(1) * refactor codes * fix ci bugs * rename ShardFormerLlamaMLP and ShardFormerLlamaAttention * Removed the dependency on LlamaFlashAttention2 * rollback test_inference_engine.py * [inference] removed redundancy init_batch (#5353) * [inference] moved ops tests to test_infer (#5354) * [doc] updated inference readme (#5343) * [Inference/opt]Optimize the mid tensor of RMS Norm (#5350) * opt rms_norm * fix bugs in rms_layernorm * [Inference]Optimize generation process of inference engine (#5356) * opt inference engine * fix run_benchmark.sh * fix generate in engine.py * rollback tesh_inference_engine.py * [Fix/Infer] Remove unused deps and revise requirements (#5341) * remove flash-attn dep * rm padding llama * revise infer requirements * move requirements out of module * [Inference]Fused the gate and up proj in mlp,and optimized the autograd process. (#5365) * fused the gate and up proj in mlp * fix code styles * opt auto_grad * rollback test_inference_engine.py * modifications based on the review feedback. * fix bugs in flash attn * Change reshape to view * fix test_rmsnorm_triton.py * [Inference] Adapt to Fused rotary (#5348) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * Revert "[Inference] Adapt to Fused rotary (#5348)" (#5373) This reverts commit 9f4ab2e. * [inference] added inference template (#5375) * [Inference/opt] Fused KVCahce Memcopy (#5374) * fused kv memcopy * add TODO in test_kvcache_copy.py * [Inference] User Experience: update the logic of default tokenizer and generation config. (#5337) * add * fix * fix * pause * fix * fix pytest * align * fix * license * fix * fix * fix readme * fix some bugs * remove tokenizer config * [inference] refactored config (#5376) * [Inference]Support vllm testing in benchmark scripts (#5379) * add vllm benchmark scripts * fix code style * update run_benchmark.sh * fix code style * [Inference] Optimize and Refactor Inference Batching/Scheduling (#5367) * add kvcache manager funcs for batching * add batch bucket for batching * revise RunningList struct in handler * add kvcache/batch funcs for compatibility * use new batching methods * fix indexing bugs * revise abort logic * use cpu seq lengths/block tables * rm unused attr in Sequence * fix type conversion/default arg * add and revise pytests * revise pytests, rm unused tests * rm unused statements * fix pop finished indexing issue * fix: use index in batch when retrieving inputs/update seqs * use dict instead of odict in batch struct * arg type hinting * fix make compress * refine comments * fix: pop_n_seqs to pop the first n seqs * add check in request handler * remove redundant conversion * fix test for request handler * fix pop method in batch bucket * fix prefill adding * [Inference]Fused kv copy into rotary calculation (#5383) * revise rotary embedding * remove useless print * adapt * fix * add * fix * modeling * fix * fix * fix * fused kv copy * fused copy * colossalai/kernel/triton/no_pad_rotary_embedding.py * del padding llama * del * Optimized the execution interval time between cuda kernels caused by view and memcopy (#5390) * opt_view_and_memcopy * fix bugs in ci * fix ci bugs * update benchmark scripts * fix ci bugs * [Fix/Inference] Fix format of input prompts and input model in inference engine (#5395) * Fix bugs in inference_engine * fix bugs in engine.py * rm CUDA_VISIBLE_DEVICES * add request_ids in generate * fix bug in engine.py * add logger.debug for BatchBucket * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * [Inference]Add CUDA KVCache Kernel (#5406) * add cuda KVCache kernel * annotation benchmark_kvcache_copy * add use cuda * fix import path * move benchmark scripts to example/ * rm benchmark codes in test_kv_cache_memcpy.py * rm redundancy codes * rm redundancy codes * pr was modified according to the review * [Inference]Move benchmark-related code to the example directory. (#5408) * move benchmark-related code to the example directory. * fix bugs in test_fused_rotary_embedding.py * add silu_and_mul for infer * [feat] cuda graph support and refactor non-functional api * add reusable utils for cuda * refactor code * feat rmsnorm cuda kernel and add unittest, benchmark script (#5417) * [fix] multi graphs capture error * [fix] multi graphs capture error * [doc] add doc * refactor code * optimize rmsnorm: add vectorized elementwise op, feat loop unrolling (#5441) * fix include path * fix rmsnorm template function invocation problem(template function partial specialization is not allowed in Cpp) and luckily pass e2e precision test (#5454) * [Inference/kernel]Add Fused Rotary Embedding and KVCache Memcopy CUDA Kernel (#5418) * add rotary embedding kernel * add rotary_embedding_kernel * add fused rotary_emb and kvcache memcopy * add fused_rotary_emb_and_cache_kernel.cu * add fused_rotary_emb_and_memcopy * fix bugs in fused_rotary_emb_and_cache_kernel.cu * fix ci bugs * use vec memcopy and opt the gloabl memory access * fix code style * fix test_rotary_embdding_unpad.py * codes revised based on the review comments * fix bugs about include path * rm inline * [fix] pytest and fix dyn grid bug * diverse tests * add implementatino for GetGPULaunchConfig1D * [fix] tmp for test * add some comments * refactor vector utils * [feat] add use_cuda_kernel option * add vec_type_trait implementation (#5473) * [fix] unused option * [fix] * [fix] * [fix] remove unused comment * [Inference]Support FP16/BF16 Flash Attention 2 And Add high_precision Flag To Rotary Embedding (#5461) * Support FP16/BF16 Flash Attention 2 * fix bugs in test_kv_cache_memcpy.py * add context_kv_cache_memcpy_kernel.cu * rm typename MT * add tail process * add high_precision * add high_precision to config.py * rm unused code * change the comment for the high_precision parameter * update test_rotary_embdding_unpad.py * fix vector_copy_utils.h * add comment for self.high_precision when using float32 * [fix] PR #5354 (#5501) * [fix] * [fix] * Update config.py docstring * [fix] docstring align * [fix] docstring align * [fix] docstring align * [Inference] Optimize request handler of llama (#5512) * optimize request_handler * fix ways of writing * The writing style of tail processing and the logic related to macro definitions have been optimized. (#5519) * [Inference/Kernel]Add get_cos_and_sin Kernel (#5528) * Add get_cos_and_sin kernel * fix code comments * fix code typos * merge common codes of get_cos_and_sin kernel. * Fixed a typo * Changed 'asset allclose' to 'assert equal'. * [Inference] Add Reduce Utils (#5537) * add reduce utils * add using to delele namespace prefix * [Fix/Inference] Remove unused and non-functional functions (#5543) * [fix] remove unused func * rm non-functional partial * add cast and op_functor for cuda build-in types (#5546) * remove unused triton kernels * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove outdated triton test * [Infer] Revise and Adapt Triton Kernels for Spec-Dec (#5401) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * resolve conflicts for revising flash-attn * adapt kv cache copy kernel for spec-dec * fix seqlen-n kvcache copy kernel/tests * test kvcache copy - use torch.equal * add assertions * (trivial) comment out * [Inference/SpecDec] Add Basic Drafter Model Container (#5405) * [Infer/Fix] Fix Dependency in test - RMSNorm kernel (#5399) fix dependency in pytest * add drafter model container (basic ver) * [Inference/SpecDec] Add Speculative Decoding Implementation (#5423) * fix flash decoding mask during verification * add spec-dec * add test for spec-dec * revise drafter init * remove drafter sampling * retire past kv in drafter * (trivial) rename attrs * (trivial) rename arg * revise how we enable/disable spec-dec * [SpecDec] Fix inputs for speculation and revise past KV trimming (#5449) * fix drafter pastkv and usage of batch bucket * [Inference/SpecDec] Support GLIDE Drafter Model (#5455) * add glide-llama policy and modeling * update glide modeling, compitable with transformers 4.36.2 * revise glide llama modeling/usage * fix issues of glimpsing large kv * revise the way re-loading params for glide drafter * fix drafter and engine tests * enable convert to glide strict=False * revise glide llama modeling * revise vicuna prompt template * revise drafter and tests * apply usage of glide model in engine * [doc] Add inference/speculative-decoding README (#5552) * add README for spec-dec * update roadmap * [Fix] resolve conflicts of rebasing feat/speculative-decoding (#5557) - resolve conflicts of rebasing feat/speculative-decoding * [Fix] Llama Modeling Control with Spec-Dec (#5580) - fix ref before asgmt - fall back to use triton kernels when using spec-dec * refactor csrc (#5582) * [Inference/Refactor] Delete Duplicated code and refactor vec_copy utils and reduce utils (#5593) * delete duplicated code and refactor vec_copy utils and reduce utils * delete unused header file * [inference/model]Adapted to the baichuan2-7B model (#5591) * Adapted to the baichuan2-7B model * modified according to the review comments. * Modified the method of obtaining random weights. * modified according to the review comments. * change mlp layewr 'NOTE' * [Inference/Kernel] Add Paged Decoding kernel, sequence split within the same thread block (#5531) * feat flash decoding for paged attention * refactor flashdecodingattention * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Feat]Tensor Model Parallel Support For Inference (#5563) * tensor parallel support naive source * [fix]precision, model load and refactor the framework * add tp unit test * docstring * fix do_sample * feat baichuan2 rmsnorm whose hidden size equals to 5120 (#5611) * [Fix/Inference] Fix GQA Triton and Support Llama3 (#5624) * [fix] GQA calling of flash decoding triton * fix kv cache alloc shape * fix rotary triton - GQA * fix sequence max length assigning * Sequence max length logic * fix scheduling and spec-dec * skip without import error * fix pytest - skip without ImportError --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix/Inference]Fix CUDA Rotary Rmbedding GQA (#5623) * fix rotary embedding GQA * change test_rotary_embdding_unpad.py KH * [example] Update Llama Inference example (#5629) * [example] add infernece benchmark llama3 * revise inference config - arg * remove unused args * add llama generation demo script * fix init rope in llama policy * add benchmark-llama3 - cleanup * [Inference/Refactor] Refactor compilation mechanism and unified multi hw (#5613) * refactor compilation mechanism and unified multi hw * fix file path bug * add init.py to make pybind a module to avoid relative path error caused by softlink * delete duplicated micros * fix micros bug in gcc * [Fix/Inference]Fix vllm benchmark (#5630) * Fix bugs about OOM when running vllm-0.4.0 * rm used params * change generation_config * change benchmark log file name * [Inference/Kernel] Optimize paged attention: Refactor key cache layout (#5643) * optimize flashdecodingattention: refactor code with different key cache layout(from [num_blocks, num_kv_heads, block_size, head_size] to [num_blocks, num_kv_heads, head_size/x, block_size, x]) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Remove obsolete files - inference (#5650) * [Inference]Adapt to baichuan2 13B (#5614) * adapt to baichuan2 13B * adapt to baichuan2 13B * change BAICHUAN_MODEL_NAME_OR_PATH * fix test_decoding_attn.py * Modifications based on review comments. * change BAICHUAN_MODEL_NAME_OR_PATH * mv attn mask processes to test flash decoding * mv get_alibi_slopes baichuan modeling * fix bugs in test_baichuan.py * [kernel] Support new KCache Layout - Context Attention Triton Kernel (#5658) * add context attn triton kernel - new kcache layout * add benchmark triton * tiny revise * trivial - code style, comment * [Inference/Feat] Add kvcache quantization support for FlashDecoding (#5656) * [Inference/Feat] Feat quant kvcache step2 (#5674) * [Inference] Adapt Baichuan2-13B TP (#5659) * adapt to baichuan2 13B * add baichuan2 13B TP * update baichuan tp logic * rm unused code * Fix TP logic * fix alibi slopes tp logic * rm nn.Module * Polished the code. * change BAICHUAN_MODEL_NAME_OR_PATH * Modified the logic for loading Baichuan weights. * fix typos * [Inference/Kernel] refactor kvcache manager and rotary_embedding and kvcache_memcpy oper… (#5663) * refactor kvcache manager and rotary_embedding and kvcache_memcpy operator * refactor decode_kv_cache_memcpy * enable alibi in pagedattention * [Inference/Feat] Add kvcache quant support for fused_rotary_embedding_cache_copy (#5680) * [inference]Add alibi to flash attn function (#5678) * add alibi to flash attn function * rm redundant modifications * [Inference] Fix quant bits order (#5681) * [kernel] Support New KCache Layout - Triton Kernel (#5677) * kvmemcpy triton for new kcache layout * revise tests for new kcache layout * naive triton flash decoding - new kcache layout * rotary triton kernel - new kcache layout * remove redundancy - triton decoding * remove redundancy - triton kvcache copy * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [Fix] Fix & Update Inference Tests (compatibility w/ main) * [Inference] Remove unnecessary float4_ and rename float8_ to float8 (#5679) * [Inference/Feat] Add quant kvcache support for decode_kv_cache_memcpy (#5686) * [hotfix] Fix KV Heads Number Assignment in KVCacheManager (#5695) - Fix key value number assignment in KVCacheManager, as well as method of accessing * [Fix] Fix Inference Example, Tests, and Requirements (#5688) * clean requirements * modify example inference struct * add test ci scripts * mark test_infer as submodule * rm deprecated cls & deps * import of HAS_FLASH_ATTN * prune inference tests to be run * prune triton kernel tests * increment pytest timeout mins * revert import path in openmoe * [hotfix] fix OpenMOE example import path (#5697) * [Inference]Adapt temperature processing logic (#5689) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * [Inference] Support the logic related to ignoring EOS token (#5693) * Adapt temperature processing logic * add ValueError for top_p and top_k * add GQA Test * fix except_msg * support ignore EOS token * change variable's name * fix annotation * [Inference] ADD async and sync Api server using FastAPI (#5396) * add api server * fix * add * add completion service and fix bug * add generation config * revise shardformer * fix bugs * add docstrings and fix some bugs * fix bugs and add choices for prompt template * [Inference] Finish Online Serving Test, add streaming output api, continuous batching test and example (#5432) * finish online test and add examples * fix test_contionus_batching * fix some bugs * fix bash * fix * fix inference * finish revision * fix typos * revision * [Online Server] Chat Api for streaming and not streaming response (#5470) * fix bugs * fix bugs * fix api server * fix api server * add chat api and test * del request.n * [Inference] resolve rebase conflicts fix * [Inference] Fix bugs and docs for feat/online-server (#5598) * fix test bugs * add do sample test * del useless lines * fix comments * fix tests * delete version tag * delete version tag * add * del test sever * fix test * fix * Revert "add" This reverts commit b9305fb. * resolve rebase conflicts on Branch feat/online-serving * [Inference] Add example test_ci script * [Inference/Feat] Add quant kvcache interface (#5700) * add quant kvcache interface * delete unused output * complete args comments * [Inference/Feat] Add convert_fp8 op for fp8 test in the future (#5706) * add convert_fp8 op for fp8 test in the future * rerun ci * [Inference]Adapt repetition_penalty and no_repeat_ngram_size (#5708) * Adapt repetition_penalty and no_repeat_ngram_size * fix no_repeat_ngram_size_logit_process * remove batch_updated * fix annotation * modified codes based on the review feedback. * rm get_batch_token_ids * [Feat]Inference RPC Server Support (#5705) * rpc support source * kv cache logical/physical disaggregation * sampler refactor * colossalai launch built in * Unitest * Rpyc support --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * add paged-attetionv2: support seq length split across thread block (#5707) * [Inference] Delete duplicated copy_vector (#5716) * [ci] Fix example tests (#5714) * [fix] revise timeout value on example CI * trivial * [Fix] Llama3 Load/Omit CheckpointIO Temporarily (#5717) * Fix Llama3 Load error * Omit Checkpoint IO Temporarily * [Inference] Fix API server, test and example (#5712) * fix api server * fix generation config * fix api server * fix comments * fix infer hanging bug * resolve comments, change backend to free port * 【Inference] Delete duplicated package (#5723) * [example] Update Inference Example (#5725) * [example] update inference example * [lazy] fix lazy cls init (#5720) * fix * fix * fix * fix * fix * remove kernel intall * rebase revert fix * fix * fix * [Inference] Fix Inference Generation Config and Sampling (#5710) * refactor and add * config default values * fix gen config passing * fix rpc generation config * [Fix/Inference] Add unsupported auto-policy error message (#5730) * [fix] auto policy error message * trivial * [doc] Update Inference Readme (#5736) * [doc] update inference readme * add contents * trivial * [Shardformer] Add parallel output for shardformer models(bloom, falcon) (#5702) * [pre-commit.ci] auto fixes from pre-commit.com hooks * add parallel cross entropy output for falcon model & fix some typos in bloom.py * fix module name error, self.model -> self.transformers in bloom, falcon model * Fix the overflow bug of distributed cross entropy loss function when training with fp16 * add dtype to parallel cross entropy loss function * fix dtype related typos adn prettify the loss.py * fix grad dtype and update dtype mismatch error * fix typo bugs * [bug] fix silly bug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [chore] add test for prefetch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [ci] Temporary fix for build on pr (#5741) * temporary fix for CI * timeout to 90 * [NFC] Fix code factors on inference triton kernels (#5743) * [NFC] fix requirements (#5744) * [inference] release (#5747) * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release * [inference] release --------- Co-authored-by: Jianghai <72591262+CjhHa1@users.noreply.github.com> Co-authored-by: Yuanheng Zhao <54058983+yuanheng-zhao@users.noreply.github.com> Co-authored-by: yuehuayingxueluo <867460659@qq.com> Co-authored-by: Round Heng <yuanhengzhao@Rounds-MacBook-Pro.local> Co-authored-by: FrankLeeeee <somerlee.9@gmail.com> Co-authored-by: Yaozheng Fang <62918515+nkfyz@users.noreply.github.com> Co-authored-by: xs_courtesy <xs1580802568@gmail.com> Co-authored-by: Runyu Lu <runyulu@umich.edu> Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com> Co-authored-by: Runyu Lu <77330637+LRY89757@users.noreply.github.com> Co-authored-by: Yuanheng <jonathan.zhaoyh@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: CjhHa1 <cjh18671720497@outlook.com> Co-authored-by: flybird11111 <1829166702@qq.com> Co-authored-by: Haze188 <haze188@qq.com> Co-authored-by: binmakeswell <binmakeswell@gmail.com>
- Loading branch information