Squashed commit of the following: · Hyper-Accel/vllm@a10068c

Commit

Squashed commit of the following:

commit 94bf9ae4e9b8199636668ccbe4dabcdc3b9e5ae6
Author: Andy Dai <76841985+Imss27@users.noreply.github.com>
Date:   Thu Oct 10 17:33:16 2024 -0700

    [Misc] Fix sampling from sonnet for long context case (#9235)

commit f990bab2a4198c4de6b5b349d35fc74bf0f36f3e
Author: omrishiv <327609+omrishiv@users.noreply.github.com>
Date:   Thu Oct 10 16:36:32 2024 -0700

    [Doc][Neuron] add note to neuron documentation about resolving triton issue (#9257)

    Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

commit e00c094f15e79c5a113fdf975df1ee9018cb65b3
Author: youkaichao <youkaichao@gmail.com>
Date:   Thu Oct 10 15:54:23 2024 -0700

    [torch.compile] generic decorators (#9258)

commit a78c6ba7c88a7bb42b38410f9dcfa5b342b95b57
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Thu Oct 10 15:45:09 2024 -0700

    [ci/build] Add placeholder command for custom models test (#9262)

commit fb870fd491482cfe5a41648b8c081d1bd6941205
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Oct 10 13:30:46 2024 -0700

    Bump actions/setup-python from 3 to 5 (#9195)

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 270953bafb1ccf444f2018d1c0a88c51472de22e
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Oct 10 13:30:35 2024 -0700

    Bump actions/checkout from 3 to 4 (#9196)

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit 9cc811c4ff3d5200cc23f16709f540821531b77c
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Thu Oct 10 13:30:24 2024 -0700

    Bump actions/github-script from 6 to 7 (#9197)

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit e4d652ea3ed9b2a60c1582cb2e2605695e61280f
Author: youkaichao <youkaichao@gmail.com>
Date:   Thu Oct 10 12:39:36 2024 -0700

    [torch.compile] integration with compilation control (#9058)

commit 78c0b4166cb097de749993970b51cb7b8becba58
Author: Simon Mo <simon.mo@hey.com>
Date:   Thu Oct 10 12:29:24 2024 -0700

    Suggest codeowners for the core componenets (#9210)

commit 21efb603f5f88a0d78ad11e4fbc6e18fe83916d4
Author: jordanyono <40174853+jyono@users.noreply.github.com>
Date:   Thu Oct 10 14:18:18 2024 -0400

    [CI/Build] Make the `Dockerfile.cpu` file's  `PIP_EXTRA_INDEX_URL` Configurable as a Build Argument (#9252)

commit 055f3270d40bbc492630d0f2c96ec8b64823ba34
Author: Rafael Vasquez <rafvasq21@gmail.com>
Date:   Thu Oct 10 13:48:51 2024 -0400

    [Doc] Improve debugging documentation (#9204)

    Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>

commit 18511aeda64b473314bb7727a97a220565e0af41
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Oct 10 13:39:56 2024 -0400

    [Bugfix] Fix Machete unittests failing with `NotImplementedError` (#9218)

commit 83ea5c72b9a287b65c9f7b95fbd868b3f613e6f5
Author: Ilya Lavrenov <ilya.lavrenov@intel.com>
Date:   Thu Oct 10 21:18:58 2024 +0400

    [OpenVINO] Use torch 2.4.0 and newer optimim version (#9121)

    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit 04de9057ab8099291e66ad876e78693c7c2f2ce5
Author: whyiug <whyiug@hotmail.com>
Date:   Thu Oct 10 23:00:47 2024 +0800

    [Model] support input image embedding for minicpmv (#9237)

commit 07c11cf4d4b9a913fa52142fe134849f1e25e393
Author: Isotr0py <2037008807@qq.com>
Date:   Thu Oct 10 21:11:56 2024 +0800

    [Bugfix] Fix lm_head weights tying with lora for llama (#9227)

commit f3a507f1d31e13a99c4fc8ac02738a73c3e3136f
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Wed Oct 9 23:17:17 2024 -0700

    [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149)

commit a64e7b940734b68d849ed2b07ca1bc3824713555
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Oct 10 02:16:17 2024 -0400

    [Bugfix] Machete garbage results for some models (large K dim) (#9212)

commit ce00231a8bfb5eae85167b5a3def1b7304c723b6
Author: Michael Goin <michael@neuralmagic.com>
Date:   Thu Oct 10 02:15:40 2024 -0400

    [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models (#9213)

commit de895f1697d22ea19a5a4d4ab3dc17037a3e9af3
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Oct 9 21:58:27 2024 -0700

    [misc] improve model support check in another process (#9208)

commit cf25b93bddb607077e52cbe4681332ca61aff189
Author: Russell Bryant <rbryant@redhat.com>
Date:   Thu Oct 10 00:10:09 2024 -0400

    [Core] Fix invalid args to _process_request (#9201)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit d5fbb8706d2c7fd00b64cff2efbe7c771fe82c3c
Author: Michael Goin <michael@neuralmagic.com>
Date:   Wed Oct 9 14:51:47 2024 -0400

    [CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 (#9130)

    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit cdca8994bd856a234112875a92746c5782837768
Author: Russell Bryant <rbryant@redhat.com>
Date:   Wed Oct 9 13:15:28 2024 -0400

    [CI/Build] mypy: check vllm/entrypoints (#9194)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit ca77dd7a44f2bc103c668560818918ac0335835a
Author: Li, Jiang <jiang1.li@intel.com>
Date:   Thu Oct 10 00:28:08 2024 +0800

    [Hardware][CPU] Support AWQ for CPU backend (#7515)

commit 7dea289066eaed35538e74dfadafd1fea1dbe05d
Author: Ewout ter Hoeven <E.M.terHoeven@student.tudelft.nl>
Date:   Wed Oct 9 17:16:26 2024 +0200

    Add Dependabot configuration for GitHub Actions updates (#1217)

    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit cfaa6008e666d4e9bb5131ece68f8609b6f94ee4
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Oct 9 22:59:57 2024 +0800

    [Bugfix] Access `get_vocab` instead of `vocab` in tool parsers (#9188)

commit 21906a6f50ee0edf49ede856a82e8840bab41471
Author: Ahmad Fahadh Ilyas <37577369+fahadh4ilyas@users.noreply.github.com>
Date:   Wed Oct 9 05:10:44 2024 -0700

    [Bugfix] Fix lora loading for Compressed Tensors in #9120 (#9179)

commit dc4aea677ab0520d91ff4979e80340cb5a090095
Author: Jiangtao Hu <ycool@users.noreply.github.com>
Date:   Wed Oct 9 16:59:42 2024 +0800

    [Doc] Fix VLM prompt placeholder sample bug (#9170)

commit c8627cd41b10747da393b76c382de5ef0eb635a2
Author: youkaichao <youkaichao@gmail.com>
Date:   Wed Oct 9 00:38:40 2024 -0700

    [ci][test] use load dummy for testing (#9165)

commit 8bfaa4e31eb63d41499fec933e68969ebbedb01f
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Oct 9 15:36:55 2024 +0800

    [Bugfix] fix composite weight loading and EAGLE weight loading (#9160)

commit 0b5b5d767e7fdc0b1070b37319de749e46a4d42a
Author: AlpinDale <52078762+AlpinDale@users.noreply.github.com>
Date:   Wed Oct 9 07:03:14 2024 +0000

    [Frontend] Log the maximum supported concurrency (#8831)

commit cdc72e3c80b7029c49de9667150f68481f386956
Author: Hui Liu <96135754+hliuca@users.noreply.github.com>
Date:   Tue Oct 8 23:43:06 2024 -0700

    [Model] Remap FP8 kv_scale in CommandR and DBRX (#9174)

commit 7627172bf42b9cd628402c98845c6ac3de80859a
Author: Joe Rowell <joerowell4@gmail.com>
Date:   Wed Oct 9 06:43:34 2024 +0100

    [Bugfix][Doc] Report neuron error in output (#9159)

commit 480b7f40cfa9a900e03ea4e825abc1a46b5d085b
Author: Travis Johnson <tsjohnso@us.ibm.com>
Date:   Tue Oct 8 22:54:48 2024 -0600

    [Misc] Improve validation errors around best_of and n (#9167)

    Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

commit acce7630c1dd655ca95a9f1abff23d92ef76262c
Author: Yuan Tang <terrytangyuan@gmail.com>
Date:   Tue Oct 8 23:58:49 2024 -0400

    Update link to KServe deployment guide (#9173)

commit ffc4b27ea8924b4b5add13552063c93d0a14fb85
Author: Yuan Tang <terrytangyuan@gmail.com>
Date:   Tue Oct 8 22:30:48 2024 -0400

    Add classifiers in setup.py (#9171)

commit 2f4117c38e101ee63b65521c93b22efe3526f77e
Author: chenqianfzh <51831990+chenqianfzh@users.noreply.github.com>
Date:   Tue Oct 8 18:52:19 2024 -0700

    support bitsandbytes quantization with more models (#9148)

commit 9ba0bd6aa6a9a3cefa5c320800ea736a0abbaf36
Author: Michael Goin <michael@neuralmagic.com>
Date:   Tue Oct 8 21:22:31 2024 -0400

    Add `lm-eval` directly to requirements-test.txt (#9161)

commit 2a131965a8144d571a4a211a44d1fc32e202ae10
Author: Russell Bryant <rbryant@redhat.com>
Date:   Tue Oct 8 18:08:22 2024 -0400

    mypy: check additional directories (#9162)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit bd37b9fbe274e28e12c0687cb9a8111dda270936
Author: bnellnm <49004751+bnellnm@users.noreply.github.com>
Date:   Tue Oct 8 17:28:12 2024 -0400

    [Bugfix] Try to handle older versions of pytorch (#9086)

commit de24046fcd24e8faa81de34b17351887bcdfbe51
Author: Rafael Vasquez <rafvasq21@gmail.com>
Date:   Tue Oct 8 16:22:08 2024 -0400

    [Doc] Improve contributing and installation documentation (#9132)

    Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>

commit 1874c6a1b0ae0f9eb2b485653b4e17ed1d861a32
Author: Sayak Paul <spsayakpaul@gmail.com>
Date:   Tue Oct 8 23:42:29 2024 +0530

    [Doc] Update vlm.rst to include an example on videos (#9155)

    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

commit 9a94ca4a5d31c0ba57ca67fc1c252233d3284012
Author: Daniele <36171005+dtrifiro@users.noreply.github.com>
Date:   Tue Oct 8 18:38:40 2024 +0200

    [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (#8537)

commit cfba685bd462f360994da7ac0d33f9759589506e
Author: Peter Pan <peter.pan@daocloud.io>
Date:   Wed Oct 9 00:37:34 2024 +0800

    [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (#8758)

    Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

commit 069d3bd8d01a72e93c0a5b51f8b567e8aaddc6e9
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Tue Oct 8 08:31:26 2024 -0600

    [Frontend] Add Early Validation For Chat Template / Tool Call Parser (#9151)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit a3691b6b5eb7e60039a8ff34550be5a7e8365394
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Tue Oct 8 08:12:56 2024 -0600

    [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit 8c746226c956f7c8a4672689fee91c7d22befed6
Author: Brendan Wong <35351983+LunrEclipse@users.noreply.github.com>
Date:   Mon Oct 7 22:51:43 2024 -0700

    [Frontend] API support for beam search for MQLLMEngine (#9117)

commit e1faa2a59876bba99d804c0a94d427cee87b0995
Author: youkaichao <youkaichao@gmail.com>
Date:   Mon Oct 7 22:26:25 2024 -0700

    [misc] improve ux on readme (#9147)

commit 80b57f00d554db8a2126d351bb5374c190b56699
Author: Kunshang Ji <kunshang.ji@intel.com>
Date:   Tue Oct 8 11:51:14 2024 +0800

    [Intel GPU] Fix xpu decode input  (#9145)

commit 04c12f81572be22c819018c2fcbddac5f08715d0
Author: youkaichao <youkaichao@gmail.com>
Date:   Mon Oct 7 19:51:49 2024 -0700

    [misc] update utils to support comparing multiple settings (#9140)

commit 8eeb85708428b7735bbd1156c81692431fd5ff34
Author: Simon Mo <simon.mo@hey.com>
Date:   Mon Oct 7 17:06:21 2024 -0700

    Add Slack to README (#9137)

commit fa45513a5189b3a9f73a59730c9ac65d061e1311
Author: youkaichao <youkaichao@gmail.com>
Date:   Mon Oct 7 16:07:05 2024 -0700

    [misc] fix comment and variable name (#9139)

commit c0d9a98d0c7182b73c2e7f88508e690a186bf0e3
Author: Kuntai Du <kuntai@uchicago.edu>
Date:   Mon Oct 7 15:04:06 2024 -0700

    [Doc] Include performance benchmark in README (#9135)

commit e0dbdb013dfe5cdbe044317b4d7d55644d6399b3
Author: Russell Bryant <rbryant@redhat.com>
Date:   Mon Oct 7 17:18:10 2024 -0400

    [CI/Build] Add linting for github actions workflows (#7876)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 93cf74a8a7b0b483becdba95e3056adbf201b7b2
Author: TimWang <7367474+haitwang-cloud@users.noreply.github.com>
Date:   Tue Oct 8 04:31:45 2024 +0800

    [Doc]: Add deploying_with_k8s guide (#8451)

commit 151ef4efd2fb52554f4d30408aca619e181ea751
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Mon Oct 7 19:55:12 2024 +0800

    [Model] Support NVLM-D and fix QK Norm in InternViT (#9045)

    Co-authored-by: Roger Wang <ywang@roblox.com>
    Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

commit f19da64871065510691cd4fcaa5f4096b661dcec
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Oct 7 18:01:46 2024 +0800

    [Core] Refactor GGUF parameters packing and forwarding (#8859)

commit 4f95ffee6f40198911ee824ed06d645fe9678511
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Oct 7 14:50:35 2024 +0800

    [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (#9089)

commit 8c6de96ea1e6e51e49a170c28ad3efc16db9413e
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Mon Oct 7 14:10:35 2024 +0800

    [Model] Explicit interface for vLLM models and support OOT embedding models (#9108)

commit 18b296fdb2248e8a65bf005e7193ebd523b875b6
Author: youkaichao <youkaichao@gmail.com>
Date:   Sun Oct 6 22:47:04 2024 -0700

    [core] remove beam search from the core (#9105)

commit c8f26bb63694adb4202ab275efb0759c13edcaa8
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Sun Oct 6 20:52:42 2024 -0700

    [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103)

commit 487678d046fe56560ff5dc6c91c3f3c31af7de6f
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Oct 7 10:14:27 2024 +0800

    [Bugfix][Hardware][CPU] Fix CPU model input for decode (#9044)

commit cb3b2b9ba4a95c413a879e30e2b8674187519a93
Author: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Date:   Sun Oct 6 15:48:11 2024 -0400

    [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (#9038)

    Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

commit fdf59d30eaf1a62979b2a13016b4f47f28f12f88
Author: Yanyi Liu <wolfsonliu@163.com>
Date:   Sun Oct 6 20:51:08 2024 +0800

    [Bugfix] fix tool_parser error handling when serve a model not support it (#8709)

commit b22b79847153ae10710523cdb4a5fb98ac864cf4
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sun Oct 6 16:35:27 2024 +0800

    [Model] PP support for embedding models and update docs (#9090)

    Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

commit f22619fe96c842ee2406638678d2b60009d8ff14
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sun Oct 6 16:33:52 2024 +0800

    [Misc] Remove user-facing error for removed VLM args (#9104)

commit 168cab6bbfb733f97defc8c1aa13df90c5319f19
Author: Brendan Wong <35351983+LunrEclipse@users.noreply.github.com>
Date:   Sat Oct 5 23:39:03 2024 -0700

    [Frontend] API support for beam search (#9087)

    Co-authored-by: youkaichao <youkaichao@126.com>

commit 23fea8714a1e90f018163e0eee59d73bc5a500e7
Author: TJian <tunjian1996@gmail.com>
Date:   Sat Oct 5 22:00:04 2024 -0700

    [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (#9101)

commit f4dd830e0945300dbe2039af79d1994f074ffcbb
Author: youkaichao <youkaichao@gmail.com>
Date:   Sat Oct 5 19:37:31 2024 -0700

    [core] use forward context for flash infer (#9097)

commit 5df183489537a155bbaad9232f25b8e57694d7b8
Author: Andy Dai <76841985+Imss27@users.noreply.github.com>
Date:   Sat Oct 5 10:35:11 2024 -0700

    [Bugfix] Fix order of arguments matters in config.yaml (#8960)

commit cfadb9c68798c0cc4d674de19970a8e3b5ea1273
Author: Chen Zhang <zhangch99@outlook.com>
Date:   Sat Oct 5 06:56:40 2024 -0700

    [Bugfix] Deprecate registration of custom configs to huggingface (#9083)

commit 15986f598c7b1f2969918c92f5c4cf7e28d5c0df
Author: Xin Yang <105740670+xyang16@users.noreply.github.com>
Date:   Fri Oct 4 23:57:05 2024 -0700

    [Model] Support Gemma2 embedding model (#9004)

commit 53b3a330273967a3c4124cbfef2cacac92f553ba
Author: hhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Date:   Fri Oct 4 22:05:37 2024 -0700

    [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (#8979)

commit dac914b0d6bc36de4eb4bf70a9d20954560893ea
Author: Chen Zhang <zhangch99@outlook.com>
Date:   Fri Oct 4 21:45:38 2024 -0700

    [Bugfix] use blockmanagerv1 for encoder-decoder (#9084)

    Co-authored-by: Roger Wang <ywang@roblox.com>

commit a95354a36ee65523a499b3eb42f70a4a0ea4322d
Author: Zhuohan Li <zhuohan123@gmail.com>
Date:   Fri Oct 4 19:54:45 2024 -0700

    [Doc] Update README.md with Ray summit slides (#9088)

commit 663874e048d88aa7bf087628430d50f9f5245175
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Oct 4 16:43:50 2024 -0700

    [torch.compile] improve allreduce registration (#9061)

commit cc90419e89c358f906e17a5ec484fbe04092c277
Author: Chongming Ni <chongmni@amazon.com>
Date:   Fri Oct 4 16:42:20 2024 -0700

    [Hardware][Neuron] Add on-device sampling support for Neuron (#8746)

    Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com>

commit 27302dd5841d4b0fa4788076ad9ff2993e133409
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Fri Oct 4 16:07:54 2024 -0700

    [Misc] Fix CI lint (#9085)

commit 0cc566ca8fd2d21a94f3a8e48bf5c5b60d42b59f
Author: Andy Dai <76841985+Imss27@users.noreply.github.com>
Date:   Fri Oct 4 14:58:57 2024 -0700

    [Misc] Add random seed for prefix cache benchmark (#9081)

commit 05c531be476e8a864a1ab83a65f7e056315ea1fc
Author: Andy Dai <76841985+Imss27@users.noreply.github.com>
Date:   Fri Oct 4 14:38:42 2024 -0700

    [Misc] Improved prefix cache example (#9077)

commit fbb74420e7018bf0cc1bc81e6fd71a2392347227
Author: Kuntai Du <kuntai@uchicago.edu>
Date:   Fri Oct 4 14:01:44 2024 -0700

    [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (#7412)

commit 05d686432f2e13296127962861b21c25cdcdfc8b
Author: ElizaWszola <eliza@neuralmagic.com>
Date:   Fri Oct 4 20:34:44 2024 +0200

    [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (#8973)

    Co-authored-by: Dipika <dipikasikka1@gmail.com>
    Co-authored-by: Dipika Sikka <ds3822@columbia.edu>

commit 0dcc8cbe5abd4f2fafd495bd1c65fdd75d8dd919
Author: Flávia Béo <119421251+flaviabeo@users.noreply.github.com>
Date:   Fri Oct 4 15:31:40 2024 -0300

    Adds truncate_prompt_tokens param for embeddings creation (#8999)

    Signed-off-by: Flavia Beo <flavia.beo@ibm.com>

commit 26aa325f4ffe8bf1d9b921535cc02fb31d80a96d
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Fri Oct 4 10:38:25 2024 -0700

    [Core][VLM] Test registration for OOT multimodal models (#8717)

    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit e5dc713c2343b3549b43d6e2764a1036e4052bf8
Author: Varad Ahirwadkar <86718090+varad-ahirwadkar@users.noreply.github.com>
Date:   Fri Oct 4 22:54:42 2024 +0530

    [Hardware][PowerPC] Make oneDNN dependency optional for Power (#9039)

    Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com>

commit 36eecfbddb9ac2c491174c86b28ee83c4773eb5e
Author: Simon Mo <simon.mo@hey.com>
Date:   Fri Oct 4 10:17:16 2024 -0700

    Remove AMD Ray Summit Banner (#9075)

commit 9ade8bbc8dc63c03b9399f05e85a0d0ddc6f5788
Author: Prashant Gupta <prashantgupta@us.ibm.com>
Date:   Fri Oct 4 09:24:40 2024 -0700

    [Model] add a bunch of supported lora modules for mixtral (#9008)

    Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>

commit 22482e495e00d409c9b5c78dade6e672ddf7fbc2
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Fri Oct 4 11:43:15 2024 -0400

    [Bugfix] Flash attention arches not getting set properly (#9062)

commit 3d826d2c52242f4f78789adcb7c02938c84ed18b
Author: whyiug <whyiug@hotmail.com>
Date:   Fri Oct 4 22:34:58 2024 +0800

    [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (#9071)

commit 0e36fd4909780392a9c5d0e367b0a84250d55fa8
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Oct 4 18:01:37 2024 +0800

    [Misc] Move registry to its own file (#9064)

commit 0f6d7a9a347944bffd2204cbf9686299e9dd6557
Author: Murali Andoorveedu <37849411+andoorve@users.noreply.github.com>
Date:   Thu Oct 3 19:56:58 2024 -0700

    [Models] Add remaining model PP support (#7168)

    Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
    Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit 303d44790a2ccab86257f1b6097e67795f0845d4
Author: Michael Goin <michael@neuralmagic.com>
Date:   Thu Oct 3 22:55:42 2024 -0400

    [Misc] Enable multi-step output streaming by default (#9047)

commit aeb37c2a725554791ff6f258b1e18830867a3ab9
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Thu Oct 3 22:55:25 2024 -0400

    [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (#8845)

commit 3dbb215b38c010c050f7fde3528fe2c6673f7a07
Author: 代君 <sydnash@users.noreply.github.com>
Date:   Fri Oct 4 10:36:39 2024 +0800

    [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (#8405)

commit 2838d6b38e1e37b303b01f2af0a9ddee2dd66f39
Author: Domen Vreš <56541137+domenVres@users.noreply.github.com>
Date:   Fri Oct 4 01:53:29 2024 +0200

    [Bugfix] Weight loading fix for OPT model (#9042)

    Co-authored-by: dvres <dvres@fri.uni-lj.si>

commit 91add85ec409a3628d01a1e4d4b3230e0fd3aa3f
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Thu Oct 3 16:07:29 2024 -0700

    Fix failing spec decode test (#9054)

commit 9aaf14c62e16a7c74b5192a44d01a78125dab2fc
Author: youkaichao <youkaichao@gmail.com>
Date:   Thu Oct 3 12:09:42 2024 -0700

    [misc] add forward context for attention (#9029)

commit 63e39937f990818e2f22a9b821a4aa22387057a7
Author: xendo <xendoo@gmail.com>
Date:   Thu Oct 3 20:02:07 2024 +0200

    [Frontend] [Neuron] Parse literals out of override-neuron-config (#8959)

    Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>

commit f5d72b2fc6771de19c351945f1fbbb0198d53b8e
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Thu Oct 3 09:44:21 2024 -0700

    [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (#8678)

commit 83caf35e082b2657dce5f71ff965a13653a763b0
Author: Guillaume Calmettes <guillaume.calmettes@gmail.com>
Date:   Thu Oct 3 10:44:52 2024 +0200

    [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (#9020)

commit 01843c89b8ddae00d4a0f0f56b8aa7fbaa3efc42
Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Date:   Wed Oct 2 23:31:07 2024 -0500

    [Misc] log when using default MoE config (#8971)

commit 19a4dd09904975d121a10e5e3f707927f3e09faa
Author: Travis Johnson <tsjohnso@us.ibm.com>
Date:   Wed Oct 2 21:04:17 2024 -0600

    [Bugfix] example template should not add parallel_tool_prompt if tools is none (#9007)

commit 18c2e30c5754dc83f86d9b8c75af0499a77e4b3f
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Thu Oct 3 03:42:24 2024 +0100

    [Doc] Update Granite model docs (#9025)

commit 19f0d2579695e518c9bfc166544cf23775772bf8
Author: Shawn Tan <shawn@wtf.sg>
Date:   Wed Oct 2 21:33:57 2024 -0400

    [Model]  Adding Granite MoE. (#8206)

    Co-authored-by: Nick Hill <nickhill@us.ibm.com>

commit f58d4fccc9b270838be438f5f0db71bea156a56d
Author: Sergey Shlyapnikov <Sergeishlyapnikov@gmail.com>
Date:   Thu Oct 3 01:50:01 2024 +0400

    [OpenVINO] Enable GPU support for OpenVINO vLLM backend (#8192)

commit afb050b29d0cac27c32c19c8206a9ac2a4662de2
Author: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Date:   Wed Oct 2 15:44:39 2024 -0400

    [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645)

    Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

commit 7f60520deb05d2e097b408e3310f1d383fbf1de6
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Wed Oct 2 05:44:38 2024 -0600

    [Misc] Update Default Image Mapper Error Log (#8977)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
    Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

commit 563649aafe7d4b9cb0047bba60d6f58efa53fd28
Author: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Date:   Wed Oct 2 03:52:20 2024 -0400

    [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804)

    Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
    Co-authored-by: Andrew Feldman <afeld2012@gmail.com>

commit 15702038642192002cd8973cf8948751b750fd07
Author: Lily Liu <lilyliupku@gmail.com>
Date:   Tue Oct 1 16:04:42 2024 -0700

    [Spec Decode] (1/2) Remove batch expansion (#8839)

commit 22f5851b807376a836eb3551903c7fc6c81eaa9b
Author: vlsav <vl_sav@mail.ru>
Date:   Tue Oct 1 21:07:06 2024 +0300

    Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (#8997)

commit 4f341bd4bf35c5b431dc523bab86e4ae210baaf8
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Oct 2 00:35:39 2024 +0800

    [Doc] Update list of supported models (#8987)

commit 35bd2151684ffb20cdad825abe33e0e6f0cc005a
Author: Sebastian Schoennenbeck <sebastian.schoennenbeck@comma-soft.com>
Date:   Tue Oct 1 11:58:06 2024 +0200

    [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (#8965)

commit 1fe0a4264aa94ceeccc7e8d99ac0d72f0560f541
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Tue Oct 1 03:52:44 2024 -0600

    [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (#8991)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit bc4eb65b5492b4f84a1b714bfc14bcff73d401f1
Author: Isotr0py <2037008807@qq.com>
Date:   Tue Oct 1 17:51:41 2024 +0800

    [Bugfix] Fix Fuyu tensor parallel inference (#8986)

commit 82f3937e599a4f088a62e59abe81d51e11bb8f83
Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Date:   Mon Sep 30 22:46:41 2024 -0500

    [Misc] add process_weights_after_loading for DummyLoader (#8969)

commit 7da2487591888da043254f8c7045a48d5dbcc753
Author: youkaichao <youkaichao@gmail.com>
Date:   Mon Sep 30 20:40:48 2024 -0700

    [torch.compile] fix tensor alias (#8982)

commit aaccca2b4d3895d64d34b123e61731404c8fc2c0
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Mon Sep 30 20:33:12 2024 -0700

    [CI/Build] Fix machete generated kernel files ordering (#8976)

    Signed-off-by: kevin <kevin@anyscale.com>
    Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

commit 062c89e7c9c6fa9fd7fb2d28fd50321c6f78f389
Author: Joe Runde <Joseph.Runde@ibm.com>
Date:   Mon Sep 30 19:34:25 2024 -0600

    [Frontend][Core] Move guided decoding params into sampling params (#8252)

    Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
    Co-authored-by: Nick Hill <nickhill@us.ibm.com>

commit bce324487a8e36140143ea37f4b27d273a0fd661
Author: Lily Liu <lilyliupku@gmail.com>
Date:   Mon Sep 30 17:51:40 2024 -0700

    [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (#8975)

commit 1425a1bcf9c53e24fe5f4812acc5b656f2aa02f3
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Mon Sep 30 17:47:08 2024 -0700

    [ci] Add CODEOWNERS for test directories  (#8795)

    Signed-off-by: kevin <kevin@anyscale.com>

commit 1cabfcefb64a489c8ff9dcb289b4dd47cf8f89cf
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Mon Sep 30 20:57:39 2024 +0800

    [Misc] Adjust max_position_embeddings for LoRA compatibility (#8957)

commit be76e5aabf8c026e1a82028ad70167e8c652cee9
Author: Sebastian Schoennenbeck <sebastian.schoennenbeck@comma-soft.com>
Date:   Mon Sep 30 14:28:44 2024 +0200

    [Core] Make scheduling policy settable via EngineArgs (#8956)

commit 2ae25f79cf1e8d21f7bcba097e4c039463c22be4
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Sep 30 13:01:20 2024 +0800

    [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (#8946)

commit 8e60afa15eb9a0540ce6c453b974a945adff3320
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Mon Sep 30 12:31:55 2024 +0800

    [Model][LoRA]LoRA support added for MiniCPMV2.6 (#8943)

    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit b6d7392579286b6dbd8ca96c0bcb4cc6f7c3c4a0
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Sun Sep 29 21:28:26 2024 -0700

    [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]`  (#8951)

commit e01ab595d897698c9a5fe9eaebd983eb3e23470a
Author: whyiug <whyiug@hotmail.com>
Date:   Mon Sep 30 11:16:10 2024 +0800

    [Model] support input embeddings for qwen2vl (#8856)

commit f13a07b1f8c11ddbdc53b40f1fbb24bf3166b900
Author: Mor Zusman <mor.zusmann@gmail.com>
Date:   Mon Sep 30 00:35:58 2024 +0300

    [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (#8533)

commit 6c9ba48fdebe2f44c82eabfe136dc8dc6ad6f4ed
Author: danieljannai21 <100521221+danieljannai21@users.noreply.github.com>
Date:   Sun Sep 29 20:59:47 2024 +0300

    [Frontend] Added support for HF's new `continue_final_message` parameter (#8942)

commit 1fb9c1b0bf8e65e6576ff4c45f5623d233d7194b
Author: juncheoll <127460634+juncheoll@users.noreply.github.com>
Date:   Mon Sep 30 00:05:54 2024 +0900

    [Misc] Fix typo in BlockSpaceManagerV1 (#8944)

commit 31f46a0d35da80118bac5f80c533019cd50ddd9a
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Sun Sep 29 10:43:14 2024 +0100

    [BugFix] Fix seeded random sampling with encoder-decoder models (#8870)

    Co-authored-by: Roger Wang <ywang@roblox.com>

commit 3d49776bbb25927abf91bb7c5537e0006c199c16
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Sun Sep 29 14:59:45 2024 +0800

    [Model][LoRA]LoRA support added for MiniCPMV2.5 (#7199)

commit bc2ef1f77c1578612198f60ec392731efb3847c5
Author: Zilin Zhu <zilinzhu@tencent.com>
Date:   Sun Sep 29 12:19:39 2024 +0800

    [Model] Support Qwen2.5-Math-RM-72B (#8896)

commit 2e7fe7e79f41e294eeed2f484eeb791284ec48a2
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Sat Sep 28 23:13:01 2024 -0400

    [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (#8930)

commit 26a68d5d7e7dd47c7d8538a326493c8a171f5016
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sun Sep 29 10:50:51 2024 +0800

    [CI/Build] Add test decorator for minimum GPU memory (#8925)

commit d081da0064b5cda9e344f0fd519d67523a437a39
Author: ElizaWszola <eliza@neuralmagic.com>
Date:   Sun Sep 29 03:19:40 2024 +0200

    [Bugfix] Fix Marlin MoE act order when is_k_full == False (#8741)

    Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

commit 5bf8789b2a28df1305f92b9999fe60264f839caa
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Sat Sep 28 18:17:45 2024 -0700

    [Bugfix] Block manager v2 with preemption and lookahead slots (#8824)

commit d1537039ce7e6018db510d0c0d9b0c0fccb62b63
Author: Russell Bryant <rbryant@redhat.com>
Date:   Sat Sep 28 21:17:07 2024 -0400

    [Core] Improve choice of Python multiprocessing method (#8823)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>
    Co-authored-by: youkaichao <youkaichao@126.com>

commit cc276443b5ac0732b00a88472f4bc4330aa14606
Author: youkaichao <youkaichao@gmail.com>
Date:   Sat Sep 28 17:48:41 2024 -0700

    [doc] organize installation doc and expose per-commit docker (#8931)

commit e585b583a92903c9a5cc8055a444a208f4387891
Author: Chen Zhang <zhangch99@outlook.com>
Date:   Sat Sep 28 11:51:22 2024 -0700

    [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (#8891)

commit 090e945e36cfe849b484db5414f64df96e97d678
Author: Edouard B. <eduard.r.balzin@gmail.com>
Date:   Sat Sep 28 20:30:21 2024 +0200

    [Frontend] Make beam search emulator temperature modifiable (#8928)

    Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr>

commit e1a3f5e831a467b2867a66e0e56ac0f70ed44394
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sun Sep 29 00:54:35 2024 +0800

    [CI/Build] Update models tests & examples (#8874)

    Co-authored-by: Roger Wang <ywang@roblox.com>

commit 19d02ff93812fb6a28f0f1a0a0f9233e9388d616
Author: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Date:   Sat Sep 28 11:52:46 2024 -0400

    [Bugfix] Fix PP for Multi-Step (#8887)

commit 39d3f8d94fd2691b70ee809e7565402f8a061c6b
Author: tastelikefeet <58414341+tastelikefeet@users.noreply.github.com>
Date:   Sat Sep 28 23:24:12 2024 +0800

    [Bugfix] Fix code for downloading models from modelscope (#8443)

commit b0298aa8cc4a54bde659e57271778630785abc9b
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 28 16:11:25 2024 +0800

    [Misc] Remove vLLM patch of `BaichuanTokenizer` (#8921)

commit 260024a3749fb6856625dfee28560a98a92dd339
Author: Tyler Titsworth <titswortht@gmail.com>
Date:   Fri Sep 27 23:45:50 2024 -0700

    [Bugfix][Intel] Fix XPU Dockerfile Build (#7824)

    Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
    Co-authored-by: youkaichao <youkaichao@126.com>

commit d86f6b2afb006ea4b4b14a49a58f64bf3b952de6
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 27 22:10:44 2024 -0700

    [misc] fix wheel name (#8919)

commit bd429f2b75f3622fabaf9c9470ca2e921f6f56ca
Author: Sebastian Schoennenbeck <sebastian.schoennenbeck@comma-soft.com>
Date:   Sat Sep 28 00:07:10 2024 +0200

    [Core] Priority-based scheduling in async engine (#8850)

commit 18e60d7d1394541b48bf48b0a57a546a93607ac2
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 27 14:27:56 2024 -0700

    [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (#8911)

commit c2ec430ab5713d0626c1a7809718ef6c4eebf389
Author: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Date:   Fri Sep 27 16:32:07 2024 -0400

    [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378)

    Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

commit c5d55356f9d2b2075ac53cf20453358c1e2b7bde
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Fri Sep 27 15:12:34 2024 -0400

    [Bugfix] fix for deepseek w4a16 (#8906)

    Co-authored-by: mgoin <michael@neuralmagic.com>

commit 172d1cd27634e9e7adc9cb9feac73552cfae1b24
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date:   Fri Sep 27 14:25:10 2024 -0400

    [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (#7271)

commit a9b15c606fea67a072416ea0ea115261a2756058
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 27 08:11:32 2024 -0700

    [torch.compile] use empty tensor instead of None for profiling (#8875)

commit 8df2dc3c8812c0abb97ce3e2913411d88524e59f
Author: Brittany <24945384+bvrockwell@users.noreply.github.com>
Date:   Fri Sep 27 01:16:55 2024 -0700

    [TPU] Update pallas.py to support trillium (#8871)

commit 6d792d2f31b2cfb335d1a4a7c45fe4ce143c203a
Author: Isotr0py <2037008807@qq.com>
Date:   Fri Sep 27 16:15:58 2024 +0800

    [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (#8892)

commit 0e088750af2e8035c07d356b56c03393cfb56004
Author: Peter Pan <peter.pan@daocloud.io>
Date:   Fri Sep 27 16:13:25 2024 +0800

    [MISC] Fix invalid escape sequence '\' (#8830)

    Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

commit dc4e3df5c23282b2ebaead95f179c25c9d7ec4d8
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 27 00:26:38 2024 -0700

    [misc] fix collect env (#8894)

commit 3b00b9c26c91e9f9ada12975b613555698054e39
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Sep 27 11:35:15 2024 +0800

    [Core] rename`PromptInputs` and `inputs` (#8876)

commit 344cd2b6f4c22bf278cff96066001d216ec1fe82
Author: Maximilien de Bayser <mbayser@br.ibm.com>
Date:   Thu Sep 26 21:01:42 2024 -0300

    [Feature] Add support for Llama 3.1 and 3.2 tool use (#8343)

    Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

commit 1b49148e474d4d18731e159ea0460145ae52e220
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Fri Sep 27 07:54:09 2024 +0800

    [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (#8764)

commit 4b377d6febed7ddd964f1b96079d7e78c231325e
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Fri Sep 27 00:46:43 2024 +0100

    [BugFix] Fix test breakages from transformers 4.45 upgrade (#8829)

commit 71d21c73abfb9b12ea402ce6b11c1b8e31eddf4c
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Thu Sep 26 19:23:45 2024 -0400

    [Bugfix] Fixup advance_step.cu warning (#8815)

commit ee2da3e9efb38add804e2023d47e9f42f38bd638
Author: Chirag Jain <jain.chirag925@gmail.com>
Date:   Fri Sep 27 04:53:17 2024 +0530

    fix validation: Only set tool_choice `auto` if at least one tool is provided (#8568)

commit e2f6f26e8636b8a23e5c0cda533a70c40ade01ec
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Thu Sep 26 19:18:26 2024 -0400

    [Bugfix] Fix print_warning_once's line info (#8867)

commit b28d2104dea6ba80c0f1f6c4596b5703d7ef923d
Author: Michael Goin <michael@neuralmagic.com>
Date:   Thu Sep 26 19:18:14 2024 -0400

    [Misc] Change dummy profiling and BOS fallback warns to log once (#8820)

commit 93d364da3406f5523e5e4772ffbc3c72dac7bbf4
Author: Pernekhan Utemuratov <pernekhan@deepinfra.com>
Date:   Thu Sep 26 15:47:00 2024 -0700

    [Bugfix] Include encoder prompts len to non-stream api usage response (#8861)

commit d9cfbc891e2e1d62d74c7aae93bde436a29bd574
Author: Kevin H. Luu <kevin@anyscale.com>
Date:   Thu Sep 26 15:02:16 2024 -0700

    [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (#8872)

    Signed-off-by: kevin <kevin@anyscale.com>

commit 70de39f6b46f6b90aecba52358825127a50b3921
Author: youkaichao <youkaichao@gmail.com>
Date:   Thu Sep 26 13:19:04 2024 -0700

    [misc][installation] build from source without compilation (#8818)

commit 68988d4e0d8765901c51f07f9bfbda58f35f6f63
Author: fyuan1316 <yuanfang@alauda.io>
Date:   Fri Sep 27 02:04:39 2024 +0800

    [CI/Build] Fix missing ci dependencies (#8834)

commit 520db4dbc10cfc60be65e85ff4ef3a6aeeeb7836
Author: Michael Goin <michael@neuralmagic.com>
Date:   Thu Sep 26 14:02:52 2024 -0400

    [Docs] Add README to the build docker image (#8825)

commit f70bccac75a0aecc0a5fc934859158a3e1f019a5
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Thu Sep 26 13:07:18 2024 -0400

    [Build/CI] Upgrade to gcc 10 in the base build Docker image (#8814)

commit 4bb98f2190aaf408cb063df5184829fb54ee5f81
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Thu Sep 26 07:45:30 2024 -0700

    [Misc] Update config loading for Qwen2-VL and remove Granite (#8837)

commit 7193774b1ff8603ad5bf4598e5efba0d9a39b436
Author: Michael Goin <michael@neuralmagic.com>
Date:   Wed Sep 25 17:46:22 2024 -0400

    [Misc] Support quantization of MllamaForCausalLM (#8822)

commit e2c6e0a8291126c868b669f631837c7781646fdc
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Wed Sep 25 13:29:48 2024 -0700

    [Doc] Update doc for Transformers 4.45 (#8817)

commit 770ec6024fc00cd696899f5c6fdc53b7148876e6
Author: Chen Zhang <zhangch99@outlook.com>
Date:   Wed Sep 25 13:29:32 2024 -0700

    [Model] Add support for the multi-modal Llama 3.2 model (#8811)

    Co-authored-by: simon-mo <xmo@berkeley.edu>
    Co-authored-by: Chang Su <chang.s.su@oracle.com>
    Co-authored-by: Simon Mo <simon.mo@hey.com>
    Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
    Co-authored-by: Roger Wang <ywang@roblox.com>

commit 4f1ba0844b83b4e7d0ff1672b7ba502ce8732f95
Author: Simon Mo <simon.mo@hey.com>
Date:   Wed Sep 25 10:36:26 2024 -0700

    Revert "rename PromptInputs and inputs with backward compatibility (#8760) (#8810)

commit 873edda6cf8a2902e8b08eea0bf8f8f6d73704a8
Author: Michael Goin <michael@neuralmagic.com>
Date:   Wed Sep 25 12:43:36 2024 -0400

    [Misc] Support FP8 MoE for compressed-tensors (#8588)

commit 64840dfae48621c5c2004eb8f1cb7fba49f9b24e
Author: 科英 <abatom@163.com>
Date:   Thu Sep 26 00:37:41 2024 +0800

    [Frontend] MQLLMEngine supports profiling. (#8761)

commit 28e1299e60e565a56a2db41396380f74b8d29e57
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Thu Sep 26 00:36:47 2024 +0800

    rename PromptInputs and inputs with backward compatibility (#8760)

commit 0c4d2ad5e641de145682674066a84ffc632e714e
Author: DefTruth <31974251+DefTruth@users.noreply.github.com>
Date:   Thu Sep 26 00:35:53 2024 +0800

    [VLM][Bugfix] internvl with num_scheduler_steps > 1 (#8614)

commit c6f2485c823b5cd76cca70798e653c6eadb811de
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Thu Sep 26 00:35:23 2024 +0800

    [[Misc]] Add extra deps for openai server image (#8792)

commit 300da09177477d0a4d2b55790addefd971f52ae0
Author: bnellnm <49004751+bnellnm@users.noreply.github.com>
Date:   Wed Sep 25 10:35:52 2024 -0400

    [Kernel] Fullgraph and opcheck tests (#8479)

commit 1c046447a6d1ac3c99b9f453796f0d355d673deb
Author: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Date:   Wed Sep 25 10:26:37 2024 -0400

    [CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (#8777)

commit 8fae5ed7f6bfd63b81310fcb24b310d9205c9687
Author: Woo-Yeon Lee <wooyeon0.lee@samsung.com>
Date:   Wed Sep 25 16:53:03 2024 +0900

    [Misc] Fix minor typo in scheduler (#8765)

commit 3368c3ab36436af1342a3156971412e9efdb6419
Author: David Newman <darthhexx@gmail.com>
Date:   Wed Sep 25 17:52:26 2024 +1000

    [Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (#8767)

    Signed-off-by: darthhexx <darthhexx@gmail.com>

commit 1ac3de09cd87290f7494ce6337623d6edd3f8667
Author: Adam Tilghman <agt@ucsd.edu>
Date:   Wed Sep 25 00:49:26 2024 -0700

    [Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (#8672)

commit 3e073e66f1790f7ce339dad71514983e6e402f30
Author: sohamparikh <sohamparikh47@gmail.com>
Date:   Wed Sep 25 02:16:30 2024 -0400

    [Bugfix] load fc bias from config for eagle (#8790)

commit c23953675f78bc85045d66fa98aea7d0581c2167
Author: Isotr0py <2037008807@qq.com>
Date:   Wed Sep 25 14:16:11 2024 +0800

    [Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (#8770)

commit e3dd0692fa2c803cd6f59a88d2fdf8bca26d8d96
Author: zifeitong <zifeitong@gmail.com>
Date:   Tue Sep 24 22:53:43 2024 -0700

    [BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (#8250)

commit fc3afc20df410dd523f94967b98836084f561ab7
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Tue Sep 24 21:26:36 2024 -0700

    Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (#8752)

commit b4522474a32b6e0bf5573a9b6a6830cb787dfb63
Author: sasha0552 <admin@sasha0552.org>
Date:   Wed Sep 25 04:26:33 2024 +0000

    [Bugfix][Kernel] Implement acquire/release polyfill for Pascal (#8776)

commit ee777d9c30418ffa9d98f98dd27c0ddea346c49c
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Tue Sep 24 21:26:18 2024 -0700

    Fix test_schedule_swapped_simple in test_scheduler.py (#8780)

commit 6e0c9d6bd07464b311eb098e2dac8196eed16721
Author: Joe Runde <Joseph.Runde@ibm.com>
Date:   Tue Sep 24 21:37:38 2024 -0600

    [Bugfix] Use heartbeats instead of health checks (#8583)

commit 6da1ab6b4134d76391a0c31a048e5d04b6283769
Author: Archit Patke <apatke@illinois.edu>
Date:   Tue Sep 24 21:50:50 2024 -0500

    [Core] Adding Priority Scheduling (#5958)

commit 01b6f9e1f0530a7cb81486ff34d3d935e4f75d28
Author: Travis Johnson <tsjohnso@us.ibm.com>
Date:   Tue Sep 24 18:29:56 2024 -0600

    [Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)

    Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

commit 13f9f7a3d0373421ee9fd7498e450214e134aa6c
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Wed Sep 25 08:08:55 2024 +0800

    [[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (#8768)

commit 1e7d5c01f5c35424eede1bbe6f723dd8781120f0
Author: youkaichao <youkaichao@gmail.com>
Date:   Tue Sep 24 15:48:39 2024 -0700

    [misc] soft drop beam search (#8763)

commit 2467b642dd9bde32a334fe5967efd78a53aa49da
Author: Daniele <36171005+dtrifiro@users.noreply.github.com>
Date:   Tue Sep 24 21:38:12 2024 +0200

    [CI/Build] fix setuptools-scm usage (#8771)

commit 72fc97a0f100b92f1ff6c6a16e27d12f1c7569aa
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Tue Sep 24 14:33:21 2024 -0400

    [Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (#8748)

commit 2529d09b5a4a124a316b6976e7d782f54e0bddde
Author: Andy <37781802+aandyw@users.noreply.github.com>
Date:   Tue Sep 24 12:44:11 2024 -0400

    [Frontend] Batch inference for llm.chat() API  (#8648)

    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
    Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
    Co-authored-by: Roger Wang <ywang@roblox.com>
    Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

commit a928ded99519f803d4cf6389df6acc707239a5cc
Author: ElizaWszola <eliza@neuralmagic.com>
Date:   Tue Sep 24 18:31:42 2024 +0200

    [Kernel] Split Marlin MoE kernels into multiple files (#8661)

    Co-authored-by: mgoin <michael@neuralmagic.com>

commit cc4325b66ac49e403ed9e1a8c38156a5324e1174
Author: Hanzhi Zhou <hanzhi713@gmail.com>
Date:   Tue Sep 24 01:08:14 2024 -0700

    [Bugfix] Fix potentially unsafe custom allreduce synchronization (#8558)

commit 8ff7ced996d5dc8b682913471f36c9fefb0e843f
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Tue Sep 24 01:36:46 2024 -0600

    [Model] Expose Phi3v num_crops as a mm_processor_kwarg (#8658)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit 3f06bae9079ee495a34cfadcd9c1ef2a23636084
Author: Peter Salas <peter@fixie.ai>
Date:   Tue Sep 24 00:14:15 2024 -0700

    [Core][Model] Support loading weights by ID within models (#7931)

commit b8747e8a7c318ab774862f94ccbdbba5b7d9dd4a
Author: Cody Yu <hao.yu.cody@gmail.com>
Date:   Mon Sep 23 23:10:03 2024 -0700

    [MISC] Skip dumping inputs when unpicklable (#8744)

commit 3185fb0ccae73816018d0936c03171b7cf1ba2f8
Author: Simon Mo <simon.mo@hey.com>
Date:   Mon Sep 23 22:45:20 2024 -0700

    Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (#8750)

commit 0250dd68c5df12ead29d2ec7d922855c9a257b06
Author: youkaichao <youkaichao@gmail.com>
Date:   Mon Sep 23 22:08:12 2024 -0700

    re-implement beam search on top of vllm core (#8726)

    Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>

commit 88577ac92808cfd9468e4b54b757d5fcbe9aa486
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Mon Sep 23 21:43:13 2024 -0700

    Fix tests in test_scheduler.py that fail with BlockManager V2 (#8728)

commit 530821d00cb2beeb8dc62f74f0e4e0003868dc93
Author: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Date:   Mon Sep 23 21:52:39 2024 -0400

    [Hardware][AMD] ROCm6.2 upgrade (#8674)

commit 1a2aef3e59f5429299618bd3b242833cb377f554
Author: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Date:   Mon Sep 23 18:38:04 2024 -0400

    Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (#8335)

commit 5f7bb584272ee15147a411b887e7ababd6b9b9d0
Author: jiqing-feng <107918818+jiqing-feng@users.noreply.github.com>
Date:   Tue Sep 24 03:32:27 2024 +0800

    Fix typical acceptance sampler with correct recovered token ids (#8562)

commit b05f5c9238c3e0c3a98080b4ffc90acfa33f9e1f
Author: Russell Bryant <rbryant@redhat.com>
Date:   Mon Sep 23 15:15:41 2024 -0400

    [Core] Allow IPv6 in VLLM_HOST_IP with zmq (#8575)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 9b0e3ec970f6a19427be358848a2ed663fd735e1
Author: Jee Jee Li <pandaleefree@gmail.com>
Date:   Tue Sep 24 02:57:42 2024 +0800

    [Kernel][LoRA]  Add assertion for punica sgmv kernels (#7585)

commit 86e9c8df29a954a7a2fc46e9985fecc2a2e15ae8
Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Date:   Mon Sep 23 13:46:26 2024 -0400

    [Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (#7701)

    Co-authored-by: mgoin <michael@neuralmagic.com>
    Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
    Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

commit ee5f34b1c2c71b2d56054a5ca23fe1c50c1458bb
Author: Daniele <36171005+dtrifiro@users.noreply.github.com>
Date:   Mon Sep 23 18:44:26 2024 +0200

    [CI/Build] use setuptools-scm to set __version__ (#4738)

    Co-authored-by: youkaichao <youkaichao@126.com>

commit f2bd246c17ba67d7749a2560a30711f74cd19177
Author: Jani Monoses <jani.monoses@gmail.com>
Date:   Mon Sep 23 17:43:09 2024 +0300

    [VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (#8707)

commit a79e5229843e2800956956d0668b1b4858dbb61e
Author: Yanyi Liu <wolfsonliu@163.com>
Date:   Mon Sep 23 21:46:59 2024 +0800

    [Model] Support pp for qwen2-vl (#8696)

commit 3e83c12b5caa466bf533b144a9ec7944a9ce9d49
Author: Li, Jiang <jiang1.li@intel.com>
Date:   Mon Sep 23 21:15:16 2024 +0800

    [Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (#8733)

commit e551ca1555b64ba1ecb2310ea658f3e25c62571d
Author: Isotr0py <2037008807@qq.com>
Date:   Mon Sep 23 20:12:20 2024 +0800

    [Hardware][CPU] Refactor CPU model runner (#8729)

commit 9b8c8ba1198cbcd311d28b7647f0f8d5dcdc9212
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Mon Sep 23 01:44:48 2024 -0600

    [Core][Frontend] Support Passing Multimodal Processor Kwargs (#8657)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit d23679eb9960ad2a876b88ebd0028dbe55c3172a
Author: Yan Ma <yan.ma@intel.com>
Date:   Mon Sep 23 13:54:18 2024 +0800

    [Bugfix] fix docker build for xpu (#8652)

commit 57a0702e63d9dc477ab7a82e686a30d14fb6c69d
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date:   Sun Sep 22 23:40:46 2024 -0400

    [Bugfix] Fix CPU CMake build (#8723)

    Co-authored-by: Yuan <yuan.zhou@intel.com>

commit 3dda7c22502033854e963fef3826c1f64627e33b
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Sun Sep 22 22:24:59 2024 -0400

    [Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (#8702)

commit 92ba7e7477619ec81464ccb64a17226f3d5047bb
Author: youkaichao <youkaichao@gmail.com>
Date:   Sun Sep 22 15:41:59 2024 -0700

    [misc] upgrade mistral-common (#8715)

commit d4a2ac830291305f202a85e157bff3a07b58e616
Author: youkaichao <youkaichao@gmail.com>
Date:   Sun Sep 22 12:47:54 2024 -0700

    [build] enable existing pytorch (for GH200, aarch64, nightly) (#8713)

commit c6bd70d7728b50f358cb5cb6e66e02b75aeb3d20
Author: Lily Liu <lilyliupku@gmail.com>
Date:   Sun Sep 22 12:34:14 2024 -0700

    [SpecDec][Misc] Cleanup, remove bonus token logic. (#8701)

commit 5b59532760c82a9d91f65a3e227524da2af7d4ef
Author: litianjian <45817262+litianjian@users.noreply.github.com>
Date:   Mon Sep 23 01:51:44 2024 +0800

    [Model][VLM] Add LLaVA-Onevision model support (#8486)

    Co-authored-by: litianjian <litianjian@bytedance.com>
    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
    Co-authored-by: Roger Wang <ywang@roblox.com>
    Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

commit ca2b628b3c25b014b9951731c0331b75262a59e0
Author: Huazhong Ji <hzji210@gmail.com>
Date:   Mon Sep 23 01:44:09 2024 +0800

    [MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (#8703)

commit 8ca5051b9afb6f8d2b3ae1b71d45d84e5d1c6f57
Author: Alex Brooks <alex.brooks@ibm.com>
Date:   Sun Sep 22 06:56:20 2024 -0600

    [Misc] Use NamedTuple in Multi-image example (#8705)

    Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

commit 06ed2815e2be50e527839c7ab09ce2639b7910b6
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sun Sep 22 20:24:21 2024 +0800

    [Model] Refactor BLIP/BLIP-2 to support composite model loading (#8407)

commit 0e40ac9b7b5d953dfe38933bc7d2fb0a6c8da53c
Author: youkaichao <youkaichao@gmail.com>
Date:   Sat Sep 21 23:24:58 2024 -0700

    [ci][build] fix vllm-flash-attn (#8699)

commit 13d88d4137f97b8cf3c79f39d7df5e4c8348603a
Author: Isotr0py <2037008807@qq.com>
Date:   Sun Sep 22 12:33:27 2024 +0800

    [Bugfix] Refactor composite weight loading logic (#8656)

commit d66ac62854e04c8fda83506dc93ef7971ebf593a
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Sat Sep 21 19:45:02 2024 -0400

    [Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (#8643)

commit 9dc7c6c7f332ac6c08311c7a946c6945e0782701
Author: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Date:   Sat Sep 21 16:09:39 2024 -0500

    [dbrx] refactor dbrx experts to extend FusedMoe class (#8518)

commit ec4aaad8124baadc7954e30c612ca9444b22d7e7
Author: rasmith <Randall.Smith@amd.com>
Date:   Sat Sep 21 04:20:54 2024 -0500

    [Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (#8646)

commit 4dfdf4319676c3dca72cdfba20470ac76d0cadf4
Author: Andy Dai <76841985+Imss27@users.noreply.github.com>
Date:   Sat Sep 21 00:24:12 2024 -0700

    [Doc] Fix typo in AMD installation guide (#8689)

commit 5e85f4f82a5b6eaad6869198d6ac76a0c12cf6d0
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 21 14:28:56 2024 +0800

    [VLM] Use `SequenceData.from_token_counts` to create dummy data (#8687)

commit 71c60491f287d8a23bed1743513b4b3e7927c69e
Author: Luka Govedič <ProExpertProg@users.noreply.github.com>
Date:   Sat Sep 21 02:27:10 2024 -0400

    [Kernel] Build flash-attn from source (#8245)

commit 0faab90eb006c677add65cd4c2d0f740a63e064d
Author: youkaichao <youkaichao@gmail.com>
Date:   Fri Sep 20 19:55:33 2024 -0700

    [beam search] add output for manually checking the correctness (#8684)

commit 0455c46ed434d70f0a6219204e89ee04f1d01336
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 21 10:30:39 2024 +0800

    [Core] Factor out common code in `SequenceData` and `Sequence` (#8675)

commit d4bf085ad064ba68a77862e2022f37c33a66e94a
Author: Kunshang Ji <kunshang.ji@intel.com>
Date:   Sat Sep 21 10:03:55 2024 +0800

    [MISC] add support custom_op check (#8557)

    Co-authored-by: youkaichao <youkaichao@126.com>

commit 0057894ef7f8db0d51385aa7254219d7fbd6c784
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 21 10:00:54 2024 +0800

    [Core] Rename `PromptInputs` and `inputs`(#8673)

commit 0f961b3ce9ac3d3fd13e201c4358884bc094905e
Author: zyddnys <zyddnys@outlook.com>
Date:   Fri Sep 20 18:48:32 2024 -0400

    [Bugfix] Fix incorrect llava next feature size calculation (#8496)

commit 7f9c8902e3d50a9d715b38e0531280a58d2bbe14
Author: omrishiv <327609+omrishiv@users.noreply.github.com>
Date:   Fri Sep 20 15:19:44 2024 -0700

    [Hardware][AWS] update neuron to 2.20 (#8676)

    Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

commit 7c8566aa4ff16b79a576436fbb50f03643febf07
Author: omrishiv <327609+omrishiv@users.noreply.github.com>
Date:   Fri Sep 20 15:04:37 2024 -0700

    [Doc] neuron documentation update (#8671)

    Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

commit b4e4eda92e1d3a013fc4007db64b69d8604264ff
Author: Patrick von Platen <patrick.v.platen@gmail.com>
Date:   Fri Sep 20 23:33:03 2024 +0200

    [Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640)

commit 2874bac618052a079efd837fc82cf3f3519079c7
Author: Pastel！ <1627301104@qq.com>
Date:   Sat Sep 21 05:00:45 2024 +0800

    [Bugfix] Config got an unexpected keyword argument 'engine' (#8556)

commit 035fa895ecedea87810889aabbe50ba8a2ad7d5d
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Sat Sep 21 04:52:19 2024 +0800

    [Misc] Show AMD GPU topology in `collect_env.py` (#8649)

commit b28298f2f4bd4ec6d1020c10b923a9eb7993dc89
Author: saumya-saran <saumya.saran@c3.ai>
Date:   Fri Sep 20 12:46:02 2024 -0700

    [Bugfix] Validate SamplingParam n is an int (#8548)

commit 2940afa04e39fa9f248c565687d9a2acf7401355
Author: Alexey Kondratiev(AMD) <143633163+alexeykondrat@users.noreply.github.com>
Date:   Fri Sep 20 13:27:44 2024 -0400

    [CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (#8670)

commit 3b63de9353ce51ba6c1c167ae8d4b87b8bcf9c9e
Author: Niklas Muennighoff <n.muennighoff@gmail.com>
Date:   Fri Sep 20 09:31:41 2024 -0700

    [Model] Add OLMoE (#7922)

commit 260d40b5ea48df9421325388abcc8d907a560fc5
Author: Jiaxin Shan <seedjeffwan@gmail.com>
Date:   Thu Sep 19 23:20:56 2024 -0700

    [Core] Support Lora lineage and base model metadata management (#6315)

commit 9e5ec35b1f8239453b1aaab28e7a02307db4ab1f
Author: William Lin <SolitaryThinker@users.noreply.github.com>
Date:   Thu Sep 19 20:49:54 2024 -0700

    [bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (#8474)

commit 18ae428a0d8792d160d811a9cd5bb004d68ea8bd
Author: Amit Garg <mitgarg17495@gmail.com>
Date:   Thu Sep 19 17:54:02 2024 -0700

    [Bugfix] Fix Phi3.5 mini and MoE LoRA inference (#8571)

commit de6f90a13d7b98c4958ba107ec16cb6f95efb10f
Author: bnellnm <49004751+bnellnm@users.noreply.github.com>
Date:   Thu Sep 19 18:36:30 2024 -0400

    [Misc] guard against change in cuda library name (#8609)

commit 6cb748e190a94e20987314025614b8bd806602f2
Author: Alexey Kondratiev(AMD) <143633163+alexeykondrat@users.noreply.github.com>
Date:   Thu Sep 19 16:06:32 2024 -0400

    [CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (#8551)

commit 9e99407e3ccbb290bae77af230da38c70a52a055
Author: Simon Mo <simon.mo@hey.com>
Date:   Thu Sep 19 12:16:28 2024 -0700

    Create SECURITY.md (#8642)

commit ea4647b7d77c4738c5ed2ab77a2c9f5ad335f6fb
Author: Isotr0py <2037008807@qq.com>
Date:   Fri Sep 20 03:15:55 2024 +0800

    [Doc] Add documentation for GGUF quantization (#8618)

commit e42c634acbd1b86b5becca51e8b8108a32a438d5
Author: 盏一 <w@hidva.com>
Date:   Fri Sep 20 02:28:25 2024 +0800

    [Core] simplify logits resort in _apply_top_k_top_p (#8619)

commit 9cc373f39036af789fb1ffc1e06b23766996d3f4
Author: Charlie Fu <charlifu@amd.com>
Date:   Thu Sep 19 12:37:57 2024 -0500

    [Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577)

commit 76515f303b44cb3ffc6de63c49148d5081a77119
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Thu Sep 19 17:51:06 2024 +0100

    [Frontend] Use MQLLMEngine for embeddings models too (#8584)

commit 855c8ae2c9a4085b1ebd66d9a978fb23f47f822c
Author: Kunshang Ji <kunshang.ji@intel.com>
Date:   Thu Sep 19 13:33:20 2024 +0800

    [MISC] remove engine_use_ray in benchmark_throughput.py (#8615)

commit c52ec5f03471008fa1312d82fb17d40b95a3ca5d
Author: Kuntai Du <kuntai@uchicago.edu>
Date:   Wed Sep 18 22:24:24 2024 -0700

    [Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616)

commit 02c9afa2d04a85269faa2760e9af30527a61d7f6
Author: Roger Wang <136131678+ywang96@users.noreply.github.com>
Date:   Wed Sep 18 21:14:28 2024 -0700

    Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593)

commit 3118f63385c0d767fba8b6d2039fc35440678da9
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Wed Sep 18 19:24:15 2024 -0700

    [Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (#8545)

commit 4c34ce8916da0e4967eadefcb7f91eb58dd7ac61
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Wed Sep 18 21:42:49 2024 -0400

    [Kernel] Remove marlin moe templating on thread_m_blocks (#8573)

    Co-authored-by: lwilkinson@neuralmagic.com

commit 0d47bf3bf40edfe9fcfd7e5cd909388497535bc5
Author: Joe Runde <Joseph.Runde@ibm.com>
Date:   Wed Sep 18 16:10:01 2024 -0600

    [Bugfix] add `dead_error` property to engine client (#8574)

    Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

commit d9cd78eb718c233ebc5b84377fc2226af7ef0fa2
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Wed Sep 18 21:17:55 2024 +0100

    [BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572)

commit db9120cdedba5033037432775417df0b6117495d
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Wed Sep 18 16:05:06 2024 -0400

    [Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039)

commit b3195bc9e4d57b6107af2222afea26c51475e262
Author: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Date:   Wed Sep 18 13:41:08 2024 -0400

    [AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)

    Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit e18749ff09c277f7cdab278895ebdd9b1041b6e8
Author: Geun, Lim <shing100@Naver.com>
Date:   Thu Sep 19 02:04:00 2024 +0900

    [Model] Support Solar Model (#8386)

    Co-authored-by: Michael Goin <michael@neuralmagic.com>

commit d65798f78c76f03f068fc2f69a68cff430ee6b6f
Author: Russell Bryant <rbryant@redhat.com>
Date:   Wed Sep 18 12:10:27 2024 -0400

    [Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)

    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit a8c1d161a7d87dbc6c7cccfce303dcbe2e4ed6be
Author: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
Date:   Wed Sep 18 11:38:43 2024 -0400

    [Core] *Prompt* logprobs support in Multi-step (#8199)

commit 7c7714d856eee6fa94aade729b67f00584f72a4c
Author: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Date:   Wed Sep 18 09:56:58 2024 -0400

    [Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (#8157)

    Co-authored-by: Nick Hill <nickhill@us.ibm.com>
    Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
    Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
    Co-authored-by: Simon Mo <simon.mo@hey.com>

commit 9d104b5beb7bbb51c64b680e007f39169489ea86
Author: Aaron Pham <contact@aarnphm.xyz>
Date:   Wed Sep 18 07:00:56 2024 -0400

    [CI/Build] Update Ruff version (#8469)

    Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

commit 6ffa3f314c59e42238f1c5f923ff2839e0af9698
Author: Cyrus Leung <tlleungac@connect.ust.hk>
Date:   Wed Sep 18 18:38:11 2024 +0800

    [CI/Build] Avoid CUDA initialization (#8534)

commit e351572900f7d87e14fe203ea3a49c1c7ddae0d6
Author: Jiaxin Shan <seedjeffwan@gmail.com>
Date:   Wed Sep 18 02:51:59 2024 -0700

    [Misc] Add argument to disable FastAPI docs (#8554)

commit 95965d31b6ac2c9557816a6ffabe4a3117a5ccb2
Author: Daniele <36171005+dtrifiro@users.noreply.github.com>
Date:   Wed Sep 18 04:49:53 2024 +0200

    [CI/Build] fix Dockerfile.cpu on podman (#8540)

commit 8110e44529f431d54b02060528601c0d3e3f7d02
Author: Tyler Michael Smith <tyler@neuralmagic.com>
Date:   Tue Sep 17 19:44:27 2024 -0400

    [Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012)

commit 09deb4721f830602d0417604c7e18b7e384f9594
Author: Alexey Kondratiev(AMD) <143633163+alexeykondrat@users.noreply.github.com>
Date:   Tue Sep 17 19:40:29 2024 -0400

    [CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520)

commit fa0c114fad4e2b807503e78d5110558cfee92ba4
Author: youkaichao <youkaichao@gmail.com>
Date:   Tue Sep 17 16:24:06 2024 -0700

    [doc] improve installation doc (#8550)

    Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>

commit 98f9713399bd602ff954a83e6e6abcb4cf8b8864
Author: Joe Runde <Joseph.Runde@ibm.com>
Date:   Tue Sep 17 17:17:08 2024 -0600

    [Bugfix] Fix TP > 1 for new granite (#8544)

    Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

commit 56c3de018c35580fd088655c2f9951cd4da5335d
Author: Nick Hill <nickhill@us.ibm.com>
Date:   Tue Sep 17 20:24:29 2024 +0100

    [Misc] Don't dump contents of kvcache tensors on errors (#8527)

commit a54ed8024953dc6b59906072a7a89cd4791ec4f0
Author: Patrick von Platen <patrick.v.platen@gmail.com>
Date:   Tue Sep 17 19:50:37 2024 +0200

    [Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)

    Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

commit 9855b99502c7537db5ef018129e603650800ac46
Author: chenqianfzh <51831990+chenqianfzh@users.noreply.github.com>
Date:   Tue Sep 17 08:09:12 2024 -0700

    [Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434)

commit 1009e93c5d634c724eeff3d4e453369337f502d4
Author: sroy745 <142070531+sroy745@users.noreply.github.com>
Date:   Tue Sep 17 07:35:01 2024 -0700

    [Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631)

commit 1b6de8352b878348974b3f117cbb68ed18daa609
Author: Isotr0py <2037008807@qq.com>
Date:   Tue Sep 17 15:34:27 2024 +0800

    [Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495)

commit cbdb25225914a04d94e8830f4e739faca8ff3b9d
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date:   Tue Sep 17 00:06:26 2024 -0700

    [Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change…

Loading branch information

opus24 committed Oct 11, 2024

1 parent 6685a3c commit a10068c

.buildkite/check-wheel-size.py

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,36 +1,43 @@
  
    import os

    import sys

    import zipfile

    MAX_SIZE_MB = 250

    # Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB

    VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))

    def print_top_10_largest_files(zip_file):

        """Print the top 10 largest files in the given zip file."""

        with zipfile.ZipFile(zip_file, 'r') as z:

            file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]

            file_sizes.sort(key=lambda x: x[1], reverse=True)

            for f, size in file_sizes[:10]:

                print(f"{f}: {size/(1024*1024)} MBs uncompressed.")

                print(f"{f}: {size / (1024 * 1024):.2f} MBs uncompressed.")

    def check_wheel_size(directory):

        """Check the size of .whl files in the given directory."""

        for root, _, files in os.walk(directory):

            for f in files:

                if f.endswith(".whl"):

                    wheel_path = os.path.join(root, f)

                    wheel_size = os.path.getsize(wheel_path)

                    wheel_size_mb = wheel_size / (1024 * 1024)

                    if wheel_size_mb > MAX_SIZE_MB:

                        print(

                            f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "

                            f"compare to the allowed size ({MAX_SIZE_MB} MB).")

            for file_name in files:

                if file_name.endswith(".whl"):

                    wheel_path = os.path.join(root, file_name)

                    wheel_size_mb = os.path.getsize(wheel_path) / (1024 * 1024)

                    if wheel_size_mb > VLLM_MAX_SIZE_MB:

                        print(f"Not allowed: Wheel {wheel_path} is larger "

                              f"({wheel_size_mb:.2f} MB) than the limit "

                              f"({VLLM_MAX_SIZE_MB} MB).")

                        print_top_10_largest_files(wheel_path)

                        return 1

                    else:

                        print(f"Wheel {wheel_path} is within the allowed size "

                              f"({wheel_size_mb} MB).")

                              f"({wheel_size_mb:.2f} MB).")

        return 0

    if __name__ == "__main__":

        import sys

        sys.exit(check_wheel_size(sys.argv[1]))

        if len(sys.argv) < 2:

            print("Usage: python check-wheel-size.py <directory>")

            sys.exit(1)

        directory = sys.argv[1]

        sys.exit(check_wheel_size(directory))

...ldkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml

-Original file line number
+Diff line change
@@ -0,0 +1,11 @@
+    # bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
+    model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
+    tasks:
+    - name: "gsm8k"
+      metrics:
+      - name: "exact_match,strict-match"
+        value: 0.764
+      - name: "exact_match,flexible-extract"
+        value: 0.764
+    limit: 250
+    num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

-Original file line number
+Diff line change
@@ -1,7 +1,7 @@
     Meta-Llama-3-8B-Instruct.yaml
-    Meta-Llama-3-8B-Instruct-FP8.yaml
     Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
     Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
+    Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
     Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
     Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
     Minitron-4B-Base-FP8.yaml
@@ Expand Down @@

.buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -2,7 +2,7 @@
  
    # We can use this script to compute baseline accuracy on GSM for transformers.

    #

    # Make sure you have lm-eval-harness installed:

    #   pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@9516087b81a61d0e220b22cc1b75be76de23bc10

    #   pip install lm-eval==0.4.4

    usage() {

        echo``

.buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -3,7 +3,7 @@ @@
     # We use this for fp8, which HF does not support.
     #
     # Make sure you have lm-eval-harness installed:
-    #   pip install lm-eval==0.4.3
+    #   pip install lm-eval==0.4.4
     usage() {
         echo``
@@ Expand Down @@

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -49,10 +49,15 @@ def test_lm_eval_correctness(): @@
         results = launch_lm_eval(eval_config)
         # Confirm scores match ground truth.
+        success = True
         for task in eval_config["tasks"]:
             for metric in task["metrics"]:
                 ground_truth = metric["value"]
                 measured_value = results["results"][task["name"]][metric["name"]]
                 print(f'{task["name"]} | {metric["name"]}: '
                       f'ground_truth={ground_truth} | measured={measured_value}')
-                assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
+                success = success and numpy.isclose(
+                    ground_truth, measured_value, rtol=RTOL)
+        # Assert at the end, print all scores even on failure for debugging.
+        assert success

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -8,8 +8,7 @@ steps: @@
               containers:
               - image: badouralix/curl-jq
                 command:
-                - sh
-                - .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
+                - sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
       - wait
       - label: "A100"
         agents:
@@ Expand Down @@

.buildkite/nightly-benchmarks/nightly-annotation.md

-Original file line number
+Diff line change
@@ -0,0 +1,28 @@
+    ## Description
+    This file contains the downloading link for benchmarking results.
+    - [benchmarking pipeline](artifact://nightly-pipeline.yaml)
+    - [benchmarking results](artifact://results.zip)
+    - [benchmarking code](artifact://nightly-benchmarks.zip)
+    Please download the visualization scripts in the post
+    ## Results reproduction
+    - Find the docker we use in `benchmarking pipeline`
+    - Deploy the docker, and inside the docker:
+      - Download `nightly-benchmarks.zip`.
+      - In the same folder, run the following code
+    ```
+    export HF_TOKEN=<your HF token>
+    apt update
+    apt install -y git
+    unzip nightly-benchmarks.zip
+    VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
+    ```
+    And the results will be inside `./benchmarks/results`.

.buildkite/nightly-benchmarks/nightly-descriptions.md

-Original file line number
+Diff line change
@@ -1,45 +1,39 @@
     # Nightly benchmark
-    The main goal of this benchmarking is two-fold:
-    - Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
-    - Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
-    ## Docker images
-    We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
-    - vllm/vllm-openai:v0.5.0.post1
-    - nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
-    - openmmlab/lmdeploy:v0.5.0
-    - ghcr.io/huggingface/text-generation-inference:2.1
-    <!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->
-    ## Hardware
-    One AWS node with 8x NVIDIA A100 GPUs.
-    ## Workload description
-    We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
-    - Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
-    - Output length: the corresponding output length of these 500 prompts.
-    - Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
-    - Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-    - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
-    <!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
-    ## Plots
-    In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
-    <img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >
-    ## Results
-    {nightly_results_benchmarking_table}
+    This benchmark aims to:
+    - Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
+    - Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
+    Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
+    Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
+    ## Setup
+    - Docker images:
+      - vLLM: `vllm/vllm-openai:v0.6.2`
+      - SGLang: `lmsysorg/sglang:v0.3.2-cu121`
+      - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
+      - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
+        - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
+      - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
+    - Hardware
+      - 8x Nvidia A100 GPUs
+    - Workload:
+      - Dataset
+        - ShareGPT dataset
+        - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
+        - Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
+        - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
+      - Models: llama-3 8B, llama-3 70B.
+        - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
+      - Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
+        - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
+      - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
+    # Known issues
+    - TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
+    - TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/nightly-pipeline.yaml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
  
    common_container_settings: &common_container_settings

      command:

        - bash .buildkite/nightly-benchmarks/run-nightly-suite.sh

        - bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh

      resources:

        limits:

          nvidia.com/gpu: 8

    @@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
  
    steps:

      - block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."

      - label: "A100 trt benchmark"

      - label: "A100 vllm step 10"

        priority: 100

        agents:

          queue: A100

    @@ -46,7 +49,21 @@ steps:
  
              podSpec:

                <<: *common_pod_spec

                containers:

                  - image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3

                  - image: vllm/vllm-openai:v0.6.2

                    <<: *common_container_settings

      - label: "A100 sglang benchmark"

        priority: 100

        agents:

          queue: A100

        plugins:

          - kubernetes:

              podSpec:

                <<: *common_pod_spec

                containers:

                  - image: lmsysorg/sglang:v0.3.2-cu121

                    <<: *common_container_settings

      - label: "A100 lmdeploy benchmark"

    @@ -58,11 +75,13 @@ steps:
  
              podSpec:

                <<: *common_pod_spec

                containers:

                  - image: openmmlab/lmdeploy:v0.5.0

                  - image: openmmlab/lmdeploy:v0.6.1-cu12

                    <<: *common_container_settings

      - label: "A100 vllm benchmark"

      - label: "A100 trt llama-8B"

        priority: 100

        agents:

          queue: A100

    @@ -71,10 +90,25 @@ steps:
  
              podSpec:

                <<: *common_pod_spec

                containers:

                  - image: vllm/vllm-openai:latest 

                  - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

                    <<: *common_container_settings

                    env:

                      - name: VLLM_USAGE_SOURCE

                        value: ci-test

                      - name: HF_HOME

                        value: /root/.cache/huggingface

                      - name: VLLM_SOURCE_CODE_LOC

                        value: /workspace/build/buildkite/vllm/performance-benchmark

                      - name: HF_TOKEN

                        valueFrom:

                          secretKeyRef:

                            name: hf-token-secret

                            key: token

                      - name: TEST_SELECTOR

                        value: "llama8B"

      - label: "A100 tgi benchmark"

      - label: "A100 trt llama-70B"

        priority: 100

        agents:

          queue: A100

    @@ -83,12 +117,54 @@ steps:
  
              podSpec:

                <<: *common_pod_spec

                containers:

                  - image: ghcr.io/huggingface/text-generation-inference:2.1 

                  - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

                    <<: *common_container_settings

                    env:

                      - name: VLLM_USAGE_SOURCE

                        value: ci-test

                      - name: HF_HOME

                        value: /root/.cache/huggingface

                      - name: VLLM_SOURCE_CODE_LOC

                        value: /workspace/build/buildkite/vllm/performance-benchmark

                      - name: HF_TOKEN

                        valueFrom:

                          secretKeyRef:

                            name: hf-token-secret

                            key: token

                      - name: TEST_SELECTOR

                        value: "llama70B"

      # FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image 

      # - label: "A100 trt benchmark"

      #   priority: 100

      #   agents:

      #     queue: A100

      #   plugins:

      #     - kubernetes:

      #         podSpec:

      #           <<: *common_pod_spec

      #           containers:

      #             - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

      #               <<: *common_container_settings

      # FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.

      # - label: "A100 tgi benchmark"

      #   priority: 100

      #   agents:

      #     queue: A100

      #   plugins:

      #     - kubernetes:

      #         podSpec:

      #           <<: *common_pod_spec

      #           containers:

      #             - image: ghcr.io/huggingface/text-generation-inference:2.2.0

      #               <<: *common_container_settings

      - wait

      - label: "Plot"

      - label: "Collect the results"

        priority: 100

        agents:

          queue: A100

    @@ -117,4 +193,4 @@ steps:
  
                        name: hf-token-secret

                        key: token

      - wait
      
      - block: ":rocket: check the results!"

0 comments on commit `a10068c`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `a10068c`

Commit

There are no files selected for viewing

0 comments on commit a10068c

0 comments on commit `a10068c`