Skip to content

Latest commit

 

History

History
25 lines (16 loc) · 3.08 KB

HOW_IT_WORKS.md

File metadata and controls

25 lines (16 loc) · 3.08 KB

OpenVINO™ GenAI: How it works

Stateful LLM

A common optimization for LLM inference is using a past KV (key/value)-cache. This cache is represented by the corresponding inputs and outputs in a model originally implemented in a DL framework (e.g. PyTorch models from Hugging Face). For further optimization and easier use, the model is transformed to a stateful form. This transformation improves inference performance and decreases the allocated runtime memory in long-running text generation scenarios. It is achieved by hiding inputs and outputs of the model that represent past KV-cache tensors and handling them inside the model in a more efficient way. Although the cache is still accessible with state API. It is opposed to stateless model approach requiring manipulating these inputs and outputs explicitly. An introduction to the stateful models can be found in the Stateful Models article.

Hiding KV-cache introduces a peculiarity for beam search algorithm. Beam search suggests batched inference of multiple beams. The design described here so far would result in generating multiple independent sequences of tokens. Beam search algorithm, on the other hand, requires removing some of the ongoing beams and splitting other beams to multiple branches. Beam removal requires deleting corresponding KV-cache entry and beam splitting requires copying corresponding KV-cache values.

To provide the possibility to implement beam search without accessing model's internal state, a stateful LLM converted with optimum-intel or llm_bench introduces an additional 1-dimentional beam_idx input. beam_idx must contain indexes of elements in a batch which are intended to be selected and will evolve during the next beam search iteration. There's only one beam when the generation starts. That beam corresponds to the initial prompt. beam_idx must have values: [0, 0] to keep the initial beam and introduce its copy. The dynamic batch size enables to change the number of beams dynamically. beam_idx must have [1] as the value to remove zeroth sequence and keep the second beam only.

Assume there are two running beams. To proceed with generating both beams at the next iteration, beam_idx values must be [0, 1], pointing to batch elements 0 and 1. To drop the last beam and split the other beam in two, beam_idx must be set to [0, 0]. This results in utilizing only the part of KV cache corresponding to the zeroth element in the batch. The process of selecting proper entries in cache is called Cache Reorder.

The images below represent stateless and stateful LLM pipelines. The model has 4 inputs:

  1. input_ids contains the next selected token
  2. attention_mask is filled with 1
  3. position_ids encodes a position of currently generating token in the sequence
  4. beam_idx selects beams

The model has 1 output logits describing the predicted distribution over the next tokens. And there's KV cache state.