Skip to content

Latest commit

 

History

History
174 lines (147 loc) · 11 KB

File metadata and controls

174 lines (147 loc) · 11 KB

OpenVINO GenAI Text Generation Samples

These samples showcase the use of OpenVINO's inference capabilities for text generation tasks, including different decoding strategies such as beam search, multinomial sampling, and speculative decoding. Each sample has a specific focus and demonstrates a unique aspect of text generation. The applications don't have many configuration options to encourage the reader to explore and modify the source code. For example, change the device for inference to GPU. There are also Jupyter notebooks for some samples. You can find links to them in the appropriate sample descriptions.

Table of Contents

  1. Download and Convert the Model and Tokenizers
  2. Sample Descriptions
  3. Troubleshooting
  4. Support and Contribution

Download and convert the model and tokenizers

The --upgrade-strategy eager option is needed to ensure optimum-intel is upgraded to the latest version. Install ../../export-requirements.txt if model conversion is required.

pip install --upgrade-strategy eager -r ../../export-requirements.txt
optimim-cli export openvino --model <model> <output_folder>

If a converted model in OpenVINO IR format is already available in the collection of OpenVINO optimized LLMs on Hugging Face, it can be downloaded directly via huggingface-cli.

pip install --upgrade-strategy eager -r ../../export-requirements.txt
huggingface-cli download <model> --local-dir <output_folder>

Sample Descriptions

Common information

Follow Get Started with Samples to get common information about OpenVINO samples. Follow build instruction to build GenAI samples

GPUs usually provide better performance compared to CPUs. Modify the source code to change the device for inference to the GPU.

See https://github.com/openvinotoolkit/openvino.genai/blob/master/SUPPORTED_MODELS.md for the list of supported models.

Install ../../deployment-requirements.txt to run samples

pip install --upgrade-strategy eager -r ../../deployment-requirements.txt

1. Chat Sample (chat_sample)

  • Description: Interactive chat interface powered by OpenVINO. Here is a Jupyter notebook that provides an example of LLM-powered text generation in Python. Recommended models: meta-llama/Llama-2-7b-chat-hf, TinyLlama/TinyLlama-1.1B-Chat-v1.0, etc
  • Main Feature: Real-time chat-like text generation.
  • Run Command:
    ./chat_sample <MODEL_DIR>

Missing chat template

If you encounter an exception indicating a missing "chat template" when launching the ov::genai::LLMPipeline in chat mode, it likely means the model was not tuned for chat functionality. To work this around, manually add the chat template to tokenizer_config.json of your model. The following template can be used as a default, but it may not work properly with every model:

"chat_template": "{% for message in messages %}{% if (message['role'] == 'user') %}{{'<|im_start|>user\n' + message['content'] + '<|im_end|>\n<|im_start|>assistant\n'}}{% elif (message['role'] == 'assistant') %}{{message['content'] + '<|im_end|>\n'}}{% endif %}{% endfor %}",

2. Greedy Causal LM (greedy_causal_lm)

  • Description: Basic text generation using a causal language model. Here is a Jupyter notebook that provides an example of LLM-powered text generation in Python. Recommended models: meta-llama/Llama-2-7b-hf, etc
  • Main Feature: Demonstrates simple text continuation.
  • Run Command:
    ./greedy_causal_lm <MODEL_DIR> "<PROMPT>"

3. Beam Search Causal LM (beam_search_causal_lm)

  • Description: Uses beam search for more coherent text generation. Here is a Jupyter notebook that provides an example of LLM-powered text generation in Python. Recommended models: meta-llama/Llama-2-7b-hf, etc
  • Main Feature: Improves text quality with beam search.
  • Run Command:
    ./beam_search_causal_lm <MODEL_DIR> "<PROMPT 1>" ["<PROMPT 2>" ...]

4. Multinomial Causal LM (multinomial_causal_lm)

  • Description: Text generation with multinomial sampling for diversity. Recommended models: meta-llama/Llama-2-7b-hf, etc
  • Main Feature: Introduces randomness for creative outputs.
  • Run Command:
    ./multinomial_causal_lm <MODEL_DIR> "<PROMPT>"

5. Prompt Lookup Decoding LM (prompt_lookup_decoding_lm)

  • Description: Prompt Lookup decoding is assested-generation technique where the draft model is replaced with simple string matching the prompt to generate candidate token sequences. This method highly effective for input grounded generation (summarization, document QA, multi-turn chat, code editing), where there is high n-gram overlap between LLM input (prompt) and LLM output. This could be entity names, phrases, or code chunks that the LLM directly copies from the input while generating the output. Prompt lookup exploits this pattern to speed up autoregressive decoding in LLMs. This results in significant speedups with no effect on output quality. Recommended models: meta-llama/Llama-2-7b-hf, etc
  • Main Feature: Specialized prompt-based inference.
  • Run Command:
    ./prompt_lookup_decoding_lm <MODEL_DIR> "<PROMPT>"

6. Speculative Decoding LM (speculative_decoding_lm)

  • Description: Speculative decoding (or assisted-generation in HF terminology) is a recent technique, that allows to speed up token generation when an additional smaller draft model is used alongside with the main model.

Speculative decoding works the following way. The draft model predicts the next K tokens one by one in an autoregressive manner, while the main model validates these predictions and corrects them if necessary. We go through each predicted token, and if a difference is detected between the draft and main model, we stop and keep the last token predicted by the main model. Then the draft model gets the latest main prediction and again tries to predict the next K tokens, repeating the cycle.

This approach reduces the need for multiple infer requests to the main model, enhancing performance. For instance, in more predictable parts of text generation, the draft model can, in best-case scenarios, generate the next K tokens that exactly match the target. In that case they are validated in a single inference request to the main model (which is bigger, more accurate but slower) instead of running K subsequent requests. More details can be found in the original paper https://arxiv.org/pdf/2211.17192.pdf, https://arxiv.org/pdf/2302.01318.pdf

Here is a Jupyter notebook that provides an example of LLM-powered text generation in Python.

Recommended models: meta-llama/Llama-2-13b-hf as main model and TinyLlama/TinyLlama-1.1B-Chat-v1.0 as draft model, etc

  • Main Feature: Reduces latency while generating high-quality text.
  • Run Command:
    ./speculative_decoding_lm <MODEL_DIR> <DRAFT_MODEL_DIR> "<PROMPT>"

7. LoRA Greedy Causal LM (lora_greedy_causal_lm)

  • Description: This sample demonstrates greedy decoding using Low-Rank Adaptation (LoRA) fine-tuned causal language models. LoRA enables efficient fine-tuning, reducing resource requirements for adapting large models to specific tasks.
  • Main Feature: Lightweight fine-tuning with LoRA for efficient text generation
  • Run Command:
    ./lora_greedy_causal_lm <MODEL_DIR> <ADAPTER_SAFETENSORS_FILE> "<PROMPT>"

8. Encrypted Model Causal LM (encrypted_model_causal_lm)

  • Description: LLMPipeline and Tokenizer objects can be initialized directly from the memory buffer, e.g. when user stores only encrypted files and decrypts them on-the-fly. The following code snippet demonstrates how to load the model from the memory buffer:
auto [model_str, weights_tensor] = decrypt_model(models_path + "/openvino_model.xml", models_path + "/openvino_model.bin");
ov::genai::Tokenizer tokenizer(models_path);
ov::genai::LLMPipeline pipe(model_str, weights_tensor, tokenizer, device);

For the sake of brevity the code above does not include Tokenizer decryption. For more details look to encrypted_model_causal_lm sample.

  • Main Feature: Read model directly from memory buffer
  • Run Command:
    ./encrypted_model_causal_lm <MODEL_DIR> "<PROMPT>"

9. LLMs benchmarking sample (benchmark_genai)

  • Description: This sample script demonstrates how to benchmark an LLMs in OpenVINO GenAI. The script includes functionality for warm-up iterations, generating text, and calculating various performance metrics.

For more information how performance metrics are calculated please follow performance-metrics tutorial.

  • Main Feature: Benchmark model via GenAI
  • Run Command:
    ./benchmark_genai [OPTIONS]

    Options

  • -m, --model: Path to the model and tokenizers base directory.
  • -p, --prompt (default: "The Sky is blue because"): The prompt to generate text.
  • -nw, --num_warmup (default: 1): Number of warmup iterations.
  • -mt, --max_new_tokens (default: 20): Number of warmup iterations.
  • -n, --num_iter (default: 3): Number of iterations.
  • -d, --device (default: "CPU"): Device to run the model on.

Troubleshooting

Unicode characters encoding error on Windows

Example error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u25aa' in position 0: character maps to <undefined>

If you encounter the error described in the example when sample is printing output to the Windows console, it is likely due to the default Windows encoding not supporting certain Unicode characters. To resolve this:

  1. Enable Unicode characters for Windows cmd - open Region settings from Control panel. Administrative->Change system locale->Beta: Use Unicode UTF-8 for worldwide language support->OK. Reboot.
  2. Enable UTF-8 mode by setting environment variable PYTHONIOENCODING="utf8".

Support and Contribution