Nonsense responses with n-gram speculative decoding #2997

olliestanley · 2025-02-06T15:13:17Z

System Info

text-generation-inference 3.1.0 (saw the same issue on 3.0.0)

model="NousResearch/Meta-Llama-3.1-8B-Instruct"
volume="$PWD/data"
docker create --name llama3.1-speculate2 --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id $model --quantize eetq --speculate 2
docker start llama3.1-speculate2

nvidia-smi (EC2 VM in AWS)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   26C    P8               9W / 300W |      3MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run the above startup script, with n-gram speculative decoding enabled
Submit a query to the LLM
Observe gibberish output. The gibberish sometimes/often appears to include repetition from the prior context

Example input (to be clear, this is the entire input - there is no system prompt or chat history):

<|start_header_id|>user<|end_header_id|>

Do machine learning models need data?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Response with no speculative decoding:

Yes, machine learning models need data to learn and make predictions. In fact, data is the fuel that powers machine learning models. Here's why:

1. **Training**: Machine learning models are trained on a dataset, which is a collection of examples or instances that the model learns from. The model uses this data to learn patterns, relationships, and decision boundaries that enable it to make predictions.
2. **Learning**: During training, the model adjusts its parameters to minimize the difference between its predictions and the actual outcomes in the training data. This process is called learning, and it's essential for the model to develop its predictive capabilities.
3. **Generalization**: Once a model is trained, it's tested on a separate dataset, called the validation set, to evaluate its performance. The model's ability to generalize to new, unseen data is critical for its effectiveness in real-world applications.
4. **Fine-tuning**: Even after a model is trained, it may need to be fine-tuned on additional data to adapt to changing conditions, such as new features or shifts in the underlying distribution of the data.

Types of data that machine learning models need:

1. **Labeled data**: This is data that has been annotated or labeled with the correct output or response. Labeled data is essential for supervised learning models, such as classification and regression models.
2. **Unlabeled data**: This is data that doesn't have any labels or annotations. Unlabeled data can be used for unsupervised learning models, such as clustering and dimensionality reduction.
3. **Feature data**: This is data that contains the features or attributes that the model uses to make predictions. Feature data can be numerical, categorical, or a combination of both.
4. **Contextual data**: This is data that provides context or additional information about the data, such as metadata or external knowledge.

In summary, machine learning models need data to learn, make predictions, and adapt to changing conditions. The type and quality of data can significantly impact the performance and effectiveness of a machine learning model.

Response with 2-gram speculation:

Yes, machine learning

Machine learning models model needs data to learn and make predictions. In fact, data is the fuel that powers machine learning

Machine learning

Machine learning.

I haven't tried with this minimum reproduction example for higher than 2-gram, but with the more complex use case I was trying previously I noticed that the responses got longer and more gibberish at 3-gram.

I would also note that I haven't tried without eetq quantisation, due to VRAM constraints.

Expected behavior

As I understand it, the n-gram speculative decoding output should not differ at all from the normal output.

The text was updated successfully, but these errors were encountered:

olliestanley · 2025-02-11T09:14:10Z

@Narsil @OlivierDehaene any thoughts on whether this is a TGI bug or there's something I should be doing differently?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonsense responses with n-gram speculative decoding #2997

Nonsense responses with n-gram speculative decoding #2997

olliestanley commented Feb 6, 2025

olliestanley commented Feb 11, 2025

Nonsense responses with n-gram speculative decoding #2997

Nonsense responses with n-gram speculative decoding #2997

Comments

olliestanley commented Feb 6, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

olliestanley commented Feb 11, 2025