Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonsense responses with n-gram speculative decoding #2997

Open
1 of 4 tasks
olliestanley opened this issue Feb 6, 2025 · 1 comment
Open
1 of 4 tasks

Nonsense responses with n-gram speculative decoding #2997

olliestanley opened this issue Feb 6, 2025 · 1 comment

Comments

@olliestanley
Copy link

System Info

text-generation-inference 3.1.0 (saw the same issue on 3.0.0)

model="NousResearch/Meta-Llama-3.1-8B-Instruct"
volume="$PWD/data"
docker create --name llama3.1-speculate2 --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id $model --quantize eetq --speculate 2
docker start llama3.1-speculate2

nvidia-smi (EC2 VM in AWS)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   26C    P8               9W / 300W |      3MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. Run the above startup script, with n-gram speculative decoding enabled
  2. Submit a query to the LLM
  3. Observe gibberish output. The gibberish sometimes/often appears to include repetition from the prior context

Example input (to be clear, this is the entire input - there is no system prompt or chat history):

<|start_header_id|>user<|end_header_id|>

Do machine learning models need data?<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Response with no speculative decoding:

Yes, machine learning models need data to learn and make predictions. In fact, data is the fuel that powers machine learning models. Here's why:

1. **Training**: Machine learning models are trained on a dataset, which is a collection of examples or instances that the model learns from. The model uses this data to learn patterns, relationships, and decision boundaries that enable it to make predictions.
2. **Learning**: During training, the model adjusts its parameters to minimize the difference between its predictions and the actual outcomes in the training data. This process is called learning, and it's essential for the model to develop its predictive capabilities.
3. **Generalization**: Once a model is trained, it's tested on a separate dataset, called the validation set, to evaluate its performance. The model's ability to generalize to new, unseen data is critical for its effectiveness in real-world applications.
4. **Fine-tuning**: Even after a model is trained, it may need to be fine-tuned on additional data to adapt to changing conditions, such as new features or shifts in the underlying distribution of the data.

Types of data that machine learning models need:

1. **Labeled data**: This is data that has been annotated or labeled with the correct output or response. Labeled data is essential for supervised learning models, such as classification and regression models.
2. **Unlabeled data**: This is data that doesn't have any labels or annotations. Unlabeled data can be used for unsupervised learning models, such as clustering and dimensionality reduction.
3. **Feature data**: This is data that contains the features or attributes that the model uses to make predictions. Feature data can be numerical, categorical, or a combination of both.
4. **Contextual data**: This is data that provides context or additional information about the data, such as metadata or external knowledge.

In summary, machine learning models need data to learn, make predictions, and adapt to changing conditions. The type and quality of data can significantly impact the performance and effectiveness of a machine learning model.

Response with 2-gram speculation:

Yes, machine learning

Machine learning models model needs data to learn and make predictions. In fact, data is the fuel that powers machine learning

Machine learning

Machine learning.

I haven't tried with this minimum reproduction example for higher than 2-gram, but with the more complex use case I was trying previously I noticed that the responses got longer and more gibberish at 3-gram.

I would also note that I haven't tried without eetq quantisation, due to VRAM constraints.

Expected behavior

As I understand it, the n-gram speculative decoding output should not differ at all from the normal output.

@olliestanley
Copy link
Author

@Narsil @OlivierDehaene any thoughts on whether this is a TGI bug or there's something I should be doing differently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant