Best way to add knowledge to a llm : r/LocalLLaMA #665

irthomasthomas · 2024-02-28T19:57:33Z

Best way to add knowledge to a llm : r/LocalLLaMA

Best way to add knowledge to a LLM: r/LocalLLaMA

DESCRIPTION: Studies like this one show GPT4 gets 75% accuracy on prompting alone. GPT4 + RAG you get 80% accuracy. GPT4 + Finetuning 81%. GPT4 + RAG + Finetuning = 86%. Other studies like this one say just for knowledge retrieval from huge datasets, RAG is enough.

Kaggle's LLM Science Exam competition link made participants answer hard science questions. The winning solution showed Llama-2 70b with prompting gets 80%. + finetuning via SFT you get 86%. But + finetuning + RAG you get 93%. All had to undergo finetuning since the output was MMLU's classification type ie output A, B, C, D etc (so a classification problem).

I would use RAG as a first try to see if it can work. Now the issue is which embeddings, which database etc. Chunk size, reranking etc.

If you find RAG to be quite annoying to set up, another approach is to shove your dataset for finetuning. It'll become a text completion model, so you might need say GPT4 to create some instructions from the dataset to "prime" your model.

So RAG definitely works, pushing accuracies from 75% to 80%. But + finetuning you get 86%. There are some bad theories spreading finetuning does not inject new knowledge, but these studies and the Kaggle comp prove otherwise.

Likewise see Open Hermes, and any finetuned model - finetuning is just continuous pretraining. Definitely the weights of the model are being edited to account for more information.

I'm also the dev of Unsloth :) If you're going to do finetuning, I have a free Colab notebook to finetune Mistral 7b 2x faster and use 70% less VRAM. Colab Notebook

All in all, I would try first prompt engineering, then RAG, then finetuning, then RAG + finetuning as the final step.

URL: r/LocalLLaMA

Suggested labels

{'label-name': 'Knowledge-Enhancement-Techniques', 'label-description': 'Methods and tools used to improve knowledge acquisition in AI models.', 'gh-repo': 'llm,finetuning,dataset,RAG,embeddings,Research', 'confidence': 70.22}

irthomasthomas · 2024-02-28T19:57:35Z

Related issues

#643: I finally got perfect labels (classification task) via prompting : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [I finally got perfect labels (classification task) via prompting : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1amvfua/i_finally_got_perfect_labels_classification_task/)

TITLE

I finally got perfect labels (classification task) via prompting : r/LocalLLaMA

DESCRIPTION

"I finally got perfect labels (classification task) via prompting

Tutorial | Guide
It took me weeks of trial and error, but here are my biggest lessons:

Alpaca works REALLY well, even for Mistral/Mixtral instructs
Mixtral8x7b-instruct is the best (in my experience) at in-context learning

For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
Split your prompt into 3 sections:

Instructions: Explains the task
Hint: Explains likely mislabeling reasons
Few-shot: Examples w/ reasoning

Below is the plug-n-play template I finalized/am using
Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.

Instruction:

Label the text based on this question: "{task}" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label.
(Hint: {common mistakes you see after trial and error})

Text: {few-shot example}
Reason for Label: {explanation}
Label: {correct label}

Input:

Text: {Text for it to label}
Label (Print Yes/No Only):

Response:

For experimentation, I found that discrepancies are your best friend. My setup was:

Create baseline labels, you don't care at this point how accurate they are - I think few-shot w/ 5 examples and no hints are the way to go here because you want the model to fail
If you use Mixtral8x7b using the prompt format above, you will 100% get Yes/No labels + it's justification, so you can just quickly sample 10 outputs to see how it did and make notes of common mistakes to make your hint
Run the model again, include a hint to your prompt, and then look specifically at the discrepancies -- you should be able to instantly tell if the baseline is overfitting for false positives or false negatives, that's kind of your goal
As you iterate through your instruction, hints, and few-shot examples, you want to continue to look at the discrepancies, your goal should be to get it to decrease little by little, so that by the time you done, your prompt will correct all the mislabels.
Adding MORE few-shot examples will exaggerate the overfitting, you want to do this so you can quickly see if your model leans towards false positives or negatives
I wrote a script that output something like this:

Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv:
Same: 900, Different: 100

Number of times M8x7b-t0 said "Yes" and M8x7b-t1 said "No": 100
Number of times M8x7b-t0 said "No" and M8x7b-t1 said "Yes": 0

That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million...

Eventually I got it down to this:

Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv:
Same: 972, Different: 28

Number of times M8x7b-t1 said "Yes" and M8x7b-t2 said "No": 2
Number of times M8x7b-t1 said "No" and M8x7b-t2 said "Yes": 26

When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels.

Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other.

monday.com
Sign Up

Sort by:

Add a Comment

aichiusagi
• 19d ago
• Edited 18d ago

For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
I ran into this too. When fine-tuning, what you need to do is provide some subset of training data where you explicitly return nothing for false positives. In my data, I set this to about ~10% of the total and the problem disappeared.

Reply

GeeBrain
• 19d ago

Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points

reply

Can you please provide an example of an actual prompt?

GeeBrain
• 18d ago

It's literally the template + whatever you want in the {}. But here ya go...

Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request.

Instruction: Label the comment based on this question: "Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label.

(Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!)

Comment: Wow, you are so beautiful.
Reason for Label: Sharing simple statements admiration or opinions, does not count as disclosing personal details, they need to express something about their personal life, habits, or experiences.
Label: No
.... (More examples)

Input:

Comment: "When he comes up?"
Label (Print Yes/No Only):

Response:

Reply

reply

trapping_rainwater
• 18d ago

What's your production use case for something like this?

Reply

reply

GeeBrain
• 18d ago

My project is around building an ML model that measures trust — kinda like a fandom score.

But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it.

Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun"

URL

Suggested labels

#660: Qwen - supervised finetuning script and guide for SFT.

### Details

Similarity score: 0.87 - [ ] [Example - Qwen](https://qwen.readthedocs.io/en/latest/training/SFT/example.html)

Example - Qwen

DESCRIPTION:
Here we provide a very simple script for supervised finetuning, which is revised from the training script in Fastchat. The script is used to finetune Qwen with Hugging Face Trainer. You can check the script here. This script for supervised finetuning (SFT) has the following features:

Support single-GPU and multi-GPU training;
Support full-parameter tuning, LoRA, and Q-LoRA.

In the following, we introduce more details about the usage of the script.

Installation
Before you start, make sure you have installed the following packages:

pip install peft deepspeed optimum accelerate

Data Preparation
For data preparation, we advise you to organize the data in a jsonl file, where each line is a dictionary as demonstrated below:

{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "Tell me something about large language models."
        },
        {
            "role": "assistant",
            "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
        }
    ],
    "source": "unknown"
}
{
    "type": "chatml",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is your name?"
        },
        {
            "role": "assistant",
            "content": "My name is Qwen."
        }
    ],
    "source": "self-made"
}

Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: type, messages, and source. messages is required while the others are optional for you to label your data format and data source. The messages field is a list of JSON objects, each of which has two fields: role and content. role can be system, user, or assistant. content is the text of the message. source is the source of the data, which can be self-made, alpaca, open-hermes, or any other string.

To make the jsonl file, you can use json to save a list of dictionaries to the jsonl file:

import json

with open('data.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

Quickstart
For you to start finetuning quickly, we directly provide a shell script for you to run without paying attention to details. You need different hyperparameters for different types of training, e.g., single-GPU / multi-GPU training, full-parameter tuning, LoRA, or Q-LoRA.

cd examples/sft
bash finetune.sh -m <model_path> -d <data_path> --deepspeed <config_path> [--use_lora True] [--q_lora True]

Specify the <model_path> for your model, <data_path> for your data, and <config_path> for your deepspeed configuration. If you use LoRA or Q-LoRA, just add --use_lora True or --q_lora True based on your requirements. This is the simplest way to start finetuning. If you want to change more hyperparameters, you can dive into the script and modify those parameters.

Advanced Usages
In this section, we introduce the details of the scripts, including the core python script as well as the corresponding shell script.

Shell Script
Before we introduce the python code, we provide a brief introduction to the shell script with commands. We provide some guidance inside the shell script and here we take finetune.sh as an example.

To set up the environment variables for distributed training (or single-GPU training), specify the following variables: GPUS_PER_NODE, NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT. No need to worry too much about them as we provide the default settings for you. In the command, you can pass in the argument -m and -d to specify the model path and data path, respectively. You can also pass in the argument --deepspeed to specify the deepspeed configuration file. We provide two configuration files for ZeRO2 and ZeRO3, and you can choose one based on your requirements. In most cases, we recommend using ZeRO3 for multi-GPU training except for Q-LoRA, where we recommend using ZeRO2.

There are a series of hyperparameters to tune. Passing in --bf16 or --fp16 to specify the precision for mixed precision training. The other significant hyperparameters include:

--output_dir: the path of your output models or adapters.
--num_train_epochs: the number of training epochs.
--gradient_accumulation_steps: the number of gradient accumulation steps.
--per_device_train_batch_size: the batch size per GPU for training, and the total batch size is equal to per_device_train_batch_size * number_of_gpus * gradient_accumulation_steps.
--learning_rate: the learning rate.
--warmup_steps: the number of warmup steps.
--lr_scheduler_type: the type of learning rate scheduler.
--weight_decay: the value of weight decay.
--adam_beta2: the value of in Adam.
--model_max_length: the maximum sequence length.
--use_lora: whether to use LoRA. Adding --q_lora can enable Q-LoRA.
--gradient_checkpointing: whether to use gradient checkpointing.

URL: https://qwen.readthedocs.io/en/latest/training/SFT/example.html

Suggested labels

#315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog

### Details

Similarity score: 0.87 - [ ] [A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog](https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b)

A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds. (high-resolution version)
It’s the start of a new year and perhaps you’re looking to break into the RAG scene by building your very first RAG system. Or, maybe you’ve built Basic RAG systems and are now looking to enhance them to something more advanced in order to better handle your user’s queries and data structures.
In either case, knowing where or how to begin may be a challenge in and of itself! If that’s true, then hopefully this blog post points you in the right direction for your next steps, and moreover, provides for you a mental model for you to anchor your decisions when building advanced RAG systems.
The RAG cheat sheet shared above was greatly inspired by a recent RAG survey paper (“Retrieval-Augmented Generation for Large Language Models: A Survey” Gao, Yunfan, et al. 2023).
Basic RAG
Mainstream RAG as defined today involves retrieving documents from an external knowledge database and passing these along with the user’s query to an LLM for response generation. In other words, RAG involves a Retrieval component, an External Knowledge database and a Generation component.
LlamaIndex Basic RAG Recipe:
from llama_index import SimpleDirectoryReader, VectorStoreIndex

load data

documents = SimpleDirectoryReader(input_dir="...").load_data()

build VectorStoreIndex that takes care of chunking documents

and encoding chunks to embeddings for future retrieval

index = VectorStoreIndex.from_documents(documents=documents)

The QueryEngine class is equipped with the generator

and facilitates the retrieval and generation steps

query_engine = index.as_query_engine()

Use your Default RAG

response = query_engine.query("A user's query")

Suggested labels

{ "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }

#647: Qwen-1.5-8x7B : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/)

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

DESCRIPTION: "Qwen-1.5-8x7B

New Model
Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B

Model: Link to Model

Dataset: Link to Dataset

Thread:

I'm excited to release a project I've been working on the last couple of weeks.

Qwen1.5-8x7b: Link to Model

And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset

The purpose and intention behind this project is better detailed in the model/dataset card, but basically:

I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card.

I then trained Qwen1.5-7b on a 100k subset over 4 epochs.

Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model.

Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs.

Good news:

Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests.

Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training.

Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family.

Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity.

We're just getting started."

URL: Link to Reddit Post

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

#625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports	Free Notebooks	Performance	Memory use
Gemma 7b	▶️ Start on Colab	2.4x faster	58% less
Mistral 7b	▶️ Start on Colab	2.2x faster	62% less
Llama-2 7b	▶️ Start on Colab	2.2x faster	43% less
TinyLlama	▶️ Start on Colab	3.9x faster	74% less
CodeLlama 34b A100	▶️ Start on Colab	1.9x faster	27% less
Mistral 7b 1xT4	▶️ Start on Kaggle	5x faster*	62% less
DPO - Zephyr	▶️ Start on Colab	1.9x faster	19% less

This conversational notebook is useful for ShareGPT ChatML / Vicuna templates.
This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster.

🦥 Unsloth.ai News

📣 Gemma 7b on 6T tokens now works. And Gemma 2b notebook
📣 Added conversational notebooks and raw text notebooks
📣 2x faster inference added for all our models
📣 DPO support is now included. More info on DPO
📣 We did a blog with 🤗Hugging Face and are in their official docs! Check out the SFT docs and DPO docs
📣 Download models 4x faster from 🤗Hugging Face. Eg: unsloth/mistral-7b-bnb-4bit

🔗 Links and Resources

Type	Links
📚 Wiki & FAQ	Read Our Wiki
📜 Documentation	Read The Doc
💾 Installation	unsloth/README.md
Twitter (aka X)	Follow us on X
🥇 Benchmarking	Performance Tables
🌐 Released Models	Unsloth Releases
✍️ Blog	Read our Blogs

⭐ Key Features

All kernels written in OpenAI's Triton language. Manual backprop engine.
0% loss in accuracy - no approximation methods - all exact.
No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
Works on Linux and Windows via WSL.
Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
Open source trains 5x faster - see Unsloth Pro for 30x faster training!
If you trained a model with 🦥Unsloth, you can use this cool sticker!

🥇 Performance Benchmarking

For the full list of reproducable benchmarking tables, go to our website

1 A100 40GB	🤗Hugging Face	Flash Attention	🦥Unsloth Open Source	🦥Unsloth Pro
Alpaca	1x	1.04x	1.98x	15.64x
LAION Chip2	1x	0.92x	1.61x	20.73x
OASST	1x	1.19x	2.17x	14.83x
Slim Orca	1x	1.18x	2.22x	14.82x

Benchmarking table below was conducted by 🤗Hugging Face.

Free Colab T4	Dataset	🤗Hugging Face	Pytorch 2.1.1	🦥Unsloth	🦥 VRAM reduction
Llama-2 7b	OASST	1x	1.19x	1.95x	-43.3%
Mistral 7b	Alpaca	1x	1.07x	1.56x	-13.7%
Tiny Llama 1.1b	Alpaca	1x	2.06x	3.87x	-73.8%
DPO with Zephyr	Ultra Chat	1x	1.09x	1.55x	-18.6%

View on GitHub

Suggested labels

#317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks

### Details

Similarity score: 0.87 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) # Efficient Streaming Language Models with Attention Sinks [[paper](http://arxiv.org/abs/2309.17453)] [[slides](assets/StreamingLLM.pdf)][[video](https://youtu.be/hvJsEzP34o8)]

streamingllm_demo.mp4

TL;DR

We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance.

News

[2024/01] SwiftInfer, a TensorRT-based implementation makes StreamingLLM more production-grade.
[2024/01] StreamingLLM is integrated into NVIDIA TensorRT-LLM!
[2023/12] StreamingLLM enables endless and efficient LLM generation on iPhone!
[2023/12] StreamingLLM is integrated by HuggingFace Transformers' main branch.
[2023/10] StreamingLLM is integrated into Intel Extension for Transformers.
[2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs.

Abstract

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.

Usage

Environment Setup

conda create -yn streaming python=3.8
conda activate streaming

pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece

python setup.py develop

Run Streaming Llama Chatbot

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py  --enable_streaming

FAQ

What does "working on infinite-length inputs" imply for LLMs?

Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods.
Is the context window of LLMs expanded?

No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096.
Can I input an extensive text, like a book, into StreamingLLM for summarization?

While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh.
What is the ideal use case for StreamingLLM?

StreamingLLM is optimized for streaming applications, such as multi-round dialogues. It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data. An example is a daily assistant based on LLMs. StreamingLLM would let the model function continuously, basing its responses on recent conversations without needing to refresh its cache. Earlier methods would either need a cache reset when the conversation length exceeded the training length (losing recent context) or recompute KV states from recent text history, which can be time-consuming.
How does StreamingLLM relate to recent works on context extension?

StreamingLLM is orthogonal to recent context extension methods and can be integrated with them. In StreamingLLM's context, "context extension" refers to the possibility of using a larger cache size to store more recent tokens. For a practical demonstration, refer to Figure 9 in our paper, where we implement StreamingLLM with models like LongChat-7B-v1.5-32K and Llama-2-7B-32K-Instruct.

TODOs

We will release the code and data in the following order, please stay tuned!

Release core code of StreamingLLM, including Llama-2, MPT, Falcon, and Pythia.
Release perplexity evaluation code
Release Streaming Llama Chatbot demo.
Release StreamEval dataset and evaluation code.

Citation

If you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper:

@article{xiao2023streamingllm,
        title={Efficient Streaming Language Models with Attention Sinks},
        author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
        journal={arXiv},
        year={2023}
        }
```</details>

irthomasthomas removed the New-Label Choose this option if the existing labels are insufficient to describe the content accurately label Feb 28, 2024

irthomasthomas mentioned this issue Mar 6, 2024

My Knowledge Wiki: personal wiki pattern #705

Open

1 task

ShellLM mentioned this issue Apr 28, 2024

Strategies for Managing Machine Learning Model Metadata and Lineage #814

Open

1 task

ShellLM mentioned this issue May 15, 2024

How I became a machine learning practitioner - Greg Brockman #833

Open

1 task

ShellLM mentioned this issue Jun 14, 2024

Prompt engineering - OpenAI API #836

Open

1 task

Best way to add knowledge to a llm : r/LocalLLaMA #665

Best way to add knowledge to a llm : r/LocalLLaMA #665

Comments

irthomasthomas commented Feb 28, 2024

Best way to add knowledge to a LLM: r/LocalLLaMA

Suggested labels

{'label-name': 'Knowledge-Enhancement-Techniques', 'label-description': 'Methods and tools used to improve knowledge acquisition in AI models.', 'gh-repo': 'llm,finetuning,dataset,RAG,embeddings,Research', 'confidence': 70.22}

irthomasthomas commented Feb 28, 2024

Related issues

#643: I finally got perfect labels (classification task) via prompting : r/LocalLLaMA

TITLE

DESCRIPTION

Instruction:

Input:

Response:

Input:

Response:

Suggested labels

#660: Qwen - supervised finetuning script and guide for SFT.

Example - Qwen

Suggested labels

#315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog

load data

build VectorStoreIndex that takes care of chunking documents

and encoding chunks to embeddings for future retrieval

The QueryEngine class is equipped with the generator

and facilitates the retrieval and generation steps

Use your Default RAG

Suggested labels

{ "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }

#647: Qwen-1.5-8x7B : r/LocalLLaMA

TITLE: Qwen-1.5-8x7B : r/LocalLLaMA

Suggested labels

{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

#625: unsloth/README.md at main · unslothai/unsloth

unsloth/README.md at main · unslothai/unsloth

Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

🦥 Unsloth.ai News

🔗 Links and Resources

⭐ Key Features

🥇 Performance Benchmarking

Suggested labels

#317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks

TL;DR

News

Abstract

Usage

Environment Setup

Run Streaming Llama Chatbot

FAQ

TODOs

Citation