-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way to add knowledge to a llm : r/LocalLLaMA #665
Comments
Related issues#643: I finally got perfect labels (classification task) via prompting : r/LocalLLaMA### DetailsSimilarity score: 0.89 - [ ] [I finally got perfect labels (classification task) via prompting : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1amvfua/i_finally_got_perfect_labels_classification_task/)TITLEI finally got perfect labels (classification task) via prompting : r/LocalLLaMA DESCRIPTION"I finally got perfect labels (classification task) via prompting Tutorial | Guide
For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral
Below is the plug-n-play template I finalized/am using Instruction:Label the text based on this question: "{task}" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label. Text: {few-shot example} Input:Text: {Text for it to label} Response:For experimentation, I found that discrepancies are your best friend. My setup was:
Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv: Number of times M8x7b-t0 said "Yes" and M8x7b-t1 said "No": 100 That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million... Eventually I got it down to this: Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv: Number of times M8x7b-t1 said "Yes" and M8x7b-t2 said "No": 2 When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels. Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other. Sort by:
aichiusagi For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral Reply GeeBrain Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points reply Can you please provide an example of an actual prompt? GeeBrain It's literally the template + whatever you want in the {}. But here ya go... Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request. Instruction: Label the comment based on this question: "Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a "Yes/No" statement first as the label.(Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!) Comment: Wow, you are so beautiful. Input:Comment: "When he comes up?" Response:Reply reply trapping_rainwater What's your production use case for something like this? Reply reply GeeBrain My project is around building an ML model that measures trust — kinda like a fandom score. But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it. Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun" Suggested labels#660: Qwen - supervised finetuning script and guide for SFT.### DetailsSimilarity score: 0.87 - [ ] [Example - Qwen](https://qwen.readthedocs.io/en/latest/training/SFT/example.html)Example - QwenDESCRIPTION:
In the following, we introduce more details about the usage of the script. Installation
Data Preparation {
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..."
}
],
"source": "unknown"
}
{
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is your name?"
},
{
"role": "assistant",
"content": "My name is Qwen."
}
],
"source": "self-made"
} Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: type, messages, and source. messages is required while the others are optional for you to label your data format and data source. The messages field is a list of JSON objects, each of which has two fields: role and content. role can be system, user, or assistant. content is the text of the message. source is the source of the data, which can be self-made, alpaca, open-hermes, or any other string. To make the jsonl file, you can use json to save a list of dictionaries to the jsonl file: import json
with open('data.jsonl', 'w') as f:
for sample in samples:
f.write(json.dumps(sample) + '\n') Quickstart cd examples/sft
bash finetune.sh -m <model_path> -d <data_path> --deepspeed <config_path> [--use_lora True] [--q_lora True] Specify the Advanced Usages Shell Script To set up the environment variables for distributed training (or single-GPU training), specify the following variables: There are a series of hyperparameters to tune. Passing in
URL: https://qwen.readthedocs.io/en/latest/training/SFT/example.html Suggested labels#315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog### DetailsSimilarity score: 0.87 - [ ] [A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog](https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b)A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds. (high-resolution version) load datadocuments = SimpleDirectoryReader(input_dir="...").load_data() build VectorStoreIndex that takes care of chunking documentsand encoding chunks to embeddings for future retrievalindex = VectorStoreIndex.from_documents(documents=documents) The QueryEngine class is equipped with the generatorand facilitates the retrieval and generation stepsquery_engine = index.as_query_engine() Use your Default RAGresponse = query_engine.query("A user's query") Suggested labels{ "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }#647: Qwen-1.5-8x7B : r/LocalLLaMA### DetailsSimilarity score: 0.87 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/)TITLE: Qwen-1.5-8x7B : r/LocalLLaMADESCRIPTION: "Qwen-1.5-8x7B New Model Model: Link to Model Dataset: Link to Dataset Thread: I'm excited to release a project I've been working on the last couple of weeks. Qwen1.5-8x7b: Link to Model And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: Link to Dataset The purpose and intention behind this project is better detailed in the model/dataset card, but basically: I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card. I then trained Qwen1.5-7b on a 100k subset over 4 epochs. Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model. Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs. Good news: Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests. Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training. Thank you to @teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @mistralai for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family. Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity. We're just getting started." URL: Link to Reddit Post Suggested labels{'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}#625: unsloth/README.md at main · unslothai/unsloth### DetailsSimilarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)unsloth/README.md at main · unslothai/unsloth✨ Finetune for FreeAll notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
🦥 Unsloth.ai News
🔗 Links and Resources
⭐ Key Features
🥇 Performance Benchmarking
Suggested labels#317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks### DetailsSimilarity score: 0.87 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) # Efficient Streaming Language Models with Attention Sinks [[paper](http://arxiv.org/abs/2309.17453)] [[slides](assets/StreamingLLM.pdf)][[video](https://youtu.be/hvJsEzP34o8)]streamingllm_demo.mp4TL;DRWe deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. News
AbstractDeploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. UsageEnvironment Setupconda create -yn streaming python=3.8
conda activate streaming
pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece
python setup.py develop Run Streaming Llama ChatbotCUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming FAQ
TODOsWe will release the code and data in the following order, please stay tuned!
CitationIf you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper: @article{xiao2023streamingllm,
title={Efficient Streaming Language Models with Attention Sinks},
author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike},
journal={arXiv},
year={2023}
}
```</details>
|
Best way to add knowledge to a LLM: r/LocalLLaMA
DESCRIPTION: Studies like this one show GPT4 gets 75% accuracy on prompting alone. GPT4 + RAG you get 80% accuracy. GPT4 + Finetuning 81%. GPT4 + RAG + Finetuning = 86%. Other studies like this one say just for knowledge retrieval from huge datasets, RAG is enough.
Kaggle's LLM Science Exam competition link made participants answer hard science questions. The winning solution showed Llama-2 70b with prompting gets 80%. + finetuning via SFT you get 86%. But + finetuning + RAG you get 93%. All had to undergo finetuning since the output was MMLU's classification type ie output A, B, C, D etc (so a classification problem).
I would use RAG as a first try to see if it can work. Now the issue is which embeddings, which database etc. Chunk size, reranking etc.
If you find RAG to be quite annoying to set up, another approach is to shove your dataset for finetuning. It'll become a text completion model, so you might need say GPT4 to create some instructions from the dataset to "prime" your model.
So RAG definitely works, pushing accuracies from 75% to 80%. But + finetuning you get 86%. There are some bad theories spreading finetuning does not inject new knowledge, but these studies and the Kaggle comp prove otherwise.
Likewise see Open Hermes, and any finetuned model - finetuning is just continuous pretraining. Definitely the weights of the model are being edited to account for more information.
I'm also the dev of Unsloth :) If you're going to do finetuning, I have a free Colab notebook to finetune Mistral 7b 2x faster and use 70% less VRAM. Colab Notebook
All in all, I would try first prompt engineering, then RAG, then finetuning, then RAG + finetuning as the final step.
URL: r/LocalLLaMA
Suggested labels
{'label-name': 'Knowledge-Enhancement-Techniques', 'label-description': 'Methods and tools used to improve knowledge acquisition in AI models.', 'gh-repo': 'llm,finetuning,dataset,RAG,embeddings,Research', 'confidence': 70.22}
The text was updated successfully, but these errors were encountered: