FederatedAI · mgqa34 · Feb 29, 2024 · Feb 29, 2024
diff --git a/doc/tutorial/fedkseed/fedkseed-example.ipynb b/doc/tutorial/fedkseed/fedkseed-example.ipynb
@@ -0,0 +1,389 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#  Federated Tuning with FedKSeed methods in FATE-LLM"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this tutorial, we will demonstrate how to efficiently train federated large language models using the FATE-LLM framework. In FATE-LLM, we introduce the \"FedKSeed\" module, specifically designed for federated learning with large language models. The Idea of FedKSeed is to use Zeroth-Order-Optimizer to optimize model along given direction that generated with random seed. This method can be used to train large language models in a federated learning setting with extremely low communication cost.\n",
+    "\n",
+    "The Algorithm is based on the paper: [Federated Full-Parameter Tuning of Billion-Sized Language Models\n",
+    "with Communication Cost under 18 Kilobytes](https://arxiv.org/pdf/2312.06353.pdf) and the code is modified from the https://github.com/alibaba/FederatedScope/tree/FedKSeed. We refactor the code to make it more compatible with (transformers/PyTorch) framework and integrate it into the FATE-LLM framework.\n",
+    "\n",
+    "The main works include:\n",
+    "1. An KSeedZerothOrderOptimizer class that can be used to optimize model along given direction that generated with random seed.\n",
+    "2. An KSeedZOExtendedTrainer subclass of Trainer from transformers that can be used to train large language models with KSeedZerothOrderOptimizer.\n",
+    "3. Trainers for federated learning with large language models.\n",
+    "\n",
+    "In this tutorial, we will demonstrate how to use the FedKSeed method to train a large language model in a federated learning setting. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model: datajuicer/LLaMA-1B-dj-refine-150B\n",
+    "\n",
+    "This is the introduction from the Huggingface model hub: [datajuicer/LLaMA-1B-dj-refine-150B](https://huggingface.co/datajuicer/LLaMA-1B-dj-refine-150B)\n",
+    "\n",
+    "> The model architecture is LLaMA-1.3B and we adopt the OpenLLaMA implementation. The model is pre-trained on 150B tokens of Data-Juicer's refined RedPajama and Pile. It achieves an average score of 34.21 over 16 HELM tasks, beating Falcon-1.3B (trained on 350B tokens from RefinedWeb), Pythia-1.4B (trained on 300B tokens from original Pile) and Open-LLaMA-1.3B (trained on 150B tokens from original RedPajama and Pile).\n",
+    "\n",
+    "> For more details, please refer to our [paper](https://arxiv.org/abs/2309.02033).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "# model_name_or_path = \"datajuicer/LLaMA-1B-dj-refine-150B\"\n",
+    "model_name_or_path = \"gpt2\""
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:27:23.512735Z",
+     "start_time": "2024-02-29T09:27:23.508790Z"
+    }
+   },
+   "execution_count": 1
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dataset: databricks/databricks-dolly-15k\n",
+    "\n",
+    "This is the introduction from the Huggingface dataset hub: [databricks/databricks-dolly-15k](https://huggingface.co/dataset/databricks/databricks-dolly-15k)\n",
+    "\n",
+    "> databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the types of questions and instructions appropriate to each category\n",
+    "\n",
+    "To use this dataset, you first need to download it from the Huggingface dataset hub:\n",
+    "\n",
+    "```bash\n",
+    "mkdir -p ../../../examples/data/dolly && cd ../../../examples/data/dolly && wget  wget https://huggingface.co/datasets/databricks/databricks-dolly-15k/resolve/main/databricks-dolly-15k.jsonl\\?download\\=true -O databricks-dolly-15k.jsonl\n",
+    "```\n",
+    "\n",
+    "### Check Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:27:26.987779Z",
+     "start_time": "2024-02-29T09:27:24.706218Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from fate_llm.dataset.hf_dataset import Dolly15K\n",
+    "from transformers import AutoTokenizer\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)\n",
+    "special_tokens = tokenizer.special_tokens_map\n",
+    "if \"pad_token\" not in tokenizer.special_tokens_map:\n",
+    "    special_tokens[\"pad_token\"] = special_tokens[\"eos_token\"]\n",
+    "\n",
+    "tokenizer.pad_token = tokenizer.eos_token\n",
+    "ds = Dolly15K(split=\"train\", tokenizer_params={\"pretrained_model_name_or_path\": model_name_or_path, **special_tokens},\n",
+    "              tokenizer_apply_params=dict(truncation=True, max_length=tokenizer.model_max_length, padding=\"max_length\", return_tensors=\"pt\"))\n",
+    "ds = ds.load('../../../examples/data/dolly')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:27:27.875025Z",
+     "start_time": "2024-02-29T09:27:27.867839Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "Dataset({\n    features: ['instruction', 'context', 'response', 'category', 'text', 'input_ids', 'attention_mask'],\n    num_rows: 15011\n})"
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For more details of FATE-LLM dataset setting, we recommend that you read through these tutorials first: [NN Dataset Customization](https://github.com/FederatedAI/FATE/blob/master/doc/tutorial/pipeline/nn_tutorial/Homo-NN-Customize-your-Dataset.ipynb), [Some Built-In Dataset](https://github.com/FederatedAI/FATE/blob/master/doc/tutorial/pipeline/nn_tutorial/Introduce-Built-In-Dataset.ipynb),"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Check local training\n",
+    "\n",
+    "Before submitting a federated learning task, we will demonstrate how to perform local testing to ensure the proper functionality of your custom dataset, model. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "outputs": [],
+   "source": [
+    "from transformers import AutoModelForCausalLM, TrainingArguments, DataCollatorForLanguageModeling\n",
+    "from fate_llm.fedkseed.trainer import KSeedZOExtendedTrainer, KSeedTrainingArguments\n",
+    "from fate_llm.fedkseed.zo_utils import build_seed_candidates, get_even_seed_probabilities\n",
+    "\n",
+    "def test_training(zo_mode=True):\n",
+    "    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path, **special_tokens)\n",
+    "    data_collector = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)\n",
+    "    model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_name_or_path)\n",
+    "\n",
+    "    training_args = TrainingArguments(output_dir='./',\n",
+    "                                      dataloader_num_workers=1,\n",
+    "                                      dataloader_prefetch_factor=1,\n",
+    "                                      remove_unused_columns=True,\n",
+    "                                      learning_rate=1e-5,\n",
+    "                                      per_device_train_batch_size=1,\n",
+    "                                      num_train_epochs=0.01,\n",
+    "                                      )\n",
+    "    kseed_args = KSeedTrainingArguments(zo_optim=zo_mode)\n",
+    "    trainer = KSeedZOExtendedTrainer(model=model, train_dataset=ds, training_args=training_args, kseed_args=kseed_args,\n",
+    "                                     tokenizer=tokenizer, data_collator=data_collector)\n",
+    "    if zo_mode:\n",
+    "        seed_candidates = build_seed_candidates(k=kseed_args.k)\n",
+    "        seed_probabilities = get_even_seed_probabilities(k=kseed_args.k)\n",
+    "        trainer.configure_seed_candidates(seed_candidates, seed_probabilities)\n",
+    "    return trainer.train()"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:38:33.175079Z",
+     "start_time": "2024-02-29T09:38:33.168844Z"
+    }
+   },
+   "execution_count": 16
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:39:37.602070Z",
+     "start_time": "2024-02-29T09:38:34.024223Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "<IPython.core.display.HTML object>",
+      "text/html": "\n    <div>\n      \n      <progress value='151' max='151' style='width:300px; height:20px; vertical-align: middle;'></progress>\n      [151/151 00:59, Epoch 0/1]\n    </div>\n    <table border=\"1\" class=\"dataframe\">\n  <thead>\n <tr style=\"text-align: left;\">\n      <th>Step</th>\n      <th>Training Loss</th>\n    </tr>\n  </thead>\n  <tbody>\n  </tbody>\n</table><p>"
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": "TrainOutput(global_step=151, training_loss=1.2660519429390005, metrics={'train_runtime': 61.8249, 'train_samples_per_second': 2.428, 'train_steps_per_second': 2.442, 'total_flos': 78910193664000.0, 'train_loss': 1.2660519429390005, 'epoch': 0.01})"
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_training(zo_mode=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "<IPython.core.display.HTML object>",
+      "text/html": "\n    <div>\n      \n      <progress value='151' max='151' style='width:300px; height:20px; vertical-align: middle;'></progress>\n      [151/151 01:29, Epoch 0/1]\n    </div>\n    <table border=\"1\" class=\"dataframe\">\n  <thead>\n <tr style=\"text-align: left;\">\n      <th>Step</th>\n      <th>Training Loss</th>\n    </tr>\n  </thead>\n  <tbody>\n  </tbody>\n</table><p>"
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": "TrainOutput(global_step=151, training_loss=0.6093456950408733, metrics={'train_runtime': 92.6158, 'train_samples_per_second': 1.621, 'train_steps_per_second': 1.63, 'total_flos': 78910193664000.0, 'train_loss': 0.6093456950408733, 'epoch': 0.01})"
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_training(zo_mode=False)"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-02-29T09:41:28.949449Z",
+     "start_time": "2024-02-29T09:39:54.802705Z"
+    }
+   },
+   "execution_count": 18
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "You can see that Zeroth-Order-Optimizer has much worse performance than AdamW, that's the price we need to pay for the low communication cost. "
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Submit Federated Task\n",
+    "Once you have successfully completed local testing, We can submit a task to FATE. Please notice that this tutorial is ran on a standalone version. **Please notice that in this tutorial we are using a standalone version, if you are using a cluster version, you need to bind the data with the corresponding name&namespace on each machine.**\n",
+    "\n",
+    "In this example we load pretrained weights for gpt2 model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "from fate_client.pipeline.components.fate.reader import Reader\n",
+    "from fate_client.pipeline import FateFlowPipeline\n",
+    "from fate_client.pipeline.components.fate.homo_nn import HomoNN, get_config_of_seq2seq_runner\n",
+    "from fate_client.pipeline.components.fate.nn.algo_params import TrainingArguments, FedAVGArguments\n",
+    "from fate_client.pipeline.components.fate.nn.loader import LLMModelLoader, LLMDatasetLoader, LLMDataFuncLoader\n",
+    "\n",
+    "guest = '10000'\n",
+    "host = '10000'\n",
+    "arbiter = '10000'\n",
+    "\n",
+    "epochs = 0.01\n",
+    "batch_size = 1\n",
+    "lr = 1e-5\n",
+    "\n",
+    "pipeline = FateFlowPipeline().set_parties(guest=guest, arbiter=arbiter)\n",
+    "pipeline.bind_local_path(path=\"/data/projects/fate/examples/data/dolly\", namespace=\"experiment\",\n",
+    "                         name=\"dolly\")\n",
+    "time.sleep(5)\n",
+    "\n",
+    "reader_0 = Reader(\"reader_0\", runtime_parties=dict(guest=guest, host=host))\n",
+    "reader_0.guest.task_parameters(\n",
+    "    namespace=\"experiment\",\n",
+    "    name=\"dolly\"\n",
+    ")\n",
+    "reader_0.hosts[0].task_parameters(\n",
+    "    namespace=\"experiment\",\n",
+    "    name=\"dolly\"\n",
+    ")\n",
+    "\n",
+    "tokenizer_params = dict(\n",
+    "    pretrained_model_name_or_path=\"gpt2\",\n",
+    "    trust_remote_code=True,\n",
+    ")\n",
+    "conf = get_config_of_seq2seq_runner(\n",
+    "    algo='fedkseed',\n",
+    "    model=LLMModelLoader(\n",
+    "        \"hf_model\",\n",
+    "        \"HFAutoModelForCausalLM\",\n",
+    "        # pretrained_model_name_or_path=\"datajuicer/LLaMA-1B-dj-refine-150B\",\n",
+    "        pretrained_model_name_or_path=\"gpt2\",\n",
+    "        trust_remote_code=True\n",
+    "    ),\n",
+    "    dataset=LLMDatasetLoader(\n",
+    "        \"hf_dataset\",\n",
+    "        \"Dolly15K\",\n",
+    "        split=\"train\",\n",
+    "        tokenizer_params=tokenizer_params,\n",
+    "        tokenizer_apply_params=dict(\n",
+    "            truncation=True,\n",
+    "            max_length=1024,\n",
+    "        )),\n",
+    "    data_collator=LLMDataFuncLoader(\n",
+    "        \"cust_func.cust_data_collator\",\n",
+    "        \"get_seq2seq_tokenizer\",\n",
+    "        tokenizer_params=tokenizer_params,\n",
+    "    ),\n",
+    "    training_args=TrainingArguments(\n",
+    "        num_train_epochs=0.01,\n",
+    "        per_device_train_batch_size=batch_size,\n",
+    "        remove_unused_columns=True,\n",
+    "        learning_rate=lr,\n",
+    "        fp16=False,\n",
+    "        use_cpu=False,\n",
+    "        disable_tqdm=False,\n",
+    "        use_mps_device=True,\n",
+    "    ),\n",
+    "    fed_args=FedAVGArguments(),\n",
+    "    task_type='causal_lm',\n",
+    "    save_trainable_weights_only=True,\n",
+    ")\n",
+    "\n",
+    "conf[\"fed_args_conf\"] = {}\n",
+    "\n",
+    "homo_nn_0 = HomoNN(\n",
+    "    'nn_0',\n",
+    "    runner_conf=conf,\n",
+    "    train_data=reader_0.outputs[\"output_data\"],\n",
+    "    runner_module=\"fedkseed_runner\",\n",
+    "    runner_class=\"FedKSeedRunner\",\n",
+    ")\n",
+    "\n",
+    "pipeline.add_tasks([reader_0, homo_nn_0])\n",
+    "pipeline.conf.set(\"task\", dict(engine_run={\"cores\": 1}))\n",
+    "\n",
+    "pipeline.compile()\n",
+    "pipeline.fit()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can use this script to submit the model, but submitting the model will take a long time to train and generate a long log, so we won't do it here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}