feat: change tracker API to initialize tracker early and track additional metrics. #50

dushyantbehl · 2024-02-19T18:17:34Z

This PR adds a tracker API with Aimstack as the default tracker. This is simple plug and play architecture which can support multiple trackers.
The tracker config is now taken as command line arguments (making it easier for any automation to pass tracker arguments).
With the new API I have added support to track any additional metrics of interest.
As an example I have added one single line to track model_load_time in AIM possibly fixing Add support for collecting metrics programmatically #33
Also bumps Aim version to 3.18.1 which is more stable and newer release.
See. Bump aim from 3.17.5 to 3.18.1 #42

Example of how new api can be invoked

 torchrun --nnodes=1 --nproc_per_node=2 --master_port=1234 tuning/sft_trainer.py --tokenizer_name_or_path ${MODEL_PATH} --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --use_peft --bf16 True --output_dir ${OUTPUT_PATH} --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_config tuning/config/fsdp_config.json --bf16 True --response_template "\n### Response:" --dataset_text_field "output" --tracker aim --aim_repo /data/aim --experiment sft-llama7b-test

dushyantbehl · 2024-02-19T18:28:30Z

cc @VassilisVassiliadis comments are welcome

Tracker now takes command line arguments as config. Aim stack is the default tracker and code contains example to measure additional metrics seamlessly into aimstack like 'model_load_time' Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Fix duplicate argument Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

tuning/sft_trainer.py

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

VassilisVassiliadis

I was thinking of something along these lines (need to double check the indentation as I used the web interface to suggest the changes)

tuning/sft_trainer.py

FSDP. Add tracker factory. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl · 2024-02-23T05:00:18Z

@VassilisVassiliadis the design and some functions had changed due to inconsistency with FSDP runs. I feel most of your ask is incorporated now. Can you check again if you have any other questions.

This argument --exp_metadata '{"gpu_model": "A100-80GB", "dataset_id": "my-amazing-dataset"}' passed to the run will result in the parameters recorded like this.

VassilisVassiliadis · 2024-02-23T08:38:51Z

Thanks! this looks good to me

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

ashokponkumar · 2024-02-23T14:25:15Z

@alex-jw-brooks PTAL

fabianlim · 2024-02-24T00:33:08Z

tuning/sft_trainer.py

 def train(
    model_args: configs.ModelArguments,
    data_args: configs.DataArguments,
    train_args: configs.TrainingArguments,
    peft_config: Optional[
        Union[peft_config.LoraConfig, peft_config.PromptTuningConfig]
    ] = None,
+    callbacks: Optional[List[TrainerCallback]] = None,
+    tracker: Optional[Tracker] = None,


Since the train function originall accepts only configurations by design, I feel we need a strong reason to allow it to also include objects. The more natural way would be to accept one "tracker config" input

Using a yaml as config was the first thought that came too. But given HF has chosen to use arguments for most configurations we went with it as a design choice. But if in fms-hf-tuning we choose to use config files for configs, we can do that too.

Train function expects to take callbacks which need to be associated with the training hence main takes the config and initializes tracker (which is needed for the callback)
Tracker here is the tracker object which train function needs to track any extra metrics and metadata hence the design choice to pass tracker separately.

fabianlim · 2024-02-24T00:38:15Z

tuning/sft_trainer.py

+    callbacks = [peft_saving_callback, file_logger_callback]
+
+    # Initialize the tracker
+    tracker = get_tracker(tracker_name, tracker_config)


why not just have a tracker_install_callbacks function somewhere to reduce 3 lines into 1

Agreed. Here too it's a design choice we have to make. We have tried to keep it consistent with how config is managed in fms-hf-tuning here too. This has tried to follow the pattern used in the peft config block above. But if we choose to have a different approach we can follow that.

Its simple choice to either perform things explicitly or run the code in some other funciton reverting to no callback under the hood. Does not affect the functionality. If you strongly feeling a meta function can help we can make that change.

tuning/config/configs.py

tuning/sft_trainer.py

nairbv · 2024-02-26T20:48:19Z

requirements.txt

@@ -2,7 +2,7 @@ numpy
 accelerate>=0.20.3
 transformers>=4.34.1
 torch
-aim==3.17.5
+aim==3.18.1


since there's a desire to implement multiple trackers, did we want to make the dependency (and imports) optional, just used when available and configured?

can be done..but then do we still list these in the requirements.txt?

or do we throw an error and ask user to install the required tracker before importing it in the code.

tuning/config/tracker_configs.py

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl · 2024-03-13T06:03:08Z

I have split this PR into two and opened them separately as #89 and #90

Not closing this right away due to the comments.

dushyantbehl · 2024-05-02T14:22:28Z

#89 and #90 is split version of this PR. We will track them for now.

dushyantbehl added 2 commits February 20, 2024 10:41

Merge remote-tracking branch 'upstream/main'

e2a2e98

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from c67abe0 to e2a2e98 Compare February 20, 2024 05:14

dushyantbehl changed the title ~~Generic Tracker API with command line arguments.~~ Change tracker API to initialize tracker early and track additional metrics. Feb 20, 2024

Bump aim version.

2d5ffa5

Fix duplicate argument Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from d9e3966 to 2d5ffa5 Compare February 20, 2024 06:21

default tracker none

80453c4

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

VassilisVassiliadis reviewed Feb 20, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main'

9357243

dushyantbehl changed the title ~~Change tracker API to initialize tracker early and track additional metrics.~~ feat: change tracker API to initialize tracker early and track additional metrics. Feb 21, 2024

separate callbacks from train

b0c170c

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from f1c9476 to b0c170c Compare February 21, 2024 06:17

enable tracker to track extra metadata

4f93a7c

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from 5086918 to 4f93a7c Compare February 21, 2024 15:41

VassilisVassiliadis suggested changes Feb 22, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Change to custom aim callback to disable multiple instantiation for

a28aa7b

FSDP. Add tracker factory. Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from f7e8e6b to a28aa7b Compare February 22, 2024 15:24

dushyantbehl force-pushed the main branch 2 times, most recently from 4bbe5d3 to 8661cc0 Compare February 23, 2024 10:47

change default output path for aim run export

550604d

Signed-off-by: Dushyant Behl <dushyantbehl@users.noreply.github.com>

dushyantbehl force-pushed the main branch from 8661cc0 to 550604d Compare February 23, 2024 11:13

fabianlim reviewed Feb 24, 2024

View reviewed changes

This was referenced Feb 26, 2024

feat: custom callbacks for train() and return TrainOutput plus model_load_time #49

Open

Bump aim from 3.17.5 to 3.18.1 #42

Merged

nairbv self-requested a review February 26, 2024 15:23