Enable loading ckpt for t0 finetuning #309

Muennighoff · 2022-07-10T11:15:43Z

No description provided.

megatron/utils.py

thomasw21 · 2022-07-11T08:52:38Z

megatron/utils.py

+            [0, 0, 0, 1, 0, 0, 0],
+            [0, 0, 0, 1, 1, 0, 0],
+            [0, 0, 0, 1, 1, 1, 0],
+            [0, 0, 0, 0, 0, 0, 0]]]]


Suggested change

[0, 0, 0, 0, 0, 0, 0]]]]

[0, 0, 0, 0, 0, 0, 1]]]]

I don't think there is a 1 , because the last row & column is 100% padding

Hum I'm wondering if this doesn't screw something up. Essentially you're going to compute softmax on a row with only zeros ...

The last row & last col are the attention scores of the last token with respect to the last token. Since the last token is masked out in our loss_mask it doesn't matter I think.
Also it's a row with only -inf, no?

No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.

Don't we fill it with -inf?
And the softmax of a row where all values are the same is just 1/n, no? Where would it cause NaNs?

You can try writing a test but I would be pretty sure that the actual results are 0. (with current kernel)

tools/preprocess_data.py

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

finetune_t0_non_causal_decoder.py

thomasw21 · 2022-07-12T08:53:16Z

finetune_t0_non_causal_decoder.py

    loss_mask *= loss_on_targets_only * loss_on_non_pad_only

    attention_mask = get_packed_attention_mask(
        # Run non-causal decoder
-        is_causal=False,
-        causal_mask=~(causal_mask.bool()),
+        is_causal=True,


let's rename this file finetune_t0_causal_decoder then

What about just finetune_t0.py?

Right but do we hardcode this everytime? I'd rather have this one be the script for causal decoder.

Added an argument prefixlm

finetune_t0_non_causal_decoder.py

megatron/model/gpt_model.py

thomasw21 · 2022-07-12T09:00:29Z

megatron/utils.py

+            [0, 0, 0, 1, 0, 0, 0],
+            [0, 0, 0, 1, 1, 0, 0],
+            [0, 0, 0, 1, 1, 1, 0],
+            [0, 0, 0, 0, 0, 0, 0]]]]


No you compute softmax, what should be the result of the softmax of a row full of masked out values .... It feels like that would return lots of Nans.

* Tmp lossseq * Efficient loss normalization * Reuse variable * Simplify division * Add norm_target_loss arg * Clarify loss on targets & remove kwarg * Loss mask is already float * Move norm to batch pipe * Reshape loss mask * Move view

thomasw21

Nice work! Some things I think shouldn't be in this PR.

thomasw21 · 2022-11-04T08:41:54Z

megatron/arguments.py

    group.add_argument('--reweight-loss-based-on-position-frequency', action="store_true",
                       help='Some objectives require us to sample loss_mask. This might introduce bias towards '
                       'specific positions. This option tries to un-bias the loss by reweighting loss on specific '
                       'positions based on how frequently we train on that position.'
                       'This is mostly used for prefix_lm training')
    group.add_argument("--noise-density", type=float, default=None, help="Span corruption noise density")
    group.add_argument("--mean-noise-span-length", type=int, default=None, help="Span corruption mean noise span length")
+    group.add_argument("--prefixlm",  action='store_true', help="Whether to train a PrefixLM - To be used with finetune t0")


Yeah actually let's remove that option. I don't think we've trained one successfully. We'll probably do as people have shown that it works but in another PR IMO.

thomasw21 · 2022-11-04T08:43:39Z

megatron/checkpointing.py

@@ -274,8 +274,8 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
    load_dir = getattr(args, load_arg)

    if args.deepspeed:
-        load_optimizer_states = False if args.no_load_optim else True
-        loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)
+        load_optimizer_states = not args.no_load_optim


Just use no_load_optim directly in the method

thomasw21 · 2022-11-04T08:44:05Z

megatron/checkpointing.py

@@ -342,7 +342,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
    set_checkpoint_version(state_dict.get('checkpoint_version', 0))

    # Set iteration.
-    if args.finetune or release:
+    if args.finetune or release or args.reset_progress:


Why is it that we didn't set finetune to True?

thomasw21 · 2022-11-04T08:46:44Z

megatron/checkpointing.py

@@ -361,7 +361,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
    # Check arguments.
    assert args.consumed_train_samples == 0
    assert args.consumed_valid_samples == 0
-    if 'args' in state_dict:
+    if 'args' in state_dict and not args.reset_progress:


Can you add a comment? Typically this is only used because the metadata loading mechanism screws with us.

thomasw21 · 2022-11-04T08:47:22Z

megatron/data/decoder_packed_mtf_dataset.py

@@ -399,7 +399,7 @@ def _build_index_mappings(
    shuffle_idx_filename = _filename + '_decoder_packed_shuffle_idx.npy'

    # Build the indexed mapping if not exist.
-    if torch.distributed.get_rank() == 0:
+    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:


Why do you need that?

Afaik you added this code; I think it was for running tests or sth

Arf probably because I wanted to use the data loader only ... Maybe let's remove for now because we should be assuming that torch distributed is always initialized at least in Meg-DS IMO.

thomasw21 · 2022-11-04T08:49:30Z

megatron/data/decoder_packed_mtf_dataset.py

-    assert counts[0].item() == (
-        torch.distributed.get_world_size() //
-        torch.distributed.get_world_size(group=mpu.get_tensor_model_parallel_group()))
+    if torch.distributed.is_initialized():


thomasw21 · 2022-11-04T09:15:56Z

finetune_t0.py

+        loss_mask = loss_mask.view(-1)
+        loss_mask = fast_normalize(loss_mask)


Maybe reshaping to the orignal structure is better API? It's better to bave the same shapes as label IMO (we still still do flatten everything)

thomasw21 · 2022-11-04T09:19:45Z

tests/test_dataloaders.py

@@ -241,7 +241,7 @@ def test_decoder_packed_mtf_dataloader(self):
                    last_padding_size = len([None for segment_id in items["decoder_segment_ids"][micro_batch_size - 1] if segment_id == 0])


-    def test_finetune_t0_non_causal_decoder_get_batch_pipe(self):
+    def test_finetune_t0_get_batch_pipe(self):


Yeah let's make it so that the script is causal decoder specific. Let's figure out non causal decoder later on.

thomasw21 · 2022-11-04T09:20:25Z

tools/preprocess_data.py

+    group.add_argument('--append-bos', action='store_true',
+                       help='Append a bos token to the end of a document.')
+    group.add_argument('--prepend-space', action='store_true',
+                    help='Prepends a space to the beginning of a document')


Add a mention in which context it's useful, typically it is when you compute targets.

Enable loading ckpt for t0 finetuning

90b8f46

thomasw21 reviewed Jul 11, 2022

View reviewed changes

Muennighoff added 2 commits July 11, 2022 12:40

Swap decoder_is_inputs & segment_ids

abdd703

Add prepend-space arg

0fcb19c

thomasw21 reviewed Jul 11, 2022

View reviewed changes

tools/preprocess_data.py Outdated Show resolved Hide resolved

Muennighoff and others added 6 commits July 11, 2022 15:02

Update tools/preprocess_data.py

63daa46

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Add helpers & set is_causal to true

89460c0

Merge remote

fb8ecb8

JSON helper scripts

a55d2fb

Remove unnec imports

2dfe5d1

Remove helper scripts

ca740f1

thomasw21 reviewed Jul 12, 2022

View reviewed changes

Muennighoff added 3 commits July 13, 2022 21:16

Avoid loading module when not loading optim

cb0313b

Allow not using torch distributed

b62dcaf

Add prefixlm arg

b15ca2d

Muennighoff mentioned this pull request Jul 16, 2022

Add t0 scripts bigscience-workshop/bigscience#50

Closed

Muennighoff requested a review from thomasw21 July 18, 2022 17:34

Muennighoff added 4 commits July 28, 2022 11:52

Add bos option

dc8d0ab

Merge branch 'main' into t0loading

0a32459

Add reset-progress key

2699721

Add option to normalize loss per target (#326)

1e77844

* Tmp lossseq * Efficient loss normalization * Reuse variable * Simplify division * Add norm_target_loss arg * Clarify loss on targets & remove kwarg * Loss mask is already float * Move norm to batch pipe * Reshape loss mask * Move view

thomasw21 requested changes Nov 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable loading ckpt for t0 finetuning #309

Enable loading ckpt for t0 finetuning #309

Muennighoff commented Jul 10, 2022

thomasw21 Jul 11, 2022

Muennighoff Jul 11, 2022

thomasw21 Jul 11, 2022

Muennighoff Jul 11, 2022

thomasw21 Jul 12, 2022

Muennighoff Jul 12, 2022

thomasw21 Jul 12, 2022 •

edited

Loading

thomasw21 Jul 12, 2022

Muennighoff Jul 12, 2022

thomasw21 Jul 12, 2022

Muennighoff Jul 16, 2022

thomasw21 Jul 12, 2022

thomasw21 left a comment

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

Muennighoff Nov 23, 2022

thomasw21 Nov 23, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

thomasw21 Nov 4, 2022

		loss_mask = loss_mask.view(-1)
		loss_mask = fast_normalize(loss_mask)

Enable loading ckpt for t0 finetuning #309

Are you sure you want to change the base?

Enable loading ckpt for t0 finetuning #309

Conversation

Muennighoff commented Jul 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 Jul 12, 2022 •

edited

Loading