Mlm adaptation #287

lintangsutawika · 2022-06-22T10:43:06Z

This is the mlm adaptation part
@thomasw21

thomasw21 · 2022-06-23T13:05:35Z

megatron/tokenizer/tokenizer.py

+    @property
+    def bos_token_id(self):
+        raise NotImplementedError("Missing <bos>")
+
+    @property
+    def eos_token_id(self):
+        raise NotImplementedError("Missing <eos>")


It's quite annoying but we use HF's as already so we shouldn't collapse tokens IMO.

thomasw21 · 2022-06-23T13:06:02Z

tests/test_dataloaders.py

+                )
+
+                sample = train_ds[0]
+                self.assertEqual(len(sample["input_tokens"]) + len(sample["target_tokens"]), args.seq_length)


Very basic test ... couldn't think of a more robust test.

thomasw21 · 2022-06-23T13:10:10Z

megatron/tokenizer/tokenizer.py

+        hf_tokenizer_kwargs = {}
+        if vocab_extra_ids > 0:
+            # TODO @thomasw21 we might need to concatenate to a pre-existing list?
+            hf_tokenizer_kwargs["additional_special_tokens"] = [f"<extra_id_{_id}>" for _id in range(vocab_extra_ids)]


I think we'll want something cleaner here, but since we're not using additional_special_tokens in our tokenizer I'd say it's okay to override that value. I think at some point we'll push another tokenizer on the hub with the additional tokens. cc @SaulLu

Yeah this will override the previous special tokens if they exist right? Does it not work to add them later with self.tokenizer.add_special_tokens?

Not very familiar with this part of tokenzier code but I would also have expected it to expand the vocabulary instead of re-using existing extra tokens :)

OK yes if the embedding is padded later then this is the right way

I just checked, whether they are added in the from_pretrained method or with the add_special_tokens method, in both cases the tokens will be added but the value of the additional_special_tokens property will be overwritten.

If we want to keep the previously added tokens in the additional_special_tokens property we need to do:

self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name_or_path, **hf_tokenizer_kwargs) new_special_tokens = { "additional_special_tokens": tok.additional_special_tokens + [f"<extra_id_{_id}>" for _id in range(vocab_extra_ids) if f"<extra_id_{_id}>" not in self.tokenizer.additional_special_tokens] } self.tokenizer.add_special_tokens(new_special_tokens)

Hum let's not over engineer this ... we're not using any right now, I can add a warning saying we're going to overwrite the additional tokens (otherwise I have to switch the logic a bit for no reason).

Fine with a warning, but is there anything wrong with @lucile's solution? Ij amy case don't think this is essential, was just curious, as long as it is documented i don't think we should spend too much time on it.

The issue if that the MLMDataset needs to be able to query the sentinel tokens. Right now I assume that all additional_special_tokens are sentinel tokens. So now we need to build a tokenizer that has specific mlm tokens, I can try and do that just didn't want to do it out of lazyness :D

megatron/data/mlm_dataset.py

Muennighoff · 2022-06-24T18:09:59Z

megatron/data/mlm_dataset.py

+        spans_start[1:], np.full((1,), len(sample), dtype=np.int32)]
+    )
+
+    sentinel_token_ids = all_sentinel_token_ids[:num_noise_spans]


Given num_noise_spans is always the same, maybe slightly faster to store sentinel_token_ids as a class attribute of MLMDataset & feed it as an argument to the func

I wonder if it wouldn't be better to make num_noise_spans probabilistic instead of deterministic

I also have a strong intuition that we should want to change those values. But the idea is to have T5 mlm here and rely on their number.

megatron/data/mlm_dataset.py

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

TevenLeScao · 2022-06-26T16:49:02Z

megatron/data/mlm_dataset.py

+            up to num_items
+        """
+        mask_indices = np.arange(num_items - 1) < (num_segments - 1)
+        # TODO @thomasw21 handle random state correctly, ie synchronized across TP.


This scares me a bit because TP-random states things are hard to debug but tbh we should just test asap to see if loss goes down at the expected rate.

Ah yes I need to double check that. I can have a go at it. Have forgotten about this TODO.

TevenLeScao

Read through the PR and didn't catch anything worrying. Let's just test it ASAP.

megatron/data/mlm_dataset.py

thomasw21 · 2022-06-27T11:30:27Z

@Muennighoff Waiting for your approval.

Muennighoff

Nice job! Is the plan roughly as follows?:
Merge this -> Finish & Merge MTF (& Figure out plan for multilingual retention) -> Try out MLM+MTF on small bloom model -> Try out Enc-Dec+MLM+MTF on small bloom model -> Try out best option on bloom176B

thomasw21 · 2022-06-27T13:12:54Z

Not exactly, in the priority order (in case we have idle compute we go to the next item):

merge and finish MTF
run MTF on small models
run MTF on big models (pending proof that MTF works for smaller models)
run MLM + MTF on small models
run MLM + MTF on big model (pending proof that MLM + MTF works for smaller models)

Modify universal checkpoint parameter patterns based on the specific model configuration. This commit adds support for llama family of models. Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>

Lintang Sutawika and others added 30 commits May 7, 2022 18:08

added train script but with prefix manually declared

d2c35fc

made new dataset

f977b85

minor adjustments

fcfbf17

added capabilities for padding and prefix lm index

870dfd8

added finetune script

791bbd0

removed script

0f44b92

added adjustments and new dataset

2ff0815

try mlm dataset

f0a79f6

minor changes

eb416c7

minor addition of import packages

c0bc21b

minor error fix

82e824c

minor error fix

7bb17ec

samples follow how gpt dataset is loaded

9929766

added masked_lm_prob

861c41f

fixed tokenizer abstractions for HF tokenizer

fe95115

added mask id

8ea5943

added mask id

aa0d146

added mask id

215e8cc

added mask id

b6eef43

added fix

bfc73a5

added bos and eos token id

1890f87

no need for sentinal token

01392a9

add aux functions

923decb

add aux functions

4611d67

add aux functions

4356de3

add pad_id

f31c686

changed lm predictions to t5

a3951e8

changed lm predictions to t5

97b9a92

changed lm predictions to t5

fe73a73

changed lm predictions to t5

6a9cb75

thomasw21 added 10 commits June 23, 2022 13:56

Remove tokenizer

7a872c2

WIP

b6f02c5

WIP

86680bc

WIP

4b2d840

WIP

b935b85

WIP

9a74d69

WIP

64b1515

WIP

b210364

MLM

6398d1d

Cleanup

e0f7c92

thomasw21 requested a review from TevenLeScao June 23, 2022 13:04

thomasw21 reviewed Jun 23, 2022

View reviewed changes

thomasw21 marked this pull request as ready for review June 23, 2022 13:10

Muennighoff reviewed Jun 24, 2022

View reviewed changes

Muennighoff mentioned this pull request Jun 24, 2022

MLM adaptation and Multitask Finetuning #284

Closed

Update megatron/data/mlm_dataset.py

6b92958

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>

TevenLeScao reviewed Jun 26, 2022

View reviewed changes

TevenLeScao approved these changes Jun 26, 2022

View reviewed changes

thomasw21 added 3 commits June 27, 2022 11:15

Cleanup + fix off by one issue

faf0b9e

Missing vocab extra ids

0e3ee15

Woops

92070ce

Muennighoff reviewed Jun 27, 2022

View reviewed changes

megatron/data/mlm_dataset.py Show resolved Hide resolved

thomasw21 added 3 commits June 27, 2022 13:24

Understanding off by one isse

ea69602

Woops

4dbe448

Add

8f42790

Muennighoff approved these changes Jun 27, 2022

View reviewed changes

thomasw21 merged commit 9d26431 into bigscience-workshop:main Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mlm adaptation #287

Mlm adaptation #287

lintangsutawika commented Jun 22, 2022

thomasw21 Jun 23, 2022

thomasw21 Jun 23, 2022

thomasw21 Jun 23, 2022

TevenLeScao Jun 23, 2022

TevenLeScao Jun 23, 2022

TevenLeScao Jun 23, 2022

SaulLu Jun 23, 2022 •

edited

Loading

thomasw21 Jun 23, 2022

TevenLeScao Jun 23, 2022

thomasw21 Jun 23, 2022 •

edited

Loading

Muennighoff Jun 24, 2022

thomasw21 Jun 26, 2022

TevenLeScao Jun 26, 2022 •

edited

Loading

thomasw21 Jun 26, 2022

TevenLeScao left a comment

thomasw21 commented Jun 27, 2022

Muennighoff left a comment

thomasw21 commented Jun 27, 2022

Mlm adaptation #287

Mlm adaptation #287

Conversation

lintangsutawika commented Jun 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaulLu Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomasw21 Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TevenLeScao Jun 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TevenLeScao left a comment

Choose a reason for hiding this comment

thomasw21 commented Jun 27, 2022

Muennighoff left a comment

Choose a reason for hiding this comment

thomasw21 commented Jun 27, 2022

SaulLu Jun 23, 2022 •

edited

Loading

thomasw21 Jun 23, 2022 •

edited

Loading

TevenLeScao Jun 26, 2022 •

edited

Loading