Add Bitfit #311

Muennighoff · 2022-07-10T18:00:50Z

This PR adds compatibility for BitFit. I'd like to try BitFit + MTF to retain Multilinguality.
Empirical evidence from this paper:

Note that adapters also add parameters to the model & increase complexity at inference in Transformers, so BF is the best option imo.
Also see this paper though they don't try BitFit.

Automatic Tests: Happy to add one if we decide to merge this 🤗

Manual Tests:

1 Nodes, PP=2, TP=2
2 Nodes, PP=2, TP=2

The below shows how the grad norm decreases as it should, because we have less gradients.
I would also expect time to decrease due to less communication, but probably only at more nodes.
Memory usage also decreases due to less optimizer states to store.

With BitFit, 2 Nodes, PP=2, TP=2

[default3]: iteration        2/  868457 | consumed samples:          384 | consumed tokens:       786432 | elapsed time per iteration (s): 12.86 | learning rate: 6.291E-07 | global batch size:   192 | lm loss: 1.244176E+01 | loss scale: 4096.0 | grad norm: 0.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 14.925 | TFLOPs: 13.65 |
[default3]: iteration        3/  868457 | consumed samples:          576 | consumed tokens:      1179648 | elapsed time per iteration (s): 12.62 | learning rate: 9.437E-07 | global batch size:   192 | lm loss: 1.244014E+01 | loss scale: 4096.0 | grad norm: 0.062 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 15.209 | TFLOPs: 13.91 |

Without BitFit

[default3]: iteration        2/  868457 | consumed samples:          384 | consumed tokens:       786432 | elapsed time per iteration (s): 12.62 | learning rate: 6.291E-07 | global batch size:   192 | lm loss: 1.244176E+01 | loss scale: 4096.0 | grad norm: 0.291 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 15.214 | TFLOPs: 13.91 |
[default3]: iteration        3/  868457 | consumed samples:          576 | consumed tokens:      1179648 | elapsed time per iteration (s): 12.63 | learning rate: 9.437E-07 | global batch size:   192 | lm loss: 1.244006E+01 | loss scale: 4096.0 | grad norm: 0.309 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 15.201 | TFLOPs: 13.90 |

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Muennighoff added 3 commits July 10, 2022 13:15

Enable loading ckpt for t0 finetuning

90b8f46

Add BitFit

9e72b32

Enhance docstring

3f613be

Muennighoff requested review from thomasw21 and TevenLeScao July 10, 2022 18:13

Muennighoff and others added 16 commits July 11, 2022 12:40

Swap decoder_is_inputs & segment_ids

abdd703

Add prepend-space arg

0fcb19c

Update tools/preprocess_data.py

63daa46

Co-authored-by: Thomas Wang <24695242+thomasw21@users.noreply.github.com>

Add helpers & set is_causal to true

89460c0

Merge remote

fb8ecb8

JSON helper scripts

a55d2fb

Remove unnec imports

2dfe5d1

Remove helper scripts

ca740f1

Avoid loading module when not loading optim

cb0313b

Allow not using torch distributed

b62dcaf

Add prefixlm arg

b15ca2d

Add bos option

dc8d0ab

Merge branch 'main' into t0loading

0a32459

Add reset-progress key

2699721

Merge branch 'main' into bitfit

1ed00bd

Merge branch 't0loading' into bitfit

6067213

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bitfit #311

Add Bitfit #311

Muennighoff commented Jul 10, 2022 •

edited

Loading

Add Bitfit #311

Are you sure you want to change the base?

Add Bitfit #311

Conversation

Muennighoff commented Jul 10, 2022 • edited Loading

Muennighoff commented Jul 10, 2022 •

edited

Loading