Skip to content
This repository has been archived by the owner on Aug 7, 2024. It is now read-only.

[9/x]: make dynamic scaling default in Float8Linear #300

Closed
wants to merge 1 commit into from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jul 2, 2024

Summary:

1. makes dynamic scaling default in Float8Linear for an easier migration
   of callsites which currently use Float8DynamicLinear. Fixes
   tests as needed.
2. updates the README to reference Float8Linear for dynamic scaling

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2024
vkuzo added a commit that referenced this pull request Jul 2, 2024
Summary:

1. makes dynamic scaling default in Float8Linear for an easier migration
   of callsites which currently use Float8DynamicLinear. Fixes
   tests as needed.
2. updates the README to reference Float8Linear for dynamic scaling

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: bb605c238a24d8d650430685c12f4fe77494ab86
Pull Request resolved: #300
@vkuzo vkuzo requested a review from drisspg July 2, 2024 23:02
@vkuzo
Copy link
Contributor Author

vkuzo commented Jul 2, 2024

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in d4cf2ad.

vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 3, 2024
Summary:

In the stack ending in
pytorch-labs/float8_experimental#300 in
float8_experimental, we are unifying `Float8DynamicLinear` and
`Float8Linear`, with a future PR being planned to delete the
`Float8DynamicLinear` object.

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched the
`fp8_linear` configuration to be a boolean on whether to swap
the linears or not. In the future we can add new options on how to
configure each linear (scaling type, scaling granularity, etc) - saving
that for a future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 3, 2024
Summary:

In the stack ending in
pytorch-labs/float8_experimental#300 in
float8_experimental, we are unifying `Float8DynamicLinear` and
`Float8Linear`, with a future PR being planned to delete the
`Float8DynamicLinear` object.

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched the
`fp8_linear` configuration to be a boolean on whether to swap
the linears or not. In the future we can add new options on how to
configure each linear (scaling type, scaling granularity, etc) - saving
that for a future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 3, 2024
Summary:

In the stack ending in
pytorch-labs/float8_experimental#300 in
float8_experimental, we are unifying `Float8DynamicLinear` and
`Float8Linear`, with a future PR being planned to delete the
`Float8DynamicLinear` object.

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched the
`fp8_linear` configuration to be a boolean on whether to swap
the linears or not. In the future we can add new options on how to
configure each linear (scaling type, scaling granularity, etc) - saving
that for a future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 3, 2024
Summary:

In the stack ending in
pytorch-labs/float8_experimental#300 in
float8_experimental, we are unifying `Float8DynamicLinear` and
`Float8Linear`, with a future PR being planned to delete the
`Float8DynamicLinear` object.

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched the
`fp8_linear` configuration to be a boolean on whether to swap
the linears or not. In the future we can add new options on how to
configure each linear (scaling type, scaling granularity, etc) - saving
that for a future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit that referenced this pull request Jul 3, 2024
Summary:

We are standardizing on `Float8Linear` as the only float8 linear object:
1. the stack ending with
   #300 moved
   all of the functionality of `Float8DynamicLinear` to `Float8Linear`.
   The default settings of `Float8Linear` are to use dynamic scaling.
2. this PR deletes `Float8DynamicLinear` from the codebase and patches
   the relevant callsites in fbsource.

Test Plan:

```
// all tests pass
./test_everything.sh

// also run all benchmarks and verify correctness
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo vkuzo mentioned this pull request Jul 3, 2024
vkuzo added a commit that referenced this pull request Jul 3, 2024
Summary:

We are standardizing on `Float8Linear` as the only float8 linear object:
1. the stack ending with
   #300 moved
   all of the functionality of `Float8DynamicLinear` to `Float8Linear`.
   The default settings of `Float8Linear` are to use dynamic scaling.
2. this PR deletes `Float8DynamicLinear` from the codebase and patches
   the relevant callsites in fbsource.

Test Plan:

```
// all tests pass
./test_everything.sh

// also run all benchmarks and verify correctness
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 8ab483377124960fec2f133c0e27fbbaab204528
Pull Request resolved: #304
facebook-github-bot pushed a commit that referenced this pull request Jul 5, 2024
Summary:
Pull Request resolved: #304

We are standardizing on `Float8Linear` as the only float8 linear object:
1. the stack ending with
   #300 moved
   all of the functionality of `Float8DynamicLinear` to `Float8Linear`.
   The default settings of `Float8Linear` are to use dynamic scaling.
2. this PR deletes `Float8DynamicLinear` from the codebase and patches
   the relevant callsites in fbsource.

Reviewed By: drisspg

Differential Revision: D59342767

fbshipit-source-id: cfb09dd5f6517cfbf41d8b46eb6d7d6a5266006a
vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 8, 2024
Summary:

In the stack ending in
pytorch-labs/float8_experimental#300 in
float8_experimental, we are unifying `Float8DynamicLinear` and
`Float8Linear`, with a future PR being planned to delete the
`Float8DynamicLinear` object.

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched the
`fp8_linear` configuration to be a boolean on whether to swap
the linears or not. In the future we can add new options on how to
configure each linear (scaling type, scaling granularity, etc) - saving
that for a future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit to pytorch/torchtitan that referenced this pull request Jul 8, 2024
Summary:

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched
the `fp8_linear` configuration to be a boolean on whether to swap the
linears or not. In the future we can add new options on how to configure
each linear (scaling type, scaling granularity, etc) - saving that for a
future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit that referenced this pull request Jul 9, 2024
Summary:

missed this in
#300

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Jul 11, 2024
* Set `record_shapes=True` for profiler

ghstack-source-id: 6f1ed49d15ce311f1bf118820965cdb5309a8030
Pull Request resolved: pytorch#419

* Improved `repeat_kv` eager perf

ghstack-source-id: 39e484954814e61cdfb2ba661f0a98c83bc0ce60
Pull Request resolved: pytorch#418

* Adding FSDP Memory Tracking and Estimation

ghstack-source-id: c8ed20fc585957bd164dd963307616a53991615d
Pull Request resolved: pytorch#425

* Adding integration test for FSDP Memory Tracking and Estimation

ghstack-source-id: cc224db8951ec7a133fd769845a4765cbedc6454
Pull Request resolved: pytorch#426

* by default disable heavy memory profiling

ghstack-source-id: cad7b3c41fd60ec19c0e6e7d058e8aa00602a187
Pull Request resolved: pytorch#430

* Add the option to turn on async-TP

ghstack-source-id: 0a03379eeb3a63b2d1ad4dff84d0e61ca82b1bbf
Pull Request resolved: pytorch#429

* Modifying memory estimation options and minor changes

ghstack-source-id: 5f09824cddaed6585cc094095e1e95dd070d76f4
Pull Request resolved: pytorch#435

* add comment pointing to Sequence Parallel optimization example

ghstack-source-id: 6fa0dcd4bca876e10a6a8349283fb940a59ad234
Pull Request resolved: pytorch#438

* switch float8 logic from Float8DynamicLinear to Float8Linear (pytorch#436)

Summary:

After pytorch-labs/float8_experimental#300,
`Float8Linear` with default settings is equivalent to
`Float8DynamicLinear`. This PR changes `torchtitan` to use
`Float8Linear`.

To support the new UX of `float8_experimental` better, I also switched
the `fp8_linear` configuration to be a boolean on whether to swap the
linears or not. In the future we can add new options on how to configure
each linear (scaling type, scaling granularity, etc) - saving that for a
future PR.

Test Plan:

```
// run baseline (Float8DynamicLinear) for llama3_8b for 50 iterations on 4 GPUs,
// verify performance and loss values do not change meaningfully between
// baseline and this PR

// baseline (before this PR)
// 1. compile, bf16
// 2. compile, float8
// 3. compile, float8, fdsp_fp8_allgather=True
// 4. compile, float8, fdsp_fp8_allgather=True, tp=2
// logs: https://gist.github.com/vkuzo/e6d5f3b15349862bfad3706baad8c9ce

// experiment (this PR): repeat all of the above, but with Float8Linear
// logs: https://gist.github.com/vkuzo/a4d6754358facffa64df931654459631
```

Reviewers:

Subscribers:

Tasks:

Tags:

* Removed `_experimental_support_context_fn_in_torch_utils_checkpoint`

ghstack-source-id: 50b2d0c2b4c22e2f045cafd8630c16f3a8c6d35f
Pull Request resolved: pytorch#444

* Reordered TP parallel plan to follow execution order

ghstack-source-id: b4924952adeb5f16d08b60faa54690762841c422
Pull Request resolved: pytorch#445

* Made some stylistic changes to `apply_dp`

ghstack-source-id: fb78e9eb8aa406ba87d6ad6cf2229c1027dae42f
Pull Request resolved: pytorch#446

* Refactored activation checkpointing

ghstack-source-id: 785c7e47651cda97ea22d0147d14b8d061ce042d
Pull Request resolved: pytorch#447

* compiled RMSNorm

ghstack-source-id: c4efb81ec6acc5442955908cc376df3e6d889af3
Pull Request resolved: pytorch#442

* Renamed parallel styles for transformer block weights

ghstack-source-id: 5fb0bf3d08cacf27242ec0f85d5dd3cdc03b739e
Pull Request resolved: pytorch#448

* Added type annotations and more stylistic changes

ghstack-source-id: 1bd5b9d5abc8644785132f8eb2baaf8b1cfc5fb5
Pull Request resolved: pytorch#449

---------

Co-authored-by: Andrew Gu <andgu@fb.com>
Co-authored-by: Sanket Jayant Purandare <sanketpurandare@meta.com>
Co-authored-by: Yifu Wang <yifu@fb.com>
Co-authored-by: Vasiliy Kuznetsov <vkuzo@users.noreply.github.com>
vkuzo added a commit that referenced this pull request Jul 12, 2024
Summary:

missed this in
#300

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 12, 2024
Summary:

missed this in
#300

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Jul 12, 2024
Summary:
Pull Request resolved: #309

missed this in
#300

Reviewed By: awgu

Differential Revision: D59685260

fbshipit-source-id: ecee2ae926f0995bf2b95a26f7817a4909cbc50d
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants