Skip to content
This repository was archived by the owner on Aug 7, 2024. It is now read-only.

[1/x]: Make Float8Linear support dynamic scaling #290

Closed
wants to merge 5 commits into from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jun 28, 2024

Stack from ghstack (oldest at bottom):

Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:

  • PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to Float8Linear
  • PRs 2..n: one by one, add features implemented in Float8DynamicLinear to Float8Linear, as necessary
  • last PR: delete Float8DynamicLinear

Test Plan:

./test/test_everything.sh

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D59305792

Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 28, 2024
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7020d5e88e29710c77baaf8c95ff1d859c91bebd
Pull Request resolved: #290
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2024
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 28, 2024
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7020d5e88e29710c77baaf8c95ff1d859c91bebd
Pull Request resolved: #290
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 28, 2024
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 0aeb524df8b1b693f0538a85c898d24f3f347081
Pull Request resolved: #290
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jun 28, 2024
Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: bb1bfe5234f6b654bc8e52d48a158d8b1b3bd616
Pull Request resolved: #290
"""Returns whether the given linear_type requires sync before forward."""
return linear_type in REQUIRES_SYNC
return linear_type is LinearType.DELAYED and any(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since Float8DynamicLinear does not support TensorScalingType, I think and is right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right 👍

emulate=emulate,
)
if linear_type is LinearType.DYNAMIC:
return Float8DynamicLinear.from_float(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert that non of the scaling types are True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically yes, I was just lazy and unmotivated since this stack is trying to delete Float8DynamicLinear

def linear_requires_sync(linear_type: LinearType):
def linear_requires_sync(
linear_type: LinearType,
scaling_type_x: TensorScalingType = TensorScalingType.DELAYED,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we should probably remove the defaults right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, long term probably yes. I'm taking the approach of tackling the "what's the default" recipe separately and not changing default behavior for now.

Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me

Summary:

At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
* PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear`
* PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary
* last PR: delete `Float8DynamicLinear`

Test Plan:

```
./test/test_everything.sh
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo
Copy link
Contributor Author

vkuzo commented Jul 2, 2024

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3cb42e1.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants