-
Notifications
You must be signed in to change notification settings - Fork 19
[1/x]: Make Float8Linear support dynamic scaling #290
Conversation
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7020d5e88e29710c77baaf8c95ff1d859c91bebd Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7020d5e88e29710c77baaf8c95ff1d859c91bebd Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0aeb524df8b1b693f0538a85c898d24f3f347081 Pull Request resolved: #290
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: bb1bfe5234f6b654bc8e52d48a158d8b1b3bd616 Pull Request resolved: #290
"""Returns whether the given linear_type requires sync before forward.""" | ||
return linear_type in REQUIRES_SYNC | ||
return linear_type is LinearType.DELAYED and any( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since Float8DynamicLinear
does not support TensorScalingType
, I think and
is right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right 👍
emulate=emulate, | ||
) | ||
if linear_type is LinearType.DYNAMIC: | ||
return Float8DynamicLinear.from_float( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we assert that non of the scaling types are True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically yes, I was just lazy and unmotivated since this stack is trying to delete Float8DynamicLinear
def linear_requires_sync(linear_type: LinearType): | ||
def linear_requires_sync( | ||
linear_type: LinearType, | ||
scaling_type_x: TensorScalingType = TensorScalingType.DELAYED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we should probably remove the defaults right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, long term probably yes. I'm taking the approach of tackling the "what's the default" recipe separately and not changing default behavior for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me
Summary: At a high level, we need to make dynamic vs delayed scaling configurable separately for activations, weights and gradients. The way I am approaching this is as follows: * PR 1 (this PR): add basic support for dynamic scaling, configurable by tensor, to `Float8Linear` * PRs 2..n: one by one, add features implemented in `Float8DynamicLinear` to `Float8Linear`, as necessary * last PR: delete `Float8DynamicLinear` Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This pull request has been merged in 3cb42e1. |
Stack from ghstack (oldest at bottom):
Summary:
At a high level, we need to make dynamic vs delayed scaling configurable
separately for activations, weights and gradients. The way I am
approaching this is as follows:
Float8Linear
Float8DynamicLinear
toFloat8Linear
, as necessaryFloat8DynamicLinear
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: D59305792