How to use a adamw optimizer with gradient clipping? #445

aniquetahir · 2022-10-25T22:58:37Z

aniquetahir
Oct 25, 2022

Hi,
I want to use a simple AdamW optimizer with the simple modification of gradient clipping. I found the following in the documentation:

# Exponential decay of the learning rate.
scheduler = optax.exponential_decay(
    init_value=start_learning_rate, 
    transition_steps=1000,
    decay_rate=0.99)

# Combining gradient transforms using `optax.chain`.
gradient_transform = optax.chain(
    optax.clip_by_global_norm(1.0),  # Clip by the gradient by the global norm.
    optax.scale_by_adam(),  # Use the updates from adam.
    optax.scale_by_schedule(scheduler),  # Use the learning rate from the scheduler.
    # Scale updates by -1 since optax.apply_updates is additive and we want to descend on the loss.
    optax.scale(-1.0)
)

This one is using a simple Adam optimizer. There is no optax.scale_by_adamw function. I am also unsure about what exactly optax.scale(-1.0) is doing. Why is -1.0 used as the argument?

Answered by rosshemsley

Nov 16, 2022

If you'd like to use clipping with adamw, you could something like the following:

opt = optax.chain(
   optax.clip_by_global_norm(1.0),
   optax.adamw(1e-4),
)

This will cause the clipping to be applied to the gradients before they are forwarded to the adam optimizer.

For the scale(-1.0) question - this is effectively flips the sign of the updates since the updates are applied by adding them to the parameters.

View full answer

rosshemsley · 2022-11-16T16:19:06Z

rosshemsley
Nov 16, 2022
Maintainer

If you'd like to use clipping with adamw, you could something like the following:

opt = optax.chain(
   optax.clip_by_global_norm(1.0),
   optax.adamw(1e-4),
)

This will cause the clipping to be applied to the gradients before they are forwarded to the adam optimizer.

For the scale(-1.0) question - this is effectively flips the sign of the updates since the updates are applied by adding them to the parameters.

1 reply

chkda Oct 10, 2023

@rosshemsley I am bit confused on the scale(-1.0). If we look at this line by default we should have scale at -1. Do I have to still add scale(-1.0) to the chain block?

a123455392 · 2023-06-25T13:19:59Z

a123455392
Jun 25, 2023

I have the same question about the "Scale by -1.0".

Why do some examples perform this operation, and others don't?
e.g. "Composing Gradient Transformations" and "Schedules" in readme.md, and your example here.
They have similar operations, but differ in the "Scale by -1.0" operation.

Could you give more explicit examples of when have to do this operation?
P.S. I have also posted a similar question on stackoverflow.

1 reply

fabianp Jan 28, 2024
Maintainer

all solvers require to flip the sign of the gradients (for the reason @rosshemsley has mentioned), it's just that some scalings do this sign flipping implicitly, like scale_by_learning_rate

Please fell free to open issues and pull requests where you think the documentation or clarity of examples could be clarified

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use a adamw optimizer with gradient clipping? #445

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to use a adamw optimizer with gradient clipping? #445

aniquetahir Oct 25, 2022

Replies: 2 comments · 2 replies

rosshemsley Nov 16, 2022 Maintainer

chkda Oct 10, 2023

a123455392 Jun 25, 2023

fabianp Jan 28, 2024 Maintainer

aniquetahir
Oct 25, 2022

Replies: 2 comments 2 replies

rosshemsley
Nov 16, 2022
Maintainer

a123455392
Jun 25, 2023

fabianp Jan 28, 2024
Maintainer