Update index.qmd #548

willtebbutt · 2024-11-04T11:41:30Z

Addresses part of #547 .

@gdalle does this read more correctly?

tutorials/docs-10-using-turing-autodiff/index.qmd

github-actions · 2024-11-04T11:46:29Z

Preview the changes: https://turinglang.org/docs/pr-previews/548
Please avoid using the search feature and navigation bar in PR previews!

gdalle · 2024-11-04T22:18:17Z

I think there might still be a misunderstanding here. The keyword argument AutoReverseDiff(; compile) only controls whether the tape is optimized after being recorded. It doesn't control whether a recorded tape is only used once, or reused across function calls (inducing possible control flow problems).
Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it. See these key lines in LogDensityProblemsAD.jl:

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADADTypesExt.jl#L52-L54

https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L45-L56

So the real question is:

in which places does Turing "prepare" the ADGradient with an example x
can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

willtebbutt · 2024-11-05T11:02:27Z

Ah, I see. Maybe @torfjelde is the right person to ask about this? I've not dug in to exactly where to sort out RD gradient stuff yet in Turing.jl.

torfjelde · 2024-11-05T12:32:30Z

in which places does Turing "prepare" the ADGradient with an example x

Typically this occurs in the initial step of the sampler, i.e. once (there are exceptions, but in those cases you can't really do much better in our case).

can the user of the library do anything about it, if they know that their loglikelihood has value-dependent control flow?

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).

torfjelde · 2024-11-05T12:38:15Z

tutorials/docs-10-using-turing-autodiff/index.qmd

+
+Cached tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.
+Thus, e.g., in the model definition and all implicitly and explicitly called functions in the model, all loops should be of fixed size, and `if`-statements should consistently execute the same branches.
+For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).


Suggested change

For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).

For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the model, e.g. fixed data.

I don't think there's much point mentioning minibatching here, as there's no "easy to use" support for this in Turing.jl and so it's really not something people do much of (don't think I've ever seen anyone actually do this in applications with Turing.jl).

torfjelde

Added some comments 👍

torfjelde · 2024-11-05T12:38:37Z

tutorials/docs-10-using-turing-autodiff/index.qmd

@@ -20,11 +20,12 @@ As of Turing version v0.30, the global configuration flag for the AD backend has
 Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`.

 For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).


Suggested change

For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).

For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of `nothing` permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).

Maybe also add this, as @gdalle pointed out, this is not actually the case anymore (this was a left-over thing from using LogDensityProblemsAD.jl I believe, but this changed when it all moved to ADTypes.jl backed)

gdalle · 2024-11-06T09:28:24Z

Unfortunately not for ReverseDiff.jl in compiled mode (they can however just not use compiled mode).

@torfjelde I think that's the key confusion between us. My understanding is that, regardless of whether it is compiled or not, the tape only captures execution for one version of the control flow. Thus, the tape always becomes invalid if the control flow changes (e.g. different branch of an if statement). Compilation just speeds up the differentiation but does not affect anything else, especially not correctness. Do you have a different understanding?

Source: https://juliadiff.org/ReverseDiff.jl/dev/api/#The-AbstractTape-API

devmotion · 2024-11-06T11:04:27Z

The tape is purely internal and neither stored nor reused if compile = Val(false).

Tape reuse is instead determined by how the ADGradient is created, and whether or not an example vector x is passed to it

This is wrong. The tape is only reused if compile = Val(true): https://github.com/tpapp/LogDensityProblemsAD.jl/blob/e3401f21b5a065df0d5de38b37fad0e6650618f3/ext/LogDensityProblemsADReverseDiffExt.jl#L51-L56

gdalle · 2024-11-06T11:21:04Z

Okay then IIUC it's a matter of inconsistency between

the "compile" kwarg here and in LogDensityProblemsAD (which determines whether a tape is reused)
the compilation as defined by ReverseDiff, which is about speeding up an existing tape (and disconnected from the reuse issue)

gdalle · 2024-11-06T11:22:51Z

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

gdalle · 2024-11-06T11:40:34Z

I'd love for you to chime in here @devmotion @torfjelde: SciML/ADTypes.jl#91

torfjelde · 2024-11-06T16:13:33Z

Note that in ADTypes, the "compile" argument to AutoReverseDiff is defined ambiguously too. So we should perhaps add more details to the struct, something like AutoReverseDiff(tape=true, compile=false)?

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

gdalle · 2024-11-06T16:44:49Z

I'm not too opinionated about this, as compilation without caching seems somewhat useless? Are there scenarios where you'd like to do that?

Indeed you can't compile a tape you never record in the first place. In any case, I think the ambiguous terminology was fixed by SciML/ADTypes.jl#91. It's just a shame that the word "compile" was chosen instead of "record", given how both are used in ReverseDiff's documentation. But it's a sunk cost now.

gdalle · 2024-11-07T09:11:10Z

tutorials/docs-10-using-turing-autodiff/index.qmd

@@ -20,11 +20,12 @@ As of Turing version v0.30, the global configuration flag for the AD backend has
 Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`.

 For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).
-For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional argument can be provided to `AutoReverseDiff` to specify whether to to compile the tape only once and cache it for later use (`false` by default, which means no caching tape). Be aware that the use of caching in certain types of models can lead to incorrect results and/or errors.
+For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional argument can be provided to `AutoReverseDiff` to specify whether to to cache the tape only once and reuse it later use (`false` by default, which means no caching). This can substantially improve performance, but risks silently incorrect results if not used with care.


Suggested change

For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional argument can be provided to `AutoReverseDiff` to specify whether to to cache the tape only once and reuse it later use (`false` by default, which means no caching). This can substantially improve performance, but risks silently incorrect results if not used with care.

For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional keyword argument called `compile` can be provided to `AutoReverseDiff`. It specifies whether to pre-record the tape only once and reuse it later (`compile` is set to `false` by default, which means no pre-recording). This can substantially improve performance, but risks silently incorrect results if not used with care.

gdalle · 2024-11-07T09:11:24Z

tutorials/docs-10-using-turing-autodiff/index.qmd

-Thus, e.g., in the model definition and all im- and explicitly called functions in the model all loops should be of fixed size, and `if`-statements should consistently execute the same branches.
-For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).
+
+Cached tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.


Suggested change

Cached tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.

Pre-recorded tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.

Update index.qmd

3d76c99

willtebbutt commented Nov 4, 2024

View reviewed changes

tutorials/docs-10-using-turing-autodiff/index.qmd Outdated Show resolved Hide resolved

Update tutorials/docs-10-using-turing-autodiff/index.qmd

a05a58b

willtebbutt requested a review from torfjelde November 4, 2024 11:42

willtebbutt mentioned this pull request Nov 5, 2024

Questions about AD support #547

Open

torfjelde reviewed Nov 5, 2024

View reviewed changes

gdalle mentioned this pull request Nov 6, 2024

What does "compile" mean for ReverseDiff? SciML/ADTypes.jl#91

Closed

gdalle reviewed Nov 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update index.qmd #548

Update index.qmd #548

willtebbutt commented Nov 4, 2024

github-actions bot commented Nov 4, 2024

gdalle commented Nov 4, 2024 •

edited

Loading

willtebbutt commented Nov 5, 2024

torfjelde commented Nov 5, 2024

torfjelde Nov 5, 2024

torfjelde left a comment

torfjelde Nov 5, 2024

torfjelde Nov 5, 2024

gdalle commented Nov 6, 2024 •

edited

Loading

devmotion commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle commented Nov 6, 2024

torfjelde commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle Nov 7, 2024

gdalle Nov 7, 2024

	For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the data will always execute the same branches during sampling (if the data is constant throughout sampling and, e.g., no mini-batching is used).
	For instance, `if`-statements with conditions that can be determined at compile time or conditions that depend only on fixed properties of the model, e.g. fixed data.

		@@ -20,11 +20,12 @@ As of Turing version v0.30, the global configuration flag for the AD backend has
		Users can pass the `adtype` keyword argument to the sampler constructor to select the desired AD backend, with the default being `AutoForwardDiff(; chunksize=0)`.

		For `ForwardDiff`, pass `adtype=AutoForwardDiff(; chunksize)` to the sampler constructor. A `chunksize` of 0 permits the chunk size to be automatically determined. For more information regarding the selection of `chunksize`, please refer to [related section of `ForwardDiff`'s documentation](https://juliadiff.org/ForwardDiff.jl/dev/user/advanced/#Configuring-Chunk-Size).

	For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional argument can be provided to `AutoReverseDiff` to specify whether to to cache the tape only once and reuse it later use (`false` by default, which means no caching). This can substantially improve performance, but risks silently incorrect results if not used with care.
	For `ReverseDiff`, pass `adtype=AutoReverseDiff()` to the sampler constructor. An additional keyword argument called `compile` can be provided to `AutoReverseDiff`. It specifies whether to pre-record the tape only once and reuse it later (`compile` is set to `false` by default, which means no pre-recording). This can substantially improve performance, but risks silently incorrect results if not used with care.

	Cached tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.
	Pre-recorded tapes should only be used if you are absolutely certain that the sequence of operations performed in your code does not change between different executions of your model.

Update index.qmd #548

Are you sure you want to change the base?

Update index.qmd #548

Conversation

willtebbutt commented Nov 4, 2024

github-actions bot commented Nov 4, 2024

gdalle commented Nov 4, 2024 • edited Loading

willtebbutt commented Nov 5, 2024

torfjelde commented Nov 5, 2024

torfjelde Nov 5, 2024

Choose a reason for hiding this comment

torfjelde left a comment

Choose a reason for hiding this comment

torfjelde Nov 5, 2024

Choose a reason for hiding this comment

torfjelde Nov 5, 2024

Choose a reason for hiding this comment

gdalle commented Nov 6, 2024 • edited Loading

devmotion commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle commented Nov 6, 2024

torfjelde commented Nov 6, 2024

gdalle commented Nov 6, 2024

gdalle Nov 7, 2024

Choose a reason for hiding this comment

gdalle Nov 7, 2024

Choose a reason for hiding this comment

gdalle commented Nov 4, 2024 •

edited

Loading

gdalle commented Nov 6, 2024 •

edited

Loading