-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for noisy problems to the Ax Benchmarks #2255
Conversation
This pull request was exported from Phabricator. Differential Revision: D51339180 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2255 +/- ##
==========================================
- Coverage 94.86% 94.85% -0.02%
==========================================
Files 469 476 +7
Lines 46545 47030 +485
==========================================
+ Hits 44156 44609 +453
- Misses 2389 2421 +32 ☔ View full report in Codecov by Sentry. |
This pull request was exported from Phabricator. Differential Revision: D51339180 |
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
cbe5186
to
e11db28
Compare
This pull request was exported from Phabricator. Differential Revision: D51339180 |
e11db28
to
81aad40
Compare
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
This pull request was exported from Phabricator. Differential Revision: D51339180 |
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
81aad40
to
4d6b4f6
Compare
This pull request was exported from Phabricator. Differential Revision: D51339180 |
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
4d6b4f6
to
03f5261
Compare
This pull request was exported from Phabricator. Differential Revision: D51339180 |
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
03f5261
to
585f1d5
Compare
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
This pull request was exported from Phabricator. Differential Revision: D51339180 |
585f1d5
to
5944694
Compare
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
This pull request was exported from Phabricator. Differential Revision: D51339180 |
5944694
to
651aca2
Compare
Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---
651aca2
to
086dfb2
Compare
This pull request has been merged in 639b731. |
Summary:
Support noisy evaluations (and observation noise) in the Ax Benchmarks
This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available).
Key changes in behavior:
BotorchTestProblemRunner
now returns arun_metadata
dict that contains not justYs
(the actual observation, potentially including noise), but alsoYs_true
(the noiseless observation / ground truth) andYstd
(the observation noise standard deviation).BenchmarkMetricBase
base class for metrics to be used in benchmarks that has ahas_ground_truth
property andmake_ground_truth_metric()
method. This gets us around the issues that we don't know whether a standardMetric
would permit a ground truth. This means that all metrics used in benchmarks now need to subclass fromBenchmarkMetricBase
. Introduced aBenchmarkMetric
that unifies what was previously implemented byBotorchTestProblemMetric
andSurrogateMetric
in a single metric. This metric extracts from the modifiedrun_metadata
not just theYs
but also theYstds
(Ys_true
are not extracted as these are for performance analysis only and must not be returned as metric values).SurrogateRunner
takes in anoise_stds
argument that specifies the noise variance add to the surrogate prediction (takes inputs of typeDict[str, float]
that map metric names to the std of their respective noise level).SurrogateBenchmarkProblemBase
and its subclasses also take in thisnoise_stds
arg and pass it down to the runner instantiation.run_metadata
. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage theScheuduler.get_trace()
method as is with that modified optimization config.BatchTrial
can be scaled by the weights in thearm_weights
of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that thenoise_std
of the problem (either synthetic or surrogate) is the standard deviation of the observation noise if all of the sample budget were allocated to a single arm. If there are multiple arms in aBatchTrial
, the observation noise forarm _i
isnoise_std / sqrt(arm_weight_i)
, wheresum(arm_weight_i) = 1
. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now).Re-organization of files
/ax/metrics
and/ax/runners
intoax/benchmark/metrics
and/ax/benchmark/runners
, respectively.Naming changes in order to increase clarity:
infer_noise
arguments and properties are nowobserve_noise_sd
in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself).Other changes
BenchmarkProblemBase
protocol has new attributes:has_ground_truth
: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems).is_noiseless
: indicates whether the problem is noiseless.observe_noise_stds
(Union[bool, Dict[str, bool]]
) indicates whether noise level is observed (if a singlebool
, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels).NOTE: the following are orthogonal (
is_noiseless
implieshas_ground_truth
but not vice versa): (i)is_noiseless
: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii)observe_noise_sd
: The noise level of the observations is observed (could be zero), (iii)has_ground_truth
: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless)WaveguideMetric
is replaced withSurrogateMetric
- this was achieved by changing the output format ofWaveguideSurrogateRunner.run
to be consistent with whatSurrogateMetric
expects.test_timeout
test significantly by usingfast_botorch_optimize
and fewer seeds._fetch_trial_data
helper fromBotorchTestProblemMetric
andSurrogateMetric
into shared helper atax/benchmark/metrics/base.py
(I believe we can also further consolidate things and mergeBotorchTestProblemMetric
andSurrogateMetric
into a single metric given that we've consolidated therun_metadata
representation).BotorchTestProblemRunner
andSurrogateRunner
into a newBenchmarkRunner
class that bothBotorchTestProblemRunner
andSurrogateRunner
now subclass.