Add support for noisy problems to the Ax Benchmarks #2255

Balandat · 2024-03-07T14:53:34Z

Summary:

Support noisy evaluations (and observation noise) in the Ax Benchmarks

This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available).

Key changes in behavior:

BotorchTestProblemRunner now returns a run_metadata dict that contains not just Ys (the actual observation, potentially including noise), but also Ys_true (the noiseless observation / ground truth) and Ystd (the observation noise standard deviation).
Introduced a new BenchmarkMetricBase base class for metrics to be used in benchmarks that has a has_ground_truth property and make_ground_truth_metric() method. This gets us around the issues that we don't know whether a standard Metric would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from BenchmarkMetricBase. Introduced a BenchmarkMetric that unifies what was previously implemented by BotorchTestProblemMetric and SurrogateMetric in a single metric. This metric extracts from the modified run_metadata not just the Ys but also the Ystds (Ys_true are not extracted as these are for performance analysis only and must not be returned as metric values).
SurrogateRunner takes in a noise_stds argument that specifies the noise variance add to the surrogate prediction (takes inputs of type Dict[str, float] that map metric names to the std of their respective noise level). SurrogateBenchmarkProblemBase and its subclasses also take in this noise_stds arg and pass it down to the runner instantiation.
We leverage additional tracking metrics to retrieve the ground truth observations from the trials' run_metadata. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the Scheuduler.get_trace() method as is with that modified optimization config.
The observation noise level of individual arms in a BatchTrial can be scaled by the weights in the arm_weights of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the noise_std of the problem (either synthetic or surrogate) is the standard deviation of the observation noise if all of the sample budget were allocated to a single arm. If there are multiple arms in a BatchTrial, the observation noise for arm _i is noise_std / sqrt(arm_weight_i), where sum(arm_weight_i) = 1. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now).

Re-organization of files

Benchmark-specific metrics and runners were moved from /ax/metrics and /ax/runners into ax/benchmark/metrics and /ax/benchmark/runners, respectively.

Naming changes in order to increase clarity:

The infer_noise arguments and properties are now observe_noise_sd in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself).

Other changes

BenchmarkProblemBase protocol has new attributes:
- has_ground_truth: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems).
- is_noiseless: indicates whether the problem is noiseless.
- observe_noise_stds (Union[bool, Dict[str, bool]]) indicates whether noise level is observed (if a single bool, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels).

NOTE: the following are orthogonal (is_noiseless implies has_ground_truth but not vice versa): (i) is_noiseless: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) observe_noise_sd: The noise level of the observations is observed (could be zero), (iii) has_ground_truth: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless)

WaveguideMetric is replaced with SurrogateMetric - this was achieved by changing the output format of WaveguideSurrogateRunner.run to be consistent with what SurrogateMetric expects.
Speed up the test_timeout test significantly by using fast_botorch_optimize and fewer seeds.
Consolidated _fetch_trial_data helper from BotorchTestProblemMetric and SurrogateMetric into shared helper at ax/benchmark/metrics/base.py (I believe we can also further consolidate things and merge BotorchTestProblemMetric and SurrogateMetric into a single metric given that we've consolidated the run_metadata representation).
Deduplicated some shared logic from BotorchTestProblemRunner and SurrogateRunner into a new BenchmarkRunner class that both BotorchTestProblemRunner and SurrogateRunner now subclass.

facebook-github-bot · 2024-03-07T14:54:01Z

This pull request was exported from Phabricator. Differential Revision: D51339180

codecov-commenter · 2024-03-07T15:08:29Z

Codecov Report

Attention: Patch coverage is 94.35484% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 94.85%. Comparing base (c7dadf4) to head (b174962).

❗ Current head b174962 differs from pull request most recent head 651aca2. Consider uploading reports for the commit 651aca2 to get more accurate results

Files	Patch %	Lines
ax/benchmark/benchmark.py	79.62%	11 Missing ⚠️
ax/benchmark/runners/base.py	83.67%	8 Missing ⚠️
ax/benchmark/runners/botorch_test.py	85.00%	6 Missing ⚠️
ax/benchmark/metrics/utils.py	86.20%	4 Missing ⚠️
ax/benchmark/problems/surrogate.py	84.21%	3 Missing ⚠️
ax/benchmark/runners/surrogate.py	95.08%	3 Missing ⚠️
ax/benchmark/metrics/jenatton.py	94.87%	2 Missing ⚠️
...x/benchmark/tests/runners/test_surrogate_runner.py	95.45%	2 Missing ⚠️
ax/benchmark/benchmark_problem.py	96.42%	1 Missing ⚠️
ax/benchmark/metrics/base.py	94.44%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2255      +/-   ##
==========================================
- Coverage   94.86%   94.85%   -0.02%     
==========================================
  Files         469      476       +7     
  Lines       46545    47030     +485     
==========================================
+ Hits        44156    44609     +453     
- Misses       2389     2421      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

facebook-github-bot · 2024-03-07T15:26:19Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-09T03:08:16Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-09T16:57:35Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-09T17:04:02Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-09T17:42:37Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-11T00:27:36Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-11T13:57:48Z

This pull request was exported from Phabricator. Differential Revision: D51339180

Summary: Pull Request resolved: facebook#2255 # Support noisy evaluations (and observation noise) in the Ax Benchmarks This diff makes a number of key changes to the benchmarking code (see below for incremental changes version-by-version). The primary high-level change is that benchmarks now support running optimization with noisy evaluations and evaluating the results with respect to the ground truth (if available). ### Key changes in behavior: 1. `BotorchTestProblemRunner` now returns a `run_metadata` dict that contains not just `Ys` (the actual observation, potentially including noise), but also `Ys_true` (the noiseless observation / ground truth) and `Ystd` (the observation noise standard deviation). 2. Introduced a new `BenchmarkMetricBase` base class for metrics to be used in benchmarks that has a `has_ground_truth` property and `make_ground_truth_metric()` method. This gets us around the issues that we don't know whether a standard `Metric` would permit a ground truth. This means that all metrics used in benchmarks now need to subclass from `BenchmarkMetricBase`. Introduced a `BenchmarkMetric` that unifies what was previously implemented by `BotorchTestProblemMetric` and `SurrogateMetric` in a single metric. This metric extracts from the modified `run_metadata` not just the `Ys` but also the `Ystds` (`Ys_true` are not extracted as these are for performance analysis only and must not be returned as metric values). 3. `SurrogateRunner` takes in a `noise_stds` argument that specifies the noise variance add to the surrogate prediction (takes inputs of type `Dict[str, float]` that map metric names to the std of their respective noise level). `SurrogateBenchmarkProblemBase` and its subclasses also take in this `noise_stds` arg and pass it down to the runner instantiation. 4. We leverage additional tracking metrics to retrieve the ground truth observations from the trials' `run_metadata`. The basic idea is to (i) generate ground truth versions of all the metrics on an experiment upon creation and add them to the tracking metrics, (ii) run the experiment with the original optimization config, (iii) during the evaluation, replace the Optimization config with one of the same specs but using the ground truth metrics instead (which we can grab fro the tracking metrics), (iv) leverage the `Scheuduler.get_trace()` method as is with that modified optimization config. 5. The observation noise level of individual arms in a `BatchTrial` can be scaled by the weights in the `arm_weights` of the trial. This enables running benchmarks with non-uniform observation noise levels across trials, which is needed to understand the behavior of different algorithms / batch sizes under a fixed total sample budget allocation. The assumption here is that the `noise_std` of the problem (either synthetic or surrogate) is the standard deviation of the observation noise *if all of the sample budget were allocated to a single arm*. If there are multiple arms in a `BatchTrial`, the observation noise for `arm _i` is `noise_std / sqrt(arm_weight_i)`, where `sum(arm_weight_i) = 1`. This implicitly assumes that the sample budget is the same for all trials (not arms!). Relaxing that assumption would require us to build out some additional way of specifying this (a lower priority right now). ### Re-organization of files - Benchmark-specific metrics and runners were moved from `/ax/metrics` and `/ax/runners` into `ax/benchmark/metrics` and `/ax/benchmark/runners`, respectively. ### Naming changes in order to increase clarity: - The `infer_noise` arguments and properties are now `observe_noise_sd` in order to make the distinction clear that this is about whether noise variances will be observed or not (whether the noise level will be inferred or not is up to the model which has nothing to do with the problem itself). ### Other changes - `BenchmarkProblemBase` protocol has new attributes: - `has_ground_truth`: indicates whether the problem admits noiseless ground truth observations (true for all synthetic problems and surrogate problems with deterministic predictions - not true for nondeterministic "real" problems). - `is_noiseless`: indicates whether the problem is noiseless. - `observe_noise_stds` (`Union[bool, Dict[str, bool]]`) indicates whether noise level is observed (if a single `bool`, the noise level is observed for either all or none of the metrics, the dict format allow specifying that only some metrics have observed noise levels). NOTE: the following are orthogonal (`is_noiseless` implies `has_ground_truth` but not vice versa): (i) `is_noiseless`: The observations are noisy (either the evaluation is indeed noise as for a real problem, or synthetic noise is added), (ii) `observe_noise_sd`: The noise level of the observations is observed (could be zero), (iii) `has_ground_truth`: The problem has ground truth observations (this is the case if the problem is synthetic or a real problem that is noiseless) - `WaveguideMetric` is replaced with `SurrogateMetric` - this was achieved by changing the output format of `WaveguideSurrogateRunner.run` to be consistent with what `SurrogateMetric` expects. - Speed up the `test_timeout` test significantly by using `fast_botorch_optimize` and fewer seeds. - Consolidated `_fetch_trial_data` helper from `BotorchTestProblemMetric` and `SurrogateMetric` into shared helper at `ax/benchmark/metrics/base.py` (I believe we can also further consolidate things and merge `BotorchTestProblemMetric` and `SurrogateMetric` into a single metric given that we've consolidated the `run_metadata` representation). - Deduplicated some shared logic from `BotorchTestProblemRunner` and `SurrogateRunner` into a new `BenchmarkRunner` class that both `BotorchTestProblemRunner` and `SurrogateRunner` now subclass. ---

facebook-github-bot · 2024-03-18T14:12:09Z

This pull request has been merged in 639b731.

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 7, 2024

facebook-github-bot added the fb-exported label Mar 7, 2024

Balandat force-pushed the export-D51339180 branch from cbe5186 to e11db28 Compare March 7, 2024 15:26

Balandat force-pushed the export-D51339180 branch from e11db28 to 81aad40 Compare March 9, 2024 03:08

Balandat force-pushed the export-D51339180 branch from 81aad40 to 4d6b4f6 Compare March 9, 2024 16:57

Balandat force-pushed the export-D51339180 branch from 4d6b4f6 to 03f5261 Compare March 9, 2024 17:04

Balandat force-pushed the export-D51339180 branch from 03f5261 to 585f1d5 Compare March 9, 2024 17:42

Balandat force-pushed the export-D51339180 branch from 585f1d5 to 5944694 Compare March 11, 2024 00:27

Balandat force-pushed the export-D51339180 branch from 5944694 to 651aca2 Compare March 11, 2024 13:57

Balandat force-pushed the export-D51339180 branch from 651aca2 to 086dfb2 Compare March 11, 2024 15:49

facebook-github-bot closed this in 639b731 Mar 18, 2024

facebook-github-bot added the Merged label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for noisy problems to the Ax Benchmarks #2255

Add support for noisy problems to the Ax Benchmarks #2255

Balandat commented Mar 7, 2024

facebook-github-bot commented Mar 7, 2024

codecov-commenter commented Mar 7, 2024 •

edited

Loading

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 11, 2024

facebook-github-bot commented Mar 11, 2024

facebook-github-bot commented Mar 18, 2024

Add support for noisy problems to the Ax Benchmarks #2255

Add support for noisy problems to the Ax Benchmarks #2255

Conversation

Balandat commented Mar 7, 2024

Support noisy evaluations (and observation noise) in the Ax Benchmarks

Key changes in behavior:

Re-organization of files

Naming changes in order to increase clarity:

Other changes

facebook-github-bot commented Mar 7, 2024

codecov-commenter commented Mar 7, 2024 • edited Loading

Codecov Report

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 9, 2024

facebook-github-bot commented Mar 11, 2024

facebook-github-bot commented Mar 11, 2024

facebook-github-bot commented Mar 18, 2024

codecov-commenter commented Mar 7, 2024 •

edited

Loading