-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release v0.2.0rc1 #695
Merged
Merged
Release v0.2.0rc1 #695
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Merge back master in develop after release
If TPE keep sampling points already suggested, the suggest loop will run infinitely. This adds a `max_retry` to limit the number of iterations on the loop.
Avoid inf-loop in TPE.suggest
Why: The meta-class factory was confusing and sometimes hard to debug. The factory should be a separate object that the constructors of the classes. When calling a class, we expect to receive an instance of the class, not of a child class. The factory also makes it simpler to add new features such as fetching class objects based on strings on grouping singleton instances per factory. How: Each class requiring the factory will be using ``Factory(base_class)`` as a factory to instantiate child classes. The explicit method ``factory.create()`` will be used instead of the confusing `MyFactory()`. Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Co-authored-by: François Corneau-Tremblay <corneau90@gmail.com>
Improve Getting started example command
Rework factory
Why: The algorithms will be working with trial objects instead of tuple of values. This means the space needs to sample trials, and space transformations should be applied on trials as well instead of tuples of values. How: For simplicity, only interface of the space classes (TransformedSpace and ReshapedSpace) will be working with trials. The transformations per dimension will be applied using tuple of values so that, in particular, reshaping operations remain straightforward. To facilitate debugging, transformed trials are wrapped so that the original trial can still be accessible. This will prove handy in algorithms if we need access to original trial objects, and also because the ID of the transformed trial should always be based on the original parameters (otherwise the ID gets incoherent with the database).
The use of `__getattr__` to copy the TransformedTrial was causing an infinite recursion error.
Why: If the names of the dimensions have the same prefix, but some have a shape, a transformed space with have a different ordering of dimension. For example, the following dimensions will have their named sorted differently: dim (shape 2), dim1 (no shape) --> dim1, dim[0], dim[1] This is causing an issue when we try to restore the shape of the transformed dimension, with the names being swapped. How: When restoring shape, keep track of the original keys and their order, and reassign the restored dimensions to the correct index (correct dim name).
Why: When flattening the space, dims of shape (1, ) should be flattened as well otherwise the parameters will be a list of one element.
Why: The deepcopy is failing on github-actions with error `RuntimeError: dictionary changed size during iteration`. I have been unable to reproduce the issue locally both with python 3.6 and 3.7. It does fail on 3.7 on github-actions. Taking a copy of the dictionary to do the deep copy should fix the issue only a dictionary inside some trials is the source of the issue. The stack trace seams to hint towards trials_info as the culprit however. ``` tests/unittests/benchmark/test_benchmark_client.py:345: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ src/orion/benchmark/__init__.py:90: in process study.execute(n_workers) src/orion/benchmark/__init__.py:341: in execute experiment.workon(self.task, n_workers=n_workers, max_trials=max_trials) src/orion/client/experiment.py:767: in workon for _ in range(n_workers) src/orion/executor/joblib_backend.py:32: in wait return joblib.Parallel(n_jobs=self.n_workers)(futures) .tox/py/lib/python3.7/site-packages/joblib/parallel.py:1056: in __call__ self.retrieve() .tox/py/lib/python3.7/site-packages/joblib/parallel.py:935: in retrieve self._output.extend(job.get(timeout=self.timeout)) /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:657: in get raise self._value /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/multiprocessing/pool.py:121: in worker result = (True, func(*args, **kwds)) .tox/py/lib/python3.7/site-packages/joblib/_parallel_backends.py:595: in __call__ return self.func(*args, **kwargs) .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in __call__ for func, args, kwargs in self.items] .tox/py/lib/python3.7/site-packages/joblib/parallel.py:263: in <listcomp> for func, args, kwargs in self.items] src/orion/client/experiment.py:781: in _optimize with self.suggest(pool_size=pool_size) as trial: src/orion/client/experiment.py:560: in suggest trial = reserve_trial(self._experiment, self._producer, pool_size) src/orion/client/experiment.py:54: in reserve_trial producer.produce(pool_size) src/orion/core/worker/producer.py:115: in produce self.algorithm.set_state(self.naive_algorithm.state_dict) src/orion/core/worker/primary_algo.py:47: in state_dict return self.algorithm.state_dict src/orion/algo/tpe.py:265: in state_dict _state_dict = super(TPE, self).state_dict src/orion/algo/base.py:132: in state_dict return {"_trials_info": copy.deepcopy(self._trials_info)} /opt/hostedtoolcache/Python/3.7.12/x64/lib/python3.7/copy.py:150: in deepcopy y = copier(x, memo) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ x = {'0ef99dad51485ac9518b49c19b43f4ec': (Trial(experiment=2, status='completed', params=x[0]:0.5497,x[1]:2.397,x[2]:-4.29...params=x[0]:-0.8816,x[1]:2.087,x[2]:1.176), {'constraint': [], 'gradient': None, 'objective': 1187.240632192948}), ...} memo = {140575410255504: datetime.datetime(2021, 11, 11, 22, 57, 32, 364584), 140575419437584: datetime.datetime(2021, 11, 11....datetime(2021, 11, 11, 22, 57, 32, 177829), 140575419438016: datetime.datetime(2021, 11, 11, 22, 57, 32, 241378), ...} deepcopy = <function deepcopy at 0x7fdac6dfb680> def _deepcopy_dict(x, memo, deepcopy=deepcopy): y = {} memo[id(x)] = y > for key, value in x.items(): E RuntimeError: dictionary changed size during iteration ```
Refactor algorithms observe/suggest to work on trial objects directly
Why: Currently only one promotion can be done for each suggest. This means with pool-size greater than 1 ASHA would always end up suggesting many new trials and only do 1 promotion at a time.
Why: The deep copy required to save algorithm's state needed to deepcopy all brackets, which are already copied as another attribute. We should instead keep indices to corresponding bracket so that the dictionary is more lightweight during the copy.
Support batched promotions in ASHA
Why: We had two levels of patience when reserving a trial. There was the customizable `max_idle_time` used in producer.produce() to limit the time spend trying to generate new trials, and there was `_max_depth` in `reserve_trial` limiting the number of times a reservation would be attempted and producer.produce would be called. This lead to misleading error messages. For instance, with many workers it happened that a worker would always be unable to reserve a trial because each time it executed producer.produce() all other workers would reserve the trials before the current worker had time to reserve one. In such scenario the error message would be that the algorithm was unable to sample new point and is waiting for trials to complete. It is not true. We should state the number of trials that were generated during these reservation attempts and recommend increasing the pool-size and timeout. How: Producer.produce only attempts producing `pool-size` once (calling algo.suggest only once) and returns the number of successfully produced trials. The whole patience is moved to `reserve_trial` where it attempts reserving and producing until it reaches the timeout, in which case a helpful error message is raised.
…iency Improve resiliency of trial reservation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Algorithm interface for
suggest()
andobserve()
is changed to use trials instead of list of points. This will provide more flexibility when designing new algorithms. There is no impact for users, except for algorithm plugins that will not be backward compatible with previous versions of Oríon.The factory abstract class used to create database, storage, algorithm, adapters and parallel strategy has been redesigned to better support external implementations and be easier to debug. This redesign should also have no direct impact for users, except for backward compatibility issues with algorithm plugins for previous versions of Oríon.
🏗 Enhancements
🐛 Bug Fixes