diff --git a/docs/source/algorithms.rst b/docs/source/algorithms.rst index 06471717..360b25a9 100644 --- a/docs/source/algorithms.rst +++ b/docs/source/algorithms.rst @@ -5,8 +5,7 @@ Algorithm configuration .. warning:: - These adaptive algorithms are (currently) experimental and may change at any - time. + The API for these algorithms is (currently) unstable. There are many queries to ask about in triplet embedding tasks. Most of these queries aren't useful; chances are most queries will have obvious answers and diff --git a/docs/source/api.rst b/docs/source/api.rst index 68ddc35f..ca29dbad 100644 --- a/docs/source/api.rst +++ b/docs/source/api.rst @@ -5,7 +5,7 @@ Algorithm API .. warning:: - These APIs are experimental and may change at any time. + These APIs are unstable. All triplet embedding algorithms must conform to this API: @@ -44,11 +44,24 @@ Active Algorithms :toctree: generated/ :template: only-init.rst + salmon.triplets.algs.RR salmon.triplets.algs.TSTE salmon.triplets.algs.STE salmon.triplets.algs.CKL salmon.triplets.algs.GNMDS +These adaptive algorithms are all the same except for the underlying noise +model, with the exception of :class:`~salmon.triplets.algs.RR`. +:class:`~salmon.triplets.algs.RR` introduces some randomness by fixing the head +and adding the top ``5 * n`` triplets to the database. This is useful because +the information gain measure used by all of these algorithms (by default) is a +rule-of-thumb. + +.. note:: + + Use of :class:`~salmon.triplets.algs.RR` is recommended as it performs well + in :ref:`the experiments we have run `. + Interface --------- diff --git a/docs/source/benchmarks/adaptive.rst b/docs/source/benchmarks/adaptive.rst index 411f419a..c6a7fed7 100644 --- a/docs/source/benchmarks/adaptive.rst +++ b/docs/source/benchmarks/adaptive.rst @@ -1,3 +1,5 @@ +.. _experiments: + Adaptive algorithms =================== @@ -6,15 +8,20 @@ about a random question like random sampling. This can mean that higher accuracies are reached sooner, or that less human responses are required to reach a particular accuracy. +.. note:: + + This page shows results of experiments run with Salmon. + For complete details, see https://github.com/stsievert/salmon-experiments + Synthetic simulation -------------------- Let's compare adaptive sampling and random sampling. Specifically, let's use Salmon like an experimentalist would: -1. Launch Salmon with the "alien eggs" dataset (with :math:`n=50` objects and - using :math:`d=2` dimensions). -2. Simulate human users (6 users with mean response time of 1s). +1. Launch Salmon with the "alien eggs" dataset, with :math:`n=30` objects + embedded into :math:`d=2` dimensions. +2. Simulate human users (10 users with mean response time of 1s). 3. Download the human responses from Salmon 4. Generate the embedding offline. @@ -28,57 +35,51 @@ is the graph that's produced: These are synthetic results, though they use a human noise model. These experiments provide evidence that Salmon works well with adaptive sampling. -This measure provide evidence to support the hypothesis that Salmon has better -performance than NEXT for adaptive triplet embeddings. For reference, in NEXT's -introduction paper, the authors found "no evidence for gains from adaptive -sampling" for the triplet embedding problem [2]_. +This measure provides evidence that Salmon's active sampling approach +outperforms random sampling. If true, this is an improvement over existing +software to deploy triplet queries to crowdsourced audiences: in NEXT's +introduction paper, [2]_ the authors found "no evidence for gains from adaptive +sampling" for (nearly) the same problem. [#same]_ -.. [1] "Active Perceptual Similarity Modeling with Auxiliary Information" by E. - Heim, M. Berger, and L. Seversky, and M. Hauskrecht. 2015. - https://arxiv.org/pdf/1511.02254.pdf - -.. [2] "NEXT: A System for Real-World Development, Evaluation, and Application - of Active Learning" by K. Jamieson, L. Jain, C. Fernandez, N. Glattard - and R. Nowak. 2017. - http://papers.nips.cc/paper/5868-next-a-system-for-real-world-development-evaluation-and-application-of-active-learning.pdf +Simulation with human responses +------------------------------- -Search efficacy ---------------- +The Zappos shoe dataset has :math:`n=85` shoes, and asks every possible triplet +4 times to crowdsourcing users. Let's run a simulation with Salmon on that that +dataset. We'll embed into :math:`d = 3` dimensions, and have a response rate of +about 2.5 response/sec (5 users with an average response time of 2.5 seconds). -Adaptive algorithms are more adaptive if they search more queries. Random sampling -can be thought of as an adaptive algorithm that only searches over one possible -query. An algorithm that searches over 50,000 queries is more adaptive than a -algorithm that can only search 50 queries. +Let's again compare adaptive sampling and random sampling: -How much do these searches matter? Let's run another experiment with this setup: +.. image:: imgs/zappos.png + :width: 600px + :align: center -* Dataset: strange fruit dataset. The response model will be determined from human - responses. There will be :math:`n=200` objects and that will be embedded into :math:`d=2` - dimensions. -* Let's measure **search efficacy.** To aid this, let's say model updates run instantly. - That means we'll run offline using essentially this code: +The likelihood of a true response conveys "margin by which the models adhere to +all responses." [1]_ The performance above mirrors the performance by Heim et +al. in their Figure 3. [1]_ -.. code-block:: python - responses_per_search = 10 - n_search = 10 - alg = TSTE(n=n, d=d, ...) +.. rubric:: References - for k in itertools.count(): - queries, scores = alg.score_queries(num=n_search * responses_per_search) - queries = _get_top_N_queries(queries, scores, N=responses_per_search) - answers = [_get_answer(query) for query in queries] +.. [1] "Active Perceptual Similarity Modeling with Auxiliary Information" by E. + Heim, M. Berger, and L. Seversky, and M. Hauskrecht. 2015. + https://arxiv.org/pdf/1511.02254.pdf - alg.partial_fit(answers) # performs 1 pass over all answers received thus far +.. [2] "NEXT: A System for Real-World Development, Evaluation, and Application + of Active Learning" by K. Jamieson, L. Jain, C. Fernandez, N. Glattard + and R. Nowak. 2017. + http://papers.nips.cc/paper/5868-next-a-system-for-real-world-development-evaluation-and-application-of-active-learning.pdf -With that, we see this performance: -.. image:: imgs/search-efficacy.png - :width: 600px - :align: center -If you only have the budget for 4,000 queries the most complete search will reach about 82% accuracy. The least complete search will only reach about 60% accuracy. +.. rubric:: Footnotes -If you want to reach 80% accuracy, the most complete searches will require about 3,800 queries. The least complete searches will require 5,100 queries. +.. [#same] Both experiment use :math:`n=30` objects and embed into :math:`d=2` + dimensions. The human noise model used in the Salmon experiments is + generated from the responses collected during NEXT's experiment. The + are the same experiment, up to different responses (NEXT + actually runs crowdsourcing experiments; Salmon's noise model is + generated from those responses). diff --git a/docs/source/benchmarks/imgs/search-efficacy.png b/docs/source/benchmarks/imgs/search-efficacy.png deleted file mode 100644 index 7630469a..00000000 Binary files a/docs/source/benchmarks/imgs/search-efficacy.png and /dev/null differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.graffle/data.plist b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/data.plist new file mode 100644 index 00000000..b7e0b1a8 Binary files /dev/null and b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/data.plist differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image11.png b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image11.png new file mode 100644 index 00000000..8e849eef Binary files /dev/null and b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image11.png differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image13.png b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image13.png new file mode 100644 index 00000000..7dfcaedf Binary files /dev/null and b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image13.png differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image14.tiff b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image14.tiff new file mode 100644 index 00000000..442b6549 Binary files /dev/null and b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image14.tiff differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image6.png b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image6.png new file mode 100644 index 00000000..32dd1b76 Binary files /dev/null and b/docs/source/benchmarks/imgs/synth-eg-acc.graffle/image6.png differ diff --git a/docs/source/benchmarks/imgs/synth-eg-acc.png b/docs/source/benchmarks/imgs/synth-eg-acc.png index 4af93508..f276db1d 100644 Binary files a/docs/source/benchmarks/imgs/synth-eg-acc.png and b/docs/source/benchmarks/imgs/synth-eg-acc.png differ diff --git a/docs/source/benchmarks/imgs/zappos-afrl.png b/docs/source/benchmarks/imgs/zappos-afrl.png new file mode 100644 index 00000000..31ca58e3 Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos-afrl.png differ diff --git a/docs/source/benchmarks/imgs/zappos.graffle/data.plist b/docs/source/benchmarks/imgs/zappos.graffle/data.plist new file mode 100644 index 00000000..edaf8727 Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos.graffle/data.plist differ diff --git a/docs/source/benchmarks/imgs/zappos.graffle/image1.png b/docs/source/benchmarks/imgs/zappos.graffle/image1.png new file mode 100644 index 00000000..7bc2e7bf Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos.graffle/image1.png differ diff --git a/docs/source/benchmarks/imgs/zappos.graffle/image3.png b/docs/source/benchmarks/imgs/zappos.graffle/image3.png new file mode 100644 index 00000000..2bedeffe Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos.graffle/image3.png differ diff --git a/docs/source/benchmarks/imgs/zappos.graffle/image4.png b/docs/source/benchmarks/imgs/zappos.graffle/image4.png new file mode 100644 index 00000000..741f11dc Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos.graffle/image4.png differ diff --git a/docs/source/benchmarks/imgs/zappos.png b/docs/source/benchmarks/imgs/zappos.png new file mode 100644 index 00000000..dd49009a Binary files /dev/null and b/docs/source/benchmarks/imgs/zappos.png differ diff --git a/docs/source/generated/salmon.triplets.algs.RR.rst b/docs/source/generated/salmon.triplets.algs.RR.rst new file mode 100644 index 00000000..1ff621d2 --- /dev/null +++ b/docs/source/generated/salmon.triplets.algs.RR.rst @@ -0,0 +1,7 @@ +:mod:`salmon.triplets.algs`.RR +===================================== + +.. currentmodule:: salmon.triplets.algs + +.. autoclass:: RR + :members: __init__ \ No newline at end of file diff --git a/docs/source/getting-started.rst b/docs/source/getting-started.rst index b79c6c83..05a7e356 100644 --- a/docs/source/getting-started.rst +++ b/docs/source/getting-started.rst @@ -10,7 +10,7 @@ the following options: 2. Upload of a YAML file describing experiment, and ZIP file for the targets. 3. Upload of a database dump from Salmon. -.. warning:: +.. note:: By default, Salmon does not support HTTPS. Make sure the URL begins with ``http://``, not ``https://``. @@ -35,7 +35,7 @@ page: This image is almost certainly out of date. -.. warning:: +.. note:: Please include the version in any bug reports or feature requests. The version number is available at ``http://[url]:8421/docs`` and should look diff --git a/docs/source/imgs/face-embedding.png b/docs/source/imgs/face-embedding.png new file mode 100644 index 00000000..79d49be3 Binary files /dev/null and b/docs/source/imgs/face-embedding.png differ diff --git a/docs/source/imgs/query.graffle/data.plist b/docs/source/imgs/query.graffle/data.plist index 023a1724..78e48eb4 100644 Binary files a/docs/source/imgs/query.graffle/data.plist and b/docs/source/imgs/query.graffle/data.plist differ diff --git a/docs/source/imgs/query.png b/docs/source/imgs/query.png index 4141ba5d..58ba2b98 100644 Binary files a/docs/source/imgs/query.png and b/docs/source/imgs/query.png differ diff --git a/docs/source/index.rst b/docs/source/index.rst index 75b4e21b..95309ceb 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -6,13 +6,21 @@ are of the form "is object :math:`a` more similar to object :math:`b` or :math:`b`?" An example is shown below with facial similarities: .. image:: imgs/query.png - :width: 400px + :width: 300px :align: center These queries are interesting because they provide some relative similarity structure: a response might indicate that object :math:`a` is closer to object :math:`b` than object :math:`c` as determined by humans and the instructions -they are given. +they are given. For example, these triplet queries have been used by +psychologists to determine what facial emotions human find similar: + +.. image:: imgs/face-embedding.png + :width: 500px + :align: center + +Only distance is relevant in this embedding, not the vertical/horizontal axes. +However, if you look closely, you can see two axes: positivity and intensity. Salmon provides efficient methods for collecting these triplet queries. Salmon can be configured to only require (say) 10,000 answers from crowdsourcing @@ -28,15 +36,7 @@ the same confidence. getting-started monitoring offline - -.. toctree:: - :maxdepth: 2 - :caption: Algorithms - algorithms - adaptive - api - developers .. toctree:: :maxdepth: 2 @@ -45,6 +45,14 @@ the same confidence. benchmarks/server benchmarks/adaptive +.. toctree:: + :maxdepth: 2 + :caption: Algorithm Developers + + adaptive + developers + api + Indices and tables ================== diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 1f79a702..f0adb7ea 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -8,10 +8,6 @@ machine. After you get Salmon running, detail on how to launch experiments in Experimentalist --------------- -.. warning:: - - This process is only ready for testing. It is **not** ready for deployment. - 1. Sign into Amazon AWS (http://aws.amazon.com/) 2. Select the "Oregon" region (or ``us-west-2``) in the upper right. 3. Go to Amazon EC2 @@ -69,7 +65,7 @@ To start using Salmon, these endpoints will be available: Download all files when stopping or terminating the machine -- especially the responses and experiment file. -.. warning:: +.. note:: If you have an issue with the machine running Salmon, be sure to include the logs when contacting the Salmon developers. They'd also appreciate it if diff --git a/docs/source/offline.rst b/docs/source/offline.rst index 164f501c..b024a1eb 100644 --- a/docs/source/offline.rst +++ b/docs/source/offline.rst @@ -47,7 +47,7 @@ This code will generate an embedding: d = 2 # embed into 2 dimensions X_train, X_test = train_test_split(X, random_state=42, test_size=0.2) - model = OfflineEmbedding(n=n, d=d) + model = OfflineEmbedding(n=n, d=d, max_epochs=1_000_000) model.fit(X_train, X_test) model.embedding_ # embedding @@ -66,9 +66,11 @@ the dashboard by downloading the "embeddings" file (or visiting the embedding coordinates and the name of the embedding that generated the algorithm. -To visualize the embedding, I would use standard plotting tools to visualize +To visualize the embedding, standard plotting tools can be used to visualize the embedding, which might be `Matplotlib`_, the `Pandas visualization API`_, -`Bokeh`_ or `Altair`_. Salmon uses Bokeh for it's visualization. +`Bokeh`_ or `Altair`_. The Pandas visualization API is likely the easiest to +use, but won't support showing HTML (images/video/etc). To do that, Salmon uses +Bokeh for it's visualization. .. _Pandas visualization API: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html diff --git a/examples/colors/colors.py b/examples/colors/colors.py index d4ae439e..2e654d1e 100644 --- a/examples/colors/colors.py +++ b/examples/colors/colors.py @@ -59,16 +59,20 @@ {"r": 0.9829, "g": 0.65643, "b": 0.078187}, ] + def _fmt_color(c): c = hex(int(255 * c))[2:] if len(c) == 1: c = f"0{c}" assert len(c) == 2 return c + + def _convert(r, g, b): r, g, b = map(_fmt_color, [r, g, b]) return f"#{r}{g}{b}" + def _fmt(hc): target = ( "
= 100: raise ValueError("infinite loop?") query, score = self.alg.get_query() diff --git a/examples/queries-searched/run.py b/examples/queries-searched/run.py index 414cc5b9..96a3ec66 100644 --- a/examples/queries-searched/run.py +++ b/examples/queries-searched/run.py @@ -246,7 +246,7 @@ def _score_fruit(self, X, y): if __name__ == "__main__": import salmon - assert salmon.__version__ == 'v0.4.1+8.geafdca2.dirty' + assert salmon.__version__ == "v0.4.1+8.geafdca2.dirty" queries_per_search = 10 # _searches = [[1 * 10 ** i, 2 * 10 ** i, 5 * 10 ** i] for i in range(0, 5 + 1)] diff --git a/salmon/backend/alg.py b/salmon/backend/alg.py index 3a256371..444c5e54 100644 --- a/salmon/backend/alg.py +++ b/salmon/backend/alg.py @@ -110,11 +110,7 @@ def submit(fn: str, *args, allow_other_workers=True, **kwargs): if hasattr(self, "get_queries"): f_search = submit( - "get_queries", - self_future, - random_state=k, - stop=done, - workers=workers[2], + "get_queries", self_future, stop=done, workers=workers[2], ) else: f_search = client.submit(lambda x: ([], []), 0) diff --git a/salmon/frontend/private.py b/salmon/frontend/private.py index 2f137bed..840d0709 100644 --- a/salmon/frontend/private.py +++ b/salmon/frontend/private.py @@ -485,8 +485,7 @@ def _fmt_embedding( @app.get("/embeddings", tags=["private"]) async def get_embeddings( - authorized: bool = Depends(_authorize), - alg: Optional[str] = None, + authorized: bool = Depends(_authorize), alg: Optional[str] = None, ): """ Get the embeddings for algorithms. diff --git a/salmon/frontend/utils.py b/salmon/frontend/utils.py index fdc57a5e..336d0926 100644 --- a/salmon/frontend/utils.py +++ b/salmon/frontend/utils.py @@ -20,7 +20,6 @@ def __init__(self, msg): raise HTTPException(status_code=500, detail=msg) - def _extract_zipfile(raw_zipfile, directory="targets"): p = Path(__file__).absolute().parent # directory to this file imgs = p / "static" / directory diff --git a/salmon/triplets/algs/_adaptive_runners.py b/salmon/triplets/algs/_adaptive_runners.py index dc12f963..cefe5c47 100644 --- a/salmon/triplets/algs/_adaptive_runners.py +++ b/salmon/triplets/algs/_adaptive_runners.py @@ -9,7 +9,6 @@ import numpy as np import numpy.linalg as LA import pandas as pd -from sklearn.utils import check_random_state import salmon.triplets.algs.adaptive as adaptive from salmon.triplets.algs.adaptive import InfoGainScorer, UncertaintyScorer @@ -34,8 +33,6 @@ be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -52,7 +49,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.050, optimizer__momentum=0.9, - random_state=None, R: float = 10, sampling: str = "adaptive", scorer: str = "infogain", @@ -82,7 +78,6 @@ def __init__( optimizer=torch.optim.SGD, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, warm_start=True, max_epochs=500, **kwargs, @@ -91,18 +86,13 @@ def __init__( if scorer == "infogain": search = InfoGainScorer( - embedding=self.opt.embedding(), - probs=self.opt.net_.module_.probs, - random_state=random_state, + embedding=self.opt.embedding(), probs=self.opt.net_.module_.probs, ) elif scorer == "uncertainty": - search = UncertaintyScorer( - embedding=self.opt.embedding(), random_state=random_state, - ) + search = UncertaintyScorer(embedding=self.opt.embedding(),) else: raise ValueError(f"scorer={scorer} not in ['uncertainty', 'infogain']") - self.random_state_ = check_random_state(random_state) self.search = search self.search.push([]) self.meta = {"num_ans": 0, "model_updates": 0, "process_answers_calls": 0} @@ -111,7 +101,6 @@ def __init__( "d": d, "R": R, "sampling": sampling, - "random_state": random_state, "optimizer": optimizer, "optimizer__lr": optimizer__lr, "optimizer__momentum": optimizer__momentum, @@ -120,28 +109,25 @@ def __init__( def get_query(self) -> Tuple[Optional[Dict[str, int]], Optional[float]]: if (self.meta["num_ans"] <= self.R * self.n) or self.sampling == "random": - head, left, right = _random_query(self.n, random_state=self.random_state_) + head, left, right = _random_query(self.n) return {"head": int(head), "left": int(left), "right": int(right)}, -9999 return None, -9999 - def get_queries( - self, num=None, stop=None, random_state=None - ) -> Tuple[List[Query], List[float], dict]: + def get_queries(self, num=None, stop=None) -> Tuple[List[Query], List[float], dict]: if num: - queries, scores = self.search.score(num=num, random_state=random_state) + queries, scores = self.search.score(num=num) return queries[:num], scores[:num] ret_queries = [] ret_scores = [] - rng = None - if random_state: - rng = check_random_state(random_state) n_searched = 0 for pwr in range(12, 40 + 1): # I think there's a memory leak in search.score -- Dask workers # kept on dying on get_queries. min(pwr, 16) to fix that (and # verified too). + # + # pwr in range(12, 41) => about 1.7 million queries searched pwr = min(pwr, 16) - queries, scores = self.search.score(num=2 ** pwr, random_state=rng) + queries, scores = self.search.score(num=2 ** pwr) n_searched += len(queries) ret_queries.append(queries) ret_scores.append(scores) @@ -150,7 +136,14 @@ def get_queries( # let's limit it to be 32MB in size if (n_searched >= 2e6) or (stop is not None and stop.is_set()): break - return np.concatenate(ret_queries), np.concatenate(ret_scores), {} + queries = np.concatenate(ret_queries).astype(int) + scores = np.concatenate(ret_scores) + + ## Rest of this function takes about 450ms + df = pd.DataFrame(queries) + hashes = pd.util.hash_pandas_object(df, index=False) + _, idx = np.unique(hashes.to_numpy(), return_index=True) + return queries[idx], scores[idx], {} def process_answers(self, answers: List[Answer]): if not len(answers): @@ -252,8 +245,6 @@ class TSTE(Adaptive): be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -298,7 +289,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, sampling="adaptive", scorer="infogain", alpha=1, @@ -311,7 +301,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module__alpha=alpha, module="TSTE", sampling=sampling, @@ -321,15 +310,58 @@ def __init__( class RR(Adaptive): + """ + A randomized round robin algorithm. + + Parameters + ---------- + d : int + Embedding dimension. + R: int = 1 + Adaptive sampling starts are ``R * n`` response have been received. + optimizer : str + The optimizer underlying the embedding. This method specifies how to + change the batch size. Choices are + ``["Embedding", "PadaDampG", "GeoDamp"]``. + optimizer__lr : float + Which learning rate to use with the optimizer. The learning rate must + be positive. + optimizer__momentum : float + The momentum to use with the optimizer. + scorer : str, (default ``"infogain"``) + The scoring method to use. + module : str, optional (default ``"TSTE"``). + The noise model to use. + kwargs : dict + Arguments to pass to :ref:`~Adaptive`. + + + Notes + ----- + This algorithm is proposed in [1]_. They propose this algorithm because + "scoring every triplet is prohibitvely expensive." It's also useful because it adds some randomness to the queries. This presents itself in a couple use cases: + + * When models don't update instantly (common). In that case, the user will + query the database for multiple queries, and queries with the same head + object may be returned. + * When the noise model does not precisely model the human responses. In + this case, the most informative query will + + References + ---------- + .. [1] Heim, Eric, et al. "Active perceptual similarity modeling withi + auxiliary information." arXiv preprint arXiv:1511.02254 (2015). https://arxiv.org/abs/1511.02254 + + """ def __init__( self, n: int, d: int = 2, + R: int = 1, ident: str = "", optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, sampling="adaptive", scorer="infogain", module="TSTE", @@ -338,11 +370,11 @@ def __init__( super().__init__( n=n, d=d, + R=R, ident=ident, optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module=module, sampling=sampling, scorer=scorer, @@ -358,10 +390,10 @@ def get_queries(self, *args, **kwargs): top_scores_by_head = df.groupby(by="h")["score"].nlargest(n=5) top_idx = top_scores_by_head.index.droplevel(0) - top_queries = df.loc[top_idx].sample(random_state=self.random_state_, frac=1) + top_queries = df.loc[top_idx].sample(frac=1) posted = top_queries[["h", "l", "r"]].values.astype("int64") r_scores = 10 + np.linspace(0, 1, num=len(posted)) - self.random_state_.shuffle(r_scores) + np.random.shuffle(r_scores) meta.update({"n_queries_scored_(complete)": len(df)}) return posted, r_scores, meta @@ -384,8 +416,6 @@ class STE(Adaptive): be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -405,7 +435,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, sampling="adaptive", scorer="infogain", **kwargs, @@ -417,7 +446,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module="STE", sampling=sampling, scorer=scorer, @@ -442,8 +470,6 @@ class GNMDS(Adaptive): be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -463,7 +489,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, sampling="adaptive", scorer="uncertainty", **kwargs, @@ -475,7 +500,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module="GNMDS", sampling=sampling, **kwargs, @@ -501,8 +525,6 @@ class CKL(Adaptive): be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -516,7 +538,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, mu=1, sampling="adaptive", **kwargs, @@ -528,7 +549,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module__mu=mu, module="CKL", sampling=sampling, @@ -555,8 +575,6 @@ class SOE(Adaptive): be positive. optimizer__momentum : float The momentum to use with the optimizer. - random_state : int, None, np.random.RandomState - The seed used to generate psuedo-random numbers. sampling : str "adaptive" by default. Use ``sampling="random"`` to perform random sampling with the same optimization method and noise model. @@ -570,7 +588,6 @@ def __init__( optimizer: str = "Embedding", optimizer__lr=0.075, optimizer__momentum=0.9, - random_state=None, mu=1, sampling="adaptive", **kwargs, @@ -582,7 +599,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, module__mu=mu, module="SOE", sampling=sampling, diff --git a/salmon/triplets/algs/_random_sampling.py b/salmon/triplets/algs/_random_sampling.py index 625a3703..3ec2b572 100644 --- a/salmon/triplets/algs/_random_sampling.py +++ b/salmon/triplets/algs/_random_sampling.py @@ -3,7 +3,6 @@ from typing import List, Tuple, Optional import numpy as np -from sklearn.utils import check_random_state from .utils import Answer, Query from ...backend.alg import Runner @@ -11,9 +10,8 @@ logger = logging.getLogger(__name__) -def _get_query(n, random_state=None) -> Tuple[int, int, int]: - random_state = check_random_state(random_state) - a, b, c = random_state.choice(n, size=3, replace=False) +def _get_query(n) -> Tuple[int, int, int]: + a, b, c = np.random.choice(n, size=3, replace=False) return int(a), int(b), int(c) @@ -25,21 +23,18 @@ class RandomSampling(Runner): ---------- n : int Number of objects - random_state: Optional[int] - Seed for random generateor ident : str Identifier of the algorithm """ - def __init__(self, n, d=2, random_state=None, ident=""): + def __init__(self, n, d=2, ident=""): self.n = n self.d = d - self.random_state = check_random_state(random_state) super().__init__(ident=ident) def get_query(self) -> Tuple[Query, Optional[float]]: - h, a, b = _get_query(self.n, random_state=self.random_state) + h, a, b = _get_query(self.n) query = {"head": int(h), "left": int(a), "right": int(b)} return query, -9999 diff --git a/salmon/triplets/algs/_round_robin.py b/salmon/triplets/algs/_round_robin.py index b95afa96..eb070c02 100644 --- a/salmon/triplets/algs/_round_robin.py +++ b/salmon/triplets/algs/_round_robin.py @@ -1,7 +1,6 @@ import logging from typing import List, Tuple - -from sklearn.utils import check_random_state +import numpy as np from .utils import Answer, Query from ...backend.alg import Runner @@ -10,11 +9,10 @@ logger = logging.getLogger(__name__) -def _get_query(n, head, random_state=None) -> Tuple[int, int, int]: - random_state = check_random_state(random_state) +def _get_query(n, head) -> Tuple[int, int, int]: a = head while True: - b, c = random_state.choice(n, size=2) + b, c = np.random.choice(n, size=2) if a != b and b != c and c != a: break return a, b, c @@ -34,24 +32,21 @@ class RoundRobin(Runner): ---------- n : int Number of objects - random_state: Optional[int] - Seed for random generateor ident : str Identifier of the algorithm """ - def __init__(self, n, d=2, random_state=None, ident=""): + def __init__(self, n, d=2, ident=""): self.n = n self.d = d self.answers = [] - self.random_state = check_random_state(random_state) self.counter = 0 super().__init__(ident=ident) def get_query(self) -> Query: head = self.counter % self.n - a, b = self.random_state.choice(self.n, size=2, replace=False) + a, b = np.random.choice(self.n, size=2, replace=False) self.counter += 1 score = max(abs(head - a), abs(head - b)) return {"head": int(head), "left": int(a), "right": int(b)}, float(score) diff --git a/salmon/triplets/algs/adaptive/_embed.py b/salmon/triplets/algs/adaptive/_embed.py index 87903111..652ec4a9 100644 --- a/salmon/triplets/algs/adaptive/_embed.py +++ b/salmon/triplets/algs/adaptive/_embed.py @@ -12,7 +12,6 @@ from torch.utils.data import TensorDataset from sklearn.base import BaseEstimator -from sklearn.utils import check_random_state from scipy.special import binom from skorch.utils import is_dataset @@ -46,7 +45,6 @@ def __init__( optimizer=optim.SGD, optimizer__lr=0.04, optimizer__momentum=0.9, - random_state=None, warm_start=True, max_epochs=100, initial_batch_size=512, @@ -58,7 +56,6 @@ def __init__( self.optimizer = optimizer self.optimizer__lr = optimizer__lr self.optimizer__momentum = optimizer__momentum - self.random_state = random_state self.warm_start = warm_start self.max_epochs = max_epochs self.initial_batch_size = initial_batch_size @@ -67,7 +64,6 @@ def __init__( def initialize(self): self.meta_ = {"num_answers": 0, "model_updates": 0, "num_grad_comps": 0} self.initialized_ = True - self.random_state_ = check_random_state(self.random_state) self.answers_ = np.zeros((1000, 3), dtype="uint16") opt_kwargs = {} @@ -81,7 +77,6 @@ def initialize(self): module=self.module, module__n=self.module__n, module__d=self.module__d, - module__random_state=self.random_state, optimizer=self.optimizer, warm_start=self.warm_start, **self.kwargs, @@ -244,7 +239,7 @@ def optimizer_(self) -> np.ndarray: def get_train_idx(self, n_ans): bs = min(n_ans, self.initial_batch_size) - idx = self.random_state_.choice(n_ans, replace=False, size=bs) + idx = np.random.choice(n_ans, replace=False, size=bs) return idx @@ -262,7 +257,7 @@ def __init__(self, shuffle=True, dwell=1, **kwargs): def get_train_idx(self, n_ans): bs = self.initial_batch_size + int(self.meta_["model_updates"] / self.dwell) n_idx = min(bs, n_ans) - rng = self.random_state_ + rng = np.random.RandomState() if self.shuffle: return rng.choice(n_ans, size=n_idx, replace=True) @@ -285,11 +280,9 @@ class Damper(Embedding): """ Damp the learning rate. """ + def __init__( - self, - initial_batch_size=64, - max_batch_size=None, - **kwargs, + self, initial_batch_size=64, max_batch_size=None, **kwargs, ): self.initial_batch_size = initial_batch_size self.max_batch_size = max_batch_size @@ -297,7 +290,7 @@ def __init__( def get_train_idx(self, len_ans): bs = self.batch_size_ - idx_train = self.random_state_.choice(len_ans, size=bs) + idx_train = np.random.choice(len_ans, size=bs) return idx_train def initialize(self): @@ -327,7 +320,6 @@ def damping(self): raise NotImplementedError - class CntsLRDamper(Damper): """ Decays the learning rate like 1 / (1 + mu) like Thm. 4.7 of [1]_. @@ -359,7 +351,6 @@ def __init__( optimizer=None, optimizer__lr=None, optimizer__momentum=0.9, - random_state=None, initial_batch_size=64, max_batch_size=None, growth_factor=1.01, @@ -373,7 +364,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, initial_batch_size=initial_batch_size, max_batch_size=max_batch_size, **kwargs, @@ -400,7 +390,6 @@ def __init__( optimizer=None, optimizer__lr=None, optimizer__momentum=0.9, - random_state=None, initial_batch_size=64, max_batch_size=None, dwell=10, @@ -414,7 +403,6 @@ def __init__( optimizer=optimizer, optimizer__lr=optimizer__lr, optimizer__momentum=optimizer__momentum, - random_state=random_state, initial_batch_size=initial_batch_size, max_batch_size=max_batch_size, **kwargs, diff --git a/salmon/triplets/algs/adaptive/_noise_models.py b/salmon/triplets/algs/adaptive/_noise_models.py index 7f3d061b..8c21e070 100644 --- a/salmon/triplets/algs/adaptive/_noise_models.py +++ b/salmon/triplets/algs/adaptive/_noise_models.py @@ -1,7 +1,6 @@ from typing import Union import numpy as np -from sklearn.utils import check_random_state import torch import torch.nn as nn @@ -16,7 +15,6 @@ class TripletDist(nn.Module): ---------- n : int d : int - random_state : None, int, np.random.RandomState Attributes ---------- @@ -25,12 +23,11 @@ class TripletDist(nn.Module): """ - def __init__(self, n: int = None, d: int = 2, random_state=None): + def __init__(self, n: int = None, d: int = 2): super().__init__() - self.random_state = random_state self.n = n self.d = d - rng = check_random_state(self.random_state) + rng = np.random.RandomState() embedding = 1e-4 * rng.randn(n, d).astype("float32") self._embedding = torch.nn.Parameter( torch.from_numpy(embedding), requires_grad=True @@ -130,8 +127,8 @@ class TSTE(TripletDist): For details """ - def __init__(self, n=None, d=2, alpha=1, random_state=None): - super().__init__(n=n, d=d, random_state=random_state) + def __init__(self, n=None, d=2, alpha=1): + super().__init__(n=n, d=d) self.alpha = alpha def _probs(self, win2, lose2): @@ -146,8 +143,8 @@ class CKL(TripletDist): The crowd kernel embedding. """ - def __init__(self, n=None, d=2, mu=1e-4, random_state=None): - super().__init__(n=n, d=d, random_state=random_state) + def __init__(self, n=None, d=2, mu=1e-4): + super().__init__(n=n, d=d) self.mu = mu def _probs(self, win2, lose2): diff --git a/salmon/triplets/algs/adaptive/_score.py b/salmon/triplets/algs/adaptive/_score.py index 464d9f43..30e1bcfe 100644 --- a/salmon/triplets/algs/adaptive/_score.py +++ b/salmon/triplets/algs/adaptive/_score.py @@ -2,7 +2,6 @@ from numba import jit, prange import numpy as np from joblib import Parallel, delayed -from sklearn.utils import check_random_state from .search import gram_utils, score import salmon.utils as utils @@ -16,9 +15,6 @@ class QueryScorer: Parameters ---------- - random_state : None, int, RandomState - The random state for query searches - embedding : array-like The embedding of points. @@ -28,11 +24,6 @@ class QueryScorer: where ``win2`` and ``lose2`` are the squared Euclidean distances between the winner and loser. - Attributes - ---------- - random_state_ : RandomState - Initialized random state - Notes ----- Inputs: include an embedding, noise model and history of answers @@ -49,23 +40,19 @@ class QueryScorer: """ - def __init__(self, random_state=None, embedding=None, probs=None): - self.random_state = random_state + def __init__(self, embedding=None, probs=None): self.embedding = embedding self.probs = probs def _initialize(self): - self.random_state_ = np.random.RandomState(self.random_state) self.initialized_ = True n = len(self.embedding) self._tau_ = np.zeros((n, n), dtype="float32") self.push([]) - def _random_queries(self, n, num=1000, random_state=None, trim=True): + def _random_queries(self, n, num=1000, trim=True): new_num = int(num * 1.1 + 3) - rng = self.random_state_ - if random_state is not None: - rng = check_random_state(random_state) + rng = np.random.RandomState() queries = rng.choice(n, size=(new_num, 3)) repeated = ( (queries[:, 0] == queries[:, 1]) @@ -137,7 +124,7 @@ def push(self, history): class InfoGainScorer(QueryScorer): - def score(self, *, queries=None, num=1000, random_state=None): + def score(self, *, queries=None, num=1000): """ Score the queries using (almost) the information gain. @@ -155,9 +142,7 @@ def score(self, *, queries=None, num=1000, random_state=None): if queries is not None and num != 1000: raise ValueError("Only specify one of `queries` or `num`") if queries is None: - queries = self._random_queries( - len(self.embedding), num=num, random_state=random_state - ) + queries = self._random_queries(len(self.embedding), num=num) Q = np.array(queries).astype("int64") H, O1, O2 = Q[:, 0], Q[:, 1], Q[:, 2] @@ -166,7 +151,7 @@ def score(self, *, queries=None, num=1000, random_state=None): class UncertaintyScorer(QueryScorer): - def score(self, *, queries=None, num=1000, trim=True, random_state=None): + def score(self, *, queries=None, num=1000, trim=True): """ Score the queries using (almost) the information gain. @@ -184,9 +169,7 @@ def score(self, *, queries=None, num=1000, trim=True, random_state=None): if queries is not None and num != 1000: raise ValueError("Only specify one of `queries` or `num`") if queries is None: - queries = self._random_queries( - len(self.embedding), num=num, random_state=random_state, trim=trim - ) + queries = self._random_queries(len(self.embedding), num=num, trim=trim) Q = np.array(queries).astype("int64") H, O1, O2 = Q[:, 0], Q[:, 1], Q[:, 2] diff --git a/salmon/triplets/algs/adaptive/search/__triplets.py b/salmon/triplets/algs/adaptive/search/__triplets.py index ee0840d4..e851a9d5 100644 --- a/salmon/triplets/algs/adaptive/search/__triplets.py +++ b/salmon/triplets/algs/adaptive/search/__triplets.py @@ -3,13 +3,11 @@ from gram_utils import dist2 from time import time import search -from sklearn.utils import check_random_state import numpy.linalg as LA -def random_query(n, random_state=None): - random_state = check_random_state(random_state) - return random_state.choice(n, size=3, replace=False) +def random_query(n): + return np.random.choice(n, size=3, replace=False) def update(G, head, winner, loser): @@ -80,12 +78,11 @@ def update(G, head, winner, loser): class NoSearch: - def __init__(self, n, d=2, random_state=None, **kwargs): - self.random_state = check_random_state(random_state) + def __init__(self, n, d=2, **kwargs): self.n = n self.d = d - self.X = self.random_state.randn(n, d) / 1000 + self.X = np.random.randn(n, d) / 1000 self.G = self.X @ self.X.T # self.G = np.eye(n) # self.X = gram_utils.decompose(self.G, d=self.d) @@ -98,8 +95,7 @@ def get_query(self): """ Returns [head, predicted_winner, predicted_loser] """ - q = random_query(self.n, random_state=None) - return q + return random_query(self.n) def process_answer(self, head, winner, loser): self.num_answers = len(self.answers) @@ -110,7 +106,7 @@ def process_answer(self, head, winner, loser): beta = self.n if num_ans > 2 * beta: - for i in self.random_state.choice(num_ans, size=min(20, beta - 1),): + for i in np.random.choice(num_ans, size=min(20, beta - 1),): self.G = update(self.G, *self.answers[i]) self._times = {"update_time": time() - start} if len(self.answers) % self.n == 0: @@ -122,7 +118,7 @@ class RandomSearch(NoSearch): def __init__(self, n, t_max=0.05, R=10, **kwargs): super().__init__(n, **kwargs) self.t_max = t_max - self.tau = self.random_state.rand(n, n) + self.tau = np.random.rand(n, n) self.n = n self.R = R self._summary = {"t_max": t_max, "name": type(self).__name__} @@ -134,7 +130,7 @@ def get_query(self): or np.abs(self.t_max) < 0 or np.allclose(self.t_max, 0) ): - q = random_query(self.n, random_state=self.random_state) + q = random_query(self.n) self._saved = {"query": q, "score": -np.inf, "searched": 0} return q @@ -147,9 +143,7 @@ def get_query(self): best_q = random_query(n) while time() - start < self.t_max: searched += 10 - queries = [ - random_query(n, random_state=self.random_state) for _ in range(10) - ] + queries = [random_query(n) for _ in range(10)] scores = [search.score(q, tau, D) for q in queries] best_idx = np.argmax(scores) diff --git a/salmon/triplets/algs/adaptive/search/_search.py b/salmon/triplets/algs/adaptive/search/_search.py index 70eee070..c6741cd2 100644 --- a/salmon/triplets/algs/adaptive/search/_search.py +++ b/salmon/triplets/algs/adaptive/search/_search.py @@ -5,7 +5,6 @@ import numpy as np import numpy.linalg as LA -from sklearn.utils import check_random_state import torch try: @@ -16,12 +15,12 @@ Array = Union[np.ndarray, torch.Tensor] -def random_query(n, random_state=None): - random_state = check_random_state(random_state) +def random_query(n): + rng = np.random.RandomState() while True: - a = random_state.choice(n) - b = random_state.choice(n) - c = random_state.choice(n) + a = rng.choice(n) + b = rng.choice(n) + c = rng.choice(n) if a != b and b != c and c != a: break return [a, b, c] diff --git a/salmon/triplets/algs/adaptive/search/tests/test_search_refactor.py b/salmon/triplets/algs/adaptive/search/tests/test_search_refactor.py index 11f5f9ca..2066a5ce 100644 --- a/salmon/triplets/algs/adaptive/search/tests/test_search_refactor.py +++ b/salmon/triplets/algs/adaptive/search/tests/test_search_refactor.py @@ -45,7 +45,7 @@ def test_score_refactor(seed=None): for i in range(n): tau[i] /= tau[i].sum() - queries = [search.random_query(n, random_state=rng) for _ in range(100)] + queries = [search.random_query(n) for _ in range(100)] # old_score has been refactored to take in [h, w, l] old_scores = [_score_next([w, l, h], tau, X) for h, w, l in queries] @@ -214,8 +214,8 @@ def test_salmon_integration(): n, d = 10, 2 rng = np.random.RandomState(42) X = rng.randn(n, d).astype("float32") - est = TSTE(n, random_state=42) - search = InfoGainScorer(random_state=42, embedding=X, probs=est.probs) + est = TSTE(n) + search = InfoGainScorer(embedding=X, probs=est.probs) history = [_simple_triplet(n, rng) for _ in range(1000)] search.push(history) queries, scores = search.score() @@ -253,8 +253,8 @@ def test_salmon_integration(): def test_salmon_posterior_refactor(n=30, d=2): rng = np.random.RandomState(42) X = rng.randn(n, d).astype("float32") - est = TSTE(n, random_state=42) - search = InfoGainScorer(random_state=42, embedding=X, probs=est.probs) + est = TSTE(n) + search = InfoGainScorer(embedding=X, probs=est.probs) history = [_simple_triplet(n, rng) for _ in range(2000)] search.push(history) queries, scores = search.score() diff --git a/salmon/triplets/algs/adaptive/search/tests/test_uncertainty.py b/salmon/triplets/algs/adaptive/search/tests/test_uncertainty.py index 0ee41bd2..b121fefc 100644 --- a/salmon/triplets/algs/adaptive/search/tests/test_uncertainty.py +++ b/salmon/triplets/algs/adaptive/search/tests/test_uncertainty.py @@ -8,10 +8,11 @@ def test_uncertainty_sampling(random_state=42): rng = np.random.RandomState(random_state) X = rng.uniform(size=(n, d)) - search = UncertaintyScorer(embedding=X, random_state=random_state,) + search = UncertaintyScorer(embedding=X) queries, scores = search.score(num=100) distances = [ - abs(LA.norm(X[h] - X[l]) - LA.norm(X[h] - X[r])) for h, l, r in queries + abs(LA.norm(X[h] - X[l]) ** 2 - LA.norm(X[h] - X[r]) ** 2) + for h, l, r in queries ] idx_best_score = np.argmax(scores) - assert distances[idx_best_score] == min(distances) + assert np.allclose(distances[idx_best_score], min(distances)) diff --git a/salmon/triplets/algs/adaptive/tests/test_basics.py b/salmon/triplets/algs/adaptive/tests/test_basics.py deleted file mode 100644 index 7c6b063c..00000000 --- a/salmon/triplets/algs/adaptive/tests/test_basics.py +++ /dev/null @@ -1,36 +0,0 @@ -import numpy as np -from sklearn.utils import check_random_state - -from salmon.triplets.algs.adaptive import TSTE, Embedding -from torch.optim import SGD - - -def test_random_state(): - n, d = 20, 2 - random_state = 10 - - rng = check_random_state(random_state) - answers = rng.choice(n, size=(4 * n, 3)) - - kwargs = dict( - module=TSTE, - module__n=n, - module__d=2, - optimizer=SGD, - optimizer__lr=0.1, - optimizer__momentum=0.9, - random_state=random_state, - ) - - est1 = Embedding(**kwargs) - est1.initialize() - est1.partial_fit(answers) - s1 = est1.score(answers) - - est2 = Embedding(**kwargs) - est2.initialize() - est2.partial_fit(answers) - s2 = est2.score(answers) - - assert np.allclose(est1.embedding(), est2.embedding()) - assert np.allclose(s1, s2) diff --git a/salmon/triplets/algs/tests/test_rr.py b/salmon/triplets/algs/tests/test_rr.py index f667b7cb..82321482 100644 --- a/salmon/triplets/algs/tests/test_rr.py +++ b/salmon/triplets/algs/tests/test_rr.py @@ -5,11 +5,11 @@ def test_rr(): - alg = RoundRobin(n=10, random_state=42) + alg = RoundRobin(n=10) + alg.foo = "bar" ir = cloudpickle.dumps(alg) alg2 = cloudpickle.loads(ir) assert type(alg2) == RoundRobin assert alg2.n == 10 - assert np.allclose( - alg2.random_state.randint(10, size=10), alg.random_state.randint(10, size=10) - ) + assert alg.foo == alg2.foo + assert alg.meta_ == alg2.meta_ diff --git a/salmon/triplets/manager.py b/salmon/triplets/manager.py index 66e43e8b..c051d44e 100644 --- a/salmon/triplets/manager.py +++ b/salmon/triplets/manager.py @@ -1,9 +1,9 @@ from datetime import datetime, timedelta from typing import Any, Dict, List import random +import numpy as np from pydantic import BaseModel -from sklearn.utils import check_random_state class Answer(BaseModel): @@ -63,8 +63,9 @@ def get_responses(answers: List[Dict[str, Any]], targets, start_time=0): out[-1].update({**idxs, **names, **meta}) return out -def random_query(n: int, random_state=None) -> Dict[str, int]: - rng = check_random_state(random_state) + +def random_query(n: int) -> Dict[str, int]: + rng = np.random.RandomState() while True: a, b, c = rng.choice(n, size=3) if a != b and b != c and c != a: diff --git a/salmon/triplets/offline.py b/salmon/triplets/offline.py index ec16a77f..5fd68ae5 100644 --- a/salmon/triplets/offline.py +++ b/salmon/triplets/offline.py @@ -4,7 +4,7 @@ from time import time from copy import deepcopy, copy from typing import Dict, Union -from number import Number +from numbers import Number from sklearn.model_selection import train_test_split from sklearn.base import BaseEstimator @@ -15,6 +15,7 @@ from salmon.triplets.algs.adaptive import CKL import salmon.triplets.algs.adaptive as adaptive + def _get_params(opt_): return { k: v @@ -64,7 +65,6 @@ def initialize(self, X_train): module=noise_model, module__n=self.n, module__d=self.d, - random_state=42 ** 4, optimizer=optim.Adadelta, max_epochs=self.max_epochs, shuffle=self.shuffle, diff --git a/tests/data/exp-active-bad.yaml b/tests/data/exp-active-bad.yaml index e8832cf5..439d9f04 100644 --- a/tests/data/exp-active-bad.yaml +++ b/tests/data/exp-active-bad.yaml @@ -9,4 +9,3 @@ max_queries: 20 samplers: foobar: class: FooBar - random_state: 42 diff --git a/tests/data/round-robin.yaml b/tests/data/round-robin.yaml index ddc58301..dcd05b62 100644 --- a/tests/data/round-robin.yaml +++ b/tests/data/round-robin.yaml @@ -2,5 +2,4 @@ targets: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] max_queries: 20 samplers: - RoundRobin: - random_state: 42 + RoundRobin: {} diff --git a/tests/test_active_search.py b/tests/test_active_search.py index 0ba6590d..3f60ef8b 100644 --- a/tests/test_active_search.py +++ b/tests/test_active_search.py @@ -33,16 +33,14 @@ def test_same_salmon_next(n=40, d=2, num_ans=4000): X, y = dataset(n, num_ans=num_ans, random_state=42) ans = answer(X, y) - est = TSTE(n=n, d=d, random_state=42) + est = TSTE(n=n, d=d) new_embedding = (np.arange(n * d) // d).reshape(n, d).astype("float32") for s1, s2 in [(30, 10)]: new_embedding[s1] = s2 new_embedding[s2] = s1 - search = InfoGainScorer( - random_state=42, embedding=new_embedding, probs=est.opt.module_.probs - ) + search = InfoGainScorer(embedding=new_embedding, probs=est.opt.module_.probs) search.push(ans) _, (useful, useless) = search.score(queries=[(29, 30, 10), (29, 11, 28)])