-
Notifications
You must be signed in to change notification settings - Fork 0
Testing strategy
Some thoughts on what we should/could test and why. See also issue #45 on this repo.
Writing unit tests usually relies on comparing the output of a function etc with some expected correct value. This paradigm, however, breaks down when the function correctly produces different values on each invocation, as is the case with code that includes calls to a random number generator (RNG). There are workarounds for this (most often setting the seed of the RNG before running the tests, and relying on the consistent outputs that this produces) but they are not robust to code changes.
A more informative approach would be to test properties of the output. These can be:
- properties that can be tested from a single invocation (e.g. "the value is negative" or "the list is sorted in descending order")
- properties that describe the statistics of the outputs, and are therefore more naturally considered in the context of multiple invocations
In the following, the focus is on the latter.
Consider a function f(...)
with non-deterministic output y. For simplicity, let us assume for now that this is the only return value of the function, although in realistic scenarios we will likely want to test multiple outputs.
We assume that there exists some distribution p (not necessarily known to us!) that the output should follow; that is, p defines the correct behaviour of the function. Ideally, we would like to assert that y
follows p. However, this is not always possible:
- We may not know p (its parametrisation or even its general family, if it even follows a "common" distribution)
- Verifying that samples are drawn from a particular distribution is not always straightforward
Therefore, we must use some proxy checks. Commonly, this is something of the form: "the mean/variance/... of y
across multiple runs is as expected". Here, "is as expected" means "is close to a particular value which I can compute, possibly with some tolerance" or "lies in a known interval". Similarly, we may want to check that the distribution of the mean is "as expected", i.e. within specific limits.
One important question is how to choose the tolerance (or width of the interval) mentioned above. In some cases, when the sampling distributions involved are convenient, we may know the probability of y
lying in an interval [M - d, M + d] - or, conversely, for a given confidence threshold, we may be able to compute the interval limits. Choosing the tolerance then becomes a question of choosing a confidence level. For the "derived" case of the distribution of the mean, this is still more complicated, but potentially manageable in some cases. A related question is how many samples we should be taking to verify that such a property holds.
Orthogonal to this is the issue of how the testing code can implement this. This includes questions like:
- How does this fit within testing frameworks like pytest, where each test function is supposed to provide an assertion that passes?
- Do we run a test multiple times and record the percentage of times the property is satisfied? Or run some code multiple times, compute statistics of the output, and verify that they match expected behaviour?
- Should we be setting the seed or does that go against the principle?
What we would like to have, practically, is:
- a general "dial" we can tweak that sets the tolerances for the different tests
- a framework (functions, classes, decorators...) that lets us easily express constraints like the above, without bloating the test code nor too much repetition (for example, a pytest plugin?)
The above also imply learning a good structure for separating out the various parts of a test, to promote readability and avoid redundancy.