Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ufloat_from_sample function #277

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Changes
time. Now such a function can be imported but if the user attempts to
execute it, a `NotImplementedError` is raised indicating that the
function can't be used because `numpy` couldn't be imported.
- Added `uncertainties.ufloat_from_sample()` to create a ufloat from a random sample of a variable.

Fixes:

Expand Down
6 changes: 6 additions & 0 deletions doc/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -534,6 +534,12 @@ manner**. This is what the :func:`nominal_value` and
>>> uncertainties.std_dev(3)
0.0

Numbers with uncertainties can be found from samples of numbers
without uncertainties using :func:`ufloat_from_sample`. This is
an estimate of the true value of the number and its uncertainty.
The currently implemented "gaussian" method returns the mean and the error on the
mean, so it works best for large samples that are normally distributed.

Finally, a utility method is provided that directly yields the
`standard score <http://en.wikipedia.org/wiki/Standard_score>`_
(number of standard deviations) between a number and a result with
Expand Down
67 changes: 67 additions & 0 deletions tests/test_uncertainties.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
except ImportError:
np = None

if np != None:
from uncertainties.unumpy.core import nominal_values,std_devs

def test_value_construction():
"""
Expand Down Expand Up @@ -146,6 +148,71 @@ def test_ufloat_fromstr():
assert numbers_close(num.std_dev, values[1])
assert num.tag == "test variable"

def test_ufloat_from_sample():
"Test genarating a number with an uncertainty from a sample"

#the test imputs for the sample paramiter
test_samples=[
[-1.5,-0.5,0,0.5,1.5] #test standard list input
]

#the test imputs for the other arguments
optional_args=[{}]

#the expected nominal values
expected_n=[0]

#expected standard deviations
expected_s=[0.5]

if np != None:
#include extra tests for numpy arrays
test_samples+=[
np.array([-1.5,-0.5,0,0.5,1.5]),
np.array([
[-3, -1, 0, 1, 3],
[-1.5, -0.5, 0, 0.5, 1.5],
[-0.75, -0.25, 0, 0.25, 0.75],
[0, 0, 0, 0, 0],
[1.5, 0.5, 0, -0.5, -1.5]
]),
np.array([
[-3, -1, 0, 1, 3],
[-1.5, -0.5, 0, 0.5, 1.5],
[-0.75, -0.25, 0, 0.25, 0.75],
[0, 0, 0, 0, 0],
[1.5, 0.5, 0, -0.5, -1.5]
])
]
optional_args+=[
{},
{'axis':0},
{'axis':1}
]
expected_n+=[
0,
[-0.75,-0.25,0.0,0.25,0.75],
[0, 0, 0, 0, 0]
]
expected_s+=[
0.5,
[0.75,0.25,0.0,0.25,0.75],
[1, 0.5, 0.25, 0, 0.5]

]


#run the tests
for i,sample in enumerate(test_samples):

num=uncert_core.ufloat_from_sample(sample,**(optional_args[i]))

#check nominal values
assert(np.allclose(nominal_values(num),expected_n[i]))

#check standard deviations
assert(np.allclose(std_devs(num),expected_s[i]))


###############################################################################

Expand Down
74 changes: 73 additions & 1 deletion uncertainties/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
from builtins import str, zip, range, object
from math import sqrt, isfinite # Optimization: no attribute look-up

from statistics import mean as stats_mean
from statistics import stdev as stats_stdev

import copy
import collections

Expand Down Expand Up @@ -71,7 +74,6 @@
except ImportError:
numpy = None


def correlated_values(nom_values, covariance_mat, tags=None):
"""
Return numbers with uncertainties (AffineScalarFunc objects)
Expand Down Expand Up @@ -1004,6 +1006,76 @@ def ufloat_fromstr(representation, tag=None):
(nom, std) = str_to_number_with_uncert(representation.strip())
return ufloat(nom, std, tag)

def ufloat_from_sample(sample, method="gaussian", axis=None):
'''
Converts a collection of values into a ufloat.

Arguments:
----------
sample: list or numpy array of numbers
The sample of values

method: optional string
The method used to calculate the ufloat. currently, only
the 'gaussian' method is implemented.

gaussian: The nominal value is the mean of the sample.
The standard deviation is the error on the mean. This
method assumes that the sample follows a gaussian
distribution, and works best for large samples. This
method works well to find an estimate of a fixed value
that has been measured multiuple times with some random
error.

axis: integer or None
Only when the sample is a numpy array. The axis along
which the ufloats are computed. If None (the default value)
the sample is the whole flattened array.

'''

if method=="gaussian":

if numpy is None:
#if numpy is not present, use pythons statistics functions instead
mean_value=stats_mean(sample)
error_on_mean=stats_stdev(sample)/sqrt(len(sample)-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the statistics.stdev already divides by sqrt(len(samples)-1), and it should not be done again here.

But also, I think we will need to discuss/document this, and perhaps have an option for using the "-1" ("Bessel's -1"), perhaps adding a ddof argument (default=0). I think the default should be "match numpy.std()", which divides variance by sqrt(len(samples)).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, i mistakenly assumed that statistics.stdev behaved the same as numpt.std does without any optional arguments.

Adding ddof=0 as an optional parameter should be simple. For the non-numpy code I can use statistics.pstdev and then include "-ddof" in the denominator.


return ufloat(mean_value,error_on_mean)

else:
#if numpy is present, use the faster numpy functions that can handle a wider range of inputs
mean_value=numpy.mean(sample, axis=axis)

#the size of each sample being collected
sample_size=0

if axis == None:
sample_size=numpy.size(sample)
else:
sample_size=numpy.shape(sample)[axis]

error_on_mean=numpy.std(sample, ddof=1, axis=axis)/numpy.sqrt(sample_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plain ndarray.std() should be used by default, and ddof should default to 1, and that dividing by the sample size here is not correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what you mean. The error on the mean is the variance of the sample divided by the square root of the sample size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant ddof should default to 0, as in numpy.std().

Dividing by sqrt(len(samples)) gives the standard deviation of the mean, not the standard deviation of the value.
The standard deviation of the mean tells you how much you expect the mean value to change if you select a different set of samples/measurements from a larger pool of values.

For propagating uncertainties, you (usually, normally?) want the standard deviation of the value itself.

I think two good rule-of-thumbs are "what does Wikipedia say?" and "what does numpy do?".
Taking the basic example at https://en.wikipedia.org/wiki/Standard_deviation#Basic_examples:

>>> import numpy as np
>>> samples = [2, 4, 4, 4, 5, 5, 7, 9]
>>> print(np.mean(samples), np.std(samples))
5.0 2.0
>>> print(np.mean(samples), np.std(samples), ddof=1)
5.0 2.13808993529939

which agrees with Wikipedia. I would be reluctant to have

>>> ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9])

return anything except ufloat(5.0, 2.0) - that would require extra explanation and documentation.

An option to divide by sqrt(len(samples)-ddof) would be OK.

Copy link
Author

@Myles244 Myles244 Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand now.

I originally intended the function to return the mean and standard error on the mean. This Is because if I am measuring a constant value where my measurement apparatus have some random error, then my measurements are a sample from a normal distribution. Where the best estimate for the value is the mean of the sample and the uncertainty is the uncertainty on the mean.

Perhaps it could be a method "from measurement". I think this would be more clear anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Myles244 if your samples are from a normal distribution, and you are measuring noisy values from an experimental measurement, then I think you do want to use the mean and the standard deviation (as from numpy.mean() and numpy.std()). Those give the average (and most likely, for normal distributions) and the appropriate measure of the variation observed.

The standard deviation of the mean is a useful quantity, but I think it is not what you want here. Let's take mass of Christmas trees (and just to keep things jolly, let's pretend those have Gaussian distributions), and the total sample size in the survey is 1 million trees. A random sample of 100 would be small, but give a mean and standard deviation, and not be terrible. If you select a different 100 samples, you would have a mean and standard deviation that was pretty consistent with the first sample of 100. If you increase the sample size to 1000, you would not expect the mean or the standard deviation to change by much from the sample of 100. But, if you took a second sample of 1000, you would expect that mean value to be closer to the mean for the first 1000 than was the case for the two samples of 100.

Another classic way to think about it is with test scores across many classrooms: give the same test to all students in 10 classrooms of 25 students each. The standard deviation of the 250 scores tells you the variation between students. The standard deviation of the mean of each classroom tells you the variation between classrooms. One is a measure of the students; the other is a measure of the teachers (or a bias in how students are assigned classrooms, so the administration, perhaps) ;).

If you are measuring voltages and see a distribution [2, 4, 4, 4, 5, 5, 7, 9], the standard deviation is 2. If you repeat the measurements 100 times and the values are [[2, 4, 4, 4, 5, 5, 7, 9]*100], the standard deviation is still 2.

OTOH, if those are 100 different samples of a set of 8 different, possibly related, voltages, then the standard deviation of the mean of those 100 observations of 8 voltages is now much smaller than 2 -- the mean value is very consistent (across those 100 "classrooms"). But to know the difference, I think you would need to provide the 100 mean and standard deviations.

I think the most natural and common use case would be to have a bunch of samples of a single quantity, in which case the standard deviation is wanted. An option to divide by sqrt(len(samples)-ddof) would be OK.

Copy link
Author

@Myles244 Myles244 Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagerber48 I think I have to agree with you; clearly, the function is ambiguous, and perhaps it would be better not to include it.

Thanks, anyway.

p.s. should I close the issue or leave it open?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave it open for now so we can continue to consider it for at least a few days. While I like the idea of having this as a constructor on UFloat I think it is too specific and there are too many different things people don't want. I'm more sympathetic to the idea of a module with a variety of useful functions like this one. But there they could be named and documented more specifically e.g. ufloat_from_sample_sem, ufloat_from_sample_std etc.

We could include e.g. those two constructor functions on UFloat. But in any case, my opinion is that this "utility" work should be held off until we are able to complete some of the "core" maintenance targets we have on uncertainties (re-architect the core classes, rework numpy integration, eliminate reliance on confusing monkey patching, etc.). In the lifetime of uncertainties this new group of maintainers is very new, so I think it makes sense to make changes, especially API changes, slowly as we learn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please leave this open

the wiki page mentions the ambiguity:

When only a sample of data from a population is available, the term standard deviation of the sample or sample standard deviation can refer to either the above-mentioned quantity as applied to those data, or to a modified quantity that is an unbiased estimate of the population standard deviation (the standard deviation of the entire population).

so I agree with having explict names like ufloat_from_sample_sem, ufloat_from_sample_std or a required keyword

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewgsavage to be clear, that snippet is from the wikipedia page on standard deviation and is referring to the distinction between the sample standard deviation and an unbiased estimator for the population/distribution standard deviation.

Neither of these are the standard error on the mean, which is the uncertainty of an unbiased estimator of the population mean calculated using the sample mean.

So there are three possible options for what the std_dev attribute on the returned UFloat should be:

  • The sample standard deviation
  • An unbiased estimator for the population/distribution standard deviation calculated using ddof != 0
  • The standard error on the mean

I agree that these could be selected by using different functions or a required keyword argument.


If we want this stuff to be in uncertainties then here's my proposal:

  • We make a new module named utils.py (I don't think it should be named utils.py, we should try to come up with something better, just using that as a placeholder)
  • This module has one function called ufloat_from_sample_std_dev which accepts a ddof argument, matching the numpy API.
  • This module has another function called ufloat_from_sample_std_err. I don't think any other arguments are necessary other than the samples.

I express a preference for following numpy conventions rather than the built-in statistics modules conventions. This is because numpy is better documented than the statistics module. It looks like the statistics module is doing some guessing under-the-hood about what ddof should equal but it's not really explaining it.

These functions should all work as expected when numpy is unavailable. In fact, at first pass, maybe they should be implemented assuming numpy is not available. Then we can do benchmarking to see if numpy makes a measurable performance difference. If so, we can configure the code to use numpy if it is available.

I do NOT think this function should support numpy arrays, at least at first pass. We are still sorting out interoperability with numpy. I think there's a decent chance we'll do something the wrong way and have to redo it if we try to include numpy support for these helper functions. In other words: I could tolerate including helper functions to create UFloat, but not helper functions to create arrays of UFloat or UArray or something.


Other functions can be moved into utils.py (under whatever name it gets).

  • The functions to create sequences of UFloat from nominal values and a covariance or correlation matrix
  • The functions to calculate the covariance or correlation matrix from a sequence of UFloat
  • The ufloat_from_sample_std_dev function
  • The ufloat_from_sample_std_err function
  • The uncertainty-weighted mean calculation in unumpy.average: init #265

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that one downside of this proposal is that it misses the "slick" opportunity to include e.g. UFloat.from_sample_std_dev. But this is ok. At this time, we don't construct UFloats with the class constructor directly anyways, it is done with the ufloat() helper function. In the future we can consider moving away from ufloat and make UFloat the direct constructor. At that time we could consider adding helper methods onto the class for alternative construction methods. But right now, since construction is done through a function, I think it makes sense for alternative construction to also be done through a function.


if len(numpy.shape(mean_value))==0:
# if the output is a single ufloat
return ufloat(mean_value,error_on_mean)
else:
#if the output is an array of ufloats (duplicate of code from unnumpy.core.uarray to avoid circular import)
return numpy.vectorize(
# ! Looking up uncert_core.Variable beforehand through
# '_Variable = uncert_core.Variable' does not result in a
# significant speed up:
lambda v, s: Variable(v, s),
otypes=[object],
)(mean_value, error_on_mean)
else:
msg={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a f-string:

     msg = f"method='{method}' is not implemented"

"{} is not one of the implemented methods".format(method)
}
raise ValueError(msg)



def ufloat(nominal_value, std_dev=None, tag=None):
"""
Expand Down