-
-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ufloat_from_sample function #277
base: master
Are you sure you want to change the base?
Changes from all commits
314c32d
f9ed2db
4fba174
86f8f7d
4e161a9
82842e9
98bb34a
6274859
ebe0591
58a1220
e864bd0
aa4852a
779459b
30a52f1
dac42d1
bf49d94
205b913
2951fcf
e9ff108
3c250d4
2d592d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,9 @@ | |
from builtins import str, zip, range, object | ||
from math import sqrt, isfinite # Optimization: no attribute look-up | ||
|
||
from statistics import mean as stats_mean | ||
from statistics import stdev as stats_stdev | ||
|
||
import copy | ||
import collections | ||
|
||
|
@@ -71,7 +74,6 @@ | |
except ImportError: | ||
numpy = None | ||
|
||
|
||
def correlated_values(nom_values, covariance_mat, tags=None): | ||
""" | ||
Return numbers with uncertainties (AffineScalarFunc objects) | ||
|
@@ -1004,6 +1006,76 @@ def ufloat_fromstr(representation, tag=None): | |
(nom, std) = str_to_number_with_uncert(representation.strip()) | ||
return ufloat(nom, std, tag) | ||
|
||
def ufloat_from_sample(sample, method="gaussian", axis=None): | ||
''' | ||
Converts a collection of values into a ufloat. | ||
|
||
Arguments: | ||
---------- | ||
sample: list or numpy array of numbers | ||
The sample of values | ||
|
||
method: optional string | ||
The method used to calculate the ufloat. currently, only | ||
the 'gaussian' method is implemented. | ||
|
||
gaussian: The nominal value is the mean of the sample. | ||
The standard deviation is the error on the mean. This | ||
method assumes that the sample follows a gaussian | ||
distribution, and works best for large samples. This | ||
method works well to find an estimate of a fixed value | ||
that has been measured multiuple times with some random | ||
error. | ||
|
||
axis: integer or None | ||
Only when the sample is a numpy array. The axis along | ||
which the ufloats are computed. If None (the default value) | ||
the sample is the whole flattened array. | ||
|
||
''' | ||
|
||
if method=="gaussian": | ||
|
||
if numpy is None: | ||
#if numpy is not present, use pythons statistics functions instead | ||
mean_value=stats_mean(sample) | ||
error_on_mean=stats_stdev(sample)/sqrt(len(sample)-1) | ||
|
||
return ufloat(mean_value,error_on_mean) | ||
|
||
else: | ||
#if numpy is present, use the faster numpy functions that can handle a wider range of inputs | ||
mean_value=numpy.mean(sample, axis=axis) | ||
|
||
#the size of each sample being collected | ||
sample_size=0 | ||
|
||
if axis == None: | ||
sample_size=numpy.size(sample) | ||
else: | ||
sample_size=numpy.shape(sample)[axis] | ||
|
||
error_on_mean=numpy.std(sample, ddof=1, axis=axis)/numpy.sqrt(sample_size) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the plain There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I don't understand what you mean. The error on the mean is the variance of the sample divided by the square root of the sample size. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I meant ddof should default to 0, as in Dividing by sqrt(len(samples)) gives the standard deviation of the mean, not the standard deviation of the value. For propagating uncertainties, you (usually, normally?) want the standard deviation of the value itself. I think two good rule-of-thumbs are "what does Wikipedia say?" and "what does numpy do?".
which agrees with Wikipedia. I would be reluctant to have
return anything except An option to divide by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, I understand now. I originally intended the function to return the mean and standard error on the mean. This Is because if I am measuring a constant value where my measurement apparatus have some random error, then my measurements are a sample from a normal distribution. Where the best estimate for the value is the mean of the sample and the uncertainty is the uncertainty on the mean. Perhaps it could be a method "from measurement". I think this would be more clear anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Myles244 if your samples are from a normal distribution, and you are measuring noisy values from an experimental measurement, then I think you do want to use the mean and the standard deviation (as from numpy.mean() and numpy.std()). Those give the average (and most likely, for normal distributions) and the appropriate measure of the variation observed. The standard deviation of the mean is a useful quantity, but I think it is not what you want here. Let's take mass of Christmas trees (and just to keep things jolly, let's pretend those have Gaussian distributions), and the total sample size in the survey is 1 million trees. A random sample of 100 would be small, but give a mean and standard deviation, and not be terrible. If you select a different 100 samples, you would have a mean and standard deviation that was pretty consistent with the first sample of 100. If you increase the sample size to 1000, you would not expect the mean or the standard deviation to change by much from the sample of 100. But, if you took a second sample of 1000, you would expect that mean value to be closer to the mean for the first 1000 than was the case for the two samples of 100. Another classic way to think about it is with test scores across many classrooms: give the same test to all students in 10 classrooms of 25 students each. The standard deviation of the 250 scores tells you the variation between students. The standard deviation of the mean of each classroom tells you the variation between classrooms. One is a measure of the students; the other is a measure of the teachers (or a bias in how students are assigned classrooms, so the administration, perhaps) ;). If you are measuring voltages and see a distribution [2, 4, 4, 4, 5, 5, 7, 9], the standard deviation is 2. If you repeat the measurements 100 times and the values are [[2, 4, 4, 4, 5, 5, 7, 9]*100], the standard deviation is still 2. OTOH, if those are 100 different samples of a set of 8 different, possibly related, voltages, then the standard deviation of the mean of those 100 observations of 8 voltages is now much smaller than 2 -- the mean value is very consistent (across those 100 "classrooms"). But to know the difference, I think you would need to provide the 100 mean and standard deviations. I think the most natural and common use case would be to have a bunch of samples of a single quantity, in which case the standard deviation is wanted. An option to divide by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jagerber48 I think I have to agree with you; clearly, the function is ambiguous, and perhaps it would be better not to include it. Thanks, anyway. p.s. should I close the issue or leave it open? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you can leave it open for now so we can continue to consider it for at least a few days. While I like the idea of having this as a constructor on We could include e.g. those two constructor functions on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes please leave this open the wiki page mentions the ambiguity:
so I agree with having explict names like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @andrewgsavage to be clear, that snippet is from the wikipedia page on standard deviation and is referring to the distinction between the sample standard deviation and an unbiased estimator for the population/distribution standard deviation. Neither of these are the standard error on the mean, which is the uncertainty of an unbiased estimator of the population mean calculated using the sample mean. So there are three possible options for what the
I agree that these could be selected by using different functions or a required keyword argument. If we want this stuff to be in
I express a preference for following These functions should all work as expected when I do NOT think this function should support numpy arrays, at least at first pass. We are still sorting out interoperability with numpy. I think there's a decent chance we'll do something the wrong way and have to redo it if we try to include numpy support for these helper functions. In other words: I could tolerate including helper functions to create Other functions can be moved into
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that one downside of this proposal is that it misses the "slick" opportunity to include e.g. |
||
|
||
if len(numpy.shape(mean_value))==0: | ||
# if the output is a single ufloat | ||
return ufloat(mean_value,error_on_mean) | ||
else: | ||
#if the output is an array of ufloats (duplicate of code from unnumpy.core.uarray to avoid circular import) | ||
return numpy.vectorize( | ||
# ! Looking up uncert_core.Variable beforehand through | ||
# '_Variable = uncert_core.Variable' does not result in a | ||
# significant speed up: | ||
lambda v, s: Variable(v, s), | ||
otypes=[object], | ||
)(mean_value, error_on_mean) | ||
else: | ||
msg={ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe just a f-string:
|
||
"{} is not one of the implemented methods".format(method) | ||
} | ||
raise ValueError(msg) | ||
|
||
|
||
|
||
def ufloat(nominal_value, std_dev=None, tag=None): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the
statistics.stdev
already divides bysqrt(len(samples)-1)
, and it should not be done again here.But also, I think we will need to discuss/document this, and perhaps have an option for using the "-1" ("Bessel's -1"), perhaps adding a
ddof
argument (default=0). I think the default should be "match numpy.std()", which divides variance bysqrt(len(samples))
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, i mistakenly assumed that
statistics.stdev
behaved the same asnumpt.std
does without any optional arguments.Adding
ddof=0
as an optional parameter should be simple. For the non-numpy code I can usestatistics.pstdev
and then include "-ddof" in the denominator.