Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate the Variable class #262

Open
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

jagerber48
Copy link
Contributor

  • Closes # (insert issue number)
  • Executed pre-commit run --all-files with no errors
  • The change is fully covered by automated unit tests
  • Documented in docs/ as appropriate
  • Added an entry to the CHANGES file

This PR:

  • Eliminates the Variable class and eliminates the AffineScalarFunc.derivatives property.
  • The AffineScalarFunc class is renamed to UFloat with AffineScalarFunc left as a legacy name.
  • The LinearCombo class is refactored into the UCombo class but the architectural role is similar: track the uncertainties of UFloat objects as a lazily expanded combination of objects with uncertainty
  • The UAtom object is introduced as the fundamental element which captures uncertainty. Each UAtom models a random variable with zero mean and unity standard deviation/variance. Each UAtom has a uuid and all UAtom are mutually statistically independent. The expanded version of the UCombo maps dict[UAtom, float].

For more detail/discussion see #251 (comment)

Note that as of the time opening this PR the PR is still a work in progress. Not all elements in the bullet list above are implemented and a good handful of tests are still failing.

@jagerber48 jagerber48 changed the title Feature/linear combo refactor Eliminate the Variable class Aug 12, 2024
@@ -104,9 +105,6 @@ def test_ufloat_fromstr():
"3.4(nan)e10": (3.4e10, float("nan")),
# NaN value:
"nan+/-3.14e2": (float("nan"), 314),
# "Double-floats"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These floats are too large to work with. Prior to this PR the large float value could be saved into the AffineScalarFunc object and displayed, but any arithmetic with such large values would result in an OverflowError. In the new framework even displaying a UFloat the first time requires a calculation, even if it's as simple as sqrt(x**2). So in the new framework the OverflowError appears immediately.

@jagerber48 jagerber48 mentioned this pull request Dec 7, 2024
@jagerber48
Copy link
Contributor Author

All the tests are passing except some having to do with the uarray inverse and pseudo inverse functions. The code for this is pretty complicated. It might make sense to just rewrite the code to convert array-compatible functions to ufloat array compatible functions.

Copy link

codecov bot commented Dec 12, 2024

Codecov Report

Attention: Patch coverage is 88.03922% with 61 lines in your changes missing coverage. Please review.

Project coverage is 93.62%. Comparing base (61f688f) to head (9607a0d).

Files with missing lines Patch % Lines
uncertainties/unumpy/core.py 78.41% 30 Missing ⚠️
uncertainties/ucombo.py 73.95% 25 Missing ⚠️
uncertainties/core.py 92.15% 4 Missing ⚠️
tests/helpers.py 97.22% 1 Missing ⚠️
uncertainties/ops.py 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #262      +/-   ##
==========================================
- Coverage   96.55%   93.62%   -2.94%     
==========================================
  Files          17       18       +1     
  Lines        1947     2039      +92     
==========================================
+ Hits         1880     1909      +29     
- Misses         67      130      +63     
Flag Coverage Δ
macos-latest-3.10 92.25% <88.03%> (-2.77%) ⬇️
macos-latest-3.11 92.25% <88.03%> (-2.77%) ⬇️
macos-latest-3.12 92.25% <88.03%> (-2.77%) ⬇️
macos-latest-3.8 92.23% <88.01%> (-2.78%) ⬇️
macos-latest-3.9 92.23% <88.01%> (-2.78%) ⬇️
no-numpy 72.24% <63.33%> (-2.90%) ⬇️
ubuntu-latest-3.10 92.25% <88.03%> (-2.77%) ⬇️
ubuntu-latest-3.11 92.25% <88.03%> (-2.77%) ⬇️
ubuntu-latest-3.12 92.25% <88.03%> (-2.77%) ⬇️
ubuntu-latest-3.8 92.23% <88.01%> (-2.78%) ⬇️
ubuntu-latest-3.9 92.23% <88.01%> (-2.78%) ⬇️
windows-latest-3.10 92.25% <88.03%> (-2.77%) ⬇️
windows-latest-3.11 92.25% <88.03%> (-2.77%) ⬇️
windows-latest-3.12 92.25% <88.03%> (-2.77%) ⬇️
windows-latest-3.8 92.23% <88.01%> (-2.78%) ⬇️
windows-latest-3.9 92.23% <88.01%> (-2.78%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jagerber48
Copy link
Contributor Author

jagerber48 commented Dec 13, 2024

Wow all tests are passing now. That's a big milestone for this PR. If folks are interested I'd be curious to hear general feedback on the source code changes in this PR. I tried to do as little wholesale rewriting as possible but did chose to do some of that in unumpy/core.py.

Here's a summary of the changes so far. See #251 for a more thorough discussion of these changes and their motivations.

  • Remove the Variable class.
  • Remove the LinearCombination class.
  • Add the UAtom class. A UAtom models a standard normally distributed random variable (zero mean, unity variance).
  • Add the UCombo class. This class essentially maps UAtoms to float. This class models a linear combination of UAtoms. Exposes a method to extract its own standard deviation and its covariance with other UCombos. This class supports multiplication by floats/ints and additions with other UCombo's. This simple arithmetic support makes uncertainty propagation calculations nice.
  • Swap the names UFloat and AffineScalarFunc. Now UFloat defines the actual class and AffineScalarFunc is an alias (perhaps it could be deprecated later).
  • The UFloat class now stores uncertainty as a UCombo rather than a LinearCombination. Previously uncertainty was roughly recorded as the derivatives of an AffineScalarFunc with respect to Variables. This information was stored and accessed in the derivatives attribute. Now there is no longer a derivatives attribute. There is only error_components which now returns a mapping from UAtom to float indicating the weighting of each UAtom within a UFloat. This is the biggest user-facing conceptual change in this PR.
  • It was relatively straightforward to give UFloat a simple hash. This will resolve Pandas sort_values with multiple columns does not work for AffineScalarFunc #186 and Should `Variable.__eq__` be based on `id()` or a separate identifier? #218 and Implement hash for AffineScalarFunc #219. I'll point out that these conversations were major concrete motivators for this refactor (together with a less concrete desire to simplify the codebase)
  • Previously AffineScalarFunc and Variable could be tagged with a user-readable tag. Now UFloats cannot be tagged, but UAtoms can.
  • Because of the previous architecture there were some confusing behaviors with respect to copying/pickling. In the new framework I think copying and pickling are much more straightforward. The behaviors are different than before, but I argue they are better. You can look at how the tests changed for more details onthis. See also Strange behavior on copying #260.
  • The code in unumpy/core.py responsible for extending np.linalg.inv and np.linalg.pinv to operate sensibly on umatrix or arrays of UFloat were rewritten. The old code (1) relied centrally on AffineScalarFunc.derivatives and (2) was quite confusing to understand. In this case I found it easier to rewrite the code than to understand and modify the old code. The old code could only uncertainty-ify functions that took one array as an argument. The new code detect UFloats or arrays containing UFloat's in any argument. By default it uses numerical differentiation, but the function can be passed functions to calculate analytical derivatives.

Some todo items

  • Check that performance isn't impacted
  • Multiple passes of clean up on the code
  • Some passes on the source code for the tests. Many tests were changed, some dramatically. Some tests were added. I think the tests as they are now are probably ok. I think some checks that this PR can go through would be good, but a more thorough addressing of tests should be done in support of resolving Clean up tests #273 in later PRs.
  • Correct and clean up comments within the source code. The legacy uncertainties code has a lot of comments and documentation. This PR is a part of a thrust to help simplify the source code so that it is a bit more self-documenting. With this, in the places touched by this PR, I want to slim down any comments, or at the very least make sure they are accurate. Especially with the removal of the derivatives attribute, I think there may be comments in lots of places that are no longer accurate. There are probably places I could add some comments.
  • Update the documentation. I haven't touched the docs documentation at all. I think there will need to be a lot of updating there. Off the top of my head I know the docs make reference to the "derivatives of variables" philosophy that is being replaced with the "linear combination of random variables" philosophy. I also know there will be some docs about copying/pickling that need to be updated, and of course docs on derivatives and error_components will need to be updated.
  • Bikeshed the naming of the UFloat, UCombo, and UAtom classes.
  • Write the changelog entries

Other notes:

  • COMMENT FOR REVIEWERS: Obviously I would appreciate wholistic reviews, But I'm especially curious for reviews on changes to the tests that start to illustrate how the public API is changing.
  • See this change. The to_uarray_func conversion function assumes inputs are numpy arrays and doesn't "respect" numpy matrices. I made this choice because of the comment on numpy.matrix documentation discouraging their use. See 4.x Release plans #244 (comment). It seems like the maintainers are in some agreement that we can drop matrix support, so maybe consider this a step in that direction.
  • I'll update this comment or later ones as I think of more changes or todos. I'll likely use this comment as a start for the changelog.

This test wasn't running because the generator defining ufloat_args was consumed while calculating output so it couldn't be iterated over again for the for loop. Need to save the result as a list before using it.
These functions need to be overhauled so for now trying to make minimal changes while keeping functionality to keep the PR as slim as possibly (though it's already pretty huge)
return self._hash

def __lt__(self: UAtom, other: UAtom) -> bool:
return self.uuid < other.uuid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov gives a load of warnings for UAtom and UCombo. Would bit be worth having tests to show methods like eq, lt behave as expected?

it's also a bit suprising some of these lines (eg for UCombo's eq) don't get hit by any of the existing tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. I haven't looked at codecov yet. The story for why UCombos __eq__ doesn't get hit is worth sharing.

First off, before this PR and as of this comment, UFloat.__eq__ does not check for equality of the nominal value and equality of the uncertainty as I would have naively implemented. Rather, it looks at the difference between self and other and checks that the resulting UFloat has zero nominal value and zero std_dev. Because subtraction is supported between UFloat and float, this allows __eq__ to easily support comparison between UFloat and float. In going through the tests I've learned that, for better or for worse, uncertainties has definitely chosen the convention that a UFloat with a std_dev of zero should behave like a regular float.

I am still using this difference calculation for UFloat.__eq__. It would be more natural to me to have __eq__ be something like

def __eq__(self, other):
    if isinstance(other, UFloat):
        return (self.nominal_value == other.nominal_value and self.uncertainty = other.uncertainty)
    else:
        if self.std_dev != 0:
            return False
        else:
            return self.nominal_value == other

But currently the code implementing __eq__ is defined in ops.py and gets monkey patched onto UFloat. I can't import UFloat into this function because it would create circular imports between ops.py and core.py.

Anyways, I can look into disentangling this.


Note though, that I tried something like the above code for a second and found out that the current version of UCombo.__eq__ was broken because I didn't know sorted returned a list! Easy fix, but emphasizes the importance of testing.

Note that I made the decision that UCombo __eq__ and __hash__ use the expanded dictionary for comparison. Thuis means that two UCombos with different ucombo_tuples could compare equal. For UFloat I think it is clear that we want two UFloats to compare based on the expanded dictionary from the uncertainty. For UCombo it wasn't as obvious to me. For example, if dx is a UAtom then it would be possible for there to be a nested UCombo storing something like

dx + (-dx + (dx + (-dx + (dx + (-dx + ...)))))

Which ultimately is equal to 0. But the former might take a longer time to compute.

Actually I think I need to add a special case for 0 so that the presence of a UAtom in one UCombo expanded dict but not the other doesn't matter if the weight is equal to 0.

In any case, does this make sense to allow two UCombo to compare the same even if they have different ucombo_tuple? Or should that distinction be made apparent in __eq__ and __hash__ (but UFloat will still compare based on expanded dict)?

sorted returns a (unhashable)  list, so much convert this list to a tuple before hashing.
jagerber48 added a commit that referenced this pull request Dec 28, 2024
- [x] Closes #274 
- [x] Executed `pre-commit run --all-files` with no errors
- [x] The change is fully covered by automated unit tests
- [x] Documented in docs/ as appropriate
- [x] Added an entry to the CHANGES file

add a performance benchmark test. This test is important especially to
ensure #262 doesn't introduce a performance regression.

---------

Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>
Copy link

codspeed-hq bot commented Dec 28, 2024

CodSpeed Performance Report

Merging #262 will degrade performances by 35.86%

Comparing jagerber48:feature/linear_combo_refactor (9607a0d) with master (61f688f)

Summary

❌ 5 regressions

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark master jagerber48:feature/linear_combo_refactor Change
test_repeated_summation_speed[100000] 2.9 s 4.3 s -33.86%
test_repeated_summation_speed[10000] 274.9 ms 424.7 ms -35.26%
test_repeated_summation_speed[1000] 27 ms 42.1 ms -35.86%
test_repeated_summation_speed[100] 2.8 ms 4.3 ms -34.39%
test_repeated_summation_speed[10] 401.2 µs 562.5 µs -28.68%

@jagerber48
Copy link
Contributor Author

Ok, it is time to pick the names of the classes because I want to start updating the docs (I'm not technically blocked but would be nice to get it right earlier). Right now we have UFloat, UComboandUAtom`. These classes are modeling the following.

For reference, suppose we have

x = x_0 + dx
y = y_0 + dy
z = x*y = x_0 * y_0 + (y_0 * dx + x_0 * dy)

dx and dy are UAtoms (individual random elements with zero mean and unity standard deviation/variance), 1*dx, 1*dy, and y_0*dx + x_0*dy are UCombos (float-weighted linear combinations of UAtoms) and x, y, and z are UFloats (nominal value + uncertainty given as a UCombo).

Notes on names:

  • UFloat: I think we're all in agreement that this name is fine. AffineScalarFunc will remain a legacy alias that I suggest we add a FutureWarning too in an upcoming release (we shouldn't do that in this PR however since I don't want to get bogged down in that discussion yet. For the record, I will say there has been discussion about the name UNumber for something that would support more types of nominal values and uncertainties than just floats. E.g. Decimal or possibly complex numbers or something. But given UFloat is already in use I say it makes sense to keep it.
  • UCombo: This class roughly replaces the old LinearCombination class. I don't have strong opinions on this name. LinearCombination would be fine, but I would like to somehow emphasize that is a linear combination of zero-mean random variables which is not captured by just LinearCombination. Other obvious options: ULinCombo, ULinearCombo, ULinearCombination. The same options without a U. Again, my opinions aren't strong here. If I don't hear otherwise I will keep UCombo.
  • UAtom. I think this is the most controversial. UAtom kind of serves to replace the old Variable class, but it is much simpler than the Variable class and this reduction in complexity is what drives the overall improvement in the architectures (IMO). I chose the name UAtom because it is an atomic/irreducible/indivisible unit (this is what is meant by "atom") of uncertainty. A UAtom has zero mean, unity variance, and is not correlated with any other UAtom. This is in contrast with the UCombo which is a combination of many UAtom. However, it's been pointed out that users outside of the physical sciences may be less familiar with the term "Atom" and what it might mean here. Open to proposals for alternative names. I don't have good ideas. UnitUncertainty, StdNormUnit (standing for "standard normal unit" referring to the standard normal distribution), RandomUnit, URandom. I prefer UAtom to all of these but open to other proposals and discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants