Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolating jagged arrays for numpy compatibility #1726

Merged
merged 44 commits into from
Jul 11, 2024
Merged

Isolating jagged arrays for numpy compatibility #1726

merged 44 commits into from
Jul 11, 2024

Conversation

mgjarrett
Copy link
Contributor

@mgjarrett mgjarrett commented Jun 11, 2024

What is the change?

Implement a JaggedArray class to interface between our jagged data structures (e.g., pin-level data within a block) and HDF5 database without using numpy to flatten the jagged array.

Why is the change being made?

The current code feeds jagged arrays to numpy and then uses numpy to flatten them for storing in the database and reconstitute them when reading data back from the database. Numpy no longer supports jagged arrays, so we are currently pinned to version 1.23.5 (the latest release is 1.26.4).

The current version of numpy we are using gives a VisibleDeprecationWarning about this:

.\armi\bookkeeping\db\database3.py:1361: VisibleDeprecationWarning: Creating an ndarray from ragged nested
sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If
you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  data = numpy.array(layout.location)[layoutIndicesForType][

This change introduces a JaggedArray class that handles the packing into and unpacking out of 1D data structures for the HDF5 database. This allows us to use a current version of numpy.

There is a discussion about this topic: #1669

Also, I found this open ticket (#246) after having done all of the work in this PR. I didn't realize how much others had already thought about how to work around this. Well, I guess I'm throwing my potential solution in the ring. As was noted many times in that ticket, matching the performance of numpy might be the main challenge. Hopefully, the amount of jagged data we have is small enough that the performance hit isn't too significant.

Notes about numpy compatibility

Using a current version of numpy, if you try to pass jagged data into numpy.array(), you get an error:

>>> a = [1, 2, 3]
>>> b = [4, 5, 6, 7]
>>> c = [0, 0, 1]
>>> import numpy
>>> d = numpy.array((a, c))
>>> e = numpy.array((b, d))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

You can get numpy to accept it if you pass in dtype=object, but this data type doesn't have the features needed to facilitate flattening the array to store in HDF5, i.e., array.flatten() doesn't do anything. So I don't know whether this would help us at all. I'm not 100% positive, but I think we'd still be in the same position of needing to implement the packing and unpacking routines ourselves.

>>> e = numpy.array((b, d), dtype=object)
>>> print(e)
[list([4, 5, 6, 7]) array([[1, 2, 3],
                           [0, 0, 1]])]
>>> e.flatten()
array([list([4, 5, 6, 7]), array([[1, 2, 3],
                                  [0, 0, 1]])], dtype=object)

Checklist

  • The release notes have been updated if necessary.
  • The documentation is still up-to-date in the doc folder.
  • The dependencies are still up-to-date in pyproject.toml.

@mgjarrett mgjarrett self-assigned this Jun 11, 2024
@mgjarrett mgjarrett added the enhancement New feature or request label Jun 11, 2024
@john-science
Copy link
Member

@mgjarrett I hope you don't mind, I fixed the merge conflicts in this branch, so the CI would run again.

And while I was at it, I moved JaggedArray to it's own file. I can't help myself, sorry!

@mgjarrett
Copy link
Contributor Author

@mgjarrett I hope you don't mind, I fixed the merge conflicts in this branch, so the CI would run again.

And while I was at it, I moved JaggedArray to it's own file. I can't help myself, sorry!

Thank you! I'm hoping to get back to this today, look at your comments and changes and clean up whatever needs to be fixed.

@mgjarrett
Copy link
Contributor Author

mgjarrett commented Jul 4, 2024

I did some simple profiling to make sure there's not an unreasonable increase in run time between the old numpy version and the new implementation. (A small increase is probably acceptable because this will enable us to move to Python 3.11 or 3.12, which is reportedly significantly faster than Python 3.9).

The test case I created for profiling is quite simple, consisting of two separate runs. For the first run:

  1. Load a reactor from inputs
  2. Populate b.p.linPowByPin with random numbers on each block (this is a jagged dataset)
  3. Write the reactor to a database

For the second run:

  1. Load from the database saved in the previous run

The time to write the database is about 11-12 seconds, compared to an overall run time of about 100 s. Most of the database writing is related to writing the layout. The _writeParams portion is less than 5 seconds, and the packSpecialData portion of the database write is less than 1 second in both cases. The database write is slightly faster in the base case (current nala main, numpy 1.23.5), than the new implementation.

Profiling results for writing to the database

  writeToDB (s) _writeParams (s) packSpecialData (s)
Current Main 11.5 4.39 0.336
This PR 11.8 4.72 0.503
This PR (numpy 1.23.5) 11.9 4.97 0.559

Profiling results for reading from the database

Again, the new implementation is slightly slower than the old implementation, although not by much (1.4 s vs. 1.5 s). PR#589 is slower across the board, even for parts of the code that are not touched by the JaggedArray update. This difference persisted after multiple runs. It is possibly due to the newer version of numpy (1.26.4) is slower than the older version (1.23.5) on Python 3.9, or something else that has changed in the code. We can't compare on Python 3.11 or 3.12 because numpy 1.23.5 is not compatible. Running the new code with numpy 1.23.5 did slightly improve the run time for unpackSpecialData and for the _readParams.

  load (s) _readParams (s) unpackSpecialData (s)
Current Main Run 1 36.3 3.78 1.45
Current Main Run 2 40.9 3.84 1.35
Current Main Run 3 41.6 4.28 1.35
This PR Run 1 40.9 4.04 1.53
This PR Run 2 40.3 4.16 1.66
This PR Run 3 41.6 3.75 1.40
This PR (numpy 1.23.5) 39.2 3.62 1.31

@john-science
Copy link
Member

@mgjarrett Your timing results are really interesting. I suppose Python 3.9 is the target version now, but 3.11 will be soon.

On the one hand, people might not like the speed decrease, on the other hand we need to be able to unlock modern versions of NumPy and Python. And both of those are blocked by this NumPy issue.

@opotowsky
Copy link
Member

@mgjarrett Your timing results are really interesting. I suppose Python 3.9 is the target version now, but 3.11 will be soon.

On the one hand, people might not like the speed decrease, on the other hand we need to be able to unlock modern versions of NumPy and Python. And both of those are blocked by this NumPy issue.

The speed decrease is not even noticeable (considering the normal variation in run times) for an actual case. I'm quite pumped about this!

@john-science
Copy link
Member

@mgjarrett When this PR is ready for review, please move it from Draft to normal. Right now, I can't tell when you think this PR is ready for prime time.

(And again, ARMI discourages opening Draft PRs.)

@mgjarrett mgjarrett marked this pull request as ready for review July 11, 2024 18:08
@john-science john-science self-requested a review July 11, 2024 18:18
@john-science john-science self-requested a review July 11, 2024 22:27
Copy link
Member

@john-science john-science left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Thanks so much! I have really been wanting this feature, but couldn't find time on the schedule.

You rock.

@mgjarrett mgjarrett merged commit aa758f3 into main Jul 11, 2024
19 checks passed
@mgjarrett mgjarrett deleted the jaggedArrays branch July 11, 2024 22:50
mgjarrett added a commit that referenced this pull request Jul 12, 2024
Make the new jagged data unpacking routine compatible with the old database format (prior to #1726). Add a unit test to check compatibility.
mgjarrett added a commit that referenced this pull request Sep 27, 2024
Empty lists could be stored in the database under the old
jagged array implementation. The new implementation in
PR #1726 treated all empty arrays as None, and was incapable
of reading empty lists from old databases that had them.

This PR updates the packing and unpacking routines for special data
so that empty lists can be stored and read back from the database.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove jagged numpy arrays
3 participants