Isolating jagged arrays for numpy compatibility #1726

mgjarrett · 2024-06-11T01:16:15Z

What is the change?

Implement a JaggedArray class to interface between our jagged data structures (e.g., pin-level data within a block) and HDF5 database without using numpy to flatten the jagged array.

Why is the change being made?

The current code feeds jagged arrays to numpy and then uses numpy to flatten them for storing in the database and reconstitute them when reading data back from the database. Numpy no longer supports jagged arrays, so we are currently pinned to version 1.23.5 (the latest release is 1.26.4).

The current version of numpy we are using gives a VisibleDeprecationWarning about this:

.\armi\bookkeeping\db\database3.py:1361: VisibleDeprecationWarning: Creating an ndarray from ragged nested
sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If
you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  data = numpy.array(layout.location)[layoutIndicesForType][

This change introduces a JaggedArray class that handles the packing into and unpacking out of 1D data structures for the HDF5 database. This allows us to use a current version of numpy.

There is a discussion about this topic: #1669

Also, I found this open ticket (#246) after having done all of the work in this PR. I didn't realize how much others had already thought about how to work around this. Well, I guess I'm throwing my potential solution in the ring. As was noted many times in that ticket, matching the performance of numpy might be the main challenge. Hopefully, the amount of jagged data we have is small enough that the performance hit isn't too significant.

Notes about numpy compatibility

Using a current version of numpy, if you try to pass jagged data into numpy.array(), you get an error:

>>> a = [1, 2, 3]
>>> b = [4, 5, 6, 7]
>>> c = [0, 0, 1]
>>> import numpy
>>> d = numpy.array((a, c))
>>> e = numpy.array((b, d))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

You can get numpy to accept it if you pass in dtype=object, but this data type doesn't have the features needed to facilitate flattening the array to store in HDF5, i.e., array.flatten() doesn't do anything. So I don't know whether this would help us at all. I'm not 100% positive, but I think we'd still be in the same position of needing to implement the packing and unpacking routines ourselves.

>>> e = numpy.array((b, d), dtype=object)
>>> print(e)
[list([4, 5, 6, 7]) array([[1, 2, 3],
                           [0, 0, 1]])]
>>> e.flatten()
array([list([4, 5, 6, 7]), array([[1, 2, 3],
                                  [0, 0, 1]])], dtype=object)

Checklist

This PR has only one purpose or idea.
Tests have been added/updated to verify any new/changed code.

The code style follows good practices.
The commit message(s) follow good practices.

The release notes have been updated if necessary.
The documentation is still up-to-date in the doc folder.
The dependencies are still up-to-date in pyproject.toml.

This class allows us to circumvent the need to create jagged numpy arrays, which is no longer supported after numpy 1.23.5.

tabulate 0.9.0 crashes when you feed it a non-scalar value for a cell entry.

armi/physics/neutronics/__init__.py

armi/bookkeeping/db/database3.py

john-science · 2024-07-03T12:53:27Z

@mgjarrett I hope you don't mind, I fixed the merge conflicts in this branch, so the CI would run again.

And while I was at it, I moved JaggedArray to it's own file. I can't help myself, sorry!

mgjarrett · 2024-07-03T18:18:05Z

@mgjarrett I hope you don't mind, I fixed the merge conflicts in this branch, so the CI would run again.

And while I was at it, I moved JaggedArray to it's own file. I can't help myself, sorry!

Thank you! I'm hoping to get back to this today, look at your comments and changes and clean up whatever needs to be fixed.

mgjarrett · 2024-07-04T00:17:47Z

I did some simple profiling to make sure there's not an unreasonable increase in run time between the old numpy version and the new implementation. (A small increase is probably acceptable because this will enable us to move to Python 3.11 or 3.12, which is reportedly significantly faster than Python 3.9).

The test case I created for profiling is quite simple, consisting of two separate runs. For the first run:

Load a reactor from inputs
Populate b.p.linPowByPin with random numbers on each block (this is a jagged dataset)
Write the reactor to a database

For the second run:

Load from the database saved in the previous run

The time to write the database is about 11-12 seconds, compared to an overall run time of about 100 s. Most of the database writing is related to writing the layout. The _writeParams portion is less than 5 seconds, and the packSpecialData portion of the database write is less than 1 second in both cases. The database write is slightly faster in the base case (current nala main, numpy 1.23.5), than the new implementation.

Profiling results for writing to the database

	writeToDB (s)	_writeParams (s)	packSpecialData (s)
Current Main	11.5	4.39	0.336
This PR	11.8	4.72	0.503
This PR (numpy 1.23.5)	11.9	4.97	0.559

Profiling results for reading from the database

Again, the new implementation is slightly slower than the old implementation, although not by much (1.4 s vs. 1.5 s). PR#589 is slower across the board, even for parts of the code that are not touched by the JaggedArray update. This difference persisted after multiple runs. It is possibly due to the newer version of numpy (1.26.4) is slower than the older version (1.23.5) on Python 3.9, or something else that has changed in the code. We can't compare on Python 3.11 or 3.12 because numpy 1.23.5 is not compatible. Running the new code with numpy 1.23.5 did slightly improve the run time for unpackSpecialData and for the _readParams.

	load (s)	_readParams (s)	unpackSpecialData (s)
Current Main Run 1	36.3	3.78	1.45
Current Main Run 2	40.9	3.84	1.35
Current Main Run 3	41.6	4.28	1.35
This PR Run 1	40.9	4.04	1.53
This PR Run 2	40.3	4.16	1.66
This PR Run 3	41.6	3.75	1.40
This PR (numpy 1.23.5)	39.2	3.62	1.31

john-science · 2024-07-09T21:31:53Z

@mgjarrett Your timing results are really interesting. I suppose Python 3.9 is the target version now, but 3.11 will be soon.

On the one hand, people might not like the speed decrease, on the other hand we need to be able to unlock modern versions of NumPy and Python. And both of those are blocked by this NumPy issue.

opotowsky · 2024-07-11T04:58:55Z

@mgjarrett Your timing results are really interesting. I suppose Python 3.9 is the target version now, but 3.11 will be soon.

On the one hand, people might not like the speed decrease, on the other hand we need to be able to unlock modern versions of NumPy and Python. And both of those are blocked by this NumPy issue.

The speed decrease is not even noticeable (considering the normal variation in run times) for an actual case. I'm quite pumped about this!

john-science · 2024-07-11T12:41:53Z

@mgjarrett When this PR is ready for review, please move it from Draft to normal. Right now, I can't tell when you think this PR is ready for prime time.

(And again, ARMI discourages opening Draft PRs.)

…Arrays

armi/bookkeeping/db/jaggedArray.py

john-science

Great work!

Thanks so much! I have really been wanting this feature, but couldn't find time on the schedule.

You rock.

Make the new jagged data unpacking routine compatible with the old database format (prior to #1726). Add a unit test to check compatibility.

Empty lists could be stored in the database under the old jagged array implementation. The new implementation in PR #1726 treated all empty arrays as None, and was incapable of reading empty lists from old databases that had them. This PR updates the packing and unpacking routines for special data so that empty lists can be stored and read back from the database.

mgjarrett added 23 commits March 25, 2024 19:29

Implement a JaggedArray class

7befacb

This class allows us to circumvent the need to create jagged numpy arrays, which is no longer supported after numpy 1.23.5.

Remove numpy version pin.

4b48b8a

Fix linting errors.

cc10751

Ignore a linting error

7c6b39c

Actually ignore the error.

c29c5e7

Handle None values in jagged array.

943fe7c

Don't check JaggedArray for NoDefault

3dc3b8b

Implement __contains__ for JaggedArray

1512505

Rearrange ops in packSpecialData

033f2cc

Flatten multi-dimensional lists of iterators.

02ce481

Retain structure of multi-dimensional array if possible

5e690be

Update numpyHackForEqual

06d59f5

Ignore linting error.

4b31b60

Protect numpyHackForEqual with type chekc

640e5b0

Black formatting.

3c25266

Update xsLibrary test.

59ae4ff

Merge branch 'main' into jaggedArrays

b95ec48

Merge branch 'main' into jaggedArrays

b697a6e

Fix a bug in staticmethod

56a8277

Update test_xsLibraries from list to numpy array

d709ff3

Use tuples for location instead of numpy array.

32ad281

update tabulation for tabulate 0.9.0

7f2124a

tabulate 0.9.0 crashes when you feed it a non-scalar value for a cell entry.

Merge branch 'main' into HEAD

2aa0702

mgjarrett self-assigned this Jun 11, 2024

mgjarrett added the enhancement New feature or request label Jun 11, 2024

mgjarrett commented Jun 11, 2024

View reviewed changes

armi/physics/neutronics/__init__.py Show resolved Hide resolved

Bring back floor on numpy version.

d9f11d8

mgjarrett commented Jun 11, 2024

View reviewed changes

armi/bookkeeping/db/database3.py Show resolved Hide resolved

mgjarrett commented Jun 11, 2024

View reviewed changes

armi/bookkeeping/db/database3.py Outdated Show resolved Hide resolved

mgjarrett commented Jun 11, 2024

View reviewed changes

armi/bookkeeping/db/database3.py Show resolved Hide resolved

Adding module-level docstring

110e544

mgjarrett added 3 commits July 3, 2024 11:20

Add some type hints.

e4fbeb8

Update some docstrings.

804bd80

Log an error message if JaggedArray fails.

4ed0a9b

mgjarrett added 3 commits July 8, 2024 14:02

Fix special data packing.

33b9937

Improve efficiency of special data unpacking

d76eea3

Merge branch 'main' into jaggedarrays

1784272

Remove change in numpy version

205416d

mgjarrett marked this pull request as ready for review July 11, 2024 18:08

mgjarrett added 2 commits July 11, 2024 11:13

Remove the numpy ceiling.

3c8cbff

Update release notes.

15cc36e

john-science self-requested a review July 11, 2024 18:18

Merge branch 'main' of https://github.com/terrapower/armi into jagged…

b51657d

…Arrays

john-science reviewed Jul 11, 2024

View reviewed changes

armi/bookkeeping/db/jaggedArray.py Show resolved Hide resolved

mgjarrett added 3 commits July 11, 2024 15:08

Add unit tests for jaggedArray

e92e31b

Add unit test for JaggedArray.flatten()

399f0b2

Fix linting errors.

0b9bd13

john-science self-requested a review July 11, 2024 22:27

john-science approved these changes Jul 11, 2024

View reviewed changes

mgjarrett merged commit aa758f3 into main Jul 11, 2024
19 checks passed

mgjarrett deleted the jaggedArrays branch July 11, 2024 22:50

mgjarrett mentioned this pull request Jul 12, 2024

Make new jaggedArray backwards compatible #1778

Merged

7 tasks

mgjarrett added a commit that referenced this pull request Jul 12, 2024

Make new jaggedArray backwards compatible (#1778)

b3986c1

Make the new jagged data unpacking routine compatible with the old database format (prior to #1726). Add a unit test to check compatibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isolating jagged arrays for numpy compatibility #1726

Isolating jagged arrays for numpy compatibility #1726

mgjarrett commented Jun 11, 2024 •

edited

Loading

john-science commented Jul 3, 2024

mgjarrett commented Jul 3, 2024

mgjarrett commented Jul 4, 2024 •

edited

Loading

john-science commented Jul 9, 2024

opotowsky commented Jul 11, 2024

john-science commented Jul 11, 2024

john-science left a comment

Isolating jagged arrays for numpy compatibility #1726

Isolating jagged arrays for numpy compatibility #1726

Conversation

mgjarrett commented Jun 11, 2024 • edited Loading

What is the change?

Why is the change being made?

Notes about numpy compatibility

Checklist

john-science commented Jul 3, 2024

mgjarrett commented Jul 3, 2024

mgjarrett commented Jul 4, 2024 • edited Loading

Profiling results for writing to the database

Profiling results for reading from the database

john-science commented Jul 9, 2024

opotowsky commented Jul 11, 2024

john-science commented Jul 11, 2024

john-science left a comment

Choose a reason for hiding this comment

mgjarrett commented Jun 11, 2024 •

edited

Loading

mgjarrett commented Jul 4, 2024 •

edited

Loading