-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isolating jagged arrays for numpy compatibility #1726
Conversation
This class allows us to circumvent the need to create jagged numpy arrays, which is no longer supported after numpy 1.23.5.
tabulate 0.9.0 crashes when you feed it a non-scalar value for a cell entry.
@mgjarrett I hope you don't mind, I fixed the merge conflicts in this branch, so the CI would run again. And while I was at it, I moved |
Thank you! I'm hoping to get back to this today, look at your comments and changes and clean up whatever needs to be fixed. |
I did some simple profiling to make sure there's not an unreasonable increase in run time between the old numpy version and the new implementation. (A small increase is probably acceptable because this will enable us to move to Python 3.11 or 3.12, which is reportedly significantly faster than Python 3.9). The test case I created for profiling is quite simple, consisting of two separate runs. For the first run:
For the second run:
The time to write the database is about 11-12 seconds, compared to an overall run time of about 100 s. Most of the database writing is related to writing the layout. The Profiling results for writing to the database
Profiling results for reading from the databaseAgain, the new implementation is slightly slower than the old implementation, although not by much (1.4 s vs. 1.5 s). PR#589 is slower across the board, even for parts of the code that are not touched by the
|
@mgjarrett Your timing results are really interesting. I suppose Python 3.9 is the target version now, but 3.11 will be soon. On the one hand, people might not like the speed decrease, on the other hand we need to be able to unlock modern versions of NumPy and Python. And both of those are blocked by this NumPy issue. |
The speed decrease is not even noticeable (considering the normal variation in run times) for an actual case. I'm quite pumped about this! |
@mgjarrett When this PR is ready for review, please move it from Draft to normal. Right now, I can't tell when you think this PR is ready for prime time. (And again, ARMI discourages opening Draft PRs.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Thanks so much! I have really been wanting this feature, but couldn't find time on the schedule.
You rock.
Make the new jagged data unpacking routine compatible with the old database format (prior to #1726). Add a unit test to check compatibility.
Empty lists could be stored in the database under the old jagged array implementation. The new implementation in PR #1726 treated all empty arrays as None, and was incapable of reading empty lists from old databases that had them. This PR updates the packing and unpacking routines for special data so that empty lists can be stored and read back from the database.
What is the change?
Implement a
JaggedArray
class to interface between our jagged data structures (e.g., pin-level data within a block) and HDF5 database without using numpy to flatten the jagged array.Why is the change being made?
The current code feeds jagged arrays to numpy and then uses numpy to flatten them for storing in the database and reconstitute them when reading data back from the database. Numpy no longer supports jagged arrays, so we are currently pinned to version 1.23.5 (the latest release is 1.26.4).
The current version of numpy we are using gives a
VisibleDeprecationWarning
about this:This change introduces a
JaggedArray
class that handles the packing into and unpacking out of 1D data structures for the HDF5 database. This allows us to use a current version of numpy.There is a discussion about this topic: #1669
Also, I found this open ticket (#246) after having done all of the work in this PR. I didn't realize how much others had already thought about how to work around this. Well, I guess I'm throwing my potential solution in the ring. As was noted many times in that ticket, matching the performance of numpy might be the main challenge. Hopefully, the amount of jagged data we have is small enough that the performance hit isn't too significant.
Notes about numpy compatibility
Using a current version of numpy, if you try to pass jagged data into
numpy.array()
, you get an error:You can get numpy to accept it if you pass in
dtype=object
, but this data type doesn't have the features needed to facilitate flattening the array to store in HDF5, i.e.,array.flatten()
doesn't do anything. So I don't know whether this would help us at all. I'm not 100% positive, but I think we'd still be in the same position of needing to implement the packing and unpacking routines ourselves.Checklist
doc
folder.pyproject.toml
.