Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dataloaders to use Aggregation Lists #264

Merged
merged 28 commits into from
Mar 23, 2023

Conversation

al-rigazzi
Copy link
Collaborator

@al-rigazzi al-rigazzi commented Feb 27, 2023

This PR updates the TF and PyTorch data loaders to make use of SmartRedis's aggregation lists.

When training in parallel, we need to adopt a round-robin distribution of Datasets. This is OK as long as the simulation and the training have the same producing/consuming speed, but if the simulation is way faster (or there are many Datasets in the list when we start to train), we end up making many calls to get the interleaved batches. We could add another parameter in the future, to change the interleaving/stride across ranks, but for now, I think it will be fine.

@al-rigazzi al-rigazzi marked this pull request as ready for review February 28, 2023 09:00
@al-rigazzi al-rigazzi added type: refactor Issues focused on refactoring existing code area: ML Issues related to SmartSim ML classes and utilities API break Issues that include incompatible API changes labels Feb 28, 2023
Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far!!

I have a marked a couple of places where I think some methods can be re-written a bit cleaner, and I am asking for a pretty substantial architecture change for the TF and Torch data generators, but overall, the meat of this PR looks fantastic!

AS always feel free to lmk what you think!!

tutorials/ml_training/surrogate/train_surrogate.ipynb Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/torch/data.py Outdated Show resolved Hide resolved
smartsim/ml/torch/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Show resolved Hide resolved
Copy link
Contributor

@mellis13 mellis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments in addition to Matt's comments

smartsim/ml/data.py Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 10, 2023

Codecov Report

Merging #264 (7959514) into develop (fb967d9) will increase coverage by 2.64%.
The diff coverage is 93.98%.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #264      +/-   ##
===========================================
+ Coverage    84.89%   87.53%   +2.64%     
===========================================
  Files           60       60              
  Lines         3423     3386      -37     
===========================================
+ Hits          2906     2964      +58     
+ Misses         517      422      -95     
Impacted Files Coverage Δ
smartsim/_core/control/controller.py 84.45% <ø> (ø)
smartsim/_core/control/manifest.py 94.78% <ø> (ø)
smartsim/experiment.py 80.43% <ø> (ø)
smartsim/ml/data.py 93.71% <93.70%> (+31.26%) ⬆️
smartsim/ml/torch/data.py 95.65% <94.44%> (+38.50%) ⬆️
smartsim/ml/tf/data.py 85.36% <94.73%> (+3.54%) ⬆️
smartsim/ml/__init__.py 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! A found a couple of last minute things to address, but overall this looks about ready to go!

smartsim/ml/data.py Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
smartsim/ml/tf/data.py Outdated Show resolved Hide resolved
smartsim/ml/tf/data.py Outdated Show resolved Hide resolved
smartsim/ml/torch/data.py Outdated Show resolved Hide resolved
smartsim/ml/torch/data.py Outdated Show resolved Hide resolved
tests/backends/test_dataloader.py Outdated Show resolved Hide resolved
smartsim/ml/tf/data.py Outdated Show resolved Hide resolved
smartsim/ml/data.py Outdated Show resolved Hide resolved
@@ -41,7 +41,7 @@
"outputs": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running the notebook in a new dev environment (numpy 1.24). I got the error below. Downgrading to 1.23 worked. Do you think we should pin to a version or update the heat transfer source files? It seems like the data type in question was deprecated several years ago.

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[3], line 14
     11 size = 64
     13 for _ in range(3):
---> 14     u_s = fd2d_heat_steady_test01 (size, size)
     15     pcolor_list(u_s, "Left: initial temperature. Right: steady state.")

File ~/craylabs/SmartSim/tutorials/ml_training/surrogate/steady_state.py:270, in fd2d_heat_steady_test01(nx, ny)
    267 source_centers = 0.2+np.random.rand(np.random.randint(1,6),2)*0.6
    269 Xgrid, Ygrid = np.meshgrid(xvec, yvec)
--> 270 u_init = np.zeros_like(Xgrid).astype(np.bool)
    271 for center in source_centers:
    272   u_init |= (Xgrid-center[0])**2 + (Ygrid-center[1])**2 < 0.05**2

File ~/miniconda3/envs/ss_test/lib/python3.9/site-packages/numpy/__init__.py:305, in __getattr__(attr)
    300     warnings.warn(
    301         f"In the future `np.{attr}` will be defined as the "
    302         "corresponding NumPy scalar.", FutureWarning, stacklevel=2)
    304 if attr in __former_attrs__:
--> 305     raise AttributeError(__former_attrs__[attr])
    307 # Importing Tester requires importing all of UnitTest which is not a
    308 # cheap import Since it is mainly used in test suits, we lazy import it
    309 # here to save on the order of 10 ms of import time for most users
    310 #
...

AttributeError: module 'numpy' has no attribute 'bool'.
`np.bool` was a deprecated alias for the builtin `bool`. To avoid this error in existing code, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Copy link
Member

@MattToast MattToast Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to confirm, as the deprecation message suggests, that swapping np.bool -> bool appears to give the desired output with both numpy==1.23.0 and numpy==1.24.2. I'm in favor of simply updating the simulation code!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that code is old, fixing it won't hurt! Thanks for the catch!

Copy link
Member

@MattToast MattToast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! Thanks for all the hard work on this one! The use of aggregation lists here make this whole section much easier to understand/utilize/extend imo!

Copy link
Contributor

@mellis13 mellis13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks!

@al-rigazzi al-rigazzi merged commit 35857f6 into CrayLabs:develop Mar 23, 2023
@al-rigazzi al-rigazzi deleted the update_dataloaders branch March 23, 2023 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API break Issues that include incompatible API changes area: ML Issues related to SmartSim ML classes and utilities type: refactor Issues focused on refactoring existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants