-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SmartSim Multidb example #409
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #409 +/- ##
===========================================
- Coverage 89.65% 89.60% -0.06%
===========================================
Files 59 59
Lines 3617 3617
===========================================
- Hits 3243 3241 -2
- Misses 374 376 +2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a very good start. Some improvements are left inline, but otherwise thanks for starting to tackle this! Also, can you rename your PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor changes and I think this will be good to go!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done, some small change to request. I also agree with Bill's comments so be sure to address both of ours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to message me to clarify anything!
|
||
.. code-block:: bash | ||
|
||
Model: colo logger@00-00-00:The colocated db has tensor_1: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you maybe print the tensors like you did above? Or just be consistent with how you should the results, so is it clear the same thing is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda like how it's polling for the tensors. I think its kinda the same a little? Checks if the tensor are in the database, and it shows another client.function(). I think we could really do either but I like your idea of remaining consistent throughout the example hmmmmm, let me add a description on what poll_tensor does!
doc/orchestrator.rst
Outdated
experiment printed before it is launched. The summary | ||
will stay for 10 seconds, and it is useful as a last | ||
check. If we set `summary=False`, then the experiment | ||
would be launched immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The summary will show each instance that has been launched and completed in this Experiment" - from the SS API. A little misleading to say it will show a summary before it is launched? Is it true that it only appears for 10 seconds and then disappears?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It does not disappear tho, it just waits to push output, so you basically have 10 seconds to scan the summary before it continue in the experiment, Ill change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok changed, let me know if its more clear!
doc/orchestrator.rst
Outdated
exp.stop(single_shard_db, multi_shard_db) | ||
logger.info(exp.summary()) | ||
|
||
Application Script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a good idea to add the name of the scripts that contain this information that you are using in your examples above (eg. exe_args="/lus/scratch/richaama/model_ex.py") is the application script. Also probably a good idea to have step by step instructions that show where files should be located in relation to smartsim, the contents of those files, what should be run (eg. only running experiment script, which uses the application script).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a step by step on what is needed to run the example would be useful (allocation, any module load stuff, the structure of the file system, contents of files, what exactly to run)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this but let me know what u think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure how specific I can be on the setup instructions bc people might be installing in many dif ways, but hmmm @ashao and @juliaputko let me know if I can put submitting an allocation or anything else? It should be the same on all slurm machines right? But someone might want to change the launcher? so I could just say make sure you have allocated resources, lemme know what yall think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say point them at the beginning of the section, you could just say something along the lines of "This section assumes that you have already installed and built SmartSim and SmartRedis. Please refer to Section "Installing SmartSim" for further details. For simplicity, this example assumes that you are running on a SLURM-based HPC-platform. Refer to "Running this example" for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dern!
doc/orchestrator.rst
Outdated
logger = get_logger("Multidb Experiment Log") | ||
exp = Experiment("getting-started-multidb", launcher="auto") | ||
|
||
single_shard_db = exp.create_database(port=6379, db_nodes=1, interface="ib0", db_identifier="single_shard_db_identifier") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a note on which interface to use depending on how you are running the example? I had to change mine to run it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting! I believe smartsim is suppose to find the interface for you, hmmm let me ask someone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok ok this is being changed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making a note that ill need to rerun example in future and make sure this does not happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to omit the ib0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omitted
doc/orchestrator.rst
Outdated
visualization workflow connected to the simulation (requiring a standard orchestrator). | ||
|
||
Below is a simple example demonstrating the process of: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe have the description of these steps directly below the list of the steps. I was a bit confused when the first heading was "Application script" when I was expecting an explanation od the launching of two standard Orchestrators, which I think you have lower down. Might be good just for easily finding the information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay so I changed this, let me know if this its more clear, good find btw I can very much see where ur coming from
doc/orchestrator.rst
Outdated
|----|------------------------------|---------------|-------------|---------|---------|-----------|--------------| | ||
| 1 | colo_model | Model | 1538905.8 | 0 | 3.5465 | Completed | 0 | | ||
| 2 | single_shard_db_identifier_0 | DBNode | 1538905.5 | 0 | 73.1798 | Cancelled | 0 | | ||
| 3 | multi-shard-db-identifier | DBNode | 1538905.6+2 | 0 | 49.8503 | Cancelled | 0 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it true that these are inconsistent with whether they have an _0 appended or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omg good catch, this is true, I pasted this from the terminal, I think I might know why this is happening tho
doc/orchestrator.rst
Outdated
@@ -497,12 +507,12 @@ to the ``create_model()`` function and assign to the variable ``model``. | |||
|
|||
Step 2: Colocate | |||
"""""""""""""""" | |||
To colocate the model, use the ``Model.colocate_db_tcp()`` function to | |||
To colocate the model, use the ``Model.colocate_db_uds()`` function to | |||
Colocate an Orchestrator instance with this Model over TCP/IP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UDS isn't a TCP/IP connection, but a Unix domain socket connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that I need to watch a vid on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your hard work on this! A couple minor typos, but nothing that will require another round of review from me for. Please get signoff from Andrew and Julia before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more minor comments, but it's getting very close!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thanks for all the hard work on this
This PR merges the SmartSim mutlidb example into the craylab docs. It is step by step instructions on how to setup multiple orchestrators, connect to the orchs using smartredis clients and send, retrieve and manipulate data using the launched dbs and the ConfigOptions object. The manipulation of tensors is done with a torch function. Lastly, the source code is provided at the end of the example.