Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguous API for DB Object Placement Swallows first_device Parameter #534

Open
MattToast opened this issue Mar 27, 2024 · 1 comment
Open
Labels
API break Issues that include incompatible API changes area: api Issues related to API changes area: ML Issues related to SmartSim ML classes and utilities bug: minor A minor bug short task Issues that can be completed and reviewed quickly type: refactor Issues focused on refactoring existing code

Comments

@MattToast
Copy link
Member

Description

When a user specifies a first_device parameter, we do not use it to format the enumerated devices IFF the num_devices_per_node != 1. This can lead to some unexpected formatting of enumerated devices.

How to reproduce

Consider the Following Script

from smartsim.entity.dbobject import DBModel

def pprint_db_obj(obj):
    print(f"{obj.name}: {obj.devices}")

def main():
    o1 = DBModel(name="Normal Object", model=b"some-model", backend="TORCH",
                 device="GPU")
    o2 = DBModel(name="Object w/ Device as str", model=b"some-model", backend="TORCH",
                 device="GPU:1")
    o3 = DBModel(name="Object w/ First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1)
    o4 = DBModel(name="Object w/ Single Enumerated First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1, devices_per_node=1)
    o5 = DBModel(name="Object w/ Multi Enumerated First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1, devices_per_node=3)
    o6 = DBModel(name="Object w/ Conflicting First Device", model=b"some-model", backend="TORCH",
                 device="GPU:123", first_device=456)

    pprint_db_obj(o1)
    pprint_db_obj(o2)
    pprint_db_obj(o3)
    pprint_db_obj(o4)
    pprint_db_obj(o5)
    pprint_db_obj(o6)
    print("=" * 20)


    o7 = DBModel(name="Object where/ First Device should be 123", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=123)

    pprint_db_obj(o7)
    print("     Yikes!! We have no pin for this GPU!!! ^^^")

    return 0

if __name__ == "__main__":
    raise SystemExit(main())

This is the current output:

Normal Object: ['GPU']
Object w/ Device as str: ['GPU:1']
Object w/ First Device: ['GPU']
Object w/ Single Enumerated First Device: ['GPU']
Object w/ Multi Enumerated First Device: ['GPU:1', 'GPU:2', 'GPU:3']
Object w/ Conflicting First Device: ['GPU:123']
====================
Object where/ First Device should be 123: ['GPU']
     Yikes!! We have no pin for this GPU!!! ^^^

Expected behavior

This is what I would expect for output

Normal Object: ['GPU']
Object w/ Device as str: --> N/A: Constructing a DBObject where `device` != `CPU` or `GPU` (case insensitive) should raise an error <--
                         --> NOTE: THIS WOULD CONSTITUTE AN API BREAK <--
Object w/ First Device: ['GPU:1']
Object w/ Single Enumerated First Device: ['GPU:1']
Object w/ Multi Enumerated First Device: ['GPU:1', 'GPU:2', 'GPU:3']
Object w/ Conflicting First Device: --> N/A: Constructing a DBObject where `device` != `CPU` or `GPU` (case insensitive) should raise an error <--
                                    --> NOTE: THIS WOULD CONSTITUTE AN API BREAK <--
====================
Object where/ First Device should be 123: ['GPU:123']
         Hooray! This is what I expected!!  ^^^^^^^

But of course this can be open to discussion/debate. At the very least, I think we should error if a user specifies device="GPU:123" and first_device=456.

System

Any/All

@MattToast MattToast added type: refactor Issues focused on refactoring existing code area: ML Issues related to SmartSim ML classes and utilities area: api Issues related to API changes bug: minor A minor bug labels Mar 27, 2024
@MattToast MattToast added the API break Issues that include incompatible API changes label Mar 27, 2024
@mellis13 mellis13 added the short task Issues that can be completed and reviewed quickly label Apr 5, 2024
@mellis13
Copy link
Contributor

mellis13 commented Apr 5, 2024

Consensus was to raise an error if the device has a ":" specifier and the first device is specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API break Issues that include incompatible API changes area: api Issues related to API changes area: ML Issues related to SmartSim ML classes and utilities bug: minor A minor bug short task Issues that can be completed and reviewed quickly type: refactor Issues focused on refactoring existing code
Projects
None yet
Development

No branches or pull requests

2 participants