Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/walltime-conversion #414

Merged
merged 1 commit into from
Apr 12, 2023
Merged

Conversation

bgunnar5
Copy link
Member

@bgunnar5 bgunnar5 commented Apr 3, 2023

This is still a draft but it's a start to a solution for #412. In the comments on that issue I came up with two solutions and I'm using the one I think is best. I'm open to feedback and willing to change this if we think there's a better solution.

This is most likely an issue that I'll revisit in the future since @jwhite242 and I have talked about creating a one-size-fits all walltime value in Maestro at some point.

@bgunnar5 bgunnar5 marked this pull request as draft April 3, 2023 18:32
@koning
Copy link
Member

koning commented Apr 3, 2023

The flux calls need a different time format, should have tests for slurm, flux and pbs for worker alloc as well as run:

flux:
-t, --time-limit=FSD
Set a time limit for the job in Flux standard duration (RFC 23).
FSD is a floating point number with a single character units
suffix ("s", "m", "h", or "d"). If unspecified, the job is sub-
ject to the system default time limit.

slurm:
-t, --time=
Set a limit on the total run time of the job allocation. If the
requested time limit exceeds the partition's time limit, the job
will be left in a PENDING state (possibly indefinitely). The
default time limit is the partition's default time limit. When
the time limit is reached, each task in each job step is sent
SIGTERM followed by SIGKILL. The interval between signals is
specified by the Slurm configuration parameter KillWait. The
OverTimeLimit configuration parameter may permit the job to run
longer than scheduled. Time resolution is one minute and second
values are rounded up to the next minute.

          A time limit of zero requests that no  time  limit  be  imposed.
          Acceptable  time  formats  include "minutes", "minutes:seconds",
          "hours:minutes:seconds", "days-hours", "days-hours:minutes"  and
          "days-hours:minutes:seconds".

pbs:
qsub -l walltime=hours:minutes:seconds script

@lucpeterson
Copy link
Member

I think instead we should use this function:
merlin.utils.convert_timestring

def convert_timestring(timestring: Union[str, int], format_method: str = "HMS") -> str:

it takes a string and returns another string in the right format, which is an argument, and already has hooks into the proper pre presentation for slurm and flux. It also checks for ints vs strings

Implementing should be something like (bad pseudo code)
if batch==flux: format_method=“FSD”
if batch==slurm: format_method=“HMS”
walltime = convert_timestring(yaml_str, format_method=format_method)

etc

@koning
Copy link
Member

koning commented Apr 4, 2023

Yes, that function is used in the scriptadapter but not the batch alloc yet.

@bgunnar5
Copy link
Member Author

bgunnar5 commented Apr 5, 2023

Whoops I somehow didn't see that the convert_timestring function took in ints, I'll try switching to that.

Should we just ignore the walltime value if lsf is the scheduler since it doesn't seem to have a walltime flag? I did a little digging and they do have a walltime flag (-W) for bmod but not one for jsrun that I can find.

@koning
Copy link
Member

koning commented Apr 5, 2023

I think your original plan to keep the string in the resulting yaml is a good one. We can just use convert_timestring internally. I don't think jsrun has a walltime flag either. On LSF you really need the bsub/jsrun combo. So the ultimate solution will be to emit a bsub script with the jsrun launch.

@bgunnar5
Copy link
Member Author

bgunnar5 commented Apr 5, 2023

I have a feeling that's going to require a lot of changes so maybe we hold off on that in this PR?

@koning
Copy link
Member

koning commented Apr 5, 2023

Yes, that is a future improvement and a low priority.

add test for walltime conversion

modify changelog

run fix-style and fix issue with unit tests

modify walltime to work with different formats for all schedulers

modify walltime tests
@bgunnar5 bgunnar5 marked this pull request as ready for review April 12, 2023 16:18
@bgunnar5
Copy link
Member Author

I got the nod of approval on this from both Luc and Joe via teams so I'm merging this now.

@bgunnar5 bgunnar5 merged commit 7f75cd4 into LLNL:develop Apr 12, 2023
@bgunnar5 bgunnar5 deleted the bugfix/walltime branch April 12, 2023 17:19
@bgunnar5 bgunnar5 mentioned this pull request Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants