added flux adaptor v0.31.0 #395

bhatiaharsh · 2022-05-16T23:23:20Z

copied over the v0.26 file to a new one for v0.31.
added a new "latest" version value.

simply followed the existing style and strategy to expand to a new version. are we going to need a new file for each version?

FrankD412 · 2022-05-16T23:29:16Z

@bhatiaharsh -- Thanks for filing this PR; since you made a new file, I can't tell if anything is actually different in the adapter compared to 0.26.0. I can check later, but if this is the same interface as 0.26.0 then there's no need to check in a new file for it. The file just needs to be checked in if 0.31.0 has different API calls and changes that the prior adapter can't handle.

bhatiaharsh · 2022-05-17T00:07:55Z

okay. i just copied the file, since i thought you're grabbing the appropriate module out. i think the new file is not needed. when you get a chance, would you please add the minimal needed here?

FrankD412 · 2022-05-17T01:33:39Z

@bhatiaharsh -- does the adapter work when specifying 0.26.0 in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.

One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.

jwhite242 · 2022-05-17T18:48:33Z

@bhatiaharsh -- does the adapter work when specifying 0.26.0 in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.

One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.

So, looking at the flux factory this range selection looks easy to implement on the dispatch side at least (though sounds like detecting flux version could be trouble?). Question there is are the existing versions amenable to that such that we could change to ranges bracketed by the existing adapters plus a >= 0.26, or do we need something more like exact versions for the old ones ( < 0.26 ) and use the ranges only going forward from 0.26?

bhatiaharsh · 2022-05-17T19:37:24Z

the v0.26 adapter did work just fine, so i dont think any changes are needed.

the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

FrankD412 · 2022-05-17T21:36:53Z

the v0.26 adapter did work just fine, so i dont think any changes are needed.

the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.

This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).

jwhite242 · 2022-05-18T16:11:34Z

the v0.26 adapter did work just fine, so i dont think any changes are needed.
the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.

This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).

Yeah, until this gets resolved (flux-framework/flux-core#2145) I think we do have to just continue building it fresh every time to ensure it's talking to the same python as maestro, or make a cli based version like with the other adapters if that ui is stable enough yet. Even with that though, that would be useful for provenance/debugging if we can get the actual version dumped into the generated files.

bhatiaharsh · 2022-05-18T19:26:53Z

it does look like something is broken. just documenting the log here until i figure out more detail. @FrankD412 will recognize this mummi logging.

2022-05-18 14:48:56,952 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:504 - DEBUG - [148.312 MB]: [createsim] Fetching status for 45 jobs
2022-05-18 14:48:56,952 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:226 - DEBUG - [148.312 MB]: Joblist type -- <class 'list'>
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:227 - DEBUG - [148.312 MB]: Joblist contents -- [989855790080, 995375494144, 1000928752640, 1006414902272, 1011917829120, 1017403978752, 1022890128384, 1028376278016, 1033895982080, 1039398908928, 1044885058560, 1050371208192, 1055924466688, 1061427393536, 1066913543168, 1072433247232, 1077919396864, 1083405546496, 1088891696128, 1094377845760, 1099863995392, 1105350145024, 1110819517440, 1116305667072, 1121791816704, 1127277966336, 1132764115968, 1138250265600, 1143736415232, 1149239342080, 1154742268928, 1160245195776, 1165731345408, 1171217495040, 1176703644672, 1182189794304, 1187726275584, 1193245979648, 1198732129280, 1204218278912, 1209721205760, 1215391904768, 1220945163264, 1226431312896, 1231934239744]
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:get_statuses:129 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,991 - [wf_manager:164728] mummi_core.workflow.jobTracker:check_sim_status:263 - DEBUG - [148.312 MB]: [createsim] found (/gpfs/alpine/proj-shared/lrn005/Campaign4/roots/mummi_c4b_051422/sims-cg/mu18-0ras0raf2ras4A1raf4A-i6_000000001657)/(createsims_success)
2022-05-18 14:48:56,995 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:561 - DEBUG - [148.312 MB]: [JOBID 1132764115968, JOB Job[createsim]: id = 1132764115968, sims = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']] : Status: [<SimulationStatus.Failed: 2>]    | Running? True | Timedout? False       | Continue? False       | Cancel? True  |
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:566 - DEBUG - [148.312 MB]: [createsim] sims: success = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:567 - DEBUG - [148.312 MB]: [createsim] sims: failed = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:568 - DEBUG - [148.312 MB]: [createsim] sims: continue = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:582 - INFO - [148.312 MB]: [createsim] processed all jobs. (#jobs: continue = 44, reclaim = 1, cancel = 44), (#sims: success = 0, failed = 1, continue = 0)
2022-05-18 14:48:56,998 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:600 - DEBUG - [148.312 MB]: [createsim] Cancelling simulations (njobs_cancel) = 1
2022-05-18 14:48:56,998 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:164 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:166 - DEBUG - [148.312 MB]: Attempting to cancel jobs.
Joblist:
1132764115968
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:175 - DEBUG - [148.312 MB]: Cancelling Job 1132764115968...
2022-05-18 14:48:57,001 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:178 - ERROR - [148.312 MB]: [Errno 2] unknown job id
2022-05-18 14:48:57,001 - [wf_manager:164728] mummi_core.workflow.jobTracker:cancel_jobs:800 - ERROR - [148.312 MB]: [createsim] Unknown error: <maestrowf.interfaces.script.CancellationRecord object at 0x20001fffe340>

FrankD412 · 2022-05-19T02:56:57Z

@bhatiaharsh -- do you happen to know if the job you're requesting to cancel is the failed job? I seem to recall that we try and cancel it anyway because the workflow checks for a dropped file. If that's the case, then it's likely cancelling a job that's been marked as failed.

FrankD412 · 2022-08-13T23:27:30Z

@bhatiaharsh Just want to follow up on this -- was this resolved?

Jmast · 2022-10-13T14:51:52Z

The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).

The following change is needed to the state method in FluxInterface_0260 to make it work:

maestrowf/interfaces/script/_flux/flux0_26_0.py

@staticmethod
def state(state):
    if state == "D":
        return State.PENDING
    elif state == "S":
        return State.QUEUED
    elif state == "R":
        return State.RUNNING
    elif state == "C":
        return State.FINISHING
    elif state == "CD":
        return State.FINISHED
    elif state == "F":
        return State.FAILED
    elif state == "CA":
        return State.CANCELLED
    elif state == "TO":
        return State.TIMEDOUT
    else:
        LOGGER.error(f"Unhandled state: {state}")
        return State.UNKNOWN

jwhite242 · 2022-10-13T17:00:25Z

The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).

The following change is needed to the state method in FluxInterface_0260 to make it work:

maestrowf/interfaces/script/_flux/flux0_26_0.py

@staticmethod
def state(state):
    if state == "D":
        return State.PENDING
    elif state == "S":
        return State.QUEUED
    elif state == "R":
        return State.RUNNING
    elif state == "C":
        return State.FINISHING
    elif state == "CD":
        return State.FINISHED
    elif state == "F":
        return State.FAILED
    elif state == "CA":
        return State.CANCELLED
    elif state == "TO":
        return State.TIMEDOUT
    else:
        LOGGER.error(f"Unhandled state: {state}")
        return State.UNKNOWN

Thanks for sorting that out. I'll start rolling that up into an updated version of the adapter and get a patch out in the next couple days. @Jmast: Do you need a release pushed up to pypi for this asap or do you just work on the develop branch and don't care much about that?

Jmast · 2022-10-13T17:36:32Z

We are working from released versions of maestro, so It would be very helpful If you could roll this out ASAP to pypi. Thanks, -- Jeff

…

On Thu, Oct 13, 2022 at 11:00 AM Jeremy White ***@***.***> wrote: The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona). The following change is needed to the state method in FluxInterface_0260 to make it work: maestrowf/interfaces/script/_flux/flux0_26_0.py @staticmethod def state(state): if state == "D": return State.PENDING elif state == "S": return State.QUEUED elif state == "R": return State.RUNNING elif state == "C": return State.FINISHING elif state == "CD": return State.FINISHED elif state == "F": return State.FAILED elif state == "CA": return State.CANCELLED elif state == "TO": return State.TIMEDOUT else: LOGGER.error(f"Unhandled state: {state}") return State.UNKNOWN Thanks for sorting that out. I'll start rolling that up into an updated version of the adapter and get a patch out in the next couple days. @Jmast <https://github.com/Jmast>: Do you need a release pushed up to pypi for this asap or do you just work on the develop branch and don't care much about that? — Reply to this email directly, view it on GitHub <#395 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATXDIAIMGTRYRWR2JXCH53WDA53HANCNFSM5WDBWJ7A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

added flux adaptor v0.31.0

8b10a1a

FrankD412 assigned bhatiaharsh May 16, 2022

FrankD412 added Adapters Items that are directly related to Maestro's adapter structure. labels May 16, 2022

FrankD412 modified the milestone: Release 1.1.9 May 16, 2022

FrankD412 removed this from the Release 1.1.9 milestone May 28, 2022

Jmast mentioned this pull request Nov 23, 2022

dill>0.3.5.1 unable to pickle CompiledFFI object #406

Open

FrankD412 mentioned this pull request Nov 26, 2022

Introduces additional states for handling new versions of Flux #407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added flux adaptor v0.31.0 #395

added flux adaptor v0.31.0 #395

bhatiaharsh commented May 16, 2022

FrankD412 commented May 16, 2022

bhatiaharsh commented May 17, 2022

FrankD412 commented May 17, 2022

jwhite242 commented May 17, 2022

bhatiaharsh commented May 17, 2022

FrankD412 commented May 17, 2022

jwhite242 commented May 18, 2022

bhatiaharsh commented May 18, 2022

FrankD412 commented May 19, 2022

FrankD412 commented Aug 13, 2022

Jmast commented Oct 13, 2022 •

edited

Loading

jwhite242 commented Oct 13, 2022

Jmast commented Oct 13, 2022 via email

added flux adaptor v0.31.0 #395

Are you sure you want to change the base?

added flux adaptor v0.31.0 #395

Conversation

bhatiaharsh commented May 16, 2022

FrankD412 commented May 16, 2022

bhatiaharsh commented May 17, 2022

FrankD412 commented May 17, 2022

jwhite242 commented May 17, 2022

bhatiaharsh commented May 17, 2022

FrankD412 commented May 17, 2022

jwhite242 commented May 18, 2022

bhatiaharsh commented May 18, 2022

FrankD412 commented May 19, 2022

FrankD412 commented Aug 13, 2022

Jmast commented Oct 13, 2022 • edited Loading

jwhite242 commented Oct 13, 2022

Jmast commented Oct 13, 2022 via email

Jmast commented Oct 13, 2022 •

edited

Loading