Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added flux adaptor v0.31.0 #395

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

bhatiaharsh
Copy link

  • copied over the v0.26 file to a new one for v0.31.
  • added a new "latest" version value.

simply followed the existing style and strategy to expand to a new version. are we going to need a new file for each version?

@FrankD412 FrankD412 added Adapters Items that are directly related to Maestro's adapter structure. labels May 16, 2022
@FrankD412 FrankD412 modified the milestone: Release 1.1.9 May 16, 2022
@FrankD412
Copy link
Member

@bhatiaharsh -- Thanks for filing this PR; since you made a new file, I can't tell if anything is actually different in the adapter compared to 0.26.0. I can check later, but if this is the same interface as 0.26.0 then there's no need to check in a new file for it. The file just needs to be checked in if 0.31.0 has different API calls and changes that the prior adapter can't handle.

@bhatiaharsh
Copy link
Author

okay. i just copied the file, since i thought you're grabbing the appropriate module out. i think the new file is not needed. when you get a chance, would you please add the minimal needed here?

@FrankD412
Copy link
Member

@bhatiaharsh -- does the adapter work when specifying 0.26.0 in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.

One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.

@jwhite242
Copy link
Collaborator

@bhatiaharsh -- does the adapter work when specifying 0.26.0 in the most recent version of Flux? If there's nothing abnormal happening I don't think any modifications need to be made.

One thing that I had been intending to do is to make the Flux adapters more version range focused, or to outright try and detect what version was in the environment. But that would have to be a future todo item.

So, looking at the flux factory this range selection looks easy to implement on the dispatch side at least (though sounds like detecting flux version could be trouble?). Question there is are the existing versions amenable to that such that we could change to ranges bracketed by the existing adapters plus a >= 0.26, or do we need something more like exact versions for the old ones ( < 0.26 ) and use the ranges only going forward from 0.26?

@bhatiaharsh
Copy link
Author

the v0.26 adapter did work just fine, so i dont think any changes are needed.

the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

@FrankD412
Copy link
Member

the v0.26 adapter did work just fine, so i dont think any changes are needed.

the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.

This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).

@jwhite242
Copy link
Collaborator

the v0.26 adapter did work just fine, so i dont think any changes are needed.
the only reason i went in to make this change was that when maestro creates the submit script, it adds the wrong version (0.26), because that seems to be hardcoded in. if there was a way to pull out the correct version, then these adapters would be more generic. ideally, the factory would pick the correct adapter based on the flux version.

The number at the top of the script is reflective of the backend adapter key that was used to generate and submit the script. That's so that there's a record of what was used for debug if something goes wrong. There would be something like that with the ranged versioning most likely since we'd want to be able to root cause issues on the right adapter.

This conversation is giving me ideas on how to best handle this issue, but as Jeremy said the ideal would be to detect the version which means that Flux has to be in the environment with the right Python bindings. If I'm recalling correctly, @jwhite242 was running into some issues with that on LC machines (and that was the reason for going with a purely Spack installed version of Flux + getting latest functionality).

Yeah, until this gets resolved (flux-framework/flux-core#2145) I think we do have to just continue building it fresh every time to ensure it's talking to the same python as maestro, or make a cli based version like with the other adapters if that ui is stable enough yet. Even with that though, that would be useful for provenance/debugging if we can get the actual version dumped into the generated files.

@bhatiaharsh
Copy link
Author

it does look like something is broken. just documenting the log here until i figure out more detail. @FrankD412 will recognize this mummi logging.

2022-05-18 14:48:56,952 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:504 - DEBUG - [148.312 MB]: [createsim] Fetching status for 45 jobs
2022-05-18 14:48:56,952 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:226 - DEBUG - [148.312 MB]: Joblist type -- <class 'list'>
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script.fluxscriptadapter:check_jobs:227 - DEBUG - [148.312 MB]: Joblist contents -- [989855790080, 995375494144, 1000928752640, 1006414902272, 1011917829120, 1017403978752, 1022890128384, 1028376278016, 1033895982080, 1039398908928, 1044885058560, 1050371208192, 1055924466688, 1061427393536, 1066913543168, 1072433247232, 1077919396864, 1083405546496, 1088891696128, 1094377845760, 1099863995392, 1105350145024, 1110819517440, 1116305667072, 1121791816704, 1127277966336, 1132764115968, 1138250265600, 1143736415232, 1149239342080, 1154742268928, 1160245195776, 1165731345408, 1171217495040, 1176703644672, 1182189794304, 1187726275584, 1193245979648, 1198732129280, 1204218278912, 1209721205760, 1215391904768, 1220945163264, 1226431312896, 1231934239744]
2022-05-18 14:48:56,953 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:get_statuses:129 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,991 - [wf_manager:164728] mummi_core.workflow.jobTracker:check_sim_status:263 - DEBUG - [148.312 MB]: [createsim] found (/gpfs/alpine/proj-shared/lrn005/Campaign4/roots/mummi_c4b_051422/sims-cg/mu18-0ras0raf2ras4A1raf4A-i6_000000001657)/(createsims_success)
2022-05-18 14:48:56,995 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:561 - DEBUG - [148.312 MB]: [JOBID 1132764115968, JOB Job[createsim]: id = 1132764115968, sims = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']] : Status: [<SimulationStatus.Failed: 2>]    | Running? True | Timedout? False       | Continue? False       | Cancel? True  |
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:566 - DEBUG - [148.312 MB]: [createsim] sims: success = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:567 - DEBUG - [148.312 MB]: [createsim] sims: failed = ['mu18-0ras0raf2ras4A1raf4A-i6_000000001657']
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:568 - DEBUG - [148.312 MB]: [createsim] sims: continue = []
2022-05-18 14:48:56,996 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:582 - INFO - [148.312 MB]: [createsim] processed all jobs. (#jobs: continue = 44, reclaim = 1, cancel = 44), (#sims: success = 0, failed = 1, continue = 0)
2022-05-18 14:48:56,998 - [wf_manager:164728] mummi_core.workflow.jobTracker:update:600 - DEBUG - [148.312 MB]: [createsim] Cancelling simulations (njobs_cancel) = 1
2022-05-18 14:48:56,998 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:164 - DEBUG - [148.312 MB]: Handle address -- 0x20001ffcb550
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:166 - DEBUG - [148.312 MB]: Attempting to cancel jobs.
Joblist:
1132764115968
2022-05-18 14:48:56,999 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:175 - DEBUG - [148.312 MB]: Cancelling Job 1132764115968...
2022-05-18 14:48:57,001 - [wf_manager:164728] maestrowf.interfaces.script._flux.flux0_26_0:cancel:178 - ERROR - [148.312 MB]: [Errno 2] unknown job id
2022-05-18 14:48:57,001 - [wf_manager:164728] mummi_core.workflow.jobTracker:cancel_jobs:800 - ERROR - [148.312 MB]: [createsim] Unknown error: <maestrowf.interfaces.script.CancellationRecord object at 0x20001fffe340>

@FrankD412
Copy link
Member

@bhatiaharsh -- do you happen to know if the job you're requesting to cancel is the failed job? I seem to recall that we try and cancel it anyway because the workflow checks for a dropped file. If that's the case, then it's likely cancelling a job that's been marked as failed.

@FrankD412 FrankD412 removed this from the Release 1.1.9 milestone May 28, 2022
@FrankD412
Copy link
Member

@bhatiaharsh Just want to follow up on this -- was this resolved?

@Jmast
Copy link

Jmast commented Oct 13, 2022

The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).

The following change is needed to the state method in FluxInterface_0260 to make it work:

maestrowf/interfaces/script/_flux/flux0_26_0.py

@staticmethod
def state(state):
    if state == "D":
        return State.PENDING
    elif state == "S":
        return State.QUEUED
    elif state == "R":
        return State.RUNNING
    elif state == "C":
        return State.FINISHING
    elif state == "CD":
        return State.FINISHED
    elif state == "F":
        return State.FAILED
    elif state == "CA":
        return State.CANCELLED
    elif state == "TO":
        return State.TIMEDOUT
    else:
        LOGGER.error(f"Unhandled state: {state}")
        return State.UNKNOWN

@jwhite242
Copy link
Collaborator

The current version of maestro (1.1.9) does not work with flux 0.43.0 (running on corona).

The following change is needed to the state method in FluxInterface_0260 to make it work:

maestrowf/interfaces/script/_flux/flux0_26_0.py

@staticmethod
def state(state):
    if state == "D":
        return State.PENDING
    elif state == "S":
        return State.QUEUED
    elif state == "R":
        return State.RUNNING
    elif state == "C":
        return State.FINISHING
    elif state == "CD":
        return State.FINISHED
    elif state == "F":
        return State.FAILED
    elif state == "CA":
        return State.CANCELLED
    elif state == "TO":
        return State.TIMEDOUT
    else:
        LOGGER.error(f"Unhandled state: {state}")
        return State.UNKNOWN

Thanks for sorting that out. I'll start rolling that up into an updated version of the adapter and get a patch out in the next couple days. @Jmast: Do you need a release pushed up to pypi for this asap or do you just work on the develop branch and don't care much about that?

@Jmast
Copy link

Jmast commented Oct 13, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Adapters Items that are directly related to Maestro's adapter structure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants