Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stdout can't be viewed as job progresses #209

Open
stsievert opened this issue Apr 8, 2020 · 3 comments
Open

stdout can't be viewed as job progresses #209

stsievert opened this issue Apr 8, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@stsievert
Copy link
Contributor

stsievert commented Apr 8, 2020

What's your issue?
I have launched several (supposedly) short jobs. On EC2 with a modern NVIDIA GPU, they take around 40 minutes. I have launched these jobs on HTCondor, and specified a GPU that's less modern. The jobs apparently take at least 120 minutes on this lower capability GPU.

I'd like some idea of the job progress, and am printing some items to stdout to view the progress. So, let's view the output of one of the running jobs:

(base) [stsievert@submit2 exp-cifar10]$ htmap stdout adadamp 0
# hangs...

This hangs indefinitely. This means I can't monitor the progress of any one component; I have to for that component to complete.

What would resolve your issue?
If the jobs stdout could be viewed even if the job wasn't completed.

(base) [stsievert@submit2 exp-cifar10]$ htmap stdout foo 0  # get output so far
iteration 0 out of 100, loss: 2.2
iteration 1 out of 100, loss: 1.2
iteration 2 out of 100, loss: 0.9
(base) [stsievert@submit2 exp-cifar10]$ # wait a while
(base) [stsievert@submit2 exp-cifar10]$ htmap stdout foo 0
iteration 1 out of 100, loss: 2.2
iteration 2 out of 100, loss: 1.2
iteration 3 out of 100, loss: 0.9
# ...
iteration 20 out of 100, loss: 0.02
iteration 21 out of 100, loss: 0.017
(base) [stsievert@submit2 exp-cifar10]$ # wait even longer
(base) [stsievert@submit2 exp-cifar10]$ htmap stdout foo 0
iteration 1 out of 100, loss: 2.2
iteration 2 out of 100, loss: 1.2
iteration 3 out of 100, loss: 0.9
# ...
iteration 98 out of 100, loss: 0.0012
iteration 99 out of 100, loss: 0.001
@stsievert stsievert added the enhancement New feature or request label Apr 8, 2020
@bbockelm
Copy link

bbockelm commented Apr 9, 2020

Josh - do we currently export the API for condor_tail? That should allow dynamic fetching of the stdout.

Scott - if there was a standard API from the map to send percent-complete notifications to the submit host, would you be in a place to use it?

@JoshKarpel
Copy link
Contributor

@bbockelm we don't, and I think that's probably the right solution for this specific problem (wanting to look at the raw stdout of a single job). @stsievert , does that describe your use case? Do you only want to use this from the CLI, or do you want to be able to retrieve live stdout programatically?

If you'd like to look at stdout/stderr from multiple jobs, you could set the map option stream_output = "True". You shouldn't do this unless you actually want to look at all the stdout live;, since it puts unnecessary load on HTCondor.

Quick brainstorming on a more generic "progress tracker" API: I think we could provide execute-side functions look like this:

def map_this_function(x):
    for step_index in range(number_of_steps):
        # do work...
        htmap.update_progress(step_index, number_of_steps)

and we could do the tqdm trick to wrap it up as an iterator:

for step in htmap.range(10):
    ...
for item in htmap.progress(items):
    ...

and could do map.progress[component] submit-side to get those numbers back. Percentage completion is probably ill-defined for a lot work, so reporting a pair of integers seems more flexible (the user can then choose how to display the data). Anyway, this feature is probably out of scope for this particular issue, but it can become a separate issue if this sounds useful.

@stsievert
Copy link
Contributor Author

export the API for condor_tail ... does that describe your use case?

Yeah, that's perfect. Viewing the last couple lines from the job would work, something along the lines of this:

(base) [stsievert@submit2 exp-cifar10]$ htmap tail foo 0
iteration 45 out of 100, loss: 0.015
iteration 46 out of 100, loss: 0.01

use this from the CLI, or do you want to be able to retrieve live stdout programatically?

My immediate use case is with the CLI: for debugging purposes, I only really care about the output of one job, and don't need the output of many jobs.

I can see a update_progress function being useful to monitor progress programatically, something of the form

def execute(N=100):
    for k in range(N):
        loss = ...
        htmap.update_progress(k, N, msg=f"iter={k}, loss={loss}")

@JoshKarpel JoshKarpel self-assigned this Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants