Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UGE showing wrong wallclock and/or dispatch time? #199

Closed
johrstrom opened this issue Jun 10, 2020 · 8 comments · Fixed by #206
Closed

UGE showing wrong wallclock and/or dispatch time? #199

johrstrom opened this issue Jun 10, 2020 · 8 comments · Fixed by #206
Milestone

Comments

@johrstrom
Copy link
Contributor

This discourse topic about UGE showing the wrong time remaining in a batch connect application came in today. It's assumed this is a bug in the adapter and not the dashboard.

Here's the output of uge's 'qstat -r -xml' to help replicate the issue. I'd say we need a test case to replicate the issue as I cannot by hacking around in irb. Here's the relevant dashboard code that displays the time remaining time.

It should be noted that Time.now is taken into account for these calculations so that should be stubbed to appropriately reflect a time > dispatch time but < 4 hours past the dispatch time.

@matthu017
Copy link
Contributor

It depends on which method the code is calling in batch.rb
https://github.com/OSC/ood_core/blob/master/lib/ood_core/job/adapters/sge/batch.rb#L59-L116
If it is calling qstat_xml_j_r_listener, wallclock_time simply doesn't exist which may have caused the problem.

@matthu017
Copy link
Contributor

We might want to combine this with issue #197

@johrstrom
Copy link
Contributor Author

You're right that it's the JR listener, and 'qstat', '-r', '-xml', '-j', job_id.to_s seem to be the command.

But looking at the example output it's the under h_rt. we just need to extract it?

This can't go into native attributes, because things like the dashboard access the wall_time directly. They shouldn't have to toggle to do something strange for separate adapters. Native is for things that can't go everywhere, or it's a catch all for things that don't fit. wall time is one of those things all schedulers use (classic schedulers anyhow).

@matthu017
Copy link
Contributor

My bad, it is qstat_xml_r that where wallclock_time doesn't exist.
After looking at JR listener, it should work as intended.

@johrstrom
Copy link
Contributor Author

No issues, that's what the 'needs investigation' tag means, it's not clear why it's going on. i

And looking at the spec file, this is supposed to work so there could be something amiss either in the way we parse, or the test case or UGE is somehow different from SGE. I'm sure the discourse user would be happy to provide new output to test against. Maybe the first step would be just to copy this test case and run it against the UGE output.

Btw, you can get a trial version of UGE to test against here.

@ericfranz ericfranz modified the milestones: OOD1.8, OOD2.0 Jun 16, 2020
@johrstrom
Copy link
Contributor Author

OK I got it. UGE's JAT_start_time is returning milliseconds past the epoch (13 digits) where SGE and (our tests) look at seconds past the epoch (10 digits).

wallclock_time is being set to some very large negative value here.

def end_JAT_start_time
@parsed_job[:status] = :running
@parsed_job[:dispatch_time] = @current_text.to_i
@parsed_job[:wallclock_time] = Time.now.to_i - @parsed_job[:dispatch_time]
end

To get the remaining time, we subtract this from the time limit, say an hour (3600), resulting in a very large positive number.

irb(main):041:0> wallclock = Time.now.to_i - 1592928425042
=> -1591335487093
irb(main):042:0> distance_of_time_in_words(3600 - wallclock, 0, false, :only => [:minutes, :hours], :accumulate_on => :hours)
=> "442037636 hours and 18 minutes"

@ericfranz
Copy link
Contributor

I guess we could use a UGE image :-)

@ericfranz
Copy link
Contributor

But with that description we can write a test.

ericfranz pushed a commit that referenced this issue Jul 2, 2020
Some Grid Engines (like UGE) use milliseconds were others use seconds past the epoch.

Fix for #199.
@matthu017 matthu017 linked a pull request Jul 2, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants