Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent task spawning performance #341

Closed
mturilli opened this issue Sep 10, 2014 · 9 comments
Closed

Agent task spawning performance #341

mturilli opened this issue Sep 10, 2014 · 9 comments
Assignees
Milestone

Comments

@mturilli
Copy link
Contributor

Evidence for the need to optimize task time startup:
541049a420a64132fe7d6fe2
The diagram shows 128 short tasks run by one out of 2 available pilots. Note that the pilot never reaches full utilization. A possible interpretation is: "the pilot manages to start ~40 CUs, then the first one seem to die off and the pilot is not keeping up with unit spawning". If confirmed, this is evidence for the need to optimize the time of task spawning within the agent.

@marksantcroos
Copy link
Contributor

Thanks! This looks similar to the data Andre presented, right?

Anyway, full-utilisation is not an absolute goal, we need to get a better intuition of what we would consider acceptable performance for the application workloads that we want to support.

Startup time can be improved, I think that is the general consensus.

I guess we can use this ticket to discuss some possible improvements, but as said, I would also like to collect reasonable expectations from an application perspective.

On 10 Sep 2014, at 9:42 , mturilli notifications@github.com wrote:

Evidence for the need to optimize task time startup:

The diagram shows 128 short tasks run by one out of 2 available pilots. Note that the pilot never reaches full utilization. A possible interpretation is: "the pilot manages to start ~40 CUs, then the first one seem to die off and the pilot is not keeping up with unit spawning". If confirmed, this is evidence for the need to optimize the time of task spawning within the agent.


Reply to this email directly or view it on GitHub.

@andre-merzky
Copy link
Member

Mark, agree with all. The purpose of the ticket is indeed to collect data, and to discuss the target line...

Thanks!

@andre-merzky andre-merzky added this to the MS-X milestone Sep 10, 2014
@andre-merzky andre-merzky self-assigned this Sep 10, 2014
@mturilli
Copy link
Contributor Author

Agreed on all.

Shall we keep a list somewhere of the user communities and type/sample of workloads that we already/need to support? I believe having something we could run ourselves would be very valuable for testing and profiling.

@mturilli
Copy link
Contributor Author

We had an interesting conversation about this ticket with Shantenu. Here a brief summary of the topic and ideas discussed:

  • The reference use cases should be the 4 or 5 patterns and related real-life workloads implemented in the radical.ensemblemd project.
  • RP seems to be getting mature enough to be tested by running the workloads served by radical.ensemblemd.
  • The goal of this run would be:
    • qualitative: the characteristics of the real-life workload are kept while reducing only its scale. The idea is: we run a smaller version of the real deal, not a simplified version of the real deal at a small scale.
    • qualitative: we run the real-life workloads at the maximum scale we can with the current release. In this way we measure how distant we are from the scalability end goals give to the radical.pilot project.
  • We use the data collected by running these tests to inform the decisions on what optimization to prioritize in the development roadmap of RP.

@andre-merzky
Copy link
Member

I certainly agree with the above. Can we (as in RP) ask/task you (as in MD-folx) to take care of this, i.e. to gather those use cases and to derive explicit qualitative and quantitative RP requirements?

@andre-merzky
Copy link
Member

ping to team MD to reply.

@andre-merzky andre-merzky modified the milestones: MS-X, MS.9 Oct 2, 2014
@andre-merzky
Copy link
Member

ping to team MD to reply.

Otherwise the ticket can probably go, as the pilot start time (the specific ticket topic) has mostly been taken care of, and performance / utilization is sufficiently addressed / discussed in other contexts.

@andre-merzky
Copy link
Member

ping to team MD to reply.

This is the final call before the ticket is closed. We'll raise the topic during the next pilot call, too.

@andre-merzky
Copy link
Member

obsolete and superseded...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants