Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass GPU diagnostics from worker to scheduler #2932

Merged
merged 2 commits into from
Aug 9, 2019

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 6, 2019

This does a few things:

  1. Use pynvml to collect information about any CUDA GPUs present
  2. Optionally add those metrics to the worker's initial handshake and
    heartbeats
  3. Collect that information in the scheduler in the WorkerState object

For now these just hang out in the scheduler information,
but in the future they might be used for dashboards,
or possibly scheduling decisions in the future.

I believe that everything gpu-specific here is fairly well separated
and generalized (others should be able to follow this pattern to add
more diagnostics relatively easily) but it would be good to hear from
others on if this is out of scope.

This does a few things:

1.  Use `pynvml` to collect information about any CUDA GPUs present
2.  Optionally add those metrics to the worker's initial handshake and
    heartbeats
3.  Collect that information in the scheduler in the WorkerState object

For now these just hang out in the scheduler information,
but in the future they might be used for dashboards,
or possibly scheduling decisions in the future.

I believe that everything gpu-specific here is fairly well separated
and generalized (others should be able to follow this pattern to add
more diagnostics relatively easily) but it would be good to hear from
others on if this is out of scope.
@TomAugspurger
Copy link
Member

In general, this kind of special / extra information seems fine, as long as it doesn't affect the common case when the GPUs isn't present (and this one is fine).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants