Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make HostDeviceVector single gpu only #4773

Merged
merged 29 commits into from
Aug 25, 2019
Merged

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Aug 14, 2019

Closes #4531
Part of 1.0.0 roadmap #4680

Finally got the CI to pass. I can do more cleanups, but thought you guys should take a look first.

@RAMitchell @trivialfis @sriramch

@rongou rongou changed the title [WIP] make HostDeviceVector single gpu only make HostDeviceVector single gpu only Aug 21, 2019
@trivialfis
Copy link
Member

@rongou I haven't looked into this in detail yet. But seems GPUDistribution can be removed too?

@trivialfis trivialfis self-requested a review August 21, 2019 06:31
@rongou
Copy link
Contributor Author

rongou commented Aug 21, 2019

@trivialfis yes there is a whole bunch of stuff that can be removed or cleaned up. I didn't want to make this bigger than it already is, but I'll do some of the more obvious ones.

@trivialfis
Copy link
Member

@rongou Nice. Please ping me when it's ready.

@rongou
Copy link
Contributor Author

rongou commented Aug 22, 2019 via email

@@ -352,40 +310,45 @@ class GPUPredictor : public xgboost::Predictor {
InitModel(model, tree_begin, tree_end);

size_t batch_offset = 0;
auto* preds = out_preds;
std::unique_ptr<HostDeviceVector<bst_float>> batch_preds{nullptr};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of dual copy:

  • copying input prediction vector into a temp one based on batch size and,
  • copying back the output prediction vector from the kernel back into out_preds

is it not possible to:

  • slice and dice the out_preds based on batch size - basically, get a device span from out_preds and get a subspan based on the batch size and batch offset
  • pass the subspan to predict internal as a pass through to the kernel

Copy link
Contributor Author

@rongou rongou Aug 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll assume the whole prediction vector will be on device, right? I think the current approach doesn't put the whole vector on device in external memory mode, although I haven't 100% verified it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if the device copy is prevented here, it'll be copied right back (soon after this) in the objective functions while computing the gradients, or later when the prediction cache is updated during training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I'll send a followup PR.

@rongou
Copy link
Contributor Author

rongou commented Aug 24, 2019

@RAMitchell @trivialfis @sriramch I think this is as far as this PR would go. Please take another look.

I still want to remove the sharding in updater_gpu_hist.cu, but it seems a bit more involved, so probably leave it to a follow up PR.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This will allow us focusing more on algorithm instead of data distribution.

@RAMitchell RAMitchell merged commit 38ab79f into dmlc:master Aug 25, 2019
@rongou rongou deleted the single-gpu-hdv branch September 19, 2019 23:09
@lock lock bot locked as resolved and limited conversation to collaborators Dec 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Remove support for single process multi-GPU
4 participants