-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move device lock before the execution instead of tensor gathering #3457
Conversation
Regarding this, I think the behavior is OK/correct but the naming could be improved so it is more obvious from the API names what is happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am also waiting to hear your results to confirm correctness/performance across your benchmark workloads
@wconstab @desertfire CPU and GPU both failed at test |
@desertfire the failure was coming from opbyop executor, I fixed by adding the barrier. However there is one change I think we need to apply to ltc code, we need to check if |
Make sense. Let me update the LTC change. |
@desertfire I found any bug that I don't have a very good solution. in Line 1290 in 81f0845
This step is to make sure Line 1370 in 81f0845
|
Hmm, the LTC behavior diverges here. If this change doesn't work for XLA, we can either add an option for control or derive a subclass of LazyGraphExecutor. Given how small this change is, I would say add an option and XLA can override its value. How do you think? |
Well, we really want a solution for XLA since AWS is the first reporter of this issue. (Even though you're right that we could land it for TS only and it would strictly be an improvement).
Hmm, you don't have a way before launching the computation to know how many data handles will be returned, or of them will be unique? |
@desertfire the placeholder does not have a valid handle for xla at least, it will get filled later. the RunPostOrder want to access the real handle is the issue here. |
Let me run some experiments. I think when two graph does not have any dependency, we can run for example
@wconstab the graph is being run in a post-order fashion, same tensor might be the input to multiple result. For example
In this case a will show up in graph twice.. I think.. need to verify. I think checking the device handle is the easiest way since if the device handle is the same, we know they are the same (from runtime point of view). I don't know if there is ever a case that two tensor have the same handle. If this is never the case, we can probably do the dedup at Tensor level/Data level not at the handle level. |
Update: I am trying the above test scenarios from @JackCaoG + the tests in It turns out this implementation leads to a couple of segfault scenarios.
[WIP] |
It turns out when The current implementation calls The |
@miladm just catching up, sounds like you fixed at least one of the major failures- is there still something else failing or is it just a matter of cleaning up the PR at this point? |
To compare the performance gain of this implementation, I ran it against ResNet50 and MNIST models. Neither model shows performance gains compared to the current PyTorch/XLA In light of the above observation, we can consider dropping the conditional barriers, and replace them with an implementation that doesn't require dependency tracking via |
I think this is an incorrect assumption. It is possible that this |
Yup, I will extract the profile to confirm if the same performance observation made on GPU (here) holds true for TPU devices. |
The profiling numbers for this branch and
I was not able to capture the ExecutorComputation diagram as it was plotted many rows away from this profile row. Though, as expected, it started immediately after |
Are you concluding that the bug / fix are not observable on TPU because of this relative time difference, therefore we have to test the fix on GPU? Or what?
You are saying that on TPU, the 'ExecuteComputation' happens back to back both with/without the fix? I think this indicates that for some reason the bug was never present on TPU, but I can't observe this on the screenshot since I only see the second 'ExecuteComputation' not the previous one. |
Update on the results shared above:
To verify this PR gives the intended performance gain + respond to @will-cromar's questions above:
As intended, the conditional barrier enables an |
Summary: cherry-picking pytorch/xla#3457 [ghstack-poisoned]
Summary: cherry-picking pytorch/xla#3457 ghstack-source-id: 06503402cee5d93e58bf6fbe97c0b00eb8d0f81c Pull Request resolved: #76974
… to a later place" Summary: cherry-picking pytorch/xla#3457 [ghstack-poisoned]
Summary: cherry-picking pytorch/xla#3457 [ghstack-poisoned]
… to a later place" Summary: cherry-picking pytorch/xla#3457 [ghstack-poisoned]
Summary: cherry-picking pytorch/xla#3457 [ghstack-poisoned]
Summary: cherry-picking pytorch/xla#3457 ghstack-source-id: b69b538fec618b1a0b32d363ac5a44cb45b31916 Pull Request resolved: #76974
Summary: cherry-picking pytorch/xla#3457 Pull Request resolved: #76974 Approved by: https://github.com/wconstab, https://github.com/JackCaoG
Summary: cherry-picking pytorch/xla#3457 Pull Request resolved: #76974 Approved by: https://github.com/wconstab, https://github.com/JackCaoG Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/36150c63a777a7cc854006d5d4bcef011a71671a Reviewed By: malfet, mikekgfb Differential Revision: D36269299 Pulled By: desertfire fbshipit-source-id: a2a793cfdc4538ed6713d73d30015dd9c3c0a841
This seems related to s4tf/s4tf#14, which I am trying to fix. Swift for TensorFlow and PyTorch share the same code base for LazyTensor/XLA execution. The pytorch/xla directory Do you have any insight into why the crash might have happened? I'm not an expert on XLA. Even if this patch resolves the crash, the crash could pop up somewhere else. |
@philipturner This might be a bit tricky to debug. I don't see why tf would affect xla/third_party/xla_client/xrt_computation_client.h Lines 148 to 150 in f0efaf5
as a placeholder in Lines 1405 to 1408 in f0efaf5
but it never get an assignment in Lines 1440 to 1445 in f0efaf5
That being said.. this part of the logic is purely lazy tensor and XRT related.. not sure why tf will affect this. This might also be a race condition. |
I synchronized S4TF's I located the third section of code you mentioned above. It's inside for (size_t i = 0; i < results.size(); ++i) {
TF_VLOG(0) << "Starter ============";
if (async->tensors_data[i] != nullptr) {
TF_VLOG(0) << "Case 1";
if (async->tensors_data[i]->HasValue()) {
TF_VLOG(0) << "Case 1.1";
} else {
TF_VLOG(0) << "Case 1.2";
}
} else {
TF_VLOG(0) << "Case 2";
if (async->tensors_data[i]->HasValue()) {
TF_VLOG(0) << "Case 2.1";
} else {
TF_VLOG(0) << "Case 2.2";
}
}
if (async->tensors_data[i] != nullptr) {
async->tensors_data[i]->Assign(*results[i]);
} else {
async->tensors_data[i] = std::move(results[i]);
}
} This is all I found in the tutorial, up to the point where it crashed. It's "Case 1.2" every time - not a single 1.1, 2.1, or 2.2. That means the
For the first/second sections of code you mentioned, we initialize data a bit differently. But the overall result is the same, creating a Code section 2: xla_data = xla::GetX10Device(tensor_device)
->CreateDataPlaceholder(std::move(shape));
tensor.SetXlaData(xla_data, config.sync_xla_data);
class XrtComputationClient::XrtDevice : public ComputationClient::Device {
public:
...
DataPtr CreateDataPlaceholder(Shape shape) override {
return std::make_shared<XrtData>(this, std::move(shape));
}
} S4TF version of your first code section from the previous comment: struct XrtData : public Data {
XrtData(Device* device, Shape device_shape)
: Data(device, std::move(device_shape)),
handle_ptr(nullptr) {}
XrtData(XrtDevice* device, Shape device_shape, int64_t handle);
XrtData(XrtDevice* device, Shape device_shape, XrtHandlePtr handle);
The Swift test I'll see if I can undo most of the refactoring done by the old S4TF team, which should bring our implementation closer in line with yours. That will also allow more features from PyTorch to be integrated into S4TF, such as setting the XLA RNG seed asynchronously. Most importantly, it will be easier to debug because our source code differs a lot less. |
Here's a more precise look at the crash site. xla::ComputationClient::DataPtr XLATensor::GetXlaData() {
bool up_to_date = true;
// My additions to existing code for the sake of debugging begin here.
xla::ComputationClient::DataPtr xla_data = CurrentXlaData();
if ((xla_data != nullptr) && (xla_data->HasValue() == false)) {
TF_VLOG(0) << "Encountered something that would cause the crash.";
ApplyPendingGraph();
}
// My additions to existing code for the sake of debugging end here.
ir::Value ir_value;
if (up_to_date) {
xla::ComputationClient::DataPtr xla_data = CurrentXlaData();
if (xla_data != nullptr) {
// This still fails, even though I applied the pending graph.
XLA_CHECK(xla_data->HasValue())
<< "Trying to access XLA data while an async operation is in flight: "
<< xla_data->shape();
return xla_data;
}
}
if (ir_value) {
AssignIrValue(std::move(ir_value));
}
if (data()->ir_value) {
ApplyPendingGraph();
} else {
XLA_CHECK(data()->tensor_data);
data()->xla_data = TensorToXlaData(*data()->tensor_data, GetDevice());
}
return data()->xla_data;
} Terminal output
Repository branch: philipturner/s4tf:fix-xla-bug for s4tf/s4tf#19.
void XLATensor::ApplyPendingGraph() {
DeviceBarrier(GetDevice());
// This method is called to ensure that the tensor data is available on
// device, so that a call to CurrentXlaData() returns a valid pointer.
if (CurrentXlaData() == nullptr) {
std::vector<XLATensor> tensors({*this});
SyncTensorsGraph(&tensors, {}, /*wait=*/true, /*sync_xla_data=*/false);
}
} |
This pr ports pytorch/pytorch#74864 to pytorch/xla and should help #3434. This will need some more testing but the concept seem right to me.
I think this optimization is based on the assumption that even if previous execution does not finish, we still truncate the pending IR of the Tensor to become a BackendData with the correct shape. The actual handle assignment to the BackendData wont' happen until the previous execution finish but that would be OK since we only need true data handle in next execution.
There is a catch that PlaceHolder and Ir truncating only happened when
force_xla_data
is true. AFAICT GetTensor will haveforce_xla_data == false
and Synctensor will haveforce_xla_data == true
. I think this is fine. If we consider a case ofwe expect print not changing the IR of c anyway so there is no need to wait for that execution to finish before we finalize our second graph.