[BugFix] Make program thread-local to support multi-threading #338

lingfanyu · 2019-01-05T06:03:26Z

Description

DGLGraph uses one global execution plan / program, which leads to data racing when multi-threading (like PyTorch DataParallel) is used. This PR fix this bug by making schedule a threading.local object. (#302)

Checklist

The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

Changes

keep reference to current schedule program in a threading.local object.

lingfanyu · 2019-01-05T06:09:57Z

@yzh119 This PR should fix the multi-gpu issue. Can you double check?

yzh119 · 2019-01-05T19:18:14Z

Thanks, the PR solved my problem.

jermainewang · 2019-01-05T20:08:16Z

Lingfan, could you run treelstm on multi gpu to see the improvement?

…

On Sat, Jan 5, 2019, 2:18 PM Zihao Ye ***@***.***> wrote: Thanks, the PR solved my problem. — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmlc_dgl_pull_338-23issuecomment-2D451683128&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=TjZkSg9Jv2ODj9HXvCAcwgw9aSXPP3jh5bcRbCbUO5s&m=e_NWBIH78C4C3HHVJGYWN2YvbHbzQceYgT1zlp8P8s8&s=CYbUFKHMGqUyLc1NvpF01-6Oz3o7XvJg5jl9Q5xXz_Y&e=>, or mute the thread <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AD3qZTcxBWvrmRfYzx5PslxUdgLh7IWGks5vAPp2gaJpZM4ZveVR&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=TjZkSg9Jv2ODj9HXvCAcwgw9aSXPP3jh5bcRbCbUO5s&m=e_NWBIH78C4C3HHVJGYWN2YvbHbzQceYgT1zlp8P8s8&s=yeGMc_gVD39y5XmEPtM99WHxNc3TpkOlaJe4dmvhd6w&e=> .

yzh119 · 2019-01-07T12:33:44Z

well, i find that currently enabling multi-GPU could not speed up training, the number of active GPUs is always one during training.

lingfanyu · 2019-01-07T15:21:20Z

@yzh119 Sounds weird. When I ran your transformer on a 4-GPU instance, all GPUs were active but with low utilization (<25%).

jermainewang

LGTM. The performance issue seems about multi-threading itself, which is not the purpose of this PR.

lingfanyu added 2 commits January 5, 2019 04:45

make program thread local

2163332

doc string

c378481

lingfanyu force-pushed the fix-multigpu-program branch from ef28b31 to c378481 Compare January 5, 2019 06:05

lingfanyu requested a review from jermainewang January 5, 2019 06:08

Merge branch 'master' into fix-multigpu-program

bbb4c67

jermainewang approved these changes Jan 12, 2019

View reviewed changes

jermainewang and others added 2 commits January 12, 2019 08:43

Merge branch 'master' into fix-multigpu-program

4c9a945

Merge branch 'master' into fix-multigpu-program

9b42dfc

lingfanyu merged commit ed1948b into dmlc:master Jan 13, 2019

lingfanyu deleted the fix-multigpu-program branch January 13, 2019 03:04

jermainewang mentioned this pull request Feb 18, 2019

[Roadmap] v0.2 release checklist #302

Closed

26 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Make program thread-local to support multi-threading #338

[BugFix] Make program thread-local to support multi-threading #338

lingfanyu commented Jan 5, 2019

lingfanyu commented Jan 5, 2019

yzh119 commented Jan 5, 2019

jermainewang commented Jan 5, 2019 via email

yzh119 commented Jan 7, 2019

lingfanyu commented Jan 7, 2019

jermainewang left a comment

[BugFix] Make program thread-local to support multi-threading #338

[BugFix] Make program thread-local to support multi-threading #338

Conversation

lingfanyu commented Jan 5, 2019

Description

Checklist

Changes

lingfanyu commented Jan 5, 2019

yzh119 commented Jan 5, 2019

jermainewang commented Jan 5, 2019 via email

yzh119 commented Jan 7, 2019

lingfanyu commented Jan 7, 2019

jermainewang left a comment

Choose a reason for hiding this comment