Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic timeline #2215

Merged
merged 46 commits into from
Sep 15, 2020
Merged

Dynamic timeline #2215

merged 46 commits into from
Sep 15, 2020

Conversation

Vikas-kum
Copy link
Contributor

@Vikas-kum Vikas-kum commented Aug 28, 2020

This PR
1)introduces a new Python API for starting and stopping the timeline:

hvd.start_timeline('/path/to/timeline.json', mark_cycles=True)
...
hvd.stop_timeline()
  1. Makes sure that timeline json file is valid json.
  2. Including starttime as metadata in timeline.

TODO
1/ thread detach() re-introduce.

Fixes #2008.
This is followup from this PR: #2129 and addresses concurrent access.

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes # (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@tgaddair
Copy link
Collaborator

Nice work, @Vikas-kum! Thanks for picking this up. I'll take a look today or tomorrow.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! A few questions.

horovod/common/timeline.cc Show resolved Hide resolved
horovod/common/timeline.cc Outdated Show resolved Hide resolved
} else {
// last event closed the json ']' , need to seek to one position back and write ',' to continue
long pos = file_.tellp();
file_.seekp (pos-1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any performance penalty to doing this? Timeline already introduces a lot of overhead, especially when using markers. Can you test it with synthetic benchmark with and without this op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think it should cause any perf penalty, given that we are already going to write bytes from next position.
Would you be able to help me with quickly running a small job and verifying on this PR as I will ahve to go through setups, which i dont currently have. :) I know I am asking for a favor but will appreciate if you can help.

test/test_torch.py Outdated Show resolved Hide resolved
test/test_torch.py Outdated Show resolved Hide resolved
test/test_torch.py Show resolved Hide resolved
test/test_torch.py Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
@Vikas-kum
Copy link
Contributor Author

Thanks @tgaddair for review. Will update PR soon addressing comments.

@Vikas-kum
Copy link
Contributor Author

@tgaddair Gentle reminder. Would you be take another round at this.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about, I believe your approach here makes sense then what I was trying to achieve in my last couple of commits to #2129, as we keep all the state modification in the RunLoopOnce, which avoids race conditions.

The one thing I would still like to see is to make these calls blocking so the user doesn't need to sleep after each call (otherwise, they could miss allreduce information on the timeline when starting, or the end of the timeline if they stop and then exit the program). Does that sound reasonable to you?

@Vikas-kum
Copy link
Contributor Author

Vikas-kum commented Sep 10, 2020

The one thing I would still like to see is to make these calls blocking so the user doesn't need to sleep after each call (otherwise, they could miss allreduce information on the timeline when starting, or the end of the timeline if they stop and then exit the program). Does that sound reasonable to you?

I took this approach because it simplifies the implementation quite a bit and also makes sure that locks are held for much smaller and confined scope.
We can make stop timeline blocking, but even then we would need sleep in test code to ensure that events are written.
Say you start_timeline

Start takes into effect only when BGThread goes to next iteration. To make sure of correctness test checks for cycle markers and events getting captured we would need to wait for at least 2 full cycle, so that event and cycle markers are indeed written.

hvd.start_Timeline()--> (sleep here in test code makes sure that timeline is started before next line test code)starts takes in to effect --> allreduce (event will be captured) --> events are written and cycles are written to queue --> stop_timeline(No more events)--> WriterThread dump any existing events to a file.

Somehow, I don't feel strongly about the idea of blocking call because it blocks my training script completely. I do see a side effect of having non-blocking calls and that is actual events may be written in next cycle of thread. But at this point, I expect user would be calling start_timeline to stop_timeline for sufficient gap which gives them a good picture of what is happening. If user wants to block, they can put this kind of sleep 1s(and know that they are knowingly blocking their script) in their code to make sure that events get dumped.

For this reason, I would like to have this non-blocking. Please let me know if you have strong feeling about blocking call and I can make stop timeline blocking.

@Vikas-kum Vikas-kum force-pushed the dynamic-timeline branch 3 times, most recently from 6365cce to 896dca7 Compare September 12, 2020 02:02
@Vikas-kum Vikas-kum closed this Sep 12, 2020
@Vikas-kum Vikas-kum reopened this Sep 12, 2020
@Vikas-kum Vikas-kum force-pushed the dynamic-timeline branch 2 times, most recently from 3bb245f to 4613b7a Compare September 13, 2020 03:01
tgaddair and others added 12 commits September 12, 2020 20:02
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Adding mroe test

Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
remove debug statement
addressed comments

Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
Signed-off-by: Vikas Kumar <vikumar@amazon.com>
@tgaddair
Copy link
Collaborator

Thanks for making these changes, @Vikas-kum!

Last thing I would ask: can run clang-format from the root of the project to make sure code style aligns with the rest of the code base?

If tests pass after that, we should be good to land :)

@Vikas-kum
Copy link
Contributor Author

Vikas-kum commented Sep 15, 2020

Thanks for making these changes, @Vikas-kum!

Last thing I would ask: can run clang-format from the root of the project to make sure code style aligns with the rest of the code base?

If tests pass after that, we should be good to land :)

thanks @tgaddair , I ran clang-format for only timeline.h timeline.cc
I see running clang-format for other files introduces a lot of formatting changes, unrelated to this PR(See here if I run clang-format on horovod-common dir https://github.com/horovod/horovod/pull/2278/files). So, I would suggest that we should have different PR just with clang-format run. WDYT?

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will take a look at the follow-up PR!

@tgaddair tgaddair merged commit a7de4ae into horovod:master Sep 15, 2020
@Vikas-kum
Copy link
Contributor Author

Sounds good, will take a look at the follow-up PR!

Thanks. Created this PR: https://github.com/horovod/horovod/pull/2281/files ,
Ran only for horovod common to keep the PR manageable for review.

@mengxr
Copy link
Contributor

mengxr commented Oct 6, 2021

@Vikas-kum We found an issue introduced by this commit. Our distributed filesystem doesn't support random writes (including re-open and append). It used to work fine with horovod timeline because it only create, writes the data sequentially, and then close. I wonder if we can keep the old behavior if users do not use the dynamic timeline feature. Any pointers on what we should change?

@danzafar
Copy link

Any next steps on this @Vikas-kum ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Is there a way to dynamically turn off/on the timeline without killing application ?
4 participants