Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hexagon] Asynchronous DMA support #12411

Merged
merged 11 commits into from
Aug 26, 2022
Merged

Conversation

adstraw
Copy link
Contributor

@adstraw adstraw commented Aug 12, 2022

Adds adds asynchronous DMA support through the Hexagon User DMA engine with unit tests to validate basic functionality. Asynchronous DMA support here means the ability to "kick off" asynchronously a number of DMAs using the Copy API and then to Poll for or Wait on a number of "in flight" (not done) DMAs. Enables future testing and development for asynchronous memory copy on Hexagon. For now, Hexagon DMA support remains synchronous in nature through existing hexagon_user_dma_1d_sync interface which uses asynchronous capable HexagonUserDMA class in a synchronous way --- calling Copy and Wait back to back for each request.

cc @mehrdadh

@adstraw adstraw force-pushed the straw-hex-async-dma branch 2 times, most recently from 72be034 to cdbc2be Compare August 16, 2022 16:19
@adstraw adstraw marked this pull request as ready for review August 17, 2022 12:21
@github-actions github-actions bot requested a review from mehrdadh August 17, 2022 12:23
@adstraw
Copy link
Contributor Author

adstraw commented Aug 17, 2022

Attention @arangasa, please review. Are the #if __HEXAGON_ARCH__ >= 68 guards removed with this PR necessary? How do we want the HexagonUserDMA class to behave for earlier architectures?

@arangasa
Copy link
Contributor

Attention @arangasa, please review. Are the #if __HEXAGON_ARCH__ >= 68 guards removed with this PR necessary? How do we want the HexagonUserDMA class to behave for earlier architectures?

Hi Krzysztof @kparzysz-quic , Could you please suggest Adam @adstraw ? Thank you.

@kparzysz-quic
Copy link
Contributor

The guards are necessary: earlier architectures don't have DMA, and if a program attempts to execute a DMA instruction, it will trigger a hardware exception.

For earlier architectures we could just do a synchronous memcpy.

@adstraw
Copy link
Contributor Author

adstraw commented Aug 17, 2022

@kparzysz-quic and @arangasa I put the Hexagon arch check around the calls to the 1D sync DMA flow in the Hexagon Device API and also HexagonBuffer class. If we are on a device that does not support DMA we will simply memcpy instead. This keeps the arch check out of HexagonUserDMA class which I believe is better. I believe this PR is ready for review.

@github-actions github-actions bot requested a review from mehrdadh August 25, 2022 05:08
* specific language governing permissions and limitations
* under the License.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that if we maintain only a single command queue / descriptor chain over the lifetime of a program, that eventually a DMA descriptor that was complete, may be "resurected" by the ring buffer as a new descriptor. E.g.

Ringbuffer[size=4]

Initial state:
desc1@0[finished] , desc2@1[finished] , desc3@2[finished] , desc4@3[finished]

New DMA needed:
desc5@0[inflight] , desc2@1[finished] , desc3@2[finished] , desc4@3[finished] , desc5@0[inflight]

With the last descriptor occupying the same memory space as the first.

Do you think this will be problematic? An integration tests with the UserDMA class to test this could reassure us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change your example slightly to make the descriptor IDs zero based. This is how it works in the code. The first descriptor has ID=0 and uses ring buffer slot 0. The second has ID=1 and uses slot 1. When ID is equivalent to ring buffer size we reuse ring buffer slot 0. And so on.

Ringbuffer[size=4]

Initial state:
desc0@0[finished] , desc1@1[finished] , desc2@2[finished] , desc3@3[finished]

New DMA needed:
desc4@0[inflight] , desc1@1[finished] , desc2@2[finished] , desc3@3[finished]

The case you have described above should work OK. I believe we can cover it with RingBuffer unit tests. I believe existing tests already hit this case, to some degree. I will add a test case to explicitly cover this scenario.

When we go to add desc4 we first check if the ring buffer is full by checking that InFlight count (0) is less than the size of the ring buffer (4). All good. Then, we GetAddr of the "next" ID = 4 which reuses ring buffer slot 0. Then, we increment the "next" ID to 5. Then we return a pointer to ring buffer slot 0.

There is one issue with this flow that I thought of ... the DMA descriptor at ring buffer slot 0 is "finished" and we simply return a pointer to it and expect that the user will add an "inflight" DMA descriptor at that location. If the user were to call InFlight before adding the "inflight" descriptor then ring buffer slot 0 would be treated as "finished" from the perspective of the RingBuffer class where it is still "inflight" from the perspective of the user.

I had initially coded an Add function which took a function pointer which would add a T to the RingBuffer rather that Next to get around this issue. It felt to complicated but it does expose a gap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it's interesting you were thinking about corner cases because I woke up thinking about the Next vs. Add API as described above. And, I also realized there is a bug in the InFlight method where we might increment the "oldest" ID beyond the "next" ID. Will add a test case that can hit that bug and fix it in the next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, it's interesting you were thinking about corner cases because I woke up thinking about the Next vs. Add API as described above. And, I also realized there is a bug in the InFlight method where we might increment the "oldest" ID beyond the "next" ID. Will add a test case that can hit that bug and fix it in the next commit.

This is now fixed.

There is one issue with this flow that I thought of ... the DMA descriptor at ring buffer slot 0 is "finished" and we simply return a pointer to it and expect that the user will add an "inflight" DMA descriptor at that location. If the user were to call InFlight before adding the "inflight" descriptor then ring buffer slot 0 would be treated as "finished" from the perspective of the RingBuffer class where it is still "inflight" from the perspective of the user.

I decided to leave this as-is and encoded this corner case in a test called wrap_corner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked with @csullivan and his point was less about the RingBuffer class and more about the HexagonUserDMA engine and whether there would be any issue if we are overwriting descriptors (even if they are done) in a HexagonUserDMA chain. I.e. is it OK to overwrite the head of the descriptor chain once it is done? Perhaps @kparzysz-quic or @arangasa have some comment. A test could not hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a unit test which overflows the HexagonUserDMA ring buffer for DMA descriptors. The test is named overflow_ring_buffer. Also added a return of DMA_RETRY from the HexagonUserDMA Copy function in case the ring buffer is full. This passes the onus to the user of the class to retry the Copy. The other way to handle it is for Copy to Wait until there is a free slot in the ring buffer to use. Not sure what is best at this moment. I am hoping the unit test provides the desired coverage. Would like to add integration tests in a separate test.

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, thanks for the great feature and improvements over the past week! We can clarify the DMA ownership (remove static singleton) in a follow up PR after we refactor the overall resource ownership model.


#define DMA_SUCCESS 0
#define DMA_FAILURE -1
#define DMA_RETRY 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: A docstring on the meaning of DMA_RETRY would be helpful for quicker understanding.

@csullivan csullivan merged commit 7f1856d into apache:main Aug 26, 2022
@adstraw adstraw deleted the straw-hex-async-dma branch August 30, 2022 18:41
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
Adds adds asynchronous DMA support through the Hexagon User DMA engine with unit tests to validate basic functionality. Asynchronous DMA support here means the ability to "kick off" asynchronously a number of DMAs using the Copy API and then to Poll for or Wait on a number of "in flight" (not done) DMAs. Enables future testing and development for asynchronous memory copy on Hexagon. For now, Hexagon DMA support remains synchronous in nature through existing hexagon_user_dma_1d_sync interface which uses asynchronous capable HexagonUserDMA class in a synchronous way --- calling Copy and Wait back to back for each request.

* use ring buffer to store DMA descriptors

* add RingBuffer class; used by HexUserDMA to store descriptors

* add test to overflow the HexagonUserDMA ring buffer
@aryamazaheri
Copy link

@adstraw @kparzysz-quic It seems that this PR removed the guard #if __HEXAGON_ARCH__ >= 68. Then, has TVM officially discontinued support for v66 DSPs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants