Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Software CSMA and Link Retries for AT86RF2XX, and Fix Concurrency Bugs #8332

Draft
wants to merge 2 commits into
base: 2017.10-branch
Choose a base branch
from

Conversation

samkumar
Copy link
Contributor

@samkumar samkumar commented Jan 8, 2018

Contribution description

In this pull request, I have implemented Software CSMA and Link Retries for the AT86RF2XX Radio Driver, and fixed some concurrency issues in the radio driver.

There are really three parts to this, which I will explain in this order:

  1. Software CSMA with AT86RF233
  2. Adding a Send Queue in netdev
  3. Concurrency issues in AT86RF2XX driver

Software CSMA with AT86RF233

I've been doing some work, lately with high-bandwidth networking over IEEE 802.15.4, using the Hamilton (@immesys) platform which uses the AT86RF233 radio. After much experimentation, I have concluded that the automatic CSMA and link retry mechanism provided by the AT86RF233's Extended Operating Mode interacts very poorly with protocols with bidirectional data transfer. (TCP is an example of a protocol with bidirectional data transfer: data goes in one direction, and acknowledgments in the opposite direction.) The reason is that the radio cannot receive packets in between link retries or CSMA attempts. Therefore, if two devices send each other a frame at approximately the same time (a common occurrence), the first packet to be sent is consistently lost, as the other device is performing CSMA backoff and therefore cannot receive packets.

The solution is to implement CSMA in such a way that packets can be received during CSMA backoff. Unfortunately this isn't possible using the built-in CSMA capability in the AT86RF233. The fundamental limitation is that the AT86RF233 has a single frame buffer for both incoming and outgoing packets. Therefore, if a packet is received during CSMA backoff, the CPU needs to write the outgoing packet to the frame buffer again.

However, we would still like to take advantage of the AT86RF233's Extended Operating Mode where possible. Therefore, we set the number of link retries to 0 and the number of CSMA retries to 0 in the AT86RF233 hardware. Now, initiating a TX operation tells the radio to perform a CCA probe and send the frame only if the channel is clear. After the transaction, software can read the TRAC_STATUS register to find out what really happened. Using this trick, we implement Software CSMA. Sending a packet writes it to a software buffer used to buffer frames, and sets a timer that initiates a TX operation after a CSMA backoff period. After the operation, we read the TRAC_STATUS register to learn whether the CSMA probe was successful or not. We also implement link retries in software, using this same mechanism.

Netdev already has a solution for Software CSMA, but was insufficient for our purpose because it sleeps in the event loop. Because it sleeps in the event loop, packets cannot be received during the CSMA backoff, meaning that it cannot be used to solve this problem.

Adding a Send Queue in netdev

Implementing the above Software CSMA makes sending a packet a split-phase operation. It cannot simply block the thread until the radio sends the packet, because that would block the same event loop used for reading new packets. However, netdev, as currently written, does not support split-phase sending. Whenever it receives a SEND message in its queue, it immediately calls dev->send(). Currently, the radio driver handles this by spinning in the current thread until the radio is ready (see https://github.com/RIOT-OS/RIOT/blob/2017.10-branch/drivers/at86rf2xx/at86rf2xx.c#L147-L150). However, we need the event loop to be able to handle received packets while waiting; in other words, send() needs to return early, during the CSMA backoff.

This causes a problem, because netdev may now call send() on a packet while the radio driver is in CSMA backoff. To solve this problem, we introduce a send queue in netdev. Netdev waits for the TX_DONE event, and keeps track of whether a frame is currently being sent by the radio. If it receives a SEND message while the radio driver is busy sending a packet (which includes the CSMA backoff), it places the frame on a queue, and dispatches it once it receives the TX_DONE call.

Concurrency Issues in the AT86RF233 Radio Driver

Doing this work in the radio driver brought out some concurrency issues. Most notably, when the radio driver tries to send a packet, but the radio is busy, it spin-waits until the radio is ready (see https://github.com/RIOT-OS/RIOT/blob/2017.10-branch/drivers/at86rf2xx/at86rf2xx.c#L147-L150). In particular, if the radio is receiving a packet and netdev tries to send a frame, it will wait until the frame is fully received. However, because we are spin-waiting in the event loop, the final interrupt once the frame is received will not be serviced until after the frame is sent. This overwrites the frame buffer with the send packet. As a result, when the interrupt is serviced, it reads the outgoing frame that was just send, rather than the incoming frame that was overwritten.

To solve this problem, the at86rf2xx_tx_prepare function should not spin-wait until the radio is busy, but instead atomically perform the following operation: check if the radio is busy or if a receive interrupt is pending, and return a failure code if so; otherwise send the packet. What should happen if at86rf2xx_tx_prepare fails? In my opinion, the correct thing to do is to count it as a CSMA failure; the frame was not transmitted because another frame was being transmitted --- the fact that it was being received by this radio isn't relevant. Therefore, fixing this bug properly requires Software CSMA. Therefore, I've included the fix in this pull request.

Issues/PRs references

I haven't opened an issue for the concurrency bug in the AT86RF2XX driver. However, I do believe that my changes incidentally solve #8242.

Additional notes

In the experiments I've done, these changes increase performance by an order of magnitude (< 6kpbs before the changes, to 63 kbps after these changes) when using the AT86RF233 radio and hamilton board.

I've done this work on top of the October 2017 release. I see that there have been some major changes since then (e.g., a lot of netdev has disappeared). I'd be happy to rebase this onto master. But first I would like a bit of guidance -- for example, where should the send queue go now that netdev is no more? Or more generally, what module now does the work that netdev used to do?

Before this commit, the gnrc_netdev->send function was called immediately when a SEND
message was received by netdev. This requires the radio driver to "block" the thread until
the radio is free to send a packet. This commit changes netdev so support split-phase
sending in the radio driver. The netdev thread now waits until the current send operation
finishes before issuing a new one. This requires it to queue messages to send until the
radio driver is no longer busy. This allows the event loop to handle other events, such
as servicing interrupts or receiving packets, while the radio is busy sending a packet.
Therefore, it has better concurrency characteristics than the previous model, which
requires the gnrc_netdev->send function to block the event loop.
@samkumar
Copy link
Contributor Author

samkumar commented Jan 8, 2018

Looking at the AT86RF2XX driver authors, good reviewers might be @kaspar030 @thomaseichinger or @haukepetersen

Although software CSMA and link retries are provided by the AT86RF233 radio's
"Extended Operating Mode", the radio is unable to receive any frames in between
CSMA probes and link retries. This interacts very poorly with protocols that
require bidirectional communication. If two nodes try to send frames to each
other at the same time, the CSMA backoff happens as it should, but when one of
the nodes finally attempts to send a frame after a CSMA backoff, the other node
is still performing CSMA backoff and therefore is not listening. The packet is
therefore lost. Therefore, we implement CSMA and link retries in software, in
such a way that packets may be received between CSMA probes or link retries.
This functionality is enabled by defining the AT86Rf2XX_SOFTWARE_CSMA flag at
compile-time. As this operation is split-phase, it requires the send queue at the
netdev layer, introduced in the previous commit.

We also fix concurrency issues in the at86rf2xx driver that were exacerbated when
we introduced software CSMA and link retries. Concurrency bugs in at86rf2xx driver
may cause a sent packet to overwrite the frame buffer while a packet is being received,
causing outgoing packets to be "received" by the driver, and incoming packets to be
dropped AFTER the radio sends a link-layer acknowledgment. Our fix is to detect, in a
race-free way, if a frame is being received when a frame is sent, and to mark it as a
CSMA failure before overwriting the radio's frame buffer. Note that the fix only works
when software CSMA is enabled for the AT86RF2XX radio (the AT86RF2XX_SOFTWARE_CSMA flag).
@cgundogan
Copy link
Member

@samkumar could you rebase this to the current master, please? There seem to be a few conflicts.

@cgundogan
Copy link
Member

I've done this work on top of the October 2017 release. I see that there have been some major changes since then (e.g., a lot of netdev has disappeared). I'd be happy to rebase this onto master. But first I would like a bit of guidance -- for example, where should the send queue go now that netdev is no more? Or more generally, what module now does the work that netdev used to do?

oops, sorry. I didn't read that far (: I was trying to rebase it onto master and try it with a couple of nodes then I just wrote my comment. Should've read all the text..

@miri64
Copy link
Member

miri64 commented Mar 15, 2018

@samkumar netdev still exists. The GNRC glue-code you are referring to was moved to gnrc_netif. But how about putting the queue into a netdev-layer (see #8198), so your code can also be used without GNRC?

@DipSwitch
Copy link
Member

DipSwitch commented Aug 12, 2018

Bug is related to: #7276 (comment)

And: #7115 #7275 #7276 #8186

@miri64 miri64 added Area: drivers Area: Device drivers Area: network Area: Networking Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation labels Oct 17, 2018
@stale
Copy link

stale bot commented Aug 10, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Aug 10, 2019
@PeterKietzmann PeterKietzmann added the State: don't stale State: Tell state-bot to ignore this issue label Aug 19, 2019
@stale stale bot removed the State: stale State: The issue / PR has no activity for >185 days label Aug 19, 2019
@PeterKietzmann
Copy link
Member

@jia200x aware of this?

@jia200x jia200x mentioned this pull request Nov 13, 2019
14 tasks
@MrKevinWeiss MrKevinWeiss added this to the Release 2021.07 milestone Jun 21, 2021
@MrKevinWeiss MrKevinWeiss removed this from the Release 2021.07 milestone Jul 15, 2021
@fjmolinas
Copy link
Contributor

@jia200x Is this one still valid?

@jia200x
Copy link
Member

jia200x commented Nov 18, 2021

I think there are some parts that would still be valid, even considering the fact we have moved most of the MAC tasks to the SubMAC and there's an ongoing PR for the radio HAL version of this radio.

The concurrency issues described in 3. are still present in current master and such a solution would definitely improve the situation.

Regarding the software CSMA-CA (and although I acknowledge the problem described with handshakes actually occurs quite often, even more in multi-hop scenarios), I'm not sure if the solution is enable frame reception during CSMA-CA. The standards doesn't clarify this, but suggest the frames may be discarded during CCA.

Although turning on RX during CSMA-CA would definitely increase the average number of received packets, we are somehow assuming it's fine for a second node to send data while the first node is still trying to transmit. That's what CSMA-CA tries to avoid. Also, the timings of CSMA-CA would definitely change using this approach.

In peer-to-peer handshake cases, this problem shouldn't occur so much because the response is triggered after the request. For handshakes between nodes more than one hop away, the problem can be mitigated choosing the right CSMA-CA backoff parameters and calibrating the CCA threshold. The best solution would be to use any slotted mechanism (TSCH, DSME).

I would also avoid adding a queue in the driver. On one side, there are already queues on top (e.g gnrc_netif_pktq) and on the other side, the device only has one framebuffer. This implies a pending packet can be stored in a pointer if required, instead of pushing it to a queue.

@mguetschow
Copy link
Contributor

Since this PR has been stale for several years, I'll convert it to a draft.

Please feel free to remove the draft state if anyone wants to pick this up again.

@mguetschow mguetschow marked this pull request as draft June 11, 2024 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers Area: network Area: Networking State: don't stale State: Tell state-bot to ignore this issue Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants