-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Software CSMA and Link Retries for AT86RF2XX, and Fix Concurrency Bugs #8332
base: 2017.10-branch
Are you sure you want to change the base?
Conversation
Before this commit, the gnrc_netdev->send function was called immediately when a SEND message was received by netdev. This requires the radio driver to "block" the thread until the radio is free to send a packet. This commit changes netdev so support split-phase sending in the radio driver. The netdev thread now waits until the current send operation finishes before issuing a new one. This requires it to queue messages to send until the radio driver is no longer busy. This allows the event loop to handle other events, such as servicing interrupts or receiving packets, while the radio is busy sending a packet. Therefore, it has better concurrency characteristics than the previous model, which requires the gnrc_netdev->send function to block the event loop.
Looking at the AT86RF2XX driver authors, good reviewers might be @kaspar030 @thomaseichinger or @haukepetersen |
Although software CSMA and link retries are provided by the AT86RF233 radio's "Extended Operating Mode", the radio is unable to receive any frames in between CSMA probes and link retries. This interacts very poorly with protocols that require bidirectional communication. If two nodes try to send frames to each other at the same time, the CSMA backoff happens as it should, but when one of the nodes finally attempts to send a frame after a CSMA backoff, the other node is still performing CSMA backoff and therefore is not listening. The packet is therefore lost. Therefore, we implement CSMA and link retries in software, in such a way that packets may be received between CSMA probes or link retries. This functionality is enabled by defining the AT86Rf2XX_SOFTWARE_CSMA flag at compile-time. As this operation is split-phase, it requires the send queue at the netdev layer, introduced in the previous commit. We also fix concurrency issues in the at86rf2xx driver that were exacerbated when we introduced software CSMA and link retries. Concurrency bugs in at86rf2xx driver may cause a sent packet to overwrite the frame buffer while a packet is being received, causing outgoing packets to be "received" by the driver, and incoming packets to be dropped AFTER the radio sends a link-layer acknowledgment. Our fix is to detect, in a race-free way, if a frame is being received when a frame is sent, and to mark it as a CSMA failure before overwriting the radio's frame buffer. Note that the fix only works when software CSMA is enabled for the AT86RF2XX radio (the AT86RF2XX_SOFTWARE_CSMA flag).
89e4f55
to
7ae8c4d
Compare
@samkumar could you rebase this to the current master, please? There seem to be a few conflicts. |
oops, sorry. I didn't read that far (: I was trying to rebase it onto master and try it with a couple of nodes then I just wrote my comment. Should've read all the text.. |
Bug is related to: #7276 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions. |
@jia200x aware of this? |
@jia200x Is this one still valid? |
I think there are some parts that would still be valid, even considering the fact we have moved most of the MAC tasks to the SubMAC and there's an ongoing PR for the radio HAL version of this radio. The concurrency issues described in 3. are still present in current master and such a solution would definitely improve the situation. Regarding the software CSMA-CA (and although I acknowledge the problem described with handshakes actually occurs quite often, even more in multi-hop scenarios), I'm not sure if the solution is enable frame reception during CSMA-CA. The standards doesn't clarify this, but suggest the frames may be discarded during CCA. Although turning on RX during CSMA-CA would definitely increase the average number of received packets, we are somehow assuming it's fine for a second node to send data while the first node is still trying to transmit. That's what CSMA-CA tries to avoid. Also, the timings of CSMA-CA would definitely change using this approach. In peer-to-peer handshake cases, this problem shouldn't occur so much because the response is triggered after the request. For handshakes between nodes more than one hop away, the problem can be mitigated choosing the right CSMA-CA backoff parameters and calibrating the CCA threshold. The best solution would be to use any slotted mechanism (TSCH, DSME). I would also avoid adding a queue in the driver. On one side, there are already queues on top (e.g |
Since this PR has been stale for several years, I'll convert it to a draft. Please feel free to remove the draft state if anyone wants to pick this up again. |
Contribution description
In this pull request, I have implemented Software CSMA and Link Retries for the AT86RF2XX Radio Driver, and fixed some concurrency issues in the radio driver.
There are really three parts to this, which I will explain in this order:
Software CSMA with AT86RF233
I've been doing some work, lately with high-bandwidth networking over IEEE 802.15.4, using the Hamilton (@immesys) platform which uses the AT86RF233 radio. After much experimentation, I have concluded that the automatic CSMA and link retry mechanism provided by the AT86RF233's Extended Operating Mode interacts very poorly with protocols with bidirectional data transfer. (TCP is an example of a protocol with bidirectional data transfer: data goes in one direction, and acknowledgments in the opposite direction.) The reason is that the radio cannot receive packets in between link retries or CSMA attempts. Therefore, if two devices send each other a frame at approximately the same time (a common occurrence), the first packet to be sent is consistently lost, as the other device is performing CSMA backoff and therefore cannot receive packets.
The solution is to implement CSMA in such a way that packets can be received during CSMA backoff. Unfortunately this isn't possible using the built-in CSMA capability in the AT86RF233. The fundamental limitation is that the AT86RF233 has a single frame buffer for both incoming and outgoing packets. Therefore, if a packet is received during CSMA backoff, the CPU needs to write the outgoing packet to the frame buffer again.
However, we would still like to take advantage of the AT86RF233's Extended Operating Mode where possible. Therefore, we set the number of link retries to 0 and the number of CSMA retries to 0 in the AT86RF233 hardware. Now, initiating a TX operation tells the radio to perform a CCA probe and send the frame only if the channel is clear. After the transaction, software can read the TRAC_STATUS register to find out what really happened. Using this trick, we implement Software CSMA. Sending a packet writes it to a software buffer used to buffer frames, and sets a timer that initiates a TX operation after a CSMA backoff period. After the operation, we read the TRAC_STATUS register to learn whether the CSMA probe was successful or not. We also implement link retries in software, using this same mechanism.
Netdev already has a solution for Software CSMA, but was insufficient for our purpose because it sleeps in the event loop. Because it sleeps in the event loop, packets cannot be received during the CSMA backoff, meaning that it cannot be used to solve this problem.
Adding a Send Queue in netdev
Implementing the above Software CSMA makes sending a packet a split-phase operation. It cannot simply block the thread until the radio sends the packet, because that would block the same event loop used for reading new packets. However, netdev, as currently written, does not support split-phase sending. Whenever it receives a SEND message in its queue, it immediately calls dev->send(). Currently, the radio driver handles this by spinning in the current thread until the radio is ready (see https://github.com/RIOT-OS/RIOT/blob/2017.10-branch/drivers/at86rf2xx/at86rf2xx.c#L147-L150). However, we need the event loop to be able to handle received packets while waiting; in other words, send() needs to return early, during the CSMA backoff.
This causes a problem, because netdev may now call send() on a packet while the radio driver is in CSMA backoff. To solve this problem, we introduce a send queue in netdev. Netdev waits for the TX_DONE event, and keeps track of whether a frame is currently being sent by the radio. If it receives a SEND message while the radio driver is busy sending a packet (which includes the CSMA backoff), it places the frame on a queue, and dispatches it once it receives the TX_DONE call.
Concurrency Issues in the AT86RF233 Radio Driver
Doing this work in the radio driver brought out some concurrency issues. Most notably, when the radio driver tries to send a packet, but the radio is busy, it spin-waits until the radio is ready (see https://github.com/RIOT-OS/RIOT/blob/2017.10-branch/drivers/at86rf2xx/at86rf2xx.c#L147-L150). In particular, if the radio is receiving a packet and netdev tries to send a frame, it will wait until the frame is fully received. However, because we are spin-waiting in the event loop, the final interrupt once the frame is received will not be serviced until after the frame is sent. This overwrites the frame buffer with the send packet. As a result, when the interrupt is serviced, it reads the outgoing frame that was just send, rather than the incoming frame that was overwritten.
To solve this problem, the at86rf2xx_tx_prepare function should not spin-wait until the radio is busy, but instead atomically perform the following operation: check if the radio is busy or if a receive interrupt is pending, and return a failure code if so; otherwise send the packet. What should happen if at86rf2xx_tx_prepare fails? In my opinion, the correct thing to do is to count it as a CSMA failure; the frame was not transmitted because another frame was being transmitted --- the fact that it was being received by this radio isn't relevant. Therefore, fixing this bug properly requires Software CSMA. Therefore, I've included the fix in this pull request.
Issues/PRs references
I haven't opened an issue for the concurrency bug in the AT86RF2XX driver. However, I do believe that my changes incidentally solve #8242.
Additional notes
In the experiments I've done, these changes increase performance by an order of magnitude (< 6kpbs before the changes, to 63 kbps after these changes) when using the AT86RF233 radio and hamilton board.
I've done this work on top of the October 2017 release. I see that there have been some major changes since then (e.g., a lot of netdev has disappeared). I'd be happy to rebase this onto master. But first I would like a bit of guidance -- for example, where should the send queue go now that netdev is no more? Or more generally, what module now does the work that netdev used to do?