[libbladeRF] Device hangs after repeated open/stream/close #408

jynik · 2015-07-21T15:41:50Z

It has been reported that running a long series of repeated "open, stream samples, close" operations can cause the device to hang. It is generally not possible to open the device after this, as its left in the "running" state.

I've used the following scripts for this, and see the fault reproduced near a thousand iterations in:

rx_test.sh:

#!/bin/bash
set -e

i=1
while [ 1 ]; do
    echo "======================================================================"
    echo "  Collecting data for iteration $i @ $(date '+%H:%M:%S')"
    echo "======================================================================"
    bladeRF-cli -p
    bladeRF-cli -s test_script.txt -v verbose 2>> /tmp/bladeRF-log.txt
    i=$(expr $i + 1)

    if [ -f "/tmp/stop_test" ]; then
        echo "======================================================================"
        echo " Exiting test "
        echo "======================================================================"
        exit 0
    fi  
done

test_script.txt:

info

set frequency 433M
set samplerate 5M

rx config file=/tmp/s.csv
rx config format=csv n=10M

rx
rx start
rx
rx wait
rx

The hangup occurs when we request the device to shut down.

When ending the stream, we tell the async handler to stop submitting new transfers and to denote when all pending transfers have come back (either with data, or with the error code resulting from us requesting that pending transfers be canceled).

However, the below log shows that some unexpected libusb failure occurs when requesting that transfers to be canceled. Furthermore, the "Wait on state change failed" indicates that none of the pending transfers are completing.

[DEBUG @ bladerf.c:345] Enable Module: RX - False
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[DEBUG @ sync_worker.c:319] sync_worker_wait_for_state: Wait on state change failed: Operation timed out
[WARNING @ sync_worker.c:255] Timed out while stopping worker. Canceling thread.
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...

In moving forward with this, I propose the following tasks:

Reproduce this with verbose libusb logs to see if we can gather some more info about these transfers that appear to go missing.
Run this test on Windows with the Cypress backend, and then again with the libusb backend. Can we isolate this to libusb (or our particular usage of it) specificially?
Enable FPGA counter mode and reproduce this with a USB protocol analyzer attached. Look to see if we can make any observations regarding those final transfers. (The counter data payload simply provides some more structured RX data.)

The text was updated successfully, but these errors were encountered:

jynik · 2015-08-27T15:31:53Z

This issue seems to be only on OSX, based on my current testing.

I'm currently trying to determine if we can better handle failed canceled transfers by forcibly entering the "Stream done" state and continuing to shut down:

diff --git a/host/libraries/libbladeRF/src/backend/usb/libusb.c b/host/libraries/libbladeRF/src/backend/usb/libusb.c
index 01f5e8b..6ccd321 100644
--- a/host/libraries/libbladeRF/src/backend/usb/libusb.c
+++ b/host/libraries/libbladeRF/src/backend/usb/libusb.c
@@ -835,9 +835,10 @@ static int lusb_get_string_descriptor(void *driver, uint8_t index,
  * error -- no big deal.  Just accepting that alleviates the need to track
  * the status of each transfer...
  */
-static inline void cancel_all_transfers(struct bladerf_stream *stream)
+static inline int cancel_all_transfers(struct bladerf_stream *stream)
 {
     size_t i;
+    int retval = 0;
     int status;
     struct lusb_stream_data *stream_data = stream->backend_data;

@@ -846,6 +847,11 @@ static inline void cancel_all_transfers(struct bladerf_stream *stream)

             status = libusb_cancel_transfer(stream_data->transfers[i]);
             if (status < 0 && status != LIBUSB_ERROR_NOT_FOUND) {
+
+                if (retval == 0) {
+                    retval = status;
+                }
+
                 log_error("Error canceling transfer (%d): %s\r\n",
                         status, libusb_error_name(status));
             } else {
@@ -853,6 +859,8 @@ static inline void cancel_all_transfers(struct bladerf_stream *stream)
             }
         }
     }
+
+    return retval;
 }

 static inline size_t transfer_idx(struct lusb_stream_data *stream_data,
@@ -883,6 +891,13 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)

     MUTEX_LOCK(&stream->lock);

+    /* Avoid touching anything if we're somehow in the DONE state as the
+     * result of any previous errors */
+    if (stream->state == STREAM_DONE) {
+        log_error("Transfer completed while stream was in DONE state.\n");
+        goto out;
+    }
+
     transfer_i = transfer_idx(stream_data, transfer);
     assert(stream_data->transfer_status[transfer_i] == TRANSFER_IN_FLIGHT ||
            stream_data->transfer_status[transfer_i] == TRANSFER_CANCEL_PENDING);
@@ -978,10 +993,20 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)
         if (stream_data->num_avail == stream_data->num_transfers) {
             stream->state = STREAM_DONE;
         } else {
-            cancel_all_transfers(stream);
+            if (cancel_all_transfers(stream) != 0) {
+                /* We've observed that if a request to cancel a transfer
+                 * occurrs, we may never get a callback. In this case,
+                 * we'll just head to the STREAM_DONE state, since
+                 * there's not much left we can do.
+                 */
+                stream->state = STREAM_DONE;
+                log_info("Forcibly shutting down stream due to failure to "
+                         "cancel transfer(s).\n");
+            }
         }
     }

+out:
     MUTEX_UNLOCK(&stream->lock);
 }

@@ -1210,7 +1235,17 @@ static int lusb_stream(void *driver, struct bladerf_stream *stream,
              * have libusb fire off callbacks with the cancelled status*/
             if (status < 0) {
                 stream->error_code = status;
-                cancel_all_transfers(stream);
+
+                if (cancel_all_transfers(stream) != 0) {
+                    /* We've observed that if a request to cancel a transfer
+                     * occurrs, we may never get a callback. In this case,
+                     * we'll just head to the STREAM_DONE state, since
+                     * there's not much left we can do.
+                     */
+                    stream->state = STREAM_DONE;
+                    log_info("Forcibly shutting down stream due to failure to "
+                            "cancel transfer(s).\n");
+                }
             }
         }
     }

However, this is yielding segfaults, - more digging and review of the code paths in this condition are required.

staticfloat · 2015-12-22T04:46:55Z

Just wanted to throw in and say that I was also seeing this on OSX. I have been experimenting with creating a tool similar to rtl_power using pybladeRF (I'm using my own slightly updated fork), and my first attempt at scanning over frequencies quickly was to open/stream/close, then open again with a different frequency, but I will need to look and see if I can force it to change frequencies instead of opening/closing.

jynik · 2015-12-22T16:27:10Z

@staticfloat I can now reproduce this on all OSes via the libbladeRF_repeated_stream test program. In the past, I did not see this, but I'm now seeing this happen when I revert to very early changesets that I do not recall exhibiting this. After the upcoming RC1 I'm going to try to dive more into this.

This is what causes the lockup in the DC calibration table generation. For the time being, I am addressing this using a single stream (not repeatedly opening/closing) and using timestamps to determine when to receive samples post-retune. That approach may work well for you.

This code is a re-designed set of DC calibration functionality built atop of libbladeRF. This functionality provides the ability to: - Run LMS6002D DC calibrations per module (or all at once) - Search for nominal RX/TX DC correction values for a single frequencies or a range of frequencies. To avoid inducing issue #408 by utilizing a single stream, rather than repeatedly starting and stopping streams. No TX thread is used, as this is uneccessary; it is sufficient to flush 0+0j samples through the TX path at the start of calibration and allow the DAC to remain held in this state as the FPGA (intentionally) underruns.

…transfer is called on one transfer, all transfers on the same endpoint are cancelled. Calling libusb_cancel_transfer on additional transfers on the same endpoint _may_ result in macOS returning kIOReturnAborted. The net result is the bladeRF can no longer be communicated with. The fix to this is to only call libusb_cancel_transfer once in cancel_all_transfers and set the appropriate state for all transfers. Refer to libusb_cancel_transfer documentation for more details.

jynik added the Issue: Bug It's a bug and it oughta be fixed label Jul 21, 2015

jynik self-assigned this Jul 21, 2015

rtucker closed this as completed Aug 2, 2017

jcmartin mentioned this issue May 18, 2023

Addresses issue #408 for macOS. Under macOS, when libusb_cancel_transfer #924

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbladeRF] Device hangs after repeated open/stream/close #408

[libbladeRF] Device hangs after repeated open/stream/close #408

jynik commented Jul 21, 2015

jynik commented Aug 27, 2015

staticfloat commented Dec 22, 2015

jynik commented Dec 22, 2015

[libbladeRF] Device hangs after repeated open/stream/close #408

[libbladeRF] Device hangs after repeated open/stream/close #408

Comments

jynik commented Jul 21, 2015

jynik commented Aug 27, 2015

staticfloat commented Dec 22, 2015

jynik commented Dec 22, 2015