Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbladeRF] Device hangs after repeated open/stream/close #408

Closed
jynik opened this issue Jul 21, 2015 · 3 comments
Closed

[libbladeRF] Device hangs after repeated open/stream/close #408

jynik opened this issue Jul 21, 2015 · 3 comments
Assignees
Labels
Issue: Bug It's a bug and it oughta be fixed

Comments

@jynik
Copy link
Contributor

jynik commented Jul 21, 2015

It has been reported that running a long series of repeated "open, stream samples, close" operations can cause the device to hang. It is generally not possible to open the device after this, as its left in the "running" state.

I've used the following scripts for this, and see the fault reproduced near a thousand iterations in:

rx_test.sh:

#!/bin/bash
set -e

i=1
while [ 1 ]; do
    echo "======================================================================"
    echo "  Collecting data for iteration $i @ $(date '+%H:%M:%S')"
    echo "======================================================================"
    bladeRF-cli -p
    bladeRF-cli -s test_script.txt -v verbose 2>> /tmp/bladeRF-log.txt
    i=$(expr $i + 1)

    if [ -f "/tmp/stop_test" ]; then
        echo "======================================================================"
        echo " Exiting test "
        echo "======================================================================"
        exit 0
    fi  
done

test_script.txt:

info

set frequency 433M
set samplerate 5M

rx config file=/tmp/s.csv
rx config format=csv n=10M

rx
rx start
rx
rx wait
rx

The hangup occurs when we request the device to shut down.

When ending the stream, we tell the async handler to stop submitting new transfers and to denote when all pending transfers have come back (either with data, or with the error code resulting from us requesting that pending transfers be canceled).

However, the below log shows that some unexpected libusb failure occurs when requesting that transfers to be canceled. Furthermore, the "Wait on state change failed" indicates that none of the pending transfers are completing.

[DEBUG @ bladerf.c:345] Enable Module: RX - False
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[DEBUG @ sync_worker.c:319] sync_worker_wait_for_state: Wait on state change failed: Operation timed out
[WARNING @ sync_worker.c:255] Timed out while stopping worker. Canceling thread.
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[ERROR @ libusb.c:850] Error canceling transfer (-99): LIBUSB_ERROR_OTHER
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...
[VERBOSE @ async.c:214] Stream not done...

In moving forward with this, I propose the following tasks:

  1. Reproduce this with verbose libusb logs to see if we can gather some more info about these transfers that appear to go missing.
  2. Run this test on Windows with the Cypress backend, and then again with the libusb backend. Can we isolate this to libusb (or our particular usage of it) specificially?
  3. Enable FPGA counter mode and reproduce this with a USB protocol analyzer attached. Look to see if we can make any observations regarding those final transfers. (The counter data payload simply provides some more structured RX data.)
@jynik jynik added the Issue: Bug It's a bug and it oughta be fixed label Jul 21, 2015
@jynik jynik self-assigned this Jul 21, 2015
@jynik
Copy link
Contributor Author

jynik commented Aug 27, 2015

This issue seems to be only on OSX, based on my current testing.

I'm currently trying to determine if we can better handle failed canceled transfers by forcibly entering the "Stream done" state and continuing to shut down:

diff --git a/host/libraries/libbladeRF/src/backend/usb/libusb.c b/host/libraries/libbladeRF/src/backend/usb/libusb.c
index 01f5e8b..6ccd321 100644
--- a/host/libraries/libbladeRF/src/backend/usb/libusb.c
+++ b/host/libraries/libbladeRF/src/backend/usb/libusb.c
@@ -835,9 +835,10 @@ static int lusb_get_string_descriptor(void *driver, uint8_t index,
  * error -- no big deal.  Just accepting that alleviates the need to track
  * the status of each transfer...
  */
-static inline void cancel_all_transfers(struct bladerf_stream *stream)
+static inline int cancel_all_transfers(struct bladerf_stream *stream)
 {
     size_t i;
+    int retval = 0;
     int status;
     struct lusb_stream_data *stream_data = stream->backend_data;

@@ -846,6 +847,11 @@ static inline void cancel_all_transfers(struct bladerf_stream *stream)

             status = libusb_cancel_transfer(stream_data->transfers[i]);
             if (status < 0 && status != LIBUSB_ERROR_NOT_FOUND) {
+
+                if (retval == 0) {
+                    retval = status;
+                }
+
                 log_error("Error canceling transfer (%d): %s\r\n",
                         status, libusb_error_name(status));
             } else {
@@ -853,6 +859,8 @@ static inline void cancel_all_transfers(struct bladerf_stream *stream)
             }
         }
     }
+
+    return retval;
 }

 static inline size_t transfer_idx(struct lusb_stream_data *stream_data,
@@ -883,6 +891,13 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)

     MUTEX_LOCK(&stream->lock);

+    /* Avoid touching anything if we're somehow in the DONE state as the
+     * result of any previous errors */
+    if (stream->state == STREAM_DONE) {
+        log_error("Transfer completed while stream was in DONE state.\n");
+        goto out;
+    }
+
     transfer_i = transfer_idx(stream_data, transfer);
     assert(stream_data->transfer_status[transfer_i] == TRANSFER_IN_FLIGHT ||
            stream_data->transfer_status[transfer_i] == TRANSFER_CANCEL_PENDING);
@@ -978,10 +993,20 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)
         if (stream_data->num_avail == stream_data->num_transfers) {
             stream->state = STREAM_DONE;
         } else {
-            cancel_all_transfers(stream);
+            if (cancel_all_transfers(stream) != 0) {
+                /* We've observed that if a request to cancel a transfer
+                 * occurrs, we may never get a callback. In this case,
+                 * we'll just head to the STREAM_DONE state, since
+                 * there's not much left we can do.
+                 */
+                stream->state = STREAM_DONE;
+                log_info("Forcibly shutting down stream due to failure to "
+                         "cancel transfer(s).\n");
+            }
         }
     }

+out:
     MUTEX_UNLOCK(&stream->lock);
 }

@@ -1210,7 +1235,17 @@ static int lusb_stream(void *driver, struct bladerf_stream *stream,
              * have libusb fire off callbacks with the cancelled status*/
             if (status < 0) {
                 stream->error_code = status;
-                cancel_all_transfers(stream);
+
+                if (cancel_all_transfers(stream) != 0) {
+                    /* We've observed that if a request to cancel a transfer
+                     * occurrs, we may never get a callback. In this case,
+                     * we'll just head to the STREAM_DONE state, since
+                     * there's not much left we can do.
+                     */
+                    stream->state = STREAM_DONE;
+                    log_info("Forcibly shutting down stream due to failure to "
+                            "cancel transfer(s).\n");
+                }
             }
         }
     }

However, this is yielding segfaults, - more digging and review of the code paths in this condition are required.

@staticfloat
Copy link

Just wanted to throw in and say that I was also seeing this on OSX. I have been experimenting with creating a tool similar to rtl_power using pybladeRF (I'm using my own slightly updated fork), and my first attempt at scanning over frequencies quickly was to open/stream/close, then open again with a different frequency, but I will need to look and see if I can force it to change frequencies instead of opening/closing.

@jynik
Copy link
Contributor Author

jynik commented Dec 22, 2015

@staticfloat I can now reproduce this on all OSes via the libbladeRF_repeated_stream test program. In the past, I did not see this, but I'm now seeing this happen when I revert to very early changesets that I do not recall exhibiting this. After the upcoming RC1 I'm going to try to dive more into this.

This is what causes the lockup in the DC calibration table generation. For the time being, I am addressing this using a single stream (not repeatedly opening/closing) and using timestamps to determine when to receive samples post-retune. That approach may work well for you.

jynik added a commit that referenced this issue Dec 28, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
jynik added a commit that referenced this issue Dec 28, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
jynik added a commit that referenced this issue Dec 28, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
jynik added a commit that referenced this issue Dec 29, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
jynik added a commit that referenced this issue Dec 29, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
jynik added a commit that referenced this issue Dec 29, 2015
This code is a re-designed set of DC calibration functionality
built atop of libbladeRF.

This functionality provides the ability to:
    - Run LMS6002D DC calibrations per module (or all at once)

    - Search for nominal RX/TX DC correction values for a single
      frequencies or a range of frequencies.

To avoid inducing issue #408 by utilizing a single stream,
rather than repeatedly starting and stopping streams.

No TX thread is used, as this is uneccessary; it is sufficient to
flush 0+0j samples through the TX path at the start of calibration
and allow the DAC to remain held in this state as the FPGA
(intentionally) underruns.
@rtucker rtucker closed this as completed Aug 2, 2017
jcmartin added a commit to jcmartin/bladeRF that referenced this issue May 18, 2023
…transfer

is called on one transfer, all transfers on the same endpoint are
cancelled. Calling libusb_cancel_transfer on additional transfers on the
same endpoint _may_ result in macOS returning kIOReturnAborted. The net
result is the bladeRF can no longer be communicated with.

The fix to this is to only call libusb_cancel_transfer once in
cancel_all_transfers and set the appropriate state for all transfers.
Refer to libusb_cancel_transfer documentation for more details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug It's a bug and it oughta be fixed
Projects
None yet
Development

No branches or pull requests

3 participants