Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness to bad and lost airspeed data #10733

Closed
wants to merge 18 commits into from

Conversation

priseborough
Copy link
Contributor

@priseborough priseborough commented Oct 22, 2018

Description

Added in response to #9869

Enables detection and recovery from disconnected and blocked airspeed sensors.

Uses a combined airspeed innovation check and normal load factor sanity check.

False positives to bad EKF operation are reduced by requiring that GPS and magnetomer innovations are within limits from other sensors are within limits before declaring airspeed innovations bad.

The normal load factor sanity check will fail the airspeed sensor if the normal load factor sustains a value that is greater than what is physically possible given the measured airspeed.

The feature is off by default and controlled by the following parameters:

COM_ASPD_FS_ACT
Sets failsafe action when bad airspeed measurements are detected:
0 do nothing
1 log a message
2 log a message, warn the user
3 log a message, warn the user, switch to non-airspeed TECS mode
4 log a message, warn the user, switch to non-airspeed TECS mode, switch to Return mode

COM_ASPD_STALL
This is the minimum indicated airspeed at which the wing can produce 1g of lift. It is used by the airspeed sensor fault detection and failsafe calculation to detect a significant airspeed low measurement error condition and should be set based on flight test for reliable operation. The failsafe response is controlled by the COM_ASPD_FS_ACT parameter.

COM_TAS_FS_INNOV
This scales the minimum airspeed innovation level required to trigger a failsafe. Increase to make the check less sensitive, decrease to make it more sensitive. The default value of 1.0 uses the same threshold as the EKF does for airspeed acceptance.

COM_TAS_FS_T1
Number of seconds checks need to fail before an airspeed fault is declared and any failsafe response is actioned. Note - if both

COM_TAS_FS_T2
Number of seconds checks need to pass before an airspeed fault is cleared.

Test data / coverage
Preliminary testing in SITL was performed by modifying it here :https://github.com/PX4/sitl_gazebo/blob/3d80f63562c24c1537e1f8423ce649c99ebc15ea/src/gazebo_mavlink_interface.cpp#L549-L555 to fail the airspeed at specified times or heights, eg

// block airspeed sensor immediately after takeoff
// clear blockage afer 30 seconds
static common::Time fail_time = 0.0;
static bool faulty = false;
if (faulty || ((pos_n.Z() < -5.0f) && (fail_time == 0.0))) {
sensor_msg.diff_pressure = 0.0f;
if (!faulty) {
fail_time = world_->GetSimTime();
faulty = true;
}
}
if ((world_->SimTime() - fail_time) > 30.0) {
faulty = false;
}

SITL test logs
COM_ASPD_FS_ACT = 4
COM_ASPD_STALL = 10
COM_TAS_FS_INNOV = 1.0
COM_TAS_FS_T1 = 3
COM_TAS_FS_T2 = 100

Pitot permanently blocked immediately after takeoff.
https://logs.px4.io/plot_app?log=569f7887-378a-465d-87fe-2e22fbd1163b

Pitot blocked immediately after takeoff and unblocked after 30 sec. Airspeed fault status eventually clears, but vehicle remains in RTL loiter as expected.
https://logs.px4.io/plot_app?log=b42fbf34-25aa-42d6-ab16-52baf613a81a

Flight testing will be required.

@dagar dagar self-requested a review October 22, 2018 13:37
@dagar
Copy link
Member

dagar commented Oct 22, 2018

What about airspeed in the fixed wing attitude controller (gain scheduling)? Either pause the airspeed update preserving the current/last value or mark it invalid to force the gain scheduling to use trim airspeed.

hrt_abstime _time_tas_good_declared{0}; /**< time TAS use was started (uSec) */
hrt_abstime _time_tas_bad_declared{0}; /**< time TAS use was stopped (uSec) */
hrt_abstime _time_last_airspeed{0}; /**< time last airspeed measurement was received (uSec) */
char *_airspeed_fault_type = new char[7];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this with flags instead of strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but that will take more time than I have to spend on this PR at this point in time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading your comment, I may have misunderstood it first time. Can you describe what you mean by using flags and why the objection to using strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of something like using more status flag booleans for each of the possibilities instead of strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this uses some more memory with the 7 character class variable, however it avoids the necessity to duplicate branching wherever the message is printed. If you have an idea how this can be avoided, please provide an outline. TXS

@dagar
Copy link
Member

dagar commented Oct 22, 2018

The behaviour should probably be dependant on the vehicle control mode so that RTL isn't engaged in manual or stabilized mode.


# airspeed fault and airspeed use status
bool aspd_check_failing # true when airspeed consistency checks are failing
bool aspd_fault_declared # true when an airspeed fault has been declared
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think about handling this similarly to position and velocity checks? https://github.com/PX4/Firmware/blob/master/msg/vehicle_status_flags.msg#L10

  • condition_airspeed_valid
  • similar probation logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is also publishing separate booleans so I don't understand the point you are making.

status.aspd_use_inhibit = true;
status.aspd_fail_rtl = true;
// let us send the critical message even if already in RTL
if (TRANSITION_DENIED != main_state_transition(status, commander_state_s::MAIN_STATE_AUTO_RTL, status_flags, &internal_state)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if in an airspeed mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment.

@Antiheavy
Copy link
Contributor

We will be available to help flight test this starting in a couple weeks.

@priseborough
Copy link
Contributor Author

This has been addressed by
1bbec34

The behaviour should probably be dependant on the vehicle control mode so that RTL isn't engaged in manual or stabilized mode.

@priseborough
Copy link
Contributor Author

Jenkins is failing on the omnibus-f4sd target build due to lack of flash space.

@Antiheavy
Copy link
Contributor

latest flight test: https://review.px4.io/plot_app?log=672b97f6-1ba8-4426-a07a-aa51925253bb

Firmware using this commit: 5600afe

Failsafe still not triggering, _load_factor_ratio still stuck at 0 the whole time.

@Antiheavy
Copy link
Contributor

Great progress, but there are a couple open issues/bugs:

1) Return mode does not send/log proper warnings (should send/log all messages):

Return mode correctly triggers in this log (COM_ASPD_FS_ACT set to 4), but PX4 doesn't send down and log the proper RTL warning messages.
https://review.px4.io/plot_app?log=0019719c-079b-4a62-9581-93e95ba955f0

2) Silently switches to non-airspeed TECS mode in some cases (should not silently switch):

Negative airspeed: The pitot is fully blocked and airspeed is reading negative. There is no indication the airspeed failsafe triggered, however, the vehicle seems to fly normally, possibly in non-airspeed TECS mode?

Full block test. I messed up and had COM_ASPD_FS_ACT set to 0. However for some reason the plane flew as if it had failsafed?
https://review.px4.io/plot_app?log=9be15b64-869b-4ccf-95f9-d2d9327393d9
Same as above. Confusing behavior...
https://review.px4.io/plot_app?log=6a671670-93f4-4c00-9e3b-a9c277e34b68

Killing airspeed driver (simulating airspeed sensor electrical fault): We attempted to stop the airspeed with the console, but it was not registered. Airspeed stoppped, but the vehicle kept flying well. What is happening here? There is no information/feedback. I feel the failsafe should be the mechanism for the TECS switch. Either way there shouldn't be silent switching.
https://review.px4.io/plot_app?log=fa18a869-1a7b-4a5f-88ef-8dd5e6c22e79

I think we will be in really good shape if we can address these issues and then do some more flight testing to establish good default parameters.

@priseborough
Copy link
Contributor Author

1) Return mode does not send/log proper warnings (should send/log all messages):

Return mode correctly triggers in this log (COM_ASPD_FS_ACT set to 4), but PX4 doesn't send down and log the proper RTL warning messages.
https://review.px4.io/plot_app?log=0019719c-079b-4a62-9581-93e95ba955f0

I can see a " AIRSPEED DATA FAULTY - stopping use and returning" warning message in the log:
screen shot 2019-02-19 at 6 08 57 pm

@Antiheavy
Copy link
Contributor

I can see a " AIRSPEED DATA FAULTY - stopping use and returning" warning message in the log:

yes, however, the RTL messages seem to be bypassed somehow. Some examples of bypassed messages:
https://github.com/PX4/Firmware/blob/5bcd7c0a0d17c3f62b15bc32371486874f05f2bb/src/modules/navigator/rtl.cpp#L116
https://github.com/PX4/Firmware/blob/5bcd7c0a0d17c3f62b15bc32371486874f05f2bb/src/modules/navigator/rtl.cpp#L161
https://github.com/PX4/Firmware/blob/5bcd7c0a0d17c3f62b15bc32371486874f05f2bb/src/modules/navigator/rtl.cpp#L190

@dagar might understand why these are being bypassed?

@priseborough
Copy link
Contributor Author

OK, so you want a separate RTL message. Did the " AIRSPEED DATA FAULTY - stopping use and returning" message appear on the GCS during flight?

@priseborough
Copy link
Contributor Author

What if the message sent to the ground and logged read: "RTL: AIRSPEED DATA FAULTY" ?

@Antiheavy
Copy link
Contributor

OK, so you want a separate RTL message. Did the " AIRSPEED DATA FAULTY - stopping use and returning" message appear on the GCS during flight?

If I recall correctly, yes, but it was a while ago we did the test. Also, QGC audibly announces the switch to RETURN mode. The issue is that the other RTL related messages that normally are logged (and show up in console) are being improperly bypassed or suppressed or something.

@Antiheavy
Copy link
Contributor

What if the message sent to the ground and logged read: "RTL: AIRSPEED DATA FAULTY" ?

If the goal is to reduce characters <50 I prefer: "Airspeed data faulty, stopping use and returning"

@Antiheavy
Copy link
Contributor

Killing airspeed driver (simulating airspeed sensor electrical fault): We attempted to stop the airspeed with the console, but it was not registered.

after some more ground testing is seems that after stopping the airpseed driver, restarting it does not allow the airspeed sensor/estimate to fully reinitialize. It seems to require an actual full system reboot. I'm not sure if or how this might factor in with the failsafe. Just commenting on it in case it does factor.

@priseborough
Copy link
Contributor Author

priseborough commented Feb 26, 2019

@dagar I ran a test in SITL to test the negative pressure case with COM_ASPD_FS_ACT = 4, see https://logs.px4.io/plot_app?log=51e7247d-fc91-4ca3-bd83-076a5c0c056e

The Mavlink GCS console received the following messages:

WARN [commander] AIRSPEED DATA FAULTY - stopping use and returning
INFO [navigator] RTL HOME activated

However only the 'WARN [commander] AIRSPEED DATA FAULTY - stopping use and returning' message was logged to flash.

The 'RTL HOME activated' message is produced here https://github.com/priseborough/Firmware/blob/pr-airSpdFailSafe/src/modules/navigator/navigator_main.cpp#L679

However the name of the function is misleading as it does not log.

Edit:

@Antiheavy I do not think this is a case of the RTL message being bypassed. It is being set here: https://github.com/priseborough/Firmware/blob/pr-airSpdFailSafe/src/modules/navigator/navigator_main.cpp#L679

@dagar Do you have suggestions on how we can ensure that that call results in a message being logged ad not just sent to console and to ground over Mavlink.

@priseborough
Copy link
Contributor Author

I've done a first pass through the
https://review.px4.io/plot_app?log=9be15b64-869b-4ccf-95f9-d2d9327393d9 log and TECS is using an airspeed midway between the high and low value internally and running in the airspeed invalid mode, eg:

screen shot 2019-02-27 at 7 52 34 pm

This is because the negative airspeed has triggered another protection mechanism here: https://github.com/priseborough/Firmware/blob/pr-airSpdFailSafe/src/modules/fw_pos_control_l1/FixedwingPositionControl.cpp#L385

This is a legacy mechanism and always runs.

@dagar
Copy link
Member

dagar commented Feb 27, 2019

@dagar Do you have suggestions on how we can ensure that that call results in a message being logged ad not just sent to console and to ground over Mavlink.

PX4_WARN() and PX4_ERR() are messages go to the console and are logged. PX4_INFO() only goes to the console.

mavlink_log_emergency() == PX4_ERR + mavlink (emergency priority)
mavlink_log_critical() == PX4_WARN + mavlink (critical priority)
mavlink_and_console_log_info() == PX4_INFO + mavlink

So if you want messages that are visible on the console, mavlink, and logged they need to be critical (mavlink_log_critical()) or emergency (mavlink_log_emergency()).

@Antiheavy
Copy link
Contributor

@Antiheavy
Copy link
Contributor

This is a legacy mechanism and always runs.

Can we roll this into the failsafe functionality such that the the appropriate user notification and action is performed (set by COM_ASPD_FS_ACT)? COM_TAS_FS_T1 should still be respected for any failsafe action. What would be a good approach here?

@priseborough
Copy link
Contributor Author

This is a legacy mechanism and always runs.

Can we roll this into the failsafe functionality such that the the appropriate user notification and action is performed (set by COM_ASPD_FS_ACT)? COM_TAS_FS_T1 should still be respected for any failsafe action. What would be a good approach here?

Protection against using a negative or NaN airspeed is something the fixed wing position controller (TECS) always requires and should not be something that should be bypassed by a parameter and therefore is not something that I would delegate totally to the commander.

I will add a check for this to the commander so that there will be immediate setting of aspd_use_inhibit to true, followed by a notification after 1 second.

@priseborough
Copy link
Contributor Author

The switch to RTL is always logged as part of the vehicle_status message, so we need to decide on what circumstances we actually need need an additional console message over and above a fault notification.

@priseborough
Copy link
Contributor Author

Following discussions with @Antiheavy I have added a parameter to enable delaying the return into RTL. This will prevent unnecessary mission failure if wind shear or other wind estimation errors cause the EKF innovation check to temporarily fail. This will allow the mission to continue for a few minutes using the TECS pitch to throttle, albeit at a less efficient flight condition whilst giving a change for wind estimates to re-converge. The following figure shows a flight where launch and wind conditions caused a temporary wind estimation failure.

Screen Shot 2019-03-11 at 12 35 45 pm

This wind speed estimation error in the North direction was associated with a large state variance until sufficient turns had been completed, eg:

Screen Shot 2019-03-11 at 12 43 48 pm

I will investigate the possibility of delaying use of airspeed innovation checks until wind state variances have converged.

* @max 2.0
* @group Commander
*/
PARAM_DEFINE_FLOAT(COM_TAS_FS_INNOV, 1.0f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on flight testing and review of flight data, we feel like 1.0 causes too many false positives for COM_TAS_INNOV. It is one thing to throw a flag indicating airspeed readings aren't in agreement with expected values and it is another to declare them bad enough to trigger a failsafe. Hence, I suggested the default for COM_TAS_FS_INNOV = 2.0. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be acceptable because we still have the load factor test. What value did the airspeed innovation ratio peak at for the false positives you experienced?

Copy link
Contributor

@Antiheavy Antiheavy Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the airspeed innovations can get pretty large, especially in windy conditions with a lot of turning.

Here are two flight logs with a lot of false positives. Airspeed was generally working, might be not calibrated perfectly, but it was reasonable. The biggest contributor here is high winds. You will see that we were changing params mid-flight to test the sensitivity to false positives.
This flight was flown in wind speeds exceeding the specified limit of the airframe:
https://review.px4.io/plot_app?log=d9a42e33-32b2-4530-ad71-bc9d917465f6
image

This flight was somewhat less windy, airspeed innovation test ratios are not as bad, but still exceed 1.0 regularly even though airspeed is functioning properly.
https://review.px4.io/plot_app?log=fa18a869-1a7b-4a5f-88ef-8dd5e6c22e79
image

In this short flight, the learned wind vector start out incorrect and caused a different kind of false positive.
https://review.px4.io/plot_app?log=8b9df1c8-82ca-417b-9f76-30fde4dd190a

image

Here is another log from an actual crash. This pitot had some moisture inside it and cause the airspeed to read "something", but it was not correct. This vehicle's firmware was similar to v1.8.0, so it did not have the benefit of this PR. You can see the airspeed test ratio slowly grow to peak at ~3.75.
https://review.px4.io/plot_app?log=1f6bc663-44e5-4bbc-8ecf-6f3fe4e51d27
image

note: airspeed is scaled by 0.1 in the above plots.

Edit: I fixed the plots above to show the tas_test_ratio instead of the airspeed_innov since it is actually the test_ratio that we care about. I updated the text above as well.
Edit 2: I added another example above of a real crash caused by a blocked pitot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Antiheavy Thanks for sharing those.

Besides increasing the threshold we can also reduce false positives by integrating the error when above the threshold so that larger transient conditions do not trigger, but a smaller persistent condition does.

I will have a go at this and also look at the log where the wind vector initialised incorrectly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides increasing the threshold we can also reduce false positives by integrating the error when above the threshold so that larger transient conditions do not trigger, but a smaller persistent condition does.

@priseborough this is a neat idea, but maybe it isn't worth the effort. It could be worth it if the solution allowed us to eliminate the need to parameterize COM_TAS_FS_INNOV, making this feature simpler for users/integrators. However, if we need a parameter to tune COM_TAS_FS_INNOV for different types of vehicles and users, then we might as well keep it a bit simpler to understand as a simple scaler on the airspeed test ratio. I feel with COM_ASPD_FS_DLY the solution is equally "safe" even with some false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory with a sensible integral error threshold we should be able to have the same COM_TAS_FS_INNOV for different vehicles because that is based on the innovation test ratio which if above 1 indicates the airspeed has been rejected by the EKF.

However in practice there will be many users that are flying with badly calibrated airspeed installations that are likely to fail the test. For example see the ones supplied by a FW tester for the 1.9 release here: #11785 where there was no awareness of the large impact of static source position error. That log would have triggered any sensible threshold, eg:

Screen Shot 2019-04-22 at 9 32 47 am

However in this example I think the user should have been alerted that they have a bad calibration. One way around this is for the initial implementation would be to use the innovation check in a 'warning only' capacity and relay on the load factor check for the failsafe.

I will implement the 'integral' check method and provide a link to it for further discussion here: #11846

@LorenzMeier
Copy link
Member

@RomanBapst Could you take a pass here and review / comment? If possible I'd love to have this in 1.9, but it might be too big to bring in.

@dagar
Copy link
Member

dagar commented Apr 14, 2019

Attempted rebase got a bit nasty. I squashed the branch, manually resolved the conflicts, and pushed to pr-airSpdFailSafe off of the main repo.

Let's continue in #11846.

@dagar dagar closed this Apr 14, 2019
@Antiheavy
Copy link
Contributor

1) Return mode does not send/log proper warnings (should send/log all messages):

Return mode correctly triggers in this log (COM_ASPD_FS_ACT set to 4), but PX4 doesn't send down and log the proper RTL warning messages.
https://review.px4.io/plot_app?log=0019719c-079b-4a62-9581-93e95ba955f0

After talking with @dagar I now undertand that the logged RTL messages are working as intended. This item can be set aside for a separate and unrelated discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants