Navigator running erratic or very slow #13536

julianoes · 2019-11-20T11:03:03Z

Describe the bug
We had a flight where the vehicle_status.nav_state was set to NAVIGATION_STATE_AUTO_RTL for about 0.6 seconds and no message or triplet from navigator was published, as if it was not running.

In the logfile below there are times where RTL should have been engaged by navigator because the nav state was set:

279 s / 7.02 --> No reaction from navigator
280.5 s --> RTL actions were taken again
284.5 s --> RTL actions were taken again

To Reproduce
Not sure.

Expected behavior
RTL should have been initiated.

Log Files and Screenshots
https://logs.px4.io/plot_app?log=7c4e2956-6ab5-409f-a2ac-0b908652a1de

Additional context
Navigator is polling on local_position and there are regular updates of this topic.

We need to investigate the effects of accessing the SD card in some of the navigator modes when not inactive. E.g. Mission and Rtl modes check mission items respective rally points even if not active! Also, we check for geofences set in every iteration.

FYI @MaEtUgR

The text was updated successfully, but these errors were encountered:

julianoes · 2019-11-20T13:30:29Z

I just tried in SITL and did some rough measurements about what navigator is doing just sitting in manual:

no waypoints and no rally points set: ~ 22 dm_read per second.
no waypoints and 4 rally points set: ~ 94 dm_read per second.
10 waypoints and 4 rally points set: ~ 94 dm_read per second, 174 dm_read right after upload.
10 waypoints and 4 rally points, and 4 geofence points: ~ 130 dm_read per second.

FYI @sfuhrer. We'll have to look into this.

dagar · 2019-11-20T15:41:16Z

RTL - constantly re-reading the safepoints when inactive

https://github.com/PX4/Firmware/blob/4ce03dfc1e77425f4ccab6c67f3d357e29127e18/src/modules/navigator/rtl.cpp#L61

One potential quick fix would be to add something to throttle RTL::find_closest_landing_point().

https://github.com/PX4/Firmware/blob/master/src/modules/navigator/rtl.cpp#L66

sfuhrer · 2019-11-20T15:53:11Z

@julianoes yes we should look into this. Let's though first get in #13320 . It's mainly about enabling the rally points also in other than RTL_TYPE = 3 mode. Note that the flight with which we had the issue was actually already with this PR.

sfuhrer · 2019-11-20T16:04:29Z

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

dagar · 2019-11-20T18:51:24Z

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

We don't really have a mechanism for that, but something similar to mission (https://github.com/PX4/Firmware/blob/master/msg/mission.msg) would probably make sense.

In this case it's a bit different though because it's rechecking based on current position.

Stifael · 2019-11-21T09:03:20Z

not sure if this info helps: we also have large delays with a merge that is about 3/4 years behind current master. We have logs where there is more than a second delay between vehicle_status and triplet message.

julianoes · 2019-11-21T10:59:38Z

we also have large delays with a merge that is about 3/4 years behind current master.

@dagar this hints that it has nothing to do with the recent addition of rally points or even your bigger architectural changes recently. It sounds more like something off with how/when navigator runs.

julianoes · 2019-11-21T11:12:34Z

@Stifael when this happened, did you have a geofence set, and or a mission?

Stifael · 2019-11-21T11:47:52Z

The people in kunshan reproduced the error several times by:

load mission
Manual takeoff
Switch into mission

Not sure if geofence was active. Let me check

Stifael · 2019-11-21T11:53:18Z

Not sure if geofence was active. Let me check

Geofence was set to default, which is Warning.

julianoes · 2019-11-21T13:23:25Z

@Stifael right but were there geofence items set? This would cause SD card access.

julianoes · 2019-11-21T13:23:59Z

This is in HITL:

navigator: 134324 events, 231147587us elapsed, 1720.82us avg, min 14us max 1401415us 26976.525us rms

I read this as 1.4 seconds worst case.

Stifael · 2019-11-21T14:07:49Z

@Stifael right but were there geofence items set? This would cause SD card access.

How do I see that from the log?

julianoes · 2019-11-21T14:33:23Z

How do I see that from the log?

I don't think you do. But it doesn't matter. I can reproduce it without a geofence set. According to my profiling using perf_counters it seems like the dm_read() call is what takes so much time.

julianoes · 2019-11-21T15:11:01Z

@dagar @bkueng @davids5 I narrowed the call talking in the order of 1s to:

https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L412-L413

more specifically the actual read() call:
https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L639

I can reproduce it in HITL on a Pixracer flying a mission. It probably happens without flying a mission, the main point is that position is initialized and navigator runs.

Edit: I could not reproduce read times above about 5ms when logger is not running!

dagar · 2019-11-21T15:44:50Z

The safe points are tiny, let's just keep a full copy of them in RTL.

Then let's add another orb message with full dataman metadata so we can limit updates.

After that I'd still like to see what the remaining dataman access pattern is like for typical missions. Some simple caching and prefetching in dataman might be interesting.

julianoes · 2019-11-21T15:47:46Z

@bkueng just mentioned that this could be happening with cheaper SD cards, and indeed I can reproduce it with a cheap SanDisk EDGE 8GB class 4 (which presumably comes with Pixhawk 4) but I don't see reads above 11ms with a more expensive SanDisk Extreme PRO U3 card.

julianoes · 2019-11-21T15:49:59Z

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

dagar · 2019-11-21T16:36:28Z

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

It's both. For RTL we can store all the safe points in a small array. In geofence (IIRC) the polygon is already stored, but dataman is still checked constantly for an update. This is the part we get rid of with something similar to ORB_ID(mission) (https://github.com/PX4/Firmware/blob/master/msg/mission.msg). The mission message could turn into a more general dataman message that includes info about all the dataman entries, or we could add separate messages for each type (safe points, fence, waypoints).

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

Dataman is already like a poor man's async. Caching the first entry (where safe points and geofence store the number of items) might be a quick way to eliminate a lot of the disk reads (similar to the orb idea).

dagar · 2019-11-21T18:03:15Z

TODO: review dataman locking. We don't seem to be using it consistently.

stale · 2020-02-20T02:13:06Z

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

MikeCharikov · 2020-10-29T08:15:33Z

Regarding geofence. We performed some tests with 60-point poly fence. Read+calculations takes from 30ms to 200ms on different UAVs. Looks like it depends on SD card condition. But 200ms is pretty high value.

@dagar Do you have any plans to rework this logic?

julianoes · 2020-10-29T13:20:58Z

I have plans but no time. @dagar might know better.

JacobCrabill · 2021-05-12T16:57:43Z

TODO: review dataman locking. We don't seem to be using it consistently.

This didn't age well....

I'll be taking at look at as time allows within the next few weeks to see if any changes on master have fixed this issue since it was first noted here. Hopefully it's fixed 🤞🏻; I'll post again if not

julianoes added the bug label Nov 20, 2019

dagar mentioned this issue Nov 20, 2019

dataman: add read and write perf counters #13539

Merged

sfuhrer mentioned this issue Nov 21, 2019

Rally Points: enable rally points always, independent of RTL_TYPE set… #13320

Merged

julianoes self-assigned this Nov 21, 2019

stale bot added the stale label Feb 20, 2020

stale bot removed the stale label Oct 29, 2020

julianoes removed their assignment Oct 29, 2020

MikeCharikov mentioned this issue Nov 12, 2020

If SD-card becomes unreadable during the flight Geofence Violation occurs (for poly fences) #16173

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Navigator running erratic or very slow #13536

Navigator running erratic or very slow #13536

julianoes commented Nov 20, 2019

julianoes commented Nov 20, 2019 •

edited

Loading

dagar commented Nov 20, 2019

sfuhrer commented Nov 20, 2019

sfuhrer commented Nov 20, 2019

dagar commented Nov 20, 2019

Stifael commented Nov 21, 2019 •

edited

Loading

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

Stifael commented Nov 21, 2019

Stifael commented Nov 21, 2019

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

Stifael commented Nov 21, 2019 •

edited

Loading

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019 •

edited

Loading

dagar commented Nov 21, 2019

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

dagar commented Nov 21, 2019

dagar commented Nov 21, 2019

stale bot commented Feb 20, 2020

MikeCharikov commented Oct 29, 2020

julianoes commented Oct 29, 2020

JacobCrabill commented May 12, 2021

Navigator running erratic or very slow #13536

Navigator running erratic or very slow #13536

Comments

julianoes commented Nov 20, 2019

julianoes commented Nov 20, 2019 • edited Loading

dagar commented Nov 20, 2019

RTL - constantly re-reading the safepoints when inactive

sfuhrer commented Nov 20, 2019

sfuhrer commented Nov 20, 2019

dagar commented Nov 20, 2019

Stifael commented Nov 21, 2019 • edited Loading

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

Stifael commented Nov 21, 2019

Stifael commented Nov 21, 2019

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

Stifael commented Nov 21, 2019 • edited Loading

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019 • edited Loading

dagar commented Nov 21, 2019

julianoes commented Nov 21, 2019

julianoes commented Nov 21, 2019

dagar commented Nov 21, 2019

dagar commented Nov 21, 2019

stale bot commented Feb 20, 2020

MikeCharikov commented Oct 29, 2020

julianoes commented Oct 29, 2020

JacobCrabill commented May 12, 2021

julianoes commented Nov 20, 2019 •

edited

Loading

Stifael commented Nov 21, 2019 •

edited

Loading

Stifael commented Nov 21, 2019 •

edited

Loading

julianoes commented Nov 21, 2019 •

edited

Loading