Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Navigator running erratic or very slow #13536

Open
julianoes opened this issue Nov 20, 2019 · 24 comments
Open

Navigator running erratic or very slow #13536

julianoes opened this issue Nov 20, 2019 · 24 comments
Labels

Comments

@julianoes
Copy link
Contributor

Describe the bug
We had a flight where the vehicle_status.nav_state was set to NAVIGATION_STATE_AUTO_RTL for about 0.6 seconds and no message or triplet from navigator was published, as if it was not running.

In the logfile below there are times where RTL should have been engaged by navigator because the nav state was set:

  • 279 s / 7.02 --> No reaction from navigator
  • 280.5 s --> RTL actions were taken again
  • 284.5 s --> RTL actions were taken again

To Reproduce
Not sure.

Expected behavior
RTL should have been initiated.

Log Files and Screenshots
https://logs.px4.io/plot_app?log=7c4e2956-6ab5-409f-a2ac-0b908652a1de

rtl times

Additional context
Navigator is polling on local_position and there are regular updates of this topic.

We need to investigate the effects of accessing the SD card in some of the navigator modes when not inactive. E.g. Mission and Rtl modes check mission items respective rally points even if not active! Also, we check for geofences set in every iteration.

FYI @MaEtUgR

@julianoes julianoes added the bug label Nov 20, 2019
@julianoes
Copy link
Contributor Author

julianoes commented Nov 20, 2019

I just tried in SITL and did some rough measurements about what navigator is doing just sitting in manual:

  • no waypoints and no rally points set: ~ 22 dm_read per second.
  • no waypoints and 4 rally points set: ~ 94 dm_read per second.
  • 10 waypoints and 4 rally points set: ~ 94 dm_read per second, 174 dm_read right after upload.
  • 10 waypoints and 4 rally points, and 4 geofence points: ~ 130 dm_read per second.

FYI @sfuhrer. We'll have to look into this.

@dagar
Copy link
Member

dagar commented Nov 20, 2019

RTL - constantly re-reading the safepoints when inactive

https://github.com/PX4/Firmware/blob/4ce03dfc1e77425f4ccab6c67f3d357e29127e18/src/modules/navigator/rtl.cpp#L61

One potential quick fix would be to add something to throttle RTL::find_closest_landing_point().

https://github.com/PX4/Firmware/blob/master/src/modules/navigator/rtl.cpp#L66

@sfuhrer
Copy link
Contributor

sfuhrer commented Nov 20, 2019

@julianoes yes we should look into this. Let's though first get in #13320 . It's mainly about enabling the rally points also in other than RTL_TYPE = 3 mode. Note that the flight with which we had the issue was actually already with this PR.

@sfuhrer
Copy link
Contributor

sfuhrer commented Nov 20, 2019

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

@dagar
Copy link
Member

dagar commented Nov 20, 2019

@dagar guess we should change that RTL::find_closest_landing_point() is only executed when the rally points are updated, right? Could you point me to the easiest way to do that?

We don't really have a mechanism for that, but something similar to mission (https://github.com/PX4/Firmware/blob/master/msg/mission.msg) would probably make sense.

In this case it's a bit different though because it's rechecking based on current position.

@Stifael
Copy link
Contributor

Stifael commented Nov 21, 2019

not sure if this info helps: we also have large delays with a merge that is about 3/4 years behind current master. We have logs where there is more than a second delay between vehicle_status and triplet message.

@julianoes
Copy link
Contributor Author

we also have large delays with a merge that is about 3/4 years behind current master.

@dagar this hints that it has nothing to do with the recent addition of rally points or even your bigger architectural changes recently. It sounds more like something off with how/when navigator runs.

@julianoes
Copy link
Contributor Author

@Stifael when this happened, did you have a geofence set, and or a mission?

@Stifael
Copy link
Contributor

Stifael commented Nov 21, 2019

The people in kunshan reproduced the error several times by:

  1. load mission
  2. Manual takeoff
  3. Switch into mission

Not sure if geofence was active. Let me check

@Stifael
Copy link
Contributor

Stifael commented Nov 21, 2019

Not sure if geofence was active. Let me check

Geofence was set to default, which is Warning.

@julianoes
Copy link
Contributor Author

@Stifael right but were there geofence items set? This would cause SD card access.

@julianoes
Copy link
Contributor Author

This is in HITL:

navigator: 134324 events, 231147587us elapsed, 1720.82us avg, min 14us max 1401415us 26976.525us rms

I read this as 1.4 seconds worst case.

@julianoes julianoes self-assigned this Nov 21, 2019
@Stifael
Copy link
Contributor

Stifael commented Nov 21, 2019

@Stifael right but were there geofence items set? This would cause SD card access.

How do I see that from the log?

@julianoes
Copy link
Contributor Author

How do I see that from the log?

I don't think you do. But it doesn't matter. I can reproduce it without a geofence set. According to my profiling using perf_counters it seems like the dm_read() call is what takes so much time.

@julianoes
Copy link
Contributor Author

julianoes commented Nov 21, 2019

@dagar @bkueng @davids5 I narrowed the call talking in the order of 1s to:

https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L412-L413

more specifically the actual read() call:
https://github.com/PX4/Firmware/blob/1371887578af6af9fbd3775104c505ce1fc69e92/src/modules/dataman/dataman.cpp#L639

I can reproduce it in HITL on a Pixracer flying a mission. It probably happens without flying a mission, the main point is that position is initialized and navigator runs.

Edit: I could not reproduce read times above about 5ms when logger is not running!

@dagar
Copy link
Member

dagar commented Nov 21, 2019

The safe points are tiny, let's just keep a full copy of them in RTL.

Then let's add another orb message with full dataman metadata so we can limit updates.

After that I'd still like to see what the remaining dataman access pattern is like for typical missions. Some simple caching and prefetching in dataman might be interesting.

@julianoes
Copy link
Contributor Author

@bkueng just mentioned that this could be happening with cheaper SD cards, and indeed I can reproduce it with a cheap SanDisk EDGE 8GB class 4 (which presumably comes with Pixhawk 4) but I don't see reads above 11ms with a more expensive SanDisk Extreme PRO U3 card.

@julianoes
Copy link
Contributor Author

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

@dagar
Copy link
Member

dagar commented Nov 21, 2019

The safe points are tiny, let's just keep a full copy of them in RTL.

This is currently not even about safe points. This is about geofence points.

It's both. For RTL we can store all the safe points in a small array. In geofence (IIRC) the polygon is already stored, but dataman is still checked constantly for an update. This is the part we get rid of with something similar to ORB_ID(mission) (https://github.com/PX4/Firmware/blob/master/msg/mission.msg). The mission message could turn into a more general dataman message that includes info about all the dataman entries, or we could add separate messages for each type (safe points, fence, waypoints).

I think we need to implement some caching at some level but also think about moving the read calls to the background and access them async.

Dataman is already like a poor man's async. Caching the first entry (where safe points and geofence store the number of items) might be a quick way to eliminate a lot of the disk reads (similar to the orb idea).

@dagar
Copy link
Member

dagar commented Nov 21, 2019

TODO: review dataman locking. We don't seem to be using it consistently.

@stale
Copy link

stale bot commented Feb 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. Thank you for your contributions.

@stale stale bot added the stale label Feb 20, 2020
@MikeCharikov
Copy link

Regarding geofence. We performed some tests with 60-point poly fence. Read+calculations takes from 30ms to 200ms on different UAVs. Looks like it depends on SD card condition. But 200ms is pretty high value.

@dagar Do you have any plans to rework this logic?

@stale stale bot removed the stale label Oct 29, 2020
@julianoes julianoes removed their assignment Oct 29, 2020
@julianoes
Copy link
Contributor Author

I have plans but no time. @dagar might know better.

@JacobCrabill
Copy link
Member

TODO: review dataman locking. We don't seem to be using it consistently.

This didn't age well....

I'll be taking at look at as time allows within the next few weeks to see if any changes on master have fixed this issue since it was first noted here. Hopefully it's fixed 🤞🏻; I'll post again if not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants