Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAN cleanup #8987

Merged
merged 28 commits into from
Aug 12, 2018
Merged

CAN cleanup #8987

merged 28 commits into from
Aug 12, 2018

Conversation

OXINARF
Copy link
Member

@OXINARF OXINARF commented Jul 22, 2018

This is a CAN cleanup:

  • adds a new interface called CANProtocol that all CAN protocols should implement: it's basically a init method that all should have
  • removes CAN thread management from HALs and leaves it for the each CANProtocol to create a thread, using the new thread_create API
  • both of the above open space for new CAN protocols being added
  • fixes ChibiOS driver that wasn't correctly handling multiple CANManager instances neither the separation of interfaces for each of the CANManager interfaces; this effectively fixes an observed issue of CAN using a lot of CPU time even when not doing much
  • cleans and reorders AP_UAVCAN to make it more readable
  • adds CAN filtering in UAVCAN for boards with space; unfortunately this increases size too much to have it there
  • removes debug messages (and parameter) by default; this amounts to almost less 2.5 KB in binary size
  • fixes two bugs and a warning in other libraries

This has been verified as working with a Zubax Babel. I'm waiting arrival of a Zubax GNSS this week and I'll do further testing. Any more testing anybody can do is appreciated.

I don't have an Emlid Edge to test the UAVCAN LED, but in the Babel messages look fine. Hopefully @staroselskii can test it.

As @bugobliterator (#8972), I've done some experiments on breaking AP_UAVCAN message handling into parts, but it really increases size so I've left it out for now. I have some ideas there, but no time to pursue them at the moment.

Copy link
Member

@bugobliterator bugobliterator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OXINARF if we are serious about #8972 then, the changes at sensor level will not be relevant, do you mind if we merge our PR into one? I am willing to do the rebase for the same, if you want.

AP_Param::setup_object_defaults(this, var_info);
#if CONFIG_HAL_BOARD == HAL_BOARD_SITL
if (_singleton != nullptr) {
AP_HAL::panic("AP_Proximity must be singleton");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AP_Proximity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ups, copy paste done wrong 😞 I'll fix it.

@bugobliterator
Copy link
Member

@OXINARF As far as the size issue goes, I think its going to be discussed in dev call by @tridge. I think the issue is inly pertaining to px4-v2 builds, which we can make defunct and replace them with either fmuv2 or get rid of altogether. Meaning boards with 1M+ size will be the only ones to support CAN Protocol, which includes all existing boards with CAN support except for Pixhawks containing uC's with Flash USB errata.

@magicrub
Copy link
Contributor

I've tested the branch from @bugobliterator that is on top of this and I LOVE IT. Tested it with a babel, a Zubax GNSS and an Emlid Edge. The class abstraction and AP_UAVCAN.cpp cleanup is greatly appreciated and very good. However, I'm seeing some heavy cpu scheduler numbers. I've never done proper comparisons so i don't know if it's big or small but on plane if I change the scheduler rate from 50 to 300 the mavlink connection really slows down. takes awhile to get a param download. At 400 I can barely connect enough to get the param changed

@OXINARF
Copy link
Member Author

OXINARF commented Jul 25, 2018

@magicrub I answered in the other PR, but that's a clear sign that you haven't compiled the ChibiOS driver change in. Make sure to update the submodules before building the firmware.

@tridge
Copy link
Contributor

tridge commented Jul 30, 2018

same problem here as in #8972, I find the zubax v1 is detected with plane, but not with copter. I suspect it is timing related.
Interestingly, for it to work with either fw I need to set both CAN_P1_DRIVER=1 and CAN_P2_DRIVER=1. I'm on a CubeBlack, which mis-labels CAN1 and CAN2. My zubax is connected to the port labelled CAN1 (which is really CAN2)
If I plug the zubax into the port labelled CAN2, and set CAN_P1_DRIVER=1 then I do see the sensors

@OXINARF
Copy link
Member Author

OXINARF commented Jul 30, 2018

Interestingly, for it to work with either fw I need to set both CAN_P1_DRIVER=1 and CAN_P2_DRIVER=1. I'm on a CubeBlack, which mis-labels CAN1 and CAN2. My zubax is connected to the port labelled CAN1 (which is really CAN2)
If I plug the zubax into the port labelled CAN2, and set CAN_P1_DRIVER=1 then I do see the sensors

@tridge That's a known issue I've told you about. You can't use the STM32 CAN2 device without enabling CAN1 - because of our multiple CANManager instances it's not very safe to force CAN1 to be initialized, although it's doable; I'd prefer to wait until I've removed the multiple CANManager stuff later though.

One thing you can test is initializing CAN1 device (port CAN2 in the Cube) without any protocol attached to it:

  • CAN_P1_DRIVER = 1
  • CAN_D1_PROTOCOL = 0
  • CAN_P2_DRIVER = 2
  • CAN_D2_PROTOCOL = 1 (should be by default)

same problem here as in #8972, I find the zubax v1 is detected with plane, but not with copter. I suspect it is timing related.

Is this with both ports enabled? And do you have ESCs enabled? I've found this problem too, but I don't know how to fix it; the issue is that because one of the ports doesn't have a device attached, all messages fail sending and so libuavcan will continuously try to re-send them during the spin - so the CAN thread doesn't sleep. It's basically the same issue we had before I fixed the ChibiOS driver, except there it also didn't work with just one port enabled and a device attached.

I've thought about changing spin to spinOnce and then doing the sleep ourselves, but the issue is that when things are working, that could delay sending messages as spinOnce will take care of all received messages, but won't try to send all the pending messages, instead only sends one message out plus one per received message.

@pavel-kirienko Any ideas here?

@pavel-kirienko
Copy link
Contributor

Any ideas here?

Profiling on-target should help. Libuavcan is supposed to sleep even if there are unused interfaces because from the viewpoint of the library the only significant difference between a functioning interface and a floating interface is that in the latter case all frames timeout (it's not entirely correct to say that they "fail sending", they are aborted by the driver). If a floating interface creates a higher CPU load, something is probably wrong with the driver, hence I recommend profiling.

@bugobliterator
Copy link
Member

bugobliterator commented Jul 30, 2018

@pavel-kirienko @tridge I did some experiments to check performance of this PR:

  • I found that it takes around 400us from getting a message in Dispacher::spin, i.e. call of handleFrame to actual callback, Which I think is massive amount of time spent just parsing the message. This explained the MAX CPU time between sleeps I am getting as shown here: https://gist.github.com/bugobliterator/2553af46a2824fa3b66ecffec3411c17 in uavcan_0 thread.
    Note: This test was performed with just GNSS Fix message being sent at 5Hz. I think if we are to increase the amount of packets as with MAG and Baro attached, Issue as seen by @tridge can be explained.
  • I also tried -O3 flag for performance optimisation and had no impact whatsoever on uavcan threads performance.
  • I saw normal behaviour with no device connected and all ESC and SRV bitmask enabled. The MAX CPU time reached ~90us instead of ~60us which is reasonable.
  • Also I am not observing any performance related issues with floating interface, as mentioned by @OXINARF.

@pavel-kirienko
Copy link
Contributor

Not sure if the function call instrumentation is enabled on your target, but it is known to cause dramatic slowdown for libuavcan due to its design. I discovered this a long time ago in PX4; more info here: PX4/PX4-Autopilot#1660. It was fixed in PX4 since then, but I don't know whether it affects ArduPilot. I can't pass on this opportunity to emphasize the benefits of regular global profiling; if system performance is a concern, it must be done: PX4/PX4-Autopilot#6829 (comment)

@bugobliterator
Copy link
Member

bugobliterator commented Jul 31, 2018

@pavel-kirienko there are no function instrumentation for the ChibiOS builds, that I am using. Further analysis revealed that maximum time is being spent in message's decode method ~300us, and the check was done ensuring that interrupt and context switches are postponed when between the measurements. Digging deeper it was found that around ~10us of CPU time is consumed per field and most of the time is spent here at this line https://github.com/ArduPilot/uavcan/blob/master/libuavcan/src/transport/uc_transfer_buffer.cpp#L175 which leads to https://github.com/ArduPilot/uavcan/blob/master/libuavcan/src/transport/uc_transfer_buffer.cpp#L101
. The time spent there is somewhere around 7-8us per field. The CPU time other than UAVCAN thread is way under the requisite bounds. It is observed that parsing a message can take more time than any other thread between sleeps (which includes Vehicle thread that does sensor fusion and controls), which I think is considerable amount of time.

@pavel-kirienko
Copy link
Contributor

Looks like I was onto something here https://github.com/ArduPilot/uavcan/blob/70df64480a7512fbbb85b442234dbd9b3bdae548/libuavcan/src/transport/uc_transfer_buffer.cpp#L168.

The linked list based buffer requires sequential iteration of the buffer segments in order to reach the desirable offset, which makes it slow. Additionally, the code that you identified as the culprit makes a lot of auxiliary operations per every byte read in order to facilitate the segment traversal logic at the outer layer. One could get rid of the list traversal problem by using some sort of a stateful reader object which would keep references to the current chunk. Is anybody willing to give it a try?

@bugobliterator
Copy link
Member

Is anybody willing to give it a try?

@pavel-kirienko If it helps with resolution of timing issue, certainly. I will give it a try. Also let me know if you have sample implementation lying around.

@pavel-kirienko
Copy link
Contributor

pavel-kirienko commented Jul 31, 2018 via email

@OXINARF OXINARF force-pushed the pr/can-cleanup branch 2 times, most recently from 81adacb to 80882fa Compare July 31, 2018 23:56
@OXINARF
Copy link
Member Author

OXINARF commented Aug 1, 2018

I've updated this PR. I haven't been able to reproduce my previous issue where it seemed that having both ports assigned to one driver and one port without anything attached, would lead to the previous issue of high CPU utilization.

I don't know what I did wrong back in that day, but I've tried multiple options and couldn't lose MAVLink connection in any again.
Part of the problem was still there though: in those conditions, port 1 wasn't sending almost any message out and it was due to a bug I had introduced while fixing the ChibiOS driver. This is now fixed.

I've also found out that #7863 introduced a bug by making ESC and Actuator messages have a timeout of 20ms instead of 2ms. This is now fixed as well. I've left LEDs with a timeout of 20ms as it shouldn't hurt, given that those messages are sent at 20Hz.

@tridge Can you please re-test this? I've been testing with the Zubax GNSS and I don't see any problem - besides the occasional bad CRC messages in MAVProxy (not sure if they are too frequent).

Regarding the findings from @bugobliterator, they seem quite important and I hope we can reach some solution.

@tridge
Copy link
Contributor

tridge commented Aug 3, 2018

rebased on master
ahh, darn, I can't, the branch isn't editable by me

@tridge
Copy link
Contributor

tridge commented Aug 3, 2018

I've pushed a rebased version here: https://github.com/tridge/ardupilot/tree/pr/can-cleanup
this rebase includes the change to ChibiOS 18.2.x which went into master today

OXINARF added 23 commits August 12, 2018 13:35
../../libraries/AP_BattMonitor/AP_BattMonitor_UAVCAN.cpp: In member function ‘virtual void AP_BattMonitor_UAVCAN::init()’:
../../libraries/AP_BattMonitor/AP_BattMonitor_UAVCAN.cpp:15:123: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘AP_Int32 {aka AP_ParamT<int, (ap_var_type)3u>}’ [-Wformat=]
 #define debug_bm_uavcan(level, fmt, args...) do { if ((level) <= AP_BoardConfig_CAN::get_can_debug()) { printf(fmt, ##args); }} while (0)
                                                                                                                           ^
../../libraries/AP_BattMonitor/AP_BattMonitor_UAVCAN.cpp:36:33: note: in expansion of macro ‘debug_bm_uavcan’
                                 debug_bm_uavcan(2, "UAVCAN BattMonitor BatteryInfo registered id: %d\n\r", _params._serial_number);
Also change the order of logical OR so that led_write in UAVCAN_RGB_LED is called for all UAVCAN instances and not only first one
This includes creating own thread

Also adapts example
Also a bit more cleanup
@OXINARF OXINARF merged commit a53c848 into ArduPilot:master Aug 12, 2018
@OXINARF OXINARF deleted the pr/can-cleanup branch August 12, 2018 13:54
@magicrub
Copy link
Contributor

Woohoo!

@proficnc
Copy link
Contributor

Awesome job team!!!

@jpkh
Copy link
Contributor

jpkh commented Aug 16, 2018

Great to see this happening.. yay good work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants