[DON'T MERGE] debug CI #11401

julianoes · 2019-02-06T20:54:22Z

This is just to debug #11380.

julianoes · 2019-02-07T06:17:09Z

successful-fw-test.log

julianoes · 2019-02-07T07:41:42Z

Something new 😄

julianoes · 2019-02-07T08:10:35Z

unsuccessful-fw-test.log

Yay, I got a log after a force push.

lamping7 · 2019-02-07T12:46:49Z

RE: #11401 (comment)

Started by user Julian Oes
Restarted from build #6, stage ROS Tests.
.
.
[Pipeline] { (Build)
Stage "Build" skipped due to this build restarting at stage "ROS Tests"

You can't just restart that stage I guess. Need to restart the whole job.

julianoes · 2019-02-07T12:59:20Z

You can't just restart that stage I guess. Need to restart the whole job.

This means the restart button should probably not be there...

lamping7 · 2019-02-07T13:47:37Z

It's because we're using stashes to hold onto the built px4 & sitl_gazebo package across stages. stashes don't persist between job builds. We might be able to address this problem if people think it is a worthwhile feature. The solution would involve pulling the artifact rather than the stash which does persist. I'm not sure how this would work if you hit restart a number of times greater than the artifacts are allowed to stick around (5 job builds currently), as the build discarder would likely throw it away at restart # 6 and we'd have the same problem.

julianoes · 2019-02-07T13:55:01Z

It's confusing because restarting using the button used to work.

lamping7 · 2019-02-07T14:04:18Z

Interesting. I assume before the latest Jenkins upgrade? I can look into this.

julianoes · 2019-02-07T15:01:02Z

Something new 😕

Cannot contact ec2_docker_slave (i-01fdf8c5be63a67e3): java.lang.InterruptedException
Creating placeholder flownodes because failed loading originals.

GitHub has been notified of this commit’s build result

java.io.IOException: Tried to load head FlowNodes for execution Owner[PX4_misc/Firmware-SITL_tests/debug-jenkins-fw/15:PX4_misc/Firmware-SITL_tests/debug-jenkins-fw #15] but FlowNode was not found in storage for head id:FlowNodeId 1:23
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:664)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:526)
	at hudson.model.RunMap.retrieve(RunMap.java:225)
	at hudson.model.RunMap.retrieve(RunMap.java:57)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
	at hudson.model.RunMap.getById(RunMap.java:205)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:901)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:912)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
	at jenkins.model.Jenkins.<init>(Jenkins.java:989)
	at hudson.model.Hudson.<init>(Hudson.java:85)
	at hudson.model.Hudson.<init>(Hudson.java:81)
	at hudson.WebAppMain$3.run(WebAppMain.java:233)
Finished: FAILURE

julianoes · 2019-02-07T21:07:17Z

I'm looking at http://ci.px4.io:8080/blue/rest/organizations/jenkins/pipelines/PX4_misc/pipelines/Firmware-SITL_tests/branches/debug-jenkins-fw/runs/17/nodes/48/steps/134/log/?start=0
and I can see that the actuator message is 10 times published to uORB and only 9 times received on the simulator side.
I think the next step is to debug if it has been successfully sent by simulator_mavlink.

lamping7 · 2019-02-08T02:52:07Z

RE: #11401 (comment) 👍

Yeah... we're working it. Non-technically -- Flownodes are used to maintain states for Jeninks to resume builds when something bad happens. It has to do with durability settings, which you can't disable unfortunately. Anyway, hopefully changing EC2 instance type fixes it.

It's fun when other people start to see what we bang our heads against.

julianoes · 2019-02-08T07:46:08Z

When you want it to fail it passes, need to force push this again. 😄

julianoes · 2019-02-10T12:11:45Z

I can see that the actuator message is 10 times published to uORB and only 9 times received on the simulator side.

The same happened again. It's published to uORB and successfully sent via TCP 22 times but only 21 times received on the simulator side. Supposedly, with TCP these messages should not ever get lost. And we already use TCP_NODELAY to flush everything immediately.

Actuators were published to uORB 22 times, and sent 21 times via TCP. On the simulator side the actuators were received 21 times. This means I should check if the sensor message somehow gets lost.

And another thought: if anything goes wrong with the transmission, why would this happen for fixedwing only?

dagar · 2019-02-11T01:34:49Z

Let's merge #11426, then grind this branch (let jenkins run 20 consecutive builds) with the FW tests re-enabled.

julianoes · 2019-02-13T10:47:14Z

We get up to here but then sensors doesn't receive a gyro update:
https://github.com/PX4/Firmware/blob/6dfd526452c12d1e116951cec1389f613965cd0e/src/modules/sensors/sensors.cpp#L631-L632

lamping7 · 2019-02-14T00:00:38Z

@dagar upgrading the timestamper plugin in Jenkins to v1.9 might help here. It is supposed to fix the missing presence of timestamps in the console output for slave agents. https://plugins.jenkins.io/timestamper

dagar · 2019-02-14T02:05:30Z

@dagar upgrading the timestamper plugin in Jenkins to v1.9 might help here. It is supposed to fix the missing presence of timestamps in the console output for slave agents. https://plugins.jenkins.io/timestamper

Updated.

lamping7 · 2019-02-14T02:31:22Z

Nevermind... buffered stdout makes this mostly useless.

julianoes · 2019-02-15T21:59:32Z

@dagar so here is the latest I've found:
What happens is that in sensors we usually enter px4_poll before the next gyro update becomes available. In that case we correctly get the update through the semaphore and move on.

For the failing case px4_poll in sensors is not called until after a new gyro sample has already been published. In this case we should never actually start waiting but return immediately from the poll here:
https://github.com/PX4/Firmware/blob/6302452066e3629f40d85823a07939eb192b1204/src/modules/uORB/uORBDeviceNode.cpp#L616-L626

According to @bkueng this should work unless there was already some orb_copy beforehand, although I wonder if this means an orb_copy by sensors or potentially someone else.

What we see from the printfs is that this appears_updated() returns false in the failing case.

INFO  [gyrosim] publishing sensor accel: 2100000
INFO  [gyrosim] publishing sensor gyro: 2100000
INFO  [accelsim] publishing sensor accel: 2100000
INFO  [sensors] before poll sleep with fds 8
INFO  [sensors] before poll with fds 8
INFO  [cdev] sensors: px4_poll: CDev->poll(setup) 8
INFO  [cdev] fds->events before: 1
INFO  [uorb] before appears_updated
INFO  [uorb] after appears_updated: not updated
INFO  [cdev] fds->events after: 1
INFO  [cdev] before px4_sem_timedwait, ts: 2.150000000 (sensors)

And if you're wondering why px4_sem_timedwait never times out: no more timestamp (in HIL_SENSOR) are sent from jMAVSim/Gazebo because it's still waiting for the last actuator control.

And if you're wondering why it only happens for fixedwing: this makes sense because there is only one path through the system, and if one pubsub copy fails somewhere, it falls apart.
For multicopter it seems like there is always one of the polls ready to grab the sample (or maybe there is even more going on that I don't understand).

I see two things to investigate next but I won't have time until Monday:

Try to reproduce this in the unit tests, so far I figured out to run them with make px4_sitl_test && make px4_sitl_test test. I would assume this is already in the test and works as expected but I haven't had time to check that.
Check what else could do the orb_copy and "steal" the sample from sensors if that's actually possible.

julianoes · 2019-02-18T13:55:25Z

@dagar I think I've found the cause, now I just need to make a nice fix PR:
614e69c

The problem is that orb_copy is done on parameter change for the temperature calibration.

julianoes · 2019-02-18T19:03:43Z

Not needed anymore, now that #11485 is in.

julianoes force-pushed the debug-jenkins-fw branch from 5f66c96 to fdc37fb Compare February 7, 2019 07:48

julianoes mentioned this pull request Feb 7, 2019

SITL intermittent failure: [Err] [gazebo_mavlink_interface.cpp:1018] poll timeout #11380

Closed

julianoes force-pushed the debug-jenkins-fw branch from 77ed73a to e67bbd8 Compare February 7, 2019 09:15

julianoes force-pushed the debug-jenkins-fw branch 2 times, most recently from eccb159 to b228790 Compare February 7, 2019 13:29

julianoes force-pushed the debug-jenkins-fw branch from b228790 to 40d3149 Compare February 7, 2019 13:51

julianoes force-pushed the debug-jenkins-fw branch 2 times, most recently from d5fcc71 to 3a97a39 Compare February 7, 2019 18:11

weekly-digest bot mentioned this pull request Feb 10, 2019

Weekly Digest (3 February, 2019 - 10 February, 2019) #11425

Closed

julianoes force-pushed the debug-jenkins-fw branch from ad2ab43 to 12d7db3 Compare February 10, 2019 11:35

julianoes force-pushed the debug-jenkins-fw branch 3 times, most recently from 1e81ac6 to 79a7304 Compare February 10, 2019 15:57

dagar assigned dagar and julianoes Feb 11, 2019

pwm_out_sim: printf for published actuator outputs

a89c655

julianoes force-pushed the debug-jenkins-fw branch from 0de432b to 88f7254 Compare February 13, 2019 09:23

sensors: debug poll

6dfd526

julianoes force-pushed the debug-jenkins-fw branch from e8b9821 to 6dfd526 Compare February 13, 2019 10:20

julianoes added 3 commits February 13, 2019 11:49

sensors: print which fd we use

896cf2c

simulator: printfs for publishing stuff

6c3e274

Revert SITL perf improvements

232a885

julianoes force-pushed the debug-jenkins-fw branch from 93a1c06 to 090b71e Compare February 13, 2019 13:38

cdev: printfs around px4_sem_timedwait

48c1b3d

julianoes force-pushed the debug-jenkins-fw branch from 090b71e to 48c1b3d Compare February 13, 2019 14:01

cdev: need more context

3d3f16a

julianoes force-pushed the debug-jenkins-fw branch 3 times, most recently from 11b82ad to f38eed5 Compare February 15, 2019 12:06

cdev: more printfs

c0d5204

julianoes force-pushed the debug-jenkins-fw branch from f38eed5 to c0d5204 Compare February 15, 2019 12:52

julianoes force-pushed the debug-jenkins-fw branch 4 times, most recently from 3b06d11 to 0513726 Compare February 18, 2019 09:33

sensors: don't copy gyro twice

614e69c

julianoes force-pushed the debug-jenkins-fw branch from 0513726 to 614e69c Compare February 18, 2019 10:14

julianoes closed this Feb 18, 2019

julianoes deleted the debug-jenkins-fw branch February 18, 2019 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DON'T MERGE] debug CI #11401

[DON'T MERGE] debug CI #11401

julianoes commented Feb 6, 2019

julianoes commented Feb 7, 2019

julianoes commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019 •

edited

Loading

julianoes commented Feb 7, 2019

lamping7 commented Feb 8, 2019 •

edited

Loading

julianoes commented Feb 8, 2019

julianoes commented Feb 10, 2019 •

edited

Loading

dagar commented Feb 11, 2019

julianoes commented Feb 13, 2019

lamping7 commented Feb 14, 2019

dagar commented Feb 14, 2019

lamping7 commented Feb 14, 2019

julianoes commented Feb 15, 2019

julianoes commented Feb 18, 2019

julianoes commented Feb 18, 2019

[DON'T MERGE] debug CI #11401

[DON'T MERGE] debug CI #11401

Conversation

julianoes commented Feb 6, 2019

julianoes commented Feb 7, 2019

julianoes commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019

lamping7 commented Feb 7, 2019

julianoes commented Feb 7, 2019 • edited Loading

julianoes commented Feb 7, 2019

lamping7 commented Feb 8, 2019 • edited Loading

julianoes commented Feb 8, 2019

julianoes commented Feb 10, 2019 • edited Loading

dagar commented Feb 11, 2019

julianoes commented Feb 13, 2019

lamping7 commented Feb 14, 2019

dagar commented Feb 14, 2019

lamping7 commented Feb 14, 2019

julianoes commented Feb 15, 2019

julianoes commented Feb 18, 2019

julianoes commented Feb 18, 2019

julianoes commented Feb 7, 2019 •

edited

Loading

lamping7 commented Feb 8, 2019 •

edited

Loading

julianoes commented Feb 10, 2019 •

edited

Loading