Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DON'T MERGE] debug CI #11401

Closed
wants to merge 22 commits into from
Closed

[DON'T MERGE] debug CI #11401

wants to merge 22 commits into from

Conversation

julianoes
Copy link
Contributor

This is just to debug #11380.

@julianoes
Copy link
Contributor Author

successful-fw-test.log

@julianoes
Copy link
Contributor Author

Something new 😄
screenshot from 2019-02-07 08-40-44

@julianoes
Copy link
Contributor Author

unsuccessful-fw-test.log

Yay, I got a log after a force push.

@lamping7
Copy link
Member

lamping7 commented Feb 7, 2019

RE: #11401 (comment)

Started by user Julian Oes
Restarted from build #6, stage ROS Tests.
.
.
[Pipeline] { (Build)
Stage "Build" skipped due to this build restarting at stage "ROS Tests"

You can't just restart that stage I guess. Need to restart the whole job.

@julianoes
Copy link
Contributor Author

You can't just restart that stage I guess. Need to restart the whole job.

This means the restart button should probably not be there...

@julianoes julianoes force-pushed the debug-jenkins-fw branch 2 times, most recently from eccb159 to b228790 Compare February 7, 2019 13:29
@lamping7
Copy link
Member

lamping7 commented Feb 7, 2019

It's because we're using stashes to hold onto the built px4 & sitl_gazebo package across stages. stashes don't persist between job builds. We might be able to address this problem if people think it is a worthwhile feature. The solution would involve pulling the artifact rather than the stash which does persist. I'm not sure how this would work if you hit restart a number of times greater than the artifacts are allowed to stick around (5 job builds currently), as the build discarder would likely throw it away at restart # 6 and we'd have the same problem.

@julianoes
Copy link
Contributor Author

It's confusing because restarting using the button used to work.

@lamping7
Copy link
Member

lamping7 commented Feb 7, 2019

Interesting. I assume before the latest Jenkins upgrade? I can look into this.

@julianoes
Copy link
Contributor Author

julianoes commented Feb 7, 2019

Something new 😕

Cannot contact ec2_docker_slave (i-01fdf8c5be63a67e3): java.lang.InterruptedException
Creating placeholder flownodes because failed loading originals.

GitHub has been notified of this commit’s build result

java.io.IOException: Tried to load head FlowNodes for execution Owner[PX4_misc/Firmware-SITL_tests/debug-jenkins-fw/15:PX4_misc/Firmware-SITL_tests/debug-jenkins-fw #15] but FlowNode was not found in storage for head id:FlowNodeId 1:23
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:664)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:526)
	at hudson.model.RunMap.retrieve(RunMap.java:225)
	at hudson.model.RunMap.retrieve(RunMap.java:57)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483)
	at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381)
	at hudson.model.RunMap.getById(RunMap.java:205)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:901)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:912)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
	at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
	at jenkins.model.Jenkins.<init>(Jenkins.java:989)
	at hudson.model.Hudson.<init>(Hudson.java:85)
	at hudson.model.Hudson.<init>(Hudson.java:81)
	at hudson.WebAppMain$3.run(WebAppMain.java:233)
Finished: FAILURE

@julianoes julianoes force-pushed the debug-jenkins-fw branch 2 times, most recently from d5fcc71 to 3a97a39 Compare February 7, 2019 18:11
@julianoes
Copy link
Contributor Author

I'm looking at http://ci.px4.io:8080/blue/rest/organizations/jenkins/pipelines/PX4_misc/pipelines/Firmware-SITL_tests/branches/debug-jenkins-fw/runs/17/nodes/48/steps/134/log/?start=0
and I can see that the actuator message is 10 times published to uORB and only 9 times received on the simulator side.
I think the next step is to debug if it has been successfully sent by simulator_mavlink.

@lamping7
Copy link
Member

lamping7 commented Feb 8, 2019

RE: #11401 (comment) 👍

Yeah... we're working it. Non-technically -- Flownodes are used to maintain states for Jeninks to resume builds when something bad happens. It has to do with durability settings, which you can't disable unfortunately. Anyway, hopefully changing EC2 instance type fixes it.

It's fun when other people start to see what we bang our heads against.

@julianoes
Copy link
Contributor Author

When you want it to fail it passes, need to force push this again. 😄

@julianoes
Copy link
Contributor Author

julianoes commented Feb 10, 2019

I can see that the actuator message is 10 times published to uORB and only 9 times received on the simulator side.

The same happened again. It's published to uORB and successfully sent via TCP 22 times but only 21 times received on the simulator side. Supposedly, with TCP these messages should not ever get lost. And we already use TCP_NODELAY to flush everything immediately.

Actuators were published to uORB 22 times, and sent 21 times via TCP. On the simulator side the actuators were received 21 times. This means I should check if the sensor message somehow gets lost.

And another thought: if anything goes wrong with the transmission, why would this happen for fixedwing only?

@julianoes julianoes force-pushed the debug-jenkins-fw branch 3 times, most recently from 1e81ac6 to 79a7304 Compare February 10, 2019 15:57
@dagar
Copy link
Member

dagar commented Feb 11, 2019

Let's merge #11426, then grind this branch (let jenkins run 20 consecutive builds) with the FW tests re-enabled.

@julianoes
Copy link
Contributor Author

We get up to here but then sensors doesn't receive a gyro update:
https://github.com/PX4/Firmware/blob/6dfd526452c12d1e116951cec1389f613965cd0e/src/modules/sensors/sensors.cpp#L631-L632

@lamping7
Copy link
Member

@dagar upgrading the timestamper plugin in Jenkins to v1.9 might help here. It is supposed to fix the missing presence of timestamps in the console output for slave agents. https://plugins.jenkins.io/timestamper

@dagar
Copy link
Member

dagar commented Feb 14, 2019

@dagar upgrading the timestamper plugin in Jenkins to v1.9 might help here. It is supposed to fix the missing presence of timestamps in the console output for slave agents. https://plugins.jenkins.io/timestamper

Updated.

@lamping7
Copy link
Member

Nevermind... buffered stdout makes this mostly useless.

@julianoes julianoes force-pushed the debug-jenkins-fw branch 3 times, most recently from 11b82ad to f38eed5 Compare February 15, 2019 12:06
@julianoes
Copy link
Contributor Author

@dagar so here is the latest I've found:
What happens is that in sensors we usually enter px4_poll before the next gyro update becomes available. In that case we correctly get the update through the semaphore and move on.

For the failing case px4_poll in sensors is not called until after a new gyro sample has already been published. In this case we should never actually start waiting but return immediately from the poll here:
https://github.com/PX4/Firmware/blob/6302452066e3629f40d85823a07939eb192b1204/src/modules/uORB/uORBDeviceNode.cpp#L616-L626

According to @bkueng this should work unless there was already some orb_copy beforehand, although I wonder if this means an orb_copy by sensors or potentially someone else.

What we see from the printfs is that this appears_updated() returns false in the failing case.

INFO  [gyrosim] publishing sensor accel: 2100000
INFO  [gyrosim] publishing sensor gyro: 2100000
INFO  [accelsim] publishing sensor accel: 2100000
INFO  [sensors] before poll sleep with fds 8
INFO  [sensors] before poll with fds 8
INFO  [cdev] sensors: px4_poll: CDev->poll(setup) 8
INFO  [cdev] fds->events before: 1
INFO  [uorb] before appears_updated
INFO  [uorb] after appears_updated: not updated
INFO  [cdev] fds->events after: 1
INFO  [cdev] before px4_sem_timedwait, ts: 2.150000000 (sensors)

And if you're wondering why px4_sem_timedwait never times out: no more timestamp (in HIL_SENSOR) are sent from jMAVSim/Gazebo because it's still waiting for the last actuator control.

And if you're wondering why it only happens for fixedwing: this makes sense because there is only one path through the system, and if one pubsub copy fails somewhere, it falls apart.
For multicopter it seems like there is always one of the polls ready to grab the sample (or maybe there is even more going on that I don't understand).

I see two things to investigate next but I won't have time until Monday:

  1. Try to reproduce this in the unit tests, so far I figured out to run them with make px4_sitl_test && make px4_sitl_test test. I would assume this is already in the test and works as expected but I haven't had time to check that.
  2. Check what else could do the orb_copy and "steal" the sample from sensors if that's actually possible.

@julianoes julianoes force-pushed the debug-jenkins-fw branch 4 times, most recently from 3b06d11 to 0513726 Compare February 18, 2019 09:33
@julianoes
Copy link
Contributor Author

@dagar I think I've found the cause, now I just need to make a nice fix PR:
614e69c

The problem is that orb_copy is done on parameter change for the temperature calibration.

@julianoes
Copy link
Contributor Author

Not needed anymore, now that #11485 is in.

@julianoes julianoes closed this Feb 18, 2019
@julianoes julianoes deleted the debug-jenkins-fw branch February 18, 2019 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants