Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM cluster move #611

Closed
rvagg opened this issue Feb 1, 2017 · 22 comments
Closed

ARM cluster move #611

rvagg opened this issue Feb 1, 2017 · 22 comments
Labels

Comments

@rvagg
Copy link
Member

rvagg commented Feb 1, 2017

Sad news folks, I have to physically move the ARM cluster today and the internet connection where it's moving to isn't properly installed yet! I have a temporary connection ready but I'm paying by the GB for it so I can't hook up normal test runs.

There are some ARMv7 and ARMv8 machines in Jenkins (and ARMv7 in release) that aren't physically in the same place (i.e. they are hosted at Scaleway and miniNodes) so they won't be impacted.

So here's what I'm going to do:

  • ARM test and releases via Jenkins will be unavailable from around 4am UTC / 8pm Pacific for at least a few hours, possibly longer (up to 24 hours) as I hack together the temporary connection.
  • I'll unhook the relevant machines from Jenkins so nobody should be obviously impacted, you just won't get the complete runs you normally do.
  • When reconnected, I'll only be connecting release machines until I have a proper connection or decide that it'll be too long to wait and come up with an alternative.

I'll keep this thread updated as I make progress.

@nodejs/collaborators

@mikeal
Copy link
Contributor

mikeal commented Feb 1, 2017

Will we get a new picture of the cluster once it is in its new home?

@rvagg
Copy link
Member Author

rvagg commented Feb 2, 2017

release machines are back online for now, armv6 and armv8, others are off, no ETA on proper connection yet

@Trott
Copy link
Member

Trott commented Feb 2, 2017

So with no ETA for the return of the Raspberry Pi cluster:

For the jobs that are stalled waiting for the Raspberry Pi farm, will they kick off tomorrow or whenever the farm comes back online? Or probably not and the jobs should just be canceled now?

Land stuff without Raspberry Pi test results in CI? Or wait for the Raspberry Pi cluster to come back?

/cc @nodejs/ctc

@joaocgreis
Copy link
Member

joaocgreis commented Feb 2, 2017

armv7-ubuntu1404 and armv8-ubuntu1404 were removed by @rvagg from the node-test-commit-arm job but node-test-commit-arm-fanned was left in place, possibly forgotten. I think it's better to cancel. I'll look for a way to remove the whole job.

EDIT: Just disabling the job woked, it is properly skipped by node-test-commit. Also disabled git-rpi-clean.

@rvagg
Copy link
Member Author

rvagg commented Feb 2, 2017

Sorry, I thought I removed node-test-commit-arm-fanned. There shouldn't be any queued jobs, if there are then I've messed up!

@rvagg
Copy link
Member Author

rvagg commented Feb 3, 2017

Thursday the 9th is the date I've been given for finalising this internet connection. Apparently there are some technical challenges (also I think some administrative incompetence but that's to be expected when dealing with large telcos!).

@rvagg
Copy link
Member Author

rvagg commented Feb 9, 2017

Bad news .. I've been notified there are network problems in the area (monopoly government-provided internet infrastructure, yay) and it's been deferfed for another week. If it goes through then it should be up on the 16th of this month.

@thefourtheye
Copy link

We have three releases and we might get RCs out soon. Should we hold them till this setup is back up? We cannot release binaries without testing them, right?

@italoacasas
Copy link

italoacasas commented Feb 9, 2017

I have two questions:

  • something I(we) can do to help right now?
  • something we can prepare(plan) in the case that this happens again in the future, like for example a storm, etc.

@mhdawson
Copy link
Member

mhdawson commented Feb 9, 2017

The LTS releases are planned for Feb 21st so availability on the 16 may not affect those directly. It may affect plan RC's, in that case the question would be if the changes going in that we wanted validation through the RC would be ARM only or can be adequately covered by use on other platforms.

In terms of testing for the Current release, I wonder if the binaries could be tested manually by somebody with access to the release machine logging in and running the tests. That might take a while to run thought since it would be on the single machine instead of fanned like it is in the regular jobs.

@Trott
Copy link
Member

Trott commented Feb 9, 2017

I think it's OK to release RCs without testing in the ARM cluster in this situation. Maybe explain/apologize in the release announcement.

And actual release (as opposed to an RC) might be different....

@Fishrock123
Copy link
Contributor

Same, RCs/Betas should be fine.

@rvagg
Copy link
Member Author

rvagg commented Feb 16, 2017

AAAAND we're back up online again on a new stable connection that's quite a bit faster than the old one as a bonus. Working my way through everything but I'm pretty sure I've got most things in place already so it should be working as it used to before the move. Please let me know if you encounter anything that doesn't seem right.

Regarding RCs and nightlies, I think that it got screwed up after a reconnect of my temporary connection where a new dynamic IP got assigned which messed up the iptables rules on both Jenkins machines. They were working just not connecting! Ooops!

@joaocgreis
Copy link
Member

joaocgreis commented Feb 16, 2017

Jobs seem to be running well! There are still 3 slaves offline and the DNS for the jump host is not updated, but this is not urgent. However, we have some tests failing:

  • test-dgram-address is failing consistently for master on RPi 1 and 2 (master test runs: 1, 2, 3)
  • v7.x-staging seems to have the same problem plus test-npm-install on all 3 RPis
  • v6.x-staging and v4.x-staging are still running at this moment seem good

@rvagg
Copy link
Member Author

rvagg commented Feb 16, 2017

Thanks to @Trott for jumping on test-dgram-address @ nodejs/node#11432, looks like that'll be addressed soon. Full green run @ https://ci.nodejs.org/job/node-test-binary-arm/6241/

I've taken three Pi's offline, suspecting corrupted filesystems or dodgy SD cards, some of the failures were because of that. I'll address them as soon as I can and bring them back online.

@rvagg
Copy link
Member Author

rvagg commented Feb 16, 2017

Failures on test-requireio_arm-ubuntu1404-arm64_xgene-2 are interesting, e.g. https://ci.nodejs.org/job/node-test-commit-arm/7806/nodes=armv8-ubuntu1404/ and correlate with disconnection notifications that we keep on getting for just this machine and they date back pretty far (prior to the move). I was tinkering on that box last night trying to understand it but I have no idea what's going on. There's nothing special about it, in fact it's the least special of the 3 XGene machines (one runs the NFS for the Pi's and does release builds, another serves as a jump host for SSH, this one just runs test builds and nothing else!). Something about Jenkins keeps on disconnecting and reconnecting, perhaps it's a Java problem..

Anyone got ideas for debugging this? @joaocgreis, @jbergstroem?

@joaocgreis
Copy link
Member

@rvagg It's strange that it's just that one machine. I have no solution, but perhaps you can try a different ping interval from the slave side. This is used for Windows:

java -Dhudson.remoting.Launcher.pingIntervalSec=10 -jar slave.jar -jnlpUrl https://ci.nodejs.org/computer/{{ server_id }}/slave-agent.jnlp -secret {{ server_secret }}
(the main thing that clearly fixed Windows was the ping interval from the master side, but this was left in place in all Windows slaves so at least it doesn't hurt).

@rvagg
Copy link
Member Author

rvagg commented Feb 22, 2017

https://ci.nodejs.org/computer/test-requireio_arm-ubuntu1404-arm64_xgene-2/builds

I tweaked the job slightly after posting the above and you can see that it's mostly green since then. It now downloads slave.jar before starting, each time, under the theory that having an updated slave.jar would be good ... but tbh I don't know if that's been a problem at all.

Kernel logs are still full of:

[511493.450658] init: jenkins main process (8821) terminated with status 255
[511493.450681] init: jenkins main process ended, respawning
[511499.729117] init: jenkins main process (8852) terminated with status 255
[511499.729139] init: jenkins main process ended, respawning
[511505.963897] init: jenkins main process (8883) terminated with status 255
[511505.963921] init: jenkins main process ended, respawning

But failures are less frequent now but they still happen. I've implemented the extended ping interval thing just now so let's see if that helps at all.

@jbergstroem
Copy link
Member

@rvagg does the exits correlate with anything interesting in the logs?

@rvagg
Copy link
Member Author

rvagg commented Feb 22, 2017

@jbergstroem well, when I look at the actual times, it would correlate with anything that's happening on the machine:

[Wed Feb 22 13:43:04 2017] init: jenkins main process (26312) terminated with status 255
[Wed Feb 22 13:43:04 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:10 2017] init: jenkins main process (26343) terminated with status 255
[Wed Feb 22 13:43:10 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:16 2017] init: jenkins main process (26374) terminated with status 255
[Wed Feb 22 13:43:16 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:23 2017] init: jenkins main process (26405) terminated with status 255
[Wed Feb 22 13:43:23 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:29 2017] init: jenkins main process (26436) terminated with status 255
[Wed Feb 22 13:43:29 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:35 2017] init: jenkins main process (26467) terminated with status 255
[Wed Feb 22 13:43:35 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:42 2017] init: jenkins main process (26498) terminated with status 255
[Wed Feb 22 13:43:42 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:53 2017] init: jenkins main process (26529) terminated with status 255
[Wed Feb 22 13:43:53 2017] init: jenkins main process ended, respawning
[Wed Feb 22 13:43:59 2017] init: jenkins main process (26560) terminated with status 255
[Wed Feb 22 13:43:59 2017] init: jenkins main process ended, respawning

(who knew dmesg had a -T eh?)

basically it's constantly happening. Going to have to run this manually and see if I can get anything from it.

@rvagg
Copy link
Member Author

rvagg commented Feb 23, 2017

captured a failure, not sure if this is the failure, relevant log portions after connect are here: https://gist.github.com/rvagg/8eeb20b0fe7cf289601593ebff5bb827

There's a problem with child processes not being cleaned up properly which seems to cause Jenkins grief (never seen this before elsewhere) and then when it tries to reconnect it gets the kind of error you get when a node is already connected and it keeps on looping from there, which is similr behaviour to what I'm seeing with it running under upstart.

I'm trying out disabling the process tree killer as per https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller to see if that helps, perhaps this is an architecture thing (i.e. this thing is "native code").

@maclover7
Copy link
Contributor

ping @rvagg -- can this be closed?

@maclover7 maclover7 added the infra label Nov 7, 2017
@rvagg rvagg closed this as completed Nov 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants