-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM failures in CI #951
Comments
Failure:https://ci.nodejs.org/job/node-test-binary-arm/11211/RUN_SUBSET=3,label=pi1-raspbian-wheezy/console Building remotely on test-requireio_chrislea-debian7-arm_pi1p-1 (pi1-raspbian-wheezy) in workspace /home/iojs/build/workspace/node-test-binary-arm + git clean -fdx
warning: failed to remove out/Release/.nfs00000000000f537d000002f9 |
I had an idea, by way of cataloging toward a HOWTO, to add a "code" to such failures - then we can count them and add steps for troubleshooting:
ps -ef | grep "[node] <defunct>" # To check
ps -ef | grep "[node] <defunct>" | awk '{print $2}' | xargs sudo kill # To clean up Seen several times recently: 2017-10-25
2017-10-17
2017-10-29
|
@refack great idea, but I'd add something else to the |
@refack you offering to start a HOWTO somewhere? |
Thinking this might be a good use for the wiki, stuff that changes often, we don't really need to worry about source control for, basically a scratchpad for anyone to edit. Looks like Johan had a similar idea a while ago. |
I'm starting to keep notes now on my maintenance of the cluster. I can drop them in here also if that's helpful. If there are repeating correlations with erroring machines even after replacing SD cards then we might be able to pinpoint ones that need to be retired. Today I r/w tested the SD cards on I've thrown one of the cards out and inserted a new one and set these 3 up from scratch and they are back in the cluster. Also, I'm hoping that by pulling back on the overclocking on these that we might have more stability. We'll see. |
I've added the "kill defunct" line to the job config (after manual testing and one mistake): |
There are multiple builds where pretty much all arm runs failed |
I'm removing stale |
Also seeing multiple failures that look like this: 11:56:00 Started by upstream project "node-test-binary-arm" build number 12701
11:56:00 originally caused by:
11:56:00 Started by upstream project "node-test-commit-arm-fanned" build number 13496
11:56:00 originally caused by:
11:56:00 Started by upstream project "node-test-commit" build number 14936
11:56:00 originally caused by:
11:56:00 Started by upstream project "node-daily-master" build number 977
11:56:00 originally caused by:
11:56:00 Started by timer
11:56:00 [EnvInject] - Loading node environment variables.
11:56:01 Building remotely on test-requireio_rvagg-debian7-arm_pi2-1 (pi2-raspbian-wheezy) in workspace /home/iojs/build/workspace/node-test-binary-arm
11:56:03 [node-test-binary-arm] $ /bin/sh -xe /tmp/jenkins1159433904551122903.sh
11:56:03 + set +x
11:56:03 Tue Dec 19 16:56:03 UTC 2017
11:56:04 + pgrep node
11:56:04 7241
11:56:04 7247
11:56:04 7252
11:56:04 7253
11:56:04 7258
11:56:04 7260
11:56:04 7269
11:56:04 7274
11:56:04 7276
11:56:05 [node-test-binary-arm] $ /bin/bash -ex /tmp/jenkins199448051140569265.sh
11:56:05 + rm -rf RUN_SUBSET
11:56:05 + case $label in
11:56:05 + REF=cc-armv7
11:56:05 + REFERENCE_REFS=+refs/heads/master:refs/remotes/reference/master
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging'
11:56:05 + REFERENCE_REFS='+refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/reference/v8.x-staging'
11:56:05 + ORIGIN_REFS=+refs/heads/master:refs/remotes/origin/master
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/origin/v7.x-staging'
11:56:05 + ORIGIN_REFS='+refs/heads/master:refs/remotes/origin/master +refs/heads/v4.x-staging:refs/remotes/origin/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/origin/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/origin/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/origin/v8.x-staging'
11:56:05 + git --version
11:56:06 git version 2.15.0
11:56:06 + git init
11:56:06 Reinitialized existing Git repository in /home/iojs/build/workspace/node-test-binary-arm/.git/
11:56:06 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/master:refs/remotes/reference/master +refs/heads/v4.x-staging:refs/remotes/reference/v4.x-staging +refs/heads/v6.x-staging:refs/remotes/reference/v6.x-staging +refs/heads/v7.x-staging:refs/remotes/reference/v7.x-staging +refs/heads/v8.x-staging:refs/remotes/reference/v8.x-staging
11:56:09 fatal: Couldn't find remote ref refs/heads/v7.x-staging
11:56:09
11:56:09 real 0m3.441s
11:56:09 user 0m0.050s
11:56:09 sys 0m0.040s
11:56:09 + echo 'Problem fetching the shared reference repo.'
11:56:09 Problem fetching the shared reference repo.
11:56:09 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/jenkins-node-test-commit-arm-fanned-13496-binary-pi1p/cc-armv7:refs/remotes/jenkins_tmp
11:56:09 fatal: The remote end hung up unexpectedly
11:56:11
11:56:11 real 0m1.893s
11:56:11 user 0m0.180s
11:56:11 sys 0m0.400s
11:56:11 + ps -ef
11:56:11 + grep '\[node\] <defunct>'
11:56:11 + awk '{print $2}'
11:56:11 + xargs -rl kill
11:56:11 + rm -f ****
11:56:11 + git checkout -f refs/remotes/jenkins_tmp
11:56:22 HEAD is now at 6c29aa6896... added binaries
11:56:22
11:56:22 real 0m11.204s
11:56:22 user 0m1.480s
11:56:22 sys 0m9.850s
11:56:22 + git reset --hard
11:56:26 HEAD is now at 6c29aa6896 added binaries
11:56:26
11:56:26 real 0m3.466s
11:56:26 user 0m1.460s
11:56:26 sys 0m0.860s
11:56:26 + git clean -fdx
11:56:30 warning: failed to remove out/Release: Directory not empty
11:56:30 Removing config.gypi
11:56:30 Removing icu_config.gypi
11:56:30 Removing node
11:56:30 Removing out/Release/node
11:56:30 Removing out/Release/openssl-cli
11:56:30 Removing test.tap
11:56:30 Removing test/.tmp.0/
11:56:30 Removing test/abort/testcfg.pyc
11:56:30 Removing test/addons-napi/testcfg.pyc
11:56:30 Removing test/addons/testcfg.pyc
11:56:30 Removing test/async-hooks/testcfg.pyc
11:56:30 Removing test/doctool/testcfg.pyc
11:56:30 Removing test/es-module/testcfg.pyc
11:56:30 Removing test/gc/testcfg.pyc
11:56:30 Removing test/internet/testcfg.pyc
11:56:30 Removing test/known_issues/testcfg.pyc
11:56:30 Removing test/message/testcfg.pyc
11:56:30 Removing test/parallel/testcfg.pyc
11:56:30 Removing test/pseudo-tty/testcfg.pyc
11:56:30 Removing test/pummel/testcfg.pyc
11:56:30 Removing test/sequential/testcfg.pyc
11:56:30 Removing test/testpy/__init__.pyc
11:56:30 Removing test/tick-processor/testcfg.pyc
11:56:30 Removing test/timers/testcfg.pyc
11:56:30 Removing tools/test.pyc
11:56:30 Removing tools/utils.pyc
11:56:31 Build step 'Execute shell' marked build as failure
11:56:31 TAP Reports Processing: START
11:56:31 Looking for TAP results report in workspace using pattern: *.tap
11:56:32 Did not find any matching files. Setting build result to FAILURE.
11:56:32 Checking ^not ok
11:56:32 Jenkins Text Finder: File set '*.tap' is empty
11:56:32 Notifying upstream projects of job completion
11:56:32 Finished: FAILURE |
Note that the above is the aforementioned 11:56:09 + git fetch --no-tags file:///home/iojs/.ccache/node.shared.reference +refs/heads/jenkins-node-test-commit-arm-fanned-13496-binary-pi1p/cc-armv7:refs/remotes/jenkins_tmp
11:56:09 fatal: The remote end hung up unexpectedly
11:56:11 |
Certainly some NFS issues showing up now: 2:11:32 + git clean -fdx
12:11:40 warning: failed to remove out/Release/.nfs00000000000b132d00000005: Device or resource busy |
But it seems to be self-healing. https://ci.nodejs.org/job/node-test-binary-arm/12703/ is now 2/3 green and looking promising. All I've done is remove stale |
FWIW our (IBM) Jenkins farm used to have all the workspaces running on a shared NFS mount, but we moved to having everything local because it kept causing problems like this. |
I imagine that's probably not an option with Raspberry Pi devices. :-( Still, good to know. |
This is caused by the nfs server having some internal problems, twice now in a few days. I'm a bit embarrassed to admit that it's likely to do with the hot weather we've been having down here (no I don't have a cooled datacenter in my garage unfortunately). I've done some restarting and cleaning up and have a couple of jobs working at the moment that seem to indicate that it's all good now. |
Thought it might be a good idea to have a tracker issue for ARM failures, I see a lot, but am not sure whether it's the same failures again and again or something new.
1 comment per new type of failure is probably a good starting point.
cc/ @rvagg
The text was updated successfully, but these errors were encountered: