-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Several release machines offline #2217
Comments
I restarted jenkins, but the failure remains:
and is true:
|
About Mar 9, looks like
|
It feels like a firewall issue, not sure if its on node infra side or not. AFAICT, the two machines I've looked at so far, though both "IBM", are provided on different orgs, with different networks, so that suggests its a nodejs side infra problem. |
I'll look at the firewall config |
Looks like at least for the 2 ibm machines, they were removed from the firewall. I do see that when I added them I had Adding those 2 back and will see if that resolves. If so I'll check the windows ones. |
release-ibm-rhel7-s390x-1 is back online, The aix one might need a kick to get it back |
I don't see any windows machines in the Ansible templates under release which is a bit strange and makes checking the ips harder. |
Does look like windows ones were removed as well. Can't double check the IPs but will add back the entries that match the names in the ci |
Did you check nodejs-private/secrets:build/release/inventory.yml? I can't decrypt, but I think they might be in there. |
aix is up. |
Windows machines seem to be back up as well |
OK, lets not close until we've tracked down why it went away, or do you know? Is it possible the encrypted inventory.yml is not correct, and the firewall config got refreshed based on that? I wonder, because neither @AshCripps or I have access to that file, so its possible it didn't get the release machine IPs added to it, and if its used to build firewalls, perhaps thats why they lost the manual config. Or is there some other reason you can see? |
@sam-github they are in nodejs-private/secrets:build/release/inventory.yml. My confusion was having some of the info in the other file/ |
To clarify, inventory.yml in the secrets:build/release directory includes the windows machines that were removed, it does not include the ibm ones. Right now each of the 2 only have partial info. The key question before closing what triggered the update to the firewall config which resulted in machines being lost. We should follow up with @rvagg as I think he'd be the only other one who would have updated since my last update on Feb 25. |
So, any idea what process caused the reset of the firewall rules? This seems like it was a side-effect of the recent jenkins server upgrade, which suggests that perhaps it was done by ansible, and that ansible doesn't know about these two new (ish) release machines? EDIT: crossed comments :-). OK, lets wait for @rvagg to comment. |
Right, as far as I know there are no automated processes for updating the firewall config, just manual updates. |
I don't think I've touched it for quite a while, certainly not since Feb 25. But I did restart the server, and ci.nodejs.org too, for security updates 2 days ago which was obviously a trigger for a reset to an old state. So I suppose rules.v4 wasn't updated properly? |
These are the instructions we have for how to update: https://github.com/nodejs/build/blob/master/ansible/MANUAL_STEPS.md#adding-firewall-entries-for-jenkins-workers Which is what I've always followed (and I'm guessing similarly whoever added the windows machines). If more needs to be done than that can you PR in what it is or add to this issue so that we can get it captured. |
I'm suspecting there might be something borked on this machine wrt iptables. /etc/cron.hourly/iptables-save does a Whatever happens, the iptables rules need to be saved to /etc/iptables/rules.v4, that's what the When I edit the firewall, I just do it directly:
Where I need to delete a rule, I do this:
This is quite manual and the steps in MANUAL_STEPS.md should be safer, but they don't seem to be properly saving on this machine:
The major differences with current rules are:
which I'm guessing corresponds to the machines that were offline? For now, I've just run Maybe if @sam-github gets access to this machine this can be one of his initial tasks? |
just added back release-nearform-macos10.15-x64-1, that was missing too |
@rvagg those were the ones I just added because they were missing. The answer here would suggest that it's not configured correctly: https://unix.stackexchange.com/questions/125833/why-isnt-the-iptables-persistent-service-saving-my-changes. We could just update that cronjob to have |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
This is long since stale, closing. |
I think release builds are backing up due to the following machines being offline in ci-release:
The text was updated successfully, but these errors were encountered: