Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodejs.org resilience #676

Closed
jbergstroem opened this issue Apr 5, 2017 · 9 comments
Closed

nodejs.org resilience #676

jbergstroem opened this issue Apr 5, 2017 · 9 comments
Labels

Comments

@jbergstroem
Copy link
Member

jbergstroem commented Apr 5, 2017

I'd argue we learnt our lesson but here we are. Before we begin I just want to make a note to the readers that most of the node.js build team are doing this on their (our) spare time. We're always looking for new members!

Anyway, lets get to work.

  1. Lets set up unencrypted.nodejs.org to host nodejs.org as well. It's already syncing most of the content; might as well do it all.
  2. Lets load balance (failover) it. Both Cloudflare (our CDN) and Digital Ocean (our main host for nodejs.org at the moment) have features for load balancing. Failing over to above makes sense.
  3. Lets let Cloudflare cache our everything; downloads, website and so on. Lets have it serve stale content if upstream fails.
  4. Lets query Cloudflare about logs (preferably before third action, above) so we (still) can generate download statistics.

Anything else? cc: @nodejs/build

@jasnell
Copy link
Member

jasnell commented Apr 5, 2017

What multi-provider options for CDN and hosting do we currently have (or can we have)?

@jbergstroem
Copy link
Member Author

jbergstroem commented Apr 5, 2017

@jasnell said:
What multi-provider options for CDN and hosting do we currently have (or can we have)?

As of now; here's a list of (viable; excluding sponsors that probably won't cut it) providers that either sponsor us or might do so in the near future (google?):

CDN:

Hosting:

  • Digital Ocean (current)
  • Joyent (suggested failover)
  • Softlayer/IBM
  • Rackspace
  • Azure/Microsoft (only for testing, but perhaps there's room to step up?)

@MylesBorins
Copy link
Contributor

MylesBorins commented Apr 5, 2017 via email

@rvagg
Copy link
Member

rvagg commented Apr 10, 2017

CloudFlare log component is @ #679, gets us closer to removing those cache bypass rules in CloudFlare.

For the record @nodejs/build, I discovered that we didn't have bypass rules for *.7z and *.zip files so metrics until today won't be accurate for them.

@rvagg
Copy link
Member

rvagg commented Apr 10, 2017

Lets have it serve stale content if upstream fails.

FWIW I did try and configure CF to do this, the "always on" option is turned off but that seems to conflict with "bypass" so it never worked. I've also never been game to trouble-shoot to figure out how to make it work because that would involve taking the server offline .. or playing in a fresh zone that I'd have to set up. Thankfully we don't have to figure that out after this is all sorted out.

Also @jbergstroem has pretty much got the mirror for the site set up so we're closer to enabling load balancing. We just need to confirm that logs are in order first and then we should be good to go.

@vielmetti
Copy link

You have access to resources on Packet, and make sure you consider those for hosting too.

@refack
Copy link
Contributor

refack commented Apr 11, 2017

Lets let Cloudflare cache our everything; downloads, website and so on. Lets have it serve stale content if upstream fails.
Lets query Cloudflare about logs (preferably before third action, above) so we (still) can generate download statistics.

Do you need any help with configuring Cloudflare, I have extensive experience with them (and some insider connections).

@rvagg
Copy link
Member

rvagg commented Apr 21, 2017

cf_lb

So @jbergstroem set this up and configured a 15 minute sync to pull from the DigitalOcean primary server to our Joyent backup server for /home/dist (downloads) and /home/www (everything else). We did a little bit of testing with a temporary hostname, it's tricky to get full confidence without sticking this into production unfortunately but we were comfortable enough to turn it on and give it a go during a low traffic time.

So what we have now is basically the same as before, but if the DO server flakes out for whatever reason (health check is just GET / for now), then it'll start pulling from the Joyent server until DO comes back. So we don't have an SPOF now, but we're not done yet.

Next step is to turn off the bypass for download files, we're still forcing the Cloudflare edges to fetch from the origin server to get all of the download files (tarballs, installers, etc.) so we can collect metrics. You can see in the PR I submitted that we're now fetching logs from CF for /dist/ and /download/ every hour. I still haven't properly verified that they contain exactly what we need, I'm in the process of doing a comparison with the raw logs to gain confidence in the process. After that we need to do a switch-over for the metrics data so that at some specific point it switches from using the raw nginx logs to using these CF logs for the files @ nodejs.org/metrics/. I have a simple converter from JSON to the nginx log format we're using as an interim step, whether we end up without that intermediate transform or using the raw JSON can be decided later, for now I just want to make it work.

A couple of interesting asides - I experienced I/O problems with the DO block device we're using for /home/ while working on there just now, makes me concerned about filesystem stability but at least we have this failover in place. Also @jbergstroem introduced me to systemd timers to replace crontab, they're ugly and verbose as per systemd but since they can apparently detect whether a previous job is still running or not they'll solve some of the problems we've had with overlapping jobs on the DO server, so we'd better convert all of that stuff I suppose.

@jbergstroem we need a playbook for the new backup server work, is that going to be hard?

@sam-github
Copy link
Contributor

Closing as stale, but if anyone wants to take this up feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants