Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server build: create a self-contained JS bundle with all node_modules included #45570

Closed
wants to merge 2 commits into from

Conversation

jsnajdr
Copy link
Member

@jsnajdr jsnajdr commented Sep 11, 2020

Helps make the Calypso server+client a well-defined artifact that can be tarballed and deployed on just about any server with a good-enough version of Node.js.

Currently, one major obstacle is the node_modules folder that contains 1.5GB of modules that are mostly build tools and are not needed at runtime. But we don't know who is who.

The first thing that this PR adds is a yarn run analyze-bundles:server command that creates webpack-analyze-bundles overview of the bundle contents.

Then I enforce bundling node_modules by removing the use of the webpack-node-externals plugin.

Before:
This is how the bundle looks before this patch, with node_modules still external to the bundle:

Chunk sizes (the parsed sizes, as gzipped is not relevant for server):
Screenshot 2020-09-11 at 08 56 47

Chunk layout:
Screenshot 2020-09-11 at 09 19 03

Even the existing server bundle seems fairly bloated. It shouldn't contain the moment locales at all, as server doesn't do anything localized. Another things that are surprising to see in the bundle are:

  • the reader state. There's no SSR in Reader so the state is very unlikely to be used. I think this comes from the Reader social preview -- we talked about this few months ago with @getdave when he was working on social-preview. It would be nice to add a streamlined Reader preview component to that package, too.
  • inline-help code. Probably dragged in by the Layout component.
  • lib/plans and lib/purchases.

One thing that would be nice to have is creating separate chunks for isomorphic sections (like login or themes) and loading them dynamically (on first request for that section, just like in client) rather than synchronously. Then it becomes better visible whether a particular module is used by the base server, or one of the isomorphic section (and which one).

After:

Chunk sizes (still parsed):
Screenshot 2020-09-11 at 09 08 34

Chunk layout:
Screenshot 2020-09-11 at 09 15 54

The most important information is that the node_modules are 60% of the bundle and the remaining 40% is the original app code.

The main sources of bloat that I see here:

  • the caniuse-lite data. It's used by browserslist-useragent used by the isUAInBrowserslist function that determines whether the browser deserves the evergreen or the fallback assets.
  • the moment-locale data. What is new after this patch is that something is requiring moment-locale synchronously and it's no longer in a separate chunk.
  • it seems we're bundling lodash twice. The server should use the lodash-es replacement plugin.
  • the request library (can't find where it's used) depends on several exotic crypto librares (sshpk, tweetnacl) and also on mime-db

@matticbot
Copy link
Contributor

@jsnajdr jsnajdr requested a review from scinos September 11, 2020 08:05
@matticbot matticbot added the [Status] Needs Review The PR is ready for review. This also triggers e2e canary tests and wp-desktop tests automatically. label Sep 11, 2020
@jsnajdr jsnajdr requested review from sirreal, sgomes and a team September 11, 2020 08:05
@matticbot
Copy link
Contributor

This PR does not affect the size of JS and CSS bundles shipped to the user's browser.

Generated by performance advisor bot at iscalypsofastyet.com.

@scinos
Copy link
Contributor

scinos commented Sep 11, 2020

Should we remove node_modules form the docker image to see if it still works, or do you plan to do that in a future PR?

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 11, 2020

Should we remove node_modules form the docker image to see if it still works, or do you plan to do that in a future PR?

Good point, I believe it should be in this PR, one of the main methods how to validate it.

Does it mean adding a command into the Dockerfile, after building Calypso and before launching it?

RUN rm -rf node_modules

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 11, 2020

We could also extract the node-modules-deleting code from the distclean command and run it as

yarn run clean:packages

That would also remove the nested client/node_modules etc, not only the folder in root.

@sirreal
Copy link
Member

sirreal commented Sep 11, 2020

I think the ideal way of removing node_modules is building on #45453 by removing these lines so node_modules is never present in the application image:

COPY --from=builder --chown=nobody:nobody /calypso/node_modules /calypso/node_modules/

COPY --from=builder --chown=nobody:nobody /calypso/client/node_modules /calypso/client/node_modules/

@scinos
Copy link
Contributor

scinos commented Sep 11, 2020

To remove it for good so it doesn't contribute to the image size, we need to make sure it is never present in the final image, what @sirreal said.

To remove node_modules to validate that this change works, I think we need something like RUN find . -name node_modules | xargs rm -fr before running the app will work. The problem is that we won't know for sure if it works in all cases: if a module is required only on some URLs, or under some specific conditions, we wouldn't know unless we hit that URL/conditions in a test (either automatically or manually).

Maybe we also need some bundle analysis to see if there are calls to require() with external modules?

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 11, 2020

I added the RUN find command to the Dockerfile that removes the node_modules folders. That should be good enough for now. To be superseded by the multi-stage builds.

Maybe we also need some bundle analysis to see if there are calls to require() with external modules?

That's a good point! For every external, webpack creates a stub module that looks like this:

/* 1 */
/***/ (function(module, exports) {
  module.exports = require("lodash");
/***/ }),

By grepping the bundle.js for exports = require, I get this list:

module.exports = require("util");
module.exports = require("fs");
module.exports = require("path");
module.exports = require("crypto");
module.exports = require("url");
module.exports = require("http");
module.exports = require("stream");
module.exports = require("events");
module.exports = require("https");
module.exports = require("tty");
module.exports = require("net");
module.exports = require("buffer");
module.exports = require("querystring");
module.exports = require("zlib");
module.exports = require("assert");
module.exports = require("child_process");
module.exports = require("../client/server/devdocs/search-index.js");
module.exports = require("tls");
module.exports = require("string_decoder");
module.exports = require("punycode");
module.exports = require("os");
module.exports = require("vm");

Except the search-index.js, which is expected to be an external, they all look like Node builtin modules, don't they?

Looking at the output of require( 'module' ).builtinModules, I see them all there.

@scinos
Copy link
Contributor

scinos commented Sep 11, 2020

I check out this branch (commit 2d07ca7065) and run yarn build-server. I end up with a file build/bundle.js which is 28 megabytes. Grepping for require *\(.*)(by running cat build/bundle.js | grep -oE 'require *\(.*)' | sort -u) I got a different list:

require( "source-map-support" ).install()
require(" + JSON.stringify(moduleJsPath) + ")
require(" + result.moduleId + ")
require("../client/server/devdocs/search-index.js")
require("../client/webpack.config.js")
require("./bundle." + ({"lib-analytics-signup":"lib-analytics-signup"}[chunkId]||chunkId) + ".js")
require("assert")
require("async_hooks")
require("buffer")
require("child_process")
require("constants")
require("crypto")
require("events")
require("fs")
require("http")
require("https")
require("inspector")
require("module")
require("net")
require("os")
require("path")
require("punycode")
require("querystring")
require("stream")
require("string_decoder")
require("tls")
require("tty")
require("url")
require("util")
require("vm")
require("zlib")
require(${JSON.stringify(moduleJsPath)})(module)
require(${JSON.stringify(request)})
require(${JSON.stringify(request[0])})
require(${externalsDepsArray}, function(${externalsArguments})
require(${request})
require('ReactPropTypes')
require('braces')
require('buffer')
require('cache-base').namespace('data')
require('ejs').__express)
require('ejs').renderFile)
require('expand-brackets')
require('express')
require('extglob')
require('fs')
require('fs').readFile(filename, 'utf-8',  function(err, content)
require('http')
require('http'), https = require('https'), fs = require('fs')
require('https')
require('iconv-lite').extendNodeEncodings() is not supported in your version of Node")
require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
require('micromatch')
require('nanomatch')
require('path')
require('path').dirname(filename), filename)
require('regex-not')
require('repeat-string')
require('scheduler/unstable_mock'));\n\n" + 'For more info, visit https://fb.me/react-mock-scheduler')
require('snapdragon-node')
require('static-extend')
require('use')
require('util')
require('verror')
require('vinyl')
require('vm').runInThisContext('(function(exports, require, __dirname, __filename) {' + content + '\\n})', filename)
require()
require())
require(o, !0)
require(…, …)

When running with NODE_ENV=production I get:

require( "source-map-support" ).install()
require("../client/server/devdocs/search-index.js")
require("./bundle." + ({"0":"lib-analytics-signup"}[chunkId]||chunkId) + ".js")
require("assert")
require("buffer")
require("child_process")
require("crypto")
require("events")
require("fs")
require("http")
require("https")
require("net")
require("os")
require("path")
require("punycode")
require("querystring")
require("stream")
require("string_decoder")
require("tls")
require("tty")
require("url")
require("util")
require("vm")
require("zlib")
require('buffer')
require('ejs').__express)
require('ejs').renderFile)
require('express')
require('http')
require('http'), https = require('https'), fs = require('fs')
require('https')
require('iconv-lite').extendNodeEncodings() is not supported in your version of Node")
require('iconv-lite').undoExtendNodeEncodings(): Nothing to undo; extendNodeEncodings() is not called.")
require('util')
require('verror')
require()
require())

Probably most of those are in comments or load a file in ./client/..., but probably we should check that.

@scinos
Copy link
Contributor

scinos commented Sep 11, 2020

Thinking about the safety of this change... how do we know we don't have a module that does a dynamic require, something like:

//Just an example, it could be a dynamic require based on anything.
const platform = require('os').platform();
const parser = require('./parser-'+platform);

I don't think that will be bundled by webpack, and therefore the module will be missing in production. How do we protect ourselves against that pattern?

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 11, 2020

const parser = require('./parser-'+platform);

I don't think that will be bundled by webpack

Webpack will bundle this: it tries really hard to understand the './parser-' + platform expression and bundles all modules and files it can find that match the ./parser-* pattern, including subdirectories like ./parser-darwin/silicon.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 11, 2020

run yarn build-server. I end up with a file build/bundle.js which is 28 megabytes.

Oops, that's a good catch. Development mode should continue to use the externals. The node_modules folder is there, bundling them is not useful, and the dev server also loads entire webpack and friends to do in-process builds with dev-server and dev-middleware. Will fix.

I'll have a closer look at the production require's that you report. Thanks! 👍

@scinos
Copy link
Contributor

scinos commented Sep 11, 2020

Oh, I didn't know about webpack interpreting './parser-' + platform, that's cool. For any future readers, this is documented in https://webpack.js.org/guides/dependency-management/#require-with-expression.

However, it is rather weak. Based on an experiment (https://codesandbox.io/s/wizardly-shape-8lofs?file=/src/index.js), seems like it only works if the expression inside require contains a string. Example:

// Works, webpack bundles any file inside `./test`
const p = "a.js"
require("./test/"+p);

// Doesn't work, webpack does not bundle ./test/a.js
const p = "./test/a.js";
require(p);

So there is still a chance that webpack won't be able to bundle everything needed at runtime.

Edit: and that's only about require... any module can do eval(fs.readFileSync('file.js')) and I'm pretty sure we won't be able to bunde file.js. I'm pretty sure there are lots of different hacky/gross ways to load files that webpack can't possibly detect.

@jsnajdr jsnajdr force-pushed the build/big-server-bundle branch 3 times, most recently from d5f2c11 to e89f8c8 Compare September 14, 2020 10:43
@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 14, 2020

require( "source-map-support" ).install()

This is something real I had to fix in e89f8c8. Previously, there was a webpack.BannerPlugin that added that line, without being interpreted by webpack in any way, to the beginning of the entry chunk.

I moved the source-map-support to the beginning of the client/server/index.js file, the same place where we load polyfills before doing anything else. Now it's a regular import and the package is bundled.

The source-map-support changes the Error.stack property and uses the source maps when printing source paths.

A console.trace() call without source-map-support:

Trace
    at /Users/jsnajdr/src/wp-calypso/build/bundle.js:197687:13
    at Layer.handle [as handle_request] (/Users/jsnajdr/src/wp-calypso/build/bundle.js:182013:5)
    at next (/Users/jsnajdr/src/wp-calypso/build/bundle.js:181624:13)
    at /Users/jsnajdr/src/wp-calypso/build/bundle.js:197310:41

and the same console.trace() with it:

Trace: 
    at /Users/jsnajdr/src/wp-calypso/build/webpack:/client/server/pages/index.js:754:11
    at Layer.handle [as handle_request] (/Users/jsnajdr/src/wp-calypso/build/webpack:/node_modules/express/lib/router/layer.js:95:1)
    at next (/Users/jsnajdr/src/wp-calypso/build/webpack:/node_modules/express/lib/router/route.js:137:1)
    at /Users/jsnajdr/src/wp-calypso/build/webpack:/client/server/pages/index.js:266:16

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 14, 2020

Another bundling issue I had to fix in 7398b9a, reusing a fix I already had to do in webpack.config.desktop.js when working on the Desktop app build.

The formidable package contains this code that hijacks the require call when the gently library is present (it can do module mocking for testing purposes):

if (global.GENTLY) require = GENTLY.hijack(require);
var crypto = require('crypto');

Unfortunately, webpack translates this to something that doesn't make much sense:

var require;if (global.GENTLY) require = GENTLY.hijack(__webpack_require__(1064));
var crypto = require('crypto');

See how require becomes undefined when global.GENTLY is falsy? That guarantees runtime crashes.

We need to use DefinePlugin to make global.GENTLY statically false. Then webpack ignores the offending code:

if (false) {}
var crypto = __webpack_require__(47);

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 14, 2020

However, it is rather weak. [...] it only works if the expression inside require contains a string.

webpack will at least issue a warning if it can't figure out the require call:

Critical dependency: the request of a dependency is an expression

It's weird that it's just a warning and the compilation will proceed anyway. I don't see any way how to enforce an error here.

webpack has some options to configure these contextual requires, but still no option to make them fail the build.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 14, 2020

and that's only about require... any module can do eval(fs.readFileSync('file.js'))

That's a good point. In practice, I don't see how a package author would be motivated to load JS files in this way, but it can be plausibly used to load and parse package-internal files that require doesn't understand:

// load from disk
const template = fs.readFileSync( __dirname + '/template.html' );
// parse and use
cheerio.load( template );

webpack won't bundle that either.

In the end, I think we'll need to "know our modules", even the ones we use to build the server. On the client side, we are quite familiar with what kind of packages are in our bundle and what they do. On the server, I'm afraid we've been somewhat oblivious to that.

@scinos
Copy link
Contributor

scinos commented Sep 14, 2020

This change looks quite brittle and risky to me.

Maybe we should take a step back and better define why we want to bundle node_modules. What do we expect to get form smaller docker images? Do we expect them to build faster? Do we expect them to deploy faster? By how much?

Answering those questions would allow us to better assess risk vs benefits and explore alternative solutions.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 15, 2020

This change looks quite brittle and risky to me.

I believe we already mitigated many of the risks, so what are the remaining ones? The eval( readFile ) calls? These can be audited by looking at the bundle code and auditing the eval and fs module usages. Is there anything else?

Maybe we should take a step back and better define why we want to bundle node_modules. What do we expect to get form smaller docker images? Do we expect them to build faster? Do we expect them to deploy faster? By how much?

I started this PR after a Slack discussion about #21537 where you said: "I think the biggest problem is the humongous node_modules." That suggests that you are concerned about the big node_modules folder, too, and that you believe it has impact on the performance of our Docker pipeline. Is that still true? Do you consider shipping 1.5GB of node_modules being a problem that's worth solving? There are other methods how to reduce the node_modules, see below. We don't strictly need to bundle everything.

Answering those questions would allow us to better assess risk vs benefits and explore alternative solutions.

One alternative solution is to ship a node_modules folder that includes only the modules that are really needed. To get that, we need to somehow cherry-pick a subset of wp-calypso/node_modules that's really used by the server bundle.

It could be done by:

  1. At build time, produce a list of modules that the bundle does require(). Can be done by a webpack plugin that inspects the ExternalModule instances in the compilation.
  2. Copy these modules from wp-calypso/node_modules to build/node_modules
  3. Recursively look at these modules depenencies and copy these, too.

One thing that's good for us is that the runtime require() calls are happening in the context of the build/ directory, and therefore the root node_modules folder is the only one relevant for module lookups. We don't need to look at, e.g., client/node_modules, because the production bundle doesn't see that folder.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 15, 2020

Some stats about the Calypso production artifacts:

  • the static files in public/, available for HTTP download by browsers, are 94 MB in total. That's the size of uncompressed tarball, with evergreen/ and fallback/ builds and media assets like images/ or fonts/.
  • the server bundle in build/ is 11MB
  • bundling node_modules into the server makes the server bundle 6MB larger
  • the node_modules folder that we are now shipping to production is 1.5GB

Therefore, a production build artifact of Calypso, containing everything that's needed to run it, is approx 100MB of uncompressed stuff. Shipping node_modules makes it currently approx 15 times larger than neccessary.

@sirreal
Copy link
Member

sirreal commented Sep 16, 2020

I've been hoping to help push this forward, but things have been hectic. I probably won't be able to get to this until next week.

@scinos
Copy link
Contributor

scinos commented Sep 21, 2020

I believe we already mitigated many of the risks, so what are the remaining ones? The eval( readFile ) calls? These can be audited by looking at the bundle code and auditing the eval and fs module usages. Is there anything else?

Can we envision an automated test for that? We can't rely on someone doing that audit manually every time a dependency is updated.

In general, modules assume they will be installed in a node_modules, and we don't know how far each author has taken that assumption. I think breaking that assumption is inherently risky, and personally I wouldn't go that path unless the benefits are huge.

One alternative solution is to ship a node_modules folder that includes only the modules that are really needed.

Let's say we move the server code to a separate app (and therefore a separate node_modules). We would have a regular node.js app with its package.json and a production installation of node_modules, which I assume they would be significantly smaller. What are the cons of that solution?

It will have other benefits for development, like separating the backend server form the webpack server.

There are shared code between client and server, but after a quick review seems like some of it can be moved to independent packages (eg: config). For the rest, we have several options we could explore: extract to packages, refactor the code, or compile client as a library that the server consumes.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 21, 2020

Can we envision an automated test for that? We can't rely on someone doing that audit manually every time a dependency is updated.

That automated test would parse the output bundle, and audit all calls to require and fs.readFile, and report suspicious things? Yes, that looks like something that can be done.

In general, modules assume they will be installed in a node_modules, and we don't know how far each author has taken that assumption.

It's true that putting packages for Node.js into node_modules is the norm, and bundling them is something almost no one does.

I tried to do the less risky approach in #45774 today: intercept all externals in the webpack compilation, and use them to select a subset of node_modules for runtime. It looks promising and the resulting node_modules are only 59MB.

Let's say we move the server code to a separate app (and therefore a separate node_modules).

It seems to me that this approach is at odds with SSR. With SSR, server and client are not really separate codebases, but it's the same code compiled for and running in two different environments.

Also, most of the node_modules bloat comes from the build tooling, not from the client part of the app. We'd have to have three independent node_modules trees in the monorepo, that are yarn install-ed indepenently, for:

  • server app
  • client bundle
  • tooling

I'm open to exploring that route, but at this moment it's hard to imagine for me how exactly it would work.

@scinos
Copy link
Contributor

scinos commented Sep 21, 2020

Also, most of the node_modules bloat comes from the build tooling, not from the client part of the app

Maybe a quick solution is to wipe-out node_modules after webpacking everything, and do a simpler yarn install --production (assuming we have our dependencies vs devDependencies properly declared).

@sirreal
Copy link
Member

sirreal commented Sep 23, 2020

There's clearly some risk here, but I'm curious about something. Do we take it as a given that there are essentially three types of packages we might be using?

  • browser-specific (probably uses window globals). What's an example of this?
  • node.js-specific (relies on node environment). Example: express.
  • agnostic? (doesn't rely on anything specific or uses compat libraries or checks environment extensively to only rely on things that are supported or ?). Example: lodash, superagent, many others.

Clearly, some packages only make sense in a certain environment (e.g. express).

I ask this because it seems odd to be more concerned about the behavior of packages when bundled in the context of the server than in the context of the client. We assume that all client-focused or agnostic packages behave well when bundled or when used in Node.js. Are we less comfortable with that assumption for server-only packages?

I'm also curious about what failure would look like in some of the above cases with eval or require( expression ). The danger is that these would appear in server-only packages and break under bundling. How many server-only packages do we rely on and how critical are they? If there's a problem with something like express, I'd expect to see failures quite early and be covered by e2e tests.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 25, 2020

browser-specific (probably uses window globals). What's an example of this?

One example: in Calypso, we use the component-file-picker package which does document.createElement() at the top level and will immediately crash Node.js if used. That's why our server webpack config has to ignore certain components like theme-upload.

agnostic? (doesn't rely on anything specific or uses compat libraries or checks environment extensively to only rely on things that are supported or ?).

This category should also include most React components. A very nice-to-have feature of these is being SSR compatible. I.e., touch DOM very carefully, only in lifecycle methods and useEffect callbacks that are guaranteed to be never called by server-side React.

I ask this because it seems odd to be more concerned about the behavior of packages when bundled in the context of the server than in the context of the client.

That's because in the client, NPM packages are almost always bundled -- it's rare exception to load a module directly from browser.

In the server world, though, bundling node_modules is rare. If anything, you bundle your own code, and ship also a node_modules folder to the production environment. See, for example, the AWS Lambda documentation for Node.js deployment packages: it has instructions how to package code and node_modules in a ZIP package.

And there are packages that really fail when bundled. superagent uses formidable to parse upload forms, and one of the files looks like this:

https://unpkg.com/browse/formidable@1.2.2/lib/incoming_form.js

The require = GENTLY.hijack(require) assignment confuses webpack so much so that the require identifier gets undefined and call attempts fail at runtime.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 25, 2020

I'm also curious about what failure would look like in some of the above cases with eval or require( expression ).

The most dangerous case is where the failing require is triggered at runtime, by hitting a certain code path, serving a specific request. Then it will crash the whole Node.js server which has been running OK so far, and might have passed all sorts of tests.

@sgomes
Copy link
Contributor

sgomes commented Sep 25, 2020

superagent uses formidable to parse upload forms, and one of the files looks like this:

If superagent is the only offender, it may be worth spending some time replacing it with native fetch, like we did on the client side.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 25, 2020

If superagent is the only offender, it may be worth spending some time replacing it with native fetch, like we did on the client side.

It's the only offender we know so far. Anyway, replacing superagent with something more lightweight would be desirable. Even the native https module in Node.js should work well. We use superagent to send WP.com API REST requests (user bootstrap, loading themes during themes SSR, ...) and to send some stats to MC. That's very basic usage.

superagent depends on libraries like formidable to support multipart/* MIME responses, plenty of crypto libraries for authenticating HTTP requests with, .e.g., SSH keys, ... We don't need any of that. Replacing superagent could make the server node_modules significantly smaller.

@jsnajdr
Copy link
Member Author

jsnajdr commented Nov 2, 2020

We merged #45774 instead, closing this one.

@jsnajdr jsnajdr closed this Nov 2, 2020
@jsnajdr jsnajdr deleted the build/big-server-bundle branch November 2, 2020 07:25
@matticbot matticbot removed the [Status] Needs Review The PR is ready for review. This also triggers e2e canary tests and wp-desktop tests automatically. label Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants