Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server build: add webpack plugin that identifies and ships node_modules used by server #45774

Merged
merged 10 commits into from
Oct 26, 2020

Conversation

jsnajdr
Copy link
Member

@jsnajdr jsnajdr commented Sep 21, 2020

This is a proof-of-concept of an alternative to full node_modules bundling in #45570.

Adds a webpack plugin that at the end of compilation looks up all external modules and builds a subset of node_modules that are used by the server bundle.

The subset is composed both from modules that are required directly, and also by recursively checking these direct modules' dependencies.

Then it creates a new build/node_modules directory with that subset copied out of root node_modules. It's indented to be shipped to the runtime environment, e.g., as part of Docker container. Leaving the root node_modules behind in the build environment.

This is a very rough draft and a proof of concept, soliciting feedback.

The resulting build/node_modules is now 59MB, quite an improvement from the 1.5GB without this optimization.

@jsnajdr jsnajdr added the Build label Sep 21, 2020
@jsnajdr jsnajdr requested review from scinos and a team September 21, 2020 11:17
@jsnajdr jsnajdr self-assigned this Sep 21, 2020
@matticbot
Copy link
Contributor

@matticbot matticbot added the [Status] Needs Review The PR is ready for review. This also triggers e2e canary tests and wp-desktop tests automatically. label Sep 21, 2020
@matticbot
Copy link
Contributor

This PR does not affect the size of JS and CSS bundles shipped to the user's browser.

Generated by performance advisor bot at iscalypsofastyet.com.

contexts = [ modFolder + '/node_modules', ...contexts ];

// iterate its dependencies
for ( const dep of Object.keys( pkg.dependencies ) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider peerDependencies and optionalDependencies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we don't. peerDependencies don't trigger installation -- the depedency is supposed to be installed by someone else. Although I read some news that the very latest NPM will start installing them, too, in certain cases.

optionalDependencies are very rare, and optional. But it's possible that a missing optional dependency will disable some feature that was until now enabled, due to a conditional require succeeding. Worth auditing. It's not a problem for the plugin to start checking for them, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that if a module declares peerDependencies or optionalDependencies, and they are installed, we should copy them as well because chances are the module are importing them.

const modFolder = contexts[ 0 ] + '/' + mod;

// read module's package.json
const pkg = JSON.parse( fs.readFileSync( modFolder + '/package.json' ) );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we replace this logic with something more native, using module.createRequire(filename) and require.resolve(request[, options])?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use require.resolve before writing my own naive resolver, but is didn't have a crucial feature I need: specifying the "context" directory from which to do the resolution. I.e., when node_module/foo depends on bar, find the node_modules/foo/node_modules/bar instance rather than node_modules/bar.

TIL about createRequire -- even after reading the docs, I still don't get what it does 🙂

The plugin should probably use the webpack internal resolver rather than anything else. Something to be done as we improve this from a proof of concept into something shippable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, it does exactly what you need: create a require function that starts the resolution in a provided directory, not in the current path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it seems then that module.createRequire and require.resolve is exactly the combo I needed here 👍


module.exports = class NodeModuler {
apply( compiler ) {
compiler.hooks.afterEmit.tapAsync( 'NodeModuler', ( compilation ) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this run after applying webpack aliases and module replacements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, module aliases and replacements are applied very early, when building the module dependency graph (starting with the entrypoint and descending recursively into dependencies).

On the other hand, afterEmit happens very late, after all the JS assets were compiled, optimized and written to the disk.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 22, 2020

@scinos I started tracking peerDependencies and optionalDependencies, too, and the results are:

The server bundle directly requires 78 packages, and when we tracked only dependencies, it led to shipping 282 packages total. With optional and peer dependencies, the number rose slightly to 302.

Optional dependencies: They are very rare, and ones that were not shipped already (due to being a dependency of another package) come exclusively from bunyan: dtrace-provider, mv, safe-json-stringify.

Peer dependencies: Most of them are react or redux. One that stands out and wasn't shipped before is @wordpress/data, a peer dependency of composite-checkout. Omitting this one could really lead to runtime crashes due to a missing package. The @wordpress/data package is declared as peer to avoid duplicate installs -- one of the things we do to battle @wordpress/* package duplication.

While working on this, I found that composite-checkout has a dependency it shouldn't have: cache-loader. It's a dev dependency, not used at runtime, and its inclusion caused another 200+ packages to be bundled, because it has webpack as peer dependency.

@jsnajdr
Copy link
Member Author

jsnajdr commented Sep 22, 2020

There's one issue that must be resolved to make this plugin work: right now, we also externalize monorepo packages like wpcom, and then ship them to the output folder, copying the symlink to ../packages/wpcom.js. That's obviously wrong.

We should either bundle these packages (we can ensure that our own packages are bundle-able), or follow the symlinks and copy the packages/ subdirectories to the output directory.

The latter is a bit suboptimal, because not all files in the package directory needs to be shipped. There is the files field in package.json that selects files that are to be published to the NPM registry.

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 2, 2020

I believe this PR is now ready for review and deploy.

It adds a webpack plugin that finds all external modules in the (finished) compilation and writes their names into a list in build/modules.json. That's all what we really need to do in webpack.

Then there is a yarn run build-server:copy-modules task (with a script in /bin that it runs) that traverses all dependencies in modules.json, finds these modules in node_modules, transitively traverses all their dependencies, and the creates a list of all dependencies that need to be shipped. Then it copies the shipped modules to build/node_modules.

Notes:

  • @scinos suggested to use Node.js native require.resolve and module.createRequire APIs, but after trying them out, I discovered they are not a good fit and cause me more trouble than joy. require.resolve( package ) returns the full path to the JS file I should load, e.g., node_modules/foo/dist/cjs/index.js. First, we are not really interested in that path -- we only want to resolve the package's home folder and its package.json. Second, some packages fail to resolve in such a way. For example, require.resolve( '@babel/runtime' ) fails, because the package doesn't have any index-like file. The consumer is supposed to import files from specific subpaths, like require( '@babel/runtime/helpers/defineProperty' ).
  • we can't ship packages in node_modules that are really symlinks to monorepo packages. The copy-production-modules script detects that (by calling fs.realpath on the module path, and checking if it resolves to a directory outside node_modules) and will fail completely when it encounters such a module. We need to bundle all monorepo packages to get a successful build.

@@ -0,0 +1,124 @@
const fs = require( 'fs' ); // eslint-disable-line import/no-nodejs-modules
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit // eslint-disable-line import/no-nodejs-modules should be moved to ./bin/.eslintrc.js. Importing node modules should be ok to any file in ./bin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed 👍

// it is not a symlink to a monorepo packages. Such packages couldn't be shipped and
// must be bundled.
const pkgRealpath = path.relative( '.', fs.realpathSync( pkgFolder ) );
if ( ! pkgRealpath.startsWith( 'node_modules/' ) ) {
Copy link
Contributor

@scinos scinos Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if server imports something that is in client/node_modules, or in client/server/node_modules? Are those bundled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server bundle is in build/server.js, both before and after this patch, so it doesn't see modules in client/. Its resolution paths are build/node_modules, node_modules and parents.

Modules in client/ are visible only to webpack at build-time, when the resolution context are the source directories. Once a module is externalized, i.e., translated to a raw require( 'foo' ) call executed by Node.js at runtime, this context is lost.

This could be a problem, because there's no guarantee that all externalized modules are always hoisted to the top node_modules directory. But this particular PR doesn't make it better or worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is an existing (and pretty bad) problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more little detail involved here: the webpack-node-externals plugin builds the list of externals by listing the root node_modules directory. If the sources request a foo package and the package is in client/node_modules/foo and not in node_modules/foo, the plugin won't identify foo as external and will let webpack bundle it.

@@ -76,6 +76,8 @@
"build-static": "npx ncp static public",
"prebuild-server": "mkdirp build",
"build-server": "BROWSERSLIST_ENV=server webpack --config client/webpack.config.node.js --display errors-only",
"build-server:copy-modules": "node bin/copy-production-modules.js",
"postbuild-server": "node -e \"process.env.CALYPSO_ENV === 'production' && process.exit(1)\" || yarn run build-server:copy-modules",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this in other envs as well like stage or wpcalypso?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CALYPSO_ENV variable is a bit weird as it has two meanings.

During build, the values used in practice are development, production and desktop. There's also desktop-development, but I think that setup is not functional at this moment.

We never use stage or wpcalypso at build time. These values are used only in the second scenario: setting an environment variable at runtime. At runtime, we always run the same build produced by the production setup, and it's instructed to load the stage or wpcalypso configs by the CALYPSO_ENV environment variable.


// The polyfills module is transpiled by Babel and only the `core-js` modules that are
// needed by current Node.js are included instead of the whole package.
'@automattic/calypso-analytics',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generate this list automatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great. Instead of a static list, there could be a function isMonorepoPackage that determines the monorepo-ness of a package. But I don't know how to write such a test right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could find all package.json in packages/*/package.json, read the name and compose this list dynamically.

Otherwise we risk anybody adding a new package used in client or server and breaking only in production.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea, implemented using globby and it seems to work beautifully 👍

@scinos
Copy link
Contributor

scinos commented Oct 15, 2020

I was playing with this PR and realized there is potentially an easier solution:

The subset is composed both from modules that are required directly, and also by recursively checking these direct modules' dependencies.

Isn't that roughly similar to this process?

  1. Ensure client/package.json has all dependencies correctly declared
  2. Tell yarn to not hoist packages from client/node_modules
  3. Install packages from client/packages.json (bonus:install in prod mode to exclude devDependencies)
  4. Copy client/node_modules to build/node_modules

With the current client/package.json, when not hoisting client/node_modules it is ~140Mb. I still have to verify it is correct and works, but looks promising.

It is not strictly the same (there may be a package declared in client/package.json that is not actually used by the server or the isomorphic code), but my gut feeling is that those unnecessary modules aren't too big

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 15, 2020

Tell yarn to not hoist packages from client/node_modules

Would that be a modification of the existing yarn install in wp-calypso? Or would it be some special install done only to create the production node_modules folder?

The current client/package.json declares many dev and test dependencies as dependencies, i.e., various webpack loaders or Jest plugins. We'd have to move these to devDependencies, is that right?

I'm not sure how the non-hoisted install in client/node_modules would work with symlinked monorepo packages. client/node_modules/@automattic/components would be a a symlink to packages/components. Would we follow the symlink and copy the packages/components folder recursively?

Another thing about monorepo packages that requires attention is: if packages used by client/ are not hoisted, it means that, e.g., react resolves to client/node_modules/react. But @automattic/components also depends on react, but modules from this package resolve react to either packages/components/node_modules/react or node_modules/react. I.e., different instances.

@scinos
Copy link
Contributor

scinos commented Oct 15, 2020

Would that be a modification of the existing yarn install in wp-calypso? Or would it be some special install done only to create the production node_modules folder?

I think it should be a special install only to create production node_modules. Technically we only need to add "calypso/*" to workspaces.nohoist in ./package.json, it should be easy to script.

We'd have to move these to devDependencies, is that right?

If we want to optimize it further, yes. After a first pass of removing some dependencies and moving others to devDependencies, I can reduce client/node_modules from 140Mb to 112Mb.

Would we follow the symlink and copy the packages/components folder recursively?

I think the symlink is fine, as long as we also copy ./packages to the Docker image (which we already do). The question is what happens with folders like ./packages/**/node_modules. I guess we'll need to "unhoist" them as well?

There are a few other things I'd like to try: yarn has a --hard-links option for install that may help here. There are also "focused workspaces".

re: multiple instances

True, we would need to do some manual work to ensure dependencies are not duplicated when they are the same version.

--

As an alternative to yarn unhoisting things, we could just delete all dependencies in the root package.json and run yarn install --production (assuming each ./package/* and ./client have its dependencies correctly declared). It will have a similar effect of unhoisting client/node_modules but will solve the problem of packages/*/node_modules and duplicated dependencies.

Which brings in another aspect of this problem: if unhoisted client/node_modules is ~140Mb and node_modules is 1.5gb, that means we have dependencies not in client that weight more than 1 Gb. Given that client is the biggest package (by far) I find that a bit suspicious. Maybe the solution of this problem is prune unused dependencies in package.json and do a correct split between dependencies and devDependencies?

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 16, 2020

I guess we also need to change Dockerfile to not copy the existing node_modules? Otherwise we won't be reducing the image size.

@scinos Could you please push a commit that does that? I'm not very familiar with how we build Docker images these days. Do we already use the multi-stage process where the build is done in one container and then the results are copied to the production container? Or do we still do everything in one container?

We could even go a bit further -- after this PR, Calypso needs only the build/ and public/ directories at runtime. We no longer need to copy the entire calypso/ dir.

Also I think it would be nice to validate that in case there is a package missing, the process ends up logging an error in the JSON logger so we'll see it Logstash.

I.e., when the require call fails at runtime, catch the error and log something nice to Logstash, instead of letting Node.js fail ungracefully? Would you mind implementing that in another commit? Looks like another opportunity to make this PR a collaboration 👍

Sounds like a huge difference compared with the 323Mb of just installing client packages. It would be nice to know where the difference is to validate we are not missing any critical package.

client/package.json has webpack and many of its friends as dependencies. I guess that's where most of the difference comes from.

Can you compare the list of packages that your experiments produce against what the copy-production-modules.js script copies to build/? The script also has a --debug option that lists everything it does to console.

@scinos
Copy link
Contributor

scinos commented Oct 21, 2020

After running this branch and generating the reduced node_modules, I tried to find any call to require() from server.js or node_modules/**/*.js, extract the module name and check that the module is present in the reduced node_modules.

There are a few "missing" modules, but after inspecting many of them, looks like they are native modules, or used in comments, tests, bin scripts or similar, so they are actually not required in production.

Script: for req in $(grep -HEoR 'require *\([^\\()+]*\)' *.js node_modules | grep ".js:" | grep -v 'test' | cut -d '(' -f 2 | cut -d ')' -f 1 | grep -v './' | tr "'" '"' | grep -e '"' | grep -v '\.' | sed 's/"//g' | sort | uniq); do (find . -type d -name $req | grep .) || echo $req; done | grep -v './node_modules'

List:

fs
LiveScript
ReactPropTypes
UnicodeBidiService
_process
acorn
assert
async
browserify
buffer
child_process
commander
createArrayFromMixed
crypto
domain
ejs
events
fetch
fs
http
https
mime-score
module
net
ngrok
os
path
querystring
react-native
react-native-dark-mode
request
stream
temp
tty
uglify-js
uglifyjs-webpack-plugin
url
vm
webpack
word-list
worker_threads
yamlparser
zlib

A notable missing package is acorn (required by build/node_modules/promise/build.js), temp (required by build/node_modules/babel-runtime/node_modules/core-js/build/index.js) and mime-scores (required by build/node_modules/send/node_modules/mime/src/build.js). Probably those requires are in a code path that never gets imported or executed in production, but maybe it is worth checking.

But given the above list, I'm more convinced that this approach is probably copying over everything required in prod.

@scinos
Copy link
Contributor

scinos commented Oct 21, 2020

re: error handling

I've merged a change to report all unhandled exceptions and rejections to logstash in #46638, it should give us visibility if this fails

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 22, 2020

Probably those requires are in a code path that never gets imported or executed in production, but maybe it is worth checking.

Yes, these are all devDependencies of the respective packages, and are used in build scripts.

@scinos
Copy link
Contributor

scinos commented Oct 22, 2020

I guess we also need to change Dockerfile to not copy the existing node_modules? Otherwise we won't be reducing the image size.

@scinos Could you please push a commit that does that?

Done. Now we only copy build, config and public to the final docker image. The final size is 537Mb

image

@scinos
Copy link
Contributor

scinos commented Oct 22, 2020

To be honest, this still looks to me like a patch over the actual problem (that our dependencies are a mess) to get a benefit not entirely clear (given how docker share layers between images, is not as simple as "this image is smaller"). But I don't want to delay this further or block it, so given that we have logs that will tell us if this breaks in prod and all tests are passing, it is good to go for me.

Something we could consider is that we can just run the plugin once, get the list of external modules, and use that list to correctly populate dependencies in client/package.json. That at least will move us closer to a well-defined dependency tree.

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 23, 2020

Thanks @scinos for your help making this happen 🤝 I'll merge it on Monday, as it's not a good candidate for a Friday deploy.

Something we could consider is that we can just run the plugin once, get the list of external modules, and use that list to correctly populate dependencies in client/package.json. That at least will move us closer to a well-defined dependency tree.

This list doesn't contain the entire list of dependencies for client -- only the ones used at runtime by the server. The server doesn't do any localization, doesn't load any components loaded with <AsyncLoad>... That removes many packages used only by the browser bundle.

In a sense, with SSR, the client and server are not two different packages, but rather two bundles built from the same sources. These sources have common dependency set defined in client/package.json, but some are used only in the Node.js server, some only in the browser bundle...

That's why I think that un-messing the dependencies will be a difficult mission, and that a dedicated mechanism to prune the server runtime dependencies will continue to be impactful for a long time. But I'll be happy to help make progress on that nevertheless.

Copy link
Collaborator

@wp-desktop wp-desktop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordPress Desktop CI Failure for job "wp-desktop-mac".

@jsnajdr please inspect this job's build steps for breaking changes at this link. For temporal failures, you may try to "Rerun Workflow from Failed".

Please also ensure this branch is rebased off latest Calypso.

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 26, 2020

I added one more commit to make the requests to /service-worker.js work. This URL was served by express.static from the client/lib/service-worker directory, which is no longer present in the production container and the request therefore 404ed. The patch copies the file to public/ and changes the /service-worker.js handler to serve it from there.

@wp-desktop wp-desktop dismissed their stale review October 26, 2020 08:39

wp-desktop ci passing, closing review

@jsnajdr
Copy link
Member Author

jsnajdr commented Oct 26, 2020

I had to disable the server/pwa module for desktop completely to make the e2e tests green again. I'm not sure why. The only change was express.static looking into another directory. But the target directory didn't exist for desktop both before and after the patch 🙁

The Desktop app doesn't need the PWA functionality anyway: push notifications are handled differently, and it also doesn't need the manifest.json file that provide information about how to add PWA app opened in browser to Home Screen or links to native apps.

FYI @nsakaimbo

@jsnajdr jsnajdr merged commit 47f6057 into master Oct 26, 2020
@jsnajdr jsnajdr deleted the try/runtime-node-modules branch October 26, 2020 10:45
@matticbot matticbot removed the [Status] Needs Review The PR is ready for review. This also triggers e2e canary tests and wp-desktop tests automatically. label Oct 26, 2020
@nsakaimbo
Copy link
Contributor

Thanks for the heads up, @jsnajdr!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants