Skip to content
This repository has been archived by the owner on Jul 31, 2018. It is now read-only.

proposal: Staged require() with lifecycle hooks #56

Closed
wants to merge 4 commits into from
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
330 changes: 330 additions & 0 deletions XXX-staged-require-with-lifecycle-hooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,330 @@
| Title | Staged require() with lifecycle hooks |
|--------|---------------------------------------|
| Author | @qard |
| Status | DRAFT |
| Date | 2017-05-05 |

# Proposal

There are two major purposes to this proposal:
- To provide module loading lifecycle hooks
- To split resolving, loading and compilation concerns

# Motivations

Currently some very popular userland modules depend heavily on monkey-patching
to provide their functionality. This can be unsafe, inflexible and damaging to
performance.

Some examples include:

- Tracing, debugging and monitoring
- Code coverage
- Mocking
- Transpiling

### Safety

Monkey-patching can be unsafe for many reasons.

Sometimes the behaviour of code changes based on the parameter length of a
function given as input to another, such as the `(req, res)`, `(req, res, next)`
and `(err, req, res, next)` route handler signatures in express, which will
switch the first argument to being an error instance if the function parameter
arity is 4. Monkey-patching express route handlers properly requires branching
instrumentation patches which provide different patched functions to match the
length of the original route handler. This is easy to miss or get wrong, which
will generally result in a crashed app.

Sometimes patches can interfere with getter/setter behaviour. An example of this
is `pg.native` mutating the rest of the postgres driver module to expect the
native bindings to be used. Dynamic instrumentation is unaware of what the app
might want to do in the future, so it needs to patch this optimistically, but
if one naively tries to just do `const oldProp = pg.native; pg.native = newProp`
it will trigger the getter behaviour, breaking the JS interface. Again, this is
easy to miss, likely resulting in a crashed app.

It can be difficult to differentiate between async callbacks that need code to
maintain barrier linkage and sync callbacks which do not. Some interfaces also
accept multiple function arguments or callbacks stored in an options object
which could contain other functions, so dynamically identifying callbacks that
need to be instrumented can be unreliable or even impossible.

Any form of queueing, even simply putting a callback in an array to be called
elsewhere, will break continuation linking and requires manual intervention to
maintain that linkage which is not always possible for dynamic instrumentation.

Most APM providers try/catch constantly because the existing solutions for
tracking continuation lose context regularly. We've done the best we can with
monkey-patching, but we've hit a wall and have been pushing for things like
`async-hooks` to improve the situation. However, `async-hooks` only solves part
of the problem, linking the callsite to the callback. However, context on what
was happening still requires monkey-patching, and there are plenty of situations
where context-loss is not the hard part.

At the code text level, applying heuristic behaviour analysis is possible to
more deeply understand the complex interaction of the code, enabling a much
wider range of possibilities in terms of dynamic instrumentation.

### Performance

Monkey-patching is prone to performance issues. The vast majority requires
wrapping things in closures. More complex instrumentation can often trigger
deoptimization. Sometimes heavy runtime checking is required with constant
dynamic function generation and reassignment, which can put a lot of undue
stress on the optimizer and garbage collector.

If one could intercept the module loader and apply code transformations directly
to the code text before compilation, there's a great deal that could be gained.
Most, if not all, run-time checking would simply disappear. Most closures would
disappear. Context loss would mostly disappear because the transformer could
take advantage of lexical scope to describe linkage directly at callsites.

### Flexibility

Many things are just not possible to any reasonable capacity relying purely on
monkey-patching. Often things which could provide valuable coverage or tracing
insight are simply not exposed.

An example of a difficult thing to apply tracing hooks to is the timing from
when a socket first enters JS in `net.js` to when it switches to a TLS socket.
One could listen for the `secureConnection` event, but there are several
failure paths between the `TLSSocket` constructor and the event handler being
triggered, which could result in context loss. One might also think to patch
`TLSSocket.prototype._init`, however that will be called by both servers and
clients, which are difficult to differentiate from in this case. To correctly
identify and correlate a net socket to a corresponding TLS upgrade takes
extensive monkey-patching with a lot of run-time checks in a major hot-path.
With an AST transform, a single line of code could be inserted
[before the `TLSSocket` construction][1] in the `Server` constructor.

### ES Modules

A big issue on the horizon is that, if ES Modules are adopted, modules become
immutable at build time. This means that the current monkey-patching approach
will not work anymore. Members of TC39 have already suggested AST transforms
to deal with this issue.

Another concern with ES Modules is the difference in loader behaviour. This is
where splitting resolution and loading concerns from compilation is valuable.

# Implementation

There are a few possible implementations to discuss, both of which rely on a
`ModuleRequest` object to express the module loading lifecycle. Initially it
would only include the `parent` and `inputPath` fields. Other fields would be
populated in the different lifecycle stages.

### `ModuleRequest`

##### `moduleRequest.parent`

The parent module which initiated the `require()` request.

##### `moduleRequest.inputPath`

The input string to the `require()` call or `import` statement which initiated
this module request.

##### `moduleRequest.resolvedPath`

The fully resolved file path, as determined by the `require.resolvers` handler.
This would be added to the module request object during the `resolver` stage.

##### `moduleRequest.contents`

The content string (or buffer?) of the module. This field is populated during
the `loader` stage.

##### `moduleRequest.module`

The final module object, built by the `compiler` stage.

## Hook-based

The idea here is to go for something similar to `Module._extensions`, but with
three stages: resolving, loading and compiling.

### `require.resolvers`

It is the responsibility of `require.resolvers` extension handlers to receive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the in-process `Module` object and use the metadata on it to figure out how to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When referencing the Module object, did you mean ModuleRequest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'll make some revisions soon. Pretty busy for the next few days! 😱

resolve a load. For `require()` relative location loading, it'd simply look at
the `parent` and resolve the input string of the `require()` call relative to
the `parent` module.

This would roughly correspond to the current `Module._resolveLookupPaths`
implementation, but it should instead receive the module request object which
it will modify in-place.

The first resolver to return a non-null value would decide the `resolvedPath`
value in subsequent stages.

```js
const mockedPaths = {
'foo.js': path.join(__dirname, 'mocked-foo.js')
}

require.resolvers.push(({ inputPath }) => {
if (Object.keys(mockedPaths).includes(inputPath)) {
return mockedPaths[inputPath]
}
})
```

### `require.loaders`

The `require.loaders` extension handlers receive the `Module` object after the
`resolver` stage and use the resolved path to attempt to load the contents of
the module file.

Returning `null` would skip to the next loader in the pipeline. If no loader
returns a value, a `MODULE_NOT_FOUND` error is thrown.

```js
require.loaders.push(({ resolvedPath }) => {
if (!/\.js$/.test(resolvedPath)) return
try {
return fs.readFileSync(resolvedPath).toString()
} catch (err) {
if (err.code === 'ENOENT') return
throw new Error(`module load error: ${err.message}`)
}
})
```

NOTE: How native modules would be handled in the three-stage design is a bit
unknown, as they don't have a textual intermediate representation, they are
loaded directly from the file path to a `.node` file to a JS object in C++ via
`process.dlopen`. It might make sense for native modules to have a `compiler`
and not a `loader`.

### `require.transformers`

The `require.transformers` extension handlers receive the `Module` object after
the `loader` stage and alter the text content of the module.

```js
require.transformers.push(({ contents, resolvedPath }) => {
if (!/\.js$/.test(resolvedPath)) return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it return contents? Or is returning undefined the same as not modifying the contents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, returning undefined means no changes. Could work either way though.

return contents.replace(
/module\.exports\s+\=/,
// Don't forget that `exports` object now!
'exports = module.exports ='
)
})
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will c++ addons go through the loader?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. My thinking was the single line extensions handler currently used would become a compiler and it'd skip the loader step. Not sure if that makes sense though.


### `require.compilers`

The `require.compilers` extension handlers receive the `ModuleRequest` object
after the `transform` stage and pass the `contents` into a compiler function
to build the final module object.

```js
require.compilers.push(({ contents, resolvedPath }) => {
if (!/\.yaml$/.test(resolvedPath)) return
const module = new Module()
module.filename = resolvedPath
module.exports = yaml.parse(contents)
return module
})
```

## Loader objects

Another possible implementation is an object-based design, which would allow for
easier managing of dependency order between loaders by iterating through the
loaders array and inspecting `name`. This would be helpful for manipulating the
order of the loaders in situations like when you need to ensure your code
coverage instrumentation is applied after a babel transpile.

I prefer this approach, but it's a bit more involved and may be difficult to
achieve without backwards-incompatible changes.

```js
class JSONConfigLoader {
constructor({ base }) {
super()
this.name = 'JSONConfigLoader'
this.base = base
}

// Receives a `ModuleRequest` object
// Return a value to set `resolvedPath`
// Return null to remove loader from remainder of this request pipeline
resolve({ inputPath }) {
if (!/^config\:/.test(inputPath)) return
return `${this.base}/${path.slice(7)}.json`
}

// Return a string to set `contents` to pass along to subsequent stages
// Return null to remove loader from remainder of this request pipeline
// Throw to indicate module loading failure (other than not found)
load({ resolvedPath }) {
try {
return fs.readFileSync(resolvedPath).toString()
} catch (err) {
if (err.code === 'ENOENT') return
throw new Error(`module load error: ${err.message}`)
}
}

// Return transformed contents
// Throw to indicate module transform failure
transform({ contents, resolvedPath }) {
if (!contents.contains('testing')) {
throw new Error('module transform error: "testing" not found')
}

return contents.replace(/testing/, 'test')
}

// Return the final compiled module object
// Throw to indicate parse/compile failure
compile({ contents, resolvedPath }) {
try {
const module = new Module()
module.filename = resolvedPath
module.exports = JSON.parse(contents)
return module
} catch (err) {
throw new Error(`module compile error: ${err.message}`)
}
}
}

// Add the custom loader to the loaders array
require.loaders.push(new JSONConfigLoader({
base: path.join(process.cwd(), 'configs')
}))

const myConfig = require('config:some-custom-config')
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also seems like it should be a stack. If the code was .ts, it might need compilation to .js, but after it's compiled to js, an APM that instruments js would want to get it so it can be rewritten for instrumentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my original PR went with a stacked design. I've been going back and forth and if that's much better than just wrapping the existing function repeatedly. It's a trade off of making all requires a bit slower with stack checks/iteration or just making patched environments slightly slower.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be possible to optimize inside module. If the array has only one member, the function used to iterate the array can be replaced directly with the only function in the array. For arrays longer than one, we might even be able to pre-compose a function out of the members of the array (basically, cacheing them to avoid repeated iteration).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely feel like the stacked interface is friendlier. Maybe I'll put that in tomorrow and have a couple alternative suggestions to discuss and narrow down on.


This example is implemented as a class, but it could also just be a simple
object.

Each stage handler would receive a `ModuleRequest` object, with each handler
corresponding to an assignment to a particular property of that request. The
`resolve` method sets `resolvedPath`, the `load` method sets `contents`, the
`transform` method alters `contents`, the `compile` method sets `module`.

Returning `null` from any of `resolve`, `load`, `transform` or `compile` will
result in that loader being removed from the remainder of the pipeline for that
load request. However, any of `load`, `transform` and `compile` can be left
unimplemented to skip just that step. Returning a value from `load` will skip
the remaining loaders in the `load` stage and proceed to the `transform` stage.
Similar to `load`, the first `compile` to return a non-null value will skip the
remainder of the compile step and that value will become the final module.

# Notes

- The `ModuleRequest` object could possibly also just be a `Module` instance,
but the current resolution stage occurs before attempting to build a module
instance, so it's possible backwards-incompatible changes may be required to
do that, thus the safer approach of an entirely new object.
- For the object-based style, it might also be a good idea to, rather than have
the `transform` method, have `preCompile` and `postCompile` transformers.

[1]: https://github.com/nodejs/node/blob/master/lib/_tls_wrap.js#L829
Copy link
Member

@watson watson May 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably wanna future proof this link by tying it to a specific git ref (it's already broken) - tip: press y while on the github page and it will update the URL for you with a ref

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll fix that soon.