Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yarn Plug 'N Play should generate a static manifest file, not .pnp.js #6388

Closed
elldritch opened this issue Sep 14, 2018 · 8 comments
Closed
Assignees
Labels

Comments

@elldritch
Copy link

elldritch commented Sep 14, 2018

This is a response to yarnpkg/rfcs#101 (comment). For context, my previous response to the PnP proposal is here.

I won't reproduce in detail the arguments against .pnp.js I've previously presented (they can be read in the context comment). To summarise:

  1. .pnp.js makes dependency analysis very difficult in environments without Node.JS. I believe this covers a non-trivial number of use cases.
  2. .pnp.js presents security concerns that are not outweighed by its advantages in implementation flexibility.

To directly address @arcanis's concerns:

please check the whitepaper, it includes a rational explaining why we went with a Javascript file

Correct me if I'm wrong, but I interpreted the rationale to be "the implementation is very complex and might change, so it's easier to provide an executable API".

How significant is your use case (running the Node resolution without Node) compared to the ecosystem at large?

Helping developers and enterprises use dependency analysis tools is my day job, and I see this use case again and again and again. One of the methods we use for Node.JS dependency analysis is basically "run npm ls --json and parse the output" and we've repeatedly encountered environments where Node.JS isn't available in a CI environment even for projects written in Node.JS. CI environments at large enterprises are often complex, multi-stage, and multi-image; integrating polyglot dependency analysis tools at the project-level CI stage where Node.JS is available is often infeasible.

This isn't an off-hand edge case. We get bug reports caused by missing Node runtimes during dependency analysis on a regular (~weekly) basis.

The flexibility you talk about comes at a price, which is the flexibility for everyone using Node - consumption from third-party JS tools would be made harder since users would have to know about the data layout, and a single implementation of the resolver could ever exist. Versioning data structures is notoriously complex, while JS APIs are easy to version and to make backward compatible regardless of the actual implementation.

This makes a lot of sense to me, and I sympathise with this concern. I think the critical question here is: how often do you think the data layout will change?

Static manifest files have a lot of advantages. They're easy to parse, easy to analyse, portable (requiring Node means the current API is not really portable), and pose significantly fewer security risks. It seems premature to give up on all of these advantages to gain flexibility that you might not end up really needing.

In what scenarios do you see the data layout changing? How often do you think this will occur?

I think the performance improvement from avoiding a copy is so great that I would be surprised if you needed constant data layout changes to improve performance. In the case where changes are not often needed, it's feasible to add a version field to the data structure and provide a resolver library that switches on the version.

I would be much less opposed to an executable API if it were portable (e.g. in Bash).

Given a standardized API, what prevents you from running a script after yarn install that would consume the API and generate a file in whatever format is adaptable to your use case?

I can't consume the API if I can't run Node. For reasons outside of my control, it's infeasible for me to get all integrating Node projects to install a post-install hook.

@ghost ghost assigned arcanis Sep 14, 2018
@ghost ghost added the triaged label Sep 14, 2018
@arcanis
Copy link
Member

arcanis commented Sep 14, 2018

Hey! Thank you for raising this issue! It's an important point, and since it's goes a bit against the current state of things (even if not entirely - cf webpack.config.js, .eslintrc.js, etc) I'm more than willing to explain why we went there, and which problems we think it solves.

I can't consume the API if I can't run Node. For reasons outside of my control, it's infeasible for me to get all integrating Node projects to install a post-install hook.

This seems dubious to me. If you can run Yarn, you can run Node. Even in your scenario where you have two images, you still have one image with Node that you use to run Yarn. I don't see what prevents you from doing this conversion as a secondary step. It doesn't have to be a post install script: you could just run it yourself.

Correct me if I'm wrong, but I interpreted the rationale to be "the implementation is very complex and might change, so it's easier to provide an executable API".

Not so much that it's very complex - we just don't expect to have the same idea as to what the implementation would be as other package managers. NPM has an history of doing things in their own ways, and I don't want to end up in a scenario where projects have to live with X different formats (assuming that PnP gets traction). Standardizing it as a JS API is a good way to circumvent such issues.

They're easy to parse, easy to analyse

I disagree on the "easy to analyze" part. In order to analyze them, you have to understand them. Meaning that you have to implement the exact same logic than any other loaders, and any shortcut might give one of your users wrong results. An API is much easier to use and less error-prone.

In what scenarios do you see the data layout changing? How often do you think this will occur?

I don't know. What I do know is that until 24h ago, the prospect of removing node_modules wasn't something anyone would have deemed credible, but here we are. I'd like to make sure that the improvements we make won't lock us out of future evolutions.

I'm would be much less opposed to an executable API if it were portable.

I think Node is the most portable thing we can get for Node applications 😉

@elldritch
Copy link
Author

If you can run Yarn, you can run Node.

I think Node is the most portable thing we can get for Node applications wink

This is where we disagree. My point is that "analyze a Node project in an environment without Node.JS" is not an uncommon edge case. This is my day job, and I've seen the pattern of "the organisation-wide CI does not have the runtime" over and over again across countless organisations, including top technology companies, Fortune 100s, etc.

It might help if I give you more context. Here is how I believe you imagine the integration process looks (this is how I thought of it at first too):

  1. A developer writes Node.JS project.
  2. A developer wants to integrate Node.JS tool.
  3. A developer runs Node.JS tool in the build/test image for the Node.JS project, where Node-the-runtime is available.

Here is how this process actually looks across many companies, 90% of the time (this is the rule, not the exception):

  1. Thousands of developers write hundreds of projects, some of which are in Node.JS.
  2. Since each of these projects are different, they each run their own project CIs. Each CI is managed by a developer familiar with the project. The organisation has an organisation-wide CI that does final smoke testing and automatic release. The organisation-wide CI only has the common dependencies for smoke testing, which often does not include Node.JS.
  3. An engineering decision-maker decides to integrate dependency analysis.
  4. Individual project CI maintainers push back, because they see this as more work for them.

At this point, we have two options for integrating a tool across an organisation:

  1. Go to each team responsible for a Node.JS project (there can be tens to hundreds of these teams), set up a meeting with them, figure out how to work with their project CI's quirks, and help them integrate the tool. In this path, the tool can rely on having Node.JS in exchange for having a long and painful path to integration.
  2. Integrate once across the entire project at the organisation-wide CI step. In this path, we don't necessarily have access to Node.JS in the environment, but we can integrate across many projects at once.

I cannot overemphasise how common this organisation structure and use case is and how painful it is to integrate on a project-by-project basis. One significant advantage of using a file format is that we can do analysis at the organisation level (and in general, we can do analysis in a portable way, regardless of what environment the tool is running in).

On one hand, I can see the argument that this is a people/enterprise problem and not a tool problem. On the other hand, I believe that tools should be built around real problems felt by real users, and not supporting this use case will cause a lot of pain for a lot of users down the line.

I disagree on the "easy to analyze" part. In order to analyze them, you have to understand them. Meaning that you have to implement the exact same logic than any other loaders, and any shortcut might give one of your users wrong results. An API is much easier to use and less error-prone.

There are a couple subtle effects in practice that I think make analysis a lot less hard than you believe. I'm going to break this down into a couple of points:

In order to analyze them, you have to understand them

I don't think the file format would be significantly harder to understand than the API. Obviously, you know better than me here; but I imagine that you can represent most of the needed state by mapping package names + revisions to file paths.

you have to implement the exact same logic than any other loaders, and any shortcut might give one of your users wrong results

One thing that we've learned in practice (that was very surprising to me) is that users are generally okay with approximately correct answers. As an example, the configuration spec for Bower is absolutely crazy (you can override the bower_components folder name with environment variables, global configuration files, project configuration files, etc.) but users are very happy with the 99.9% correct answer of "just read bower_components".

At the very least, users in practice have strongly preferred "approximately correct" to "no answer available" (or even "answer available, but you have to put in a lot more effort").


It sounds to me like the main problem right now is that I haven't convinced you that my use case is common and real. What other context can I share that would help you make that decision?

@edmorley
Copy link
Contributor

Integrate once across the entire project at the organisation-wide CI step. In this path, we don't necessarily have access to Node.JS in the environment, but we can integrate across many projects at once.

If you can bundle a tool, why can't you also bundle a runtime for the tool, at the org-wide CI step?

@arcanis
Copy link
Member

arcanis commented Sep 17, 2018

Go to each team responsible for a Node.JS project (there can be tens to hundreds of these teams), set up a meeting with them, figure out how to work with their project CI's quirks, and help them integrate the tool.

At least for now, Plug'n'Play will require a small integration, in that those teams will have to enable it manually (by adding the "installConfig": {"pnp": true} in their package.json). What would you think of simply adding a step to your global org CI similar to this?:

if [[ -f .pnp.js && ! -f static-resolutions.json ]]; then
  echo "It seems you're trying to register a package without checking-in the static resolutions table"
  echo "Check this page for more information on how to solve this issue: http://internal.wiki/pnp"
  exit 1
fi > /dev/stderr

It sounds to me like the main problem right now is that I haven't convinced you that my use case is common and real

I think it's more that I'm not convinced that it's something that should be fixed at the package manager level. I'm trying to understand why there is no other layer where this can/should happen.

@elldritch
Copy link
Author

@edmorley: bundling a runtime is possible, but I would rather avoid this step. It's a significant complexity cost that every tool trying to analyse Yarn PNP projects would need to pay.

@arcanis:

What would you think of simply adding a step to your global org CI similar to this?:

This sounds like a good intermediate step. The pain here is that building those static resolution tables will still require a project-by-project integration.

I think it's more that I'm not convinced that it's something that should be fixed at the package manager level. I'm trying to understand why there is no other layer where this can/should happen.

My intuition says that the package manager is the right place to fix this, but I'm still trying to gather those thoughts into words. It feels to me like the package manager is the most seamless layer to implement this functionality. As another data point, here's another comment with similar concerns (excerpt below): #2250 (comment)

It would be great, then, if there were a better way than the above to figure out what's in a Yarn lockfile, when using something other than Node as our scripting language.

@arcanis
Copy link
Member

arcanis commented Sep 25, 2018

Hey folks, I put together the following project: https://github.com/arcanis/build-pnm

It consumes the PnP API in order to generate a static package-name-maps file. As you can see by checking the source it's not a lot of code, and you could easily adapt it to output any language you want! In fact, it could likely be generalized to output data based on a template 🙂

The pain here is that building those static resolution tables will still require a project-by-project integration.

Could be alleviated if we had a plugin system. I have something in mind regarding this. I haven't progressed enough to show anything yet but just know that it's something I'd be super interested to pursue.

It would be great, then, if there were a better way than the above to figure out what's in a Yarn lockfile, when using something other than Node as our scripting language.

It's a somewhat different concern in that the problems behind the lockfile format being close-to-yaml-but-not-yaml are clear, and the solution is well understood (we should have added a few characters that would have made the file a subset of yaml). The topic here is much more uncharted territories.

@elldritch
Copy link
Author

The plugin system sounds interesting -- our Maven integration uses this route as well.

It's a somewhat different concern in that the problems behind the lockfile format being close-to-yaml-but-not-yaml are clear, and the solution is well understood (we should have added a few characters that would have made the file a subset of yaml).

Yes, that's correct. My point was that there exist other people out there who want to understand Yarn projects without using Node.

@arcanis
Copy link
Member

arcanis commented Mar 11, 2019

Closing, Yarn v2 has an option (pnpEnableInlining) that allows disabling the inlining and will cause Yarn to generate two files instead of one: the manifest, and the loader. Additionally, we'll expose a special package that you will be able to depend upon to instantiate a fresh PnP API from a serialized manifest.

Note that the manifest should be treated as a safe opaque file, and any assumption regarding the data it contains might lead to breakages (ideally only between major releases, but possibly minors as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants