Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] - pnpm process hang up #840

Closed
muk-ai opened this issue Nov 17, 2024 · 19 comments
Closed

[Bug] - pnpm process hang up #840

muk-ai opened this issue Nov 17, 2024 · 19 comments
Labels
bug Something isn't working kernel

Comments

@muk-ai
Copy link

muk-ai commented Nov 17, 2024

Describe the bug

After upgrading releasever to 2023.6.20241111 certain commands hang up.

$ pnpm install
Lockfile is up to date, resolution step is skipped
Already up to date



^C^C^C^C^C^C^C^C^C^C

This pnpm process cannot be killed.

$ ps auxwww | grep pnpm
ec2-user    5353 17.1  1.0 1994412 170116 pts/0  Dl+  21:55   0:04 node /home/ec2-user/.nvm/versions/node/v20.9.0/bin/pnpm install
ec2-user    5511  0.0  0.0 222316  2068 pts/1    S+   21:56   0:00 grep --color=auto pnpm
$ sudo kill -KILL 5353
$ sudo kill -KILL 5353
$ sudo kill -KILL 5353
$ ps auxwwww | grep pnpm
ec2-user    5353 11.1  1.0 1994412 170116 pts/0  Dl+  21:55   0:04 node /home/ec2-user/.nvm/versions/node/v20.9.0/bin/pnpm install
ec2-user    5531  0.0  0.0 222316  2032 pts/1    S+   21:56   0:00 grep --color=auto pnpm

There seems to be no load. The process just stops responding.

To Reproduce

execute pnpm install
#840 (comment)

Expected behavior
Usually, this command takes only a few seconds.

Additional context

$ hostnamectl
 Static hostname: ip-172-16-1-250.ap-northeast-1.compute.internal
       Icon name: computer-vm
         Chassis: vm 🖴
      Machine ID: ec207c3edce5552d95110b9be794801c
         Boot ID: 2e69151397344fbc8a0c09aba812c8a4
  Virtualization: amazon
Operating System: Amazon Linux 2023.6.20241111
     CPE OS Name: cpe:2.3:o:amazon:amazon_linux:2023
          Kernel: Linux 6.1.115-126.197.amzn2023.x86_64
    Architecture: x86-64
 Hardware Vendor: Amazon EC2
  Hardware Model: t3a.xlarge
Firmware Version: 1.0
@elsaco
Copy link

elsaco commented Nov 18, 2024

@muk-ai look at the state of your process, the 8th column in the ps output. Your process is in Dl+ state, D meaning uninterruptible sleep, processes that cannot be killed or interrupted with a signal. Sending a SIGKILL won't kill it. Reboot your system!

From man ps:

PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S") will display
       to describe the state of a process:

               D    uninterruptible sleep (usually IO)

@muk-ai
Copy link
Author

muk-ai commented Nov 18, 2024

@elsaco
Thanks for letting me know. After rebooting, the process goes away, but the problem reproduces again.

@muk-ai
Copy link
Author

muk-ai commented Nov 18, 2024

The minimum reproduction method is this.

  1. install node.js
$ sudo dnf install nodejs20
  1. put package.json
{
  "name": "tmp.fnn0IclzCx",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "karma": "~6.4.0"
  }
}
  1. install packages
$ npm-20 install
npm warn deprecated inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported
⠦
⠏
⠹
⠸
⠴
⠧
⠧^C^C^C^C^C^C^C^C^C

If it does not reproduce, please try deleting node_modules and run npm-20 install several times.

@NanoSector
Copy link

We're seeing the same behaviour with NPM and Yarn. Even a simple npm install -g pm2 causes hangs.

The kernel logs (sudo dmesg -T) show repeating messages like the following after a while:

[Mon Nov 18 10:00:14 2024] INFO: task iou-sqp-2900:2908 blocked for more than 491 seconds.
[Mon Nov 18 10:00:14 2024]       Not tainted 6.1.115-126.197.amzn2023.x86_64 #1
[Mon Nov 18 10:00:14 2024] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 18 10:00:14 2024] task:iou-sqp-2900    state:D stack:0     pid:2908  ppid:2502   flags:0x00004000
[Mon Nov 18 10:00:14 2024] Call Trace:
[Mon Nov 18 10:00:14 2024]  <TASK>
[Mon Nov 18 10:00:14 2024]  __schedule+0x1ad/0x530
[Mon Nov 18 10:00:14 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Mon Nov 18 10:00:14 2024]  schedule+0x5a/0xd0
[Mon Nov 18 10:00:14 2024]  schedule_preempt_disabled+0x11/0x20
[Mon Nov 18 10:00:14 2024]  __mutex_lock.constprop.0+0x372/0x6c0
[Mon Nov 18 10:00:14 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Mon Nov 18 10:00:14 2024]  ? io_submit_sqes+0x303/0x390
[Mon Nov 18 10:00:14 2024]  io_sq_thread+0x275/0x4e0
[Mon Nov 18 10:00:14 2024]  ? membarrier_register_private_expedited+0x90/0x90
[Mon Nov 18 10:00:14 2024]  ? io_sqd_handle_event+0xd0/0xd0
[Mon Nov 18 10:00:14 2024]  ret_from_fork+0x22/0x30
[Mon Nov 18 10:00:14 2024]  </TASK>

Since we rebuild our silver images daily, we can see this broke between November 14th 7:07 GMT+1 and November 15th 7:07 GMT+1. The working base AMI was ami-03ca36368dbc9cfa1, the next broken build was using AMI ami-0fcc0bef51bad3cb2.

Hope this helps.

@rrehbein
Copy link

We ran into the same issue with ami-0ff6f6818640fabe3 and the symptoms are very similar to nodejs/node#55587 which links back to a missing commit gregkh/linux@f4ce3b5 regarding io_uring ovelflow.

@kureaken
Copy link

kureaken commented Nov 19, 2024

We also encountered the same problem.

  1. kernel version
[ec2-user@amazonlinux2023 ~]$ uname -a
Linux amazonlinux2023 6.1.115-126.197.amzn2023.aarch64 #1 SMP Tue Nov  5 17:36:31 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
[ec2-user@amazonlinux2023 ~]$

2 . npm install

[ec2-user@amazonlinux2023 ~]$ npm-20 install
npm warn deprecated inflight@1.0.6: This module is not supported, and leaks memory. Do not use it. Check out lru-cache if you want a good and tested way to coalesce async requests by a key value, which is much more comprehensive and powerful.
npm warn deprecated rimraf@3.0.2: Rimraf versions prior to v4 are no longer supported
npm warn deprecated glob@7.2.3: Glob versions prior to v9 are no longer supported
⠴^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C

@andreasstieger
Copy link

Thank you reporters. This is is indeed the Linux kernel issue mentioned by Ray. Running npm with UV_USE_IO_URING=0 in the environment works around the issue, the upstream fix will be in the next release (let me grab the date...)

@andreasstieger
Copy link

Amazon Linux 2023 release 2023.6.20241111.0 with the Linux kernel 6.1.115-126.197.amzn2023 is affected by an issue in the io_uring subsystem. It can be observed when running the npm cli, and results in hung processes in uninterruptible sleep state (“D” in ps output).

To work around the issue, temporarily disable the libuv use of io_uring by setting the corresponding environment variable UV_USE_IO_URING=0. Using an earlier Linux kernel version also works around the issue. A correction will be made available by December 9, 2024.

@muk-ai
Copy link
Author

muk-ai commented Nov 20, 2024

@andreasstieger Thanks for sharing the cause and workaround. 👍

deki added a commit to aws-samples/aws-lambda-java-workshop that referenced this issue Nov 20, 2024
deki added a commit to aws-samples/java-on-aws that referenced this issue Nov 20, 2024
colincasey added a commit to heroku/heroku-buildpack-nodejs that referenced this issue Nov 20, 2024
colincasey added a commit to heroku/heroku-buildpack-nodejs that referenced this issue Nov 20, 2024
* Disable `io_uring` use in `libuv`

amazonlinux/amazon-linux-2023#840 (comment)

* Update CHANGELOG.md
@dliberatore-th
Copy link

I'm seeing this in the base npm package as well. You don't even need a package.json, you can simply try to install a new npm with npm install -g npm@<VERSION> and you'll get this behavior.

@bobziuchkovski
Copy link

@dliberatore-th That's correct. @andreasstieger 's comment indicated this will affect nodejs in general. That UV_USE_IO_URING=0 option he linked to is documented for nodejs itself.

@dliberatore-th
Copy link

Confirmed, that workaround fixes npm! Thank you!

@infakt-HNP
Copy link

It also causes that yarn install doesn't work on new AMI. Thank you for a hint.

@januszm
Copy link

januszm commented Nov 27, 2024

Could we also bump node to v20 in the next Linux release?

@waynee-tw
Copy link

Confirmed, that workaround fixes npm installing pm2! Thank you!

@sylr
Copy link

sylr commented Dec 10, 2024

Hi @andreasstieger 👋

A correction will be made available by December 9, 2024.

I'm wondering when the fix will be published ? it seems the latest kernel version available is still 6.1.115-126.197.amzn2023.

Regards.

@andreasstieger
Copy link

Thanks for asking, I checked this morning my TZ and this is released. This is kernel-6.1.119-129.201.amzn2023
For more details see https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.6.20241209.html

@squaredx
Copy link

As far as I can tell, there isn't an updated AMI for elastic beanstalk instances. Is there anyway we can get the patched kernel in our EB environments or do we just have to wait for Amazon to publish the new images?

@muk-ai
Copy link
Author

muk-ai commented Dec 19, 2024

After upgrading releasever to 2023.6.20241212, I have confirmed that the issue has been resolved.
Thanks to everyone's reports and to the amazon linux team for the fix.

@stewartsmith stewartsmith added bug Something isn't working kernel labels Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working kernel
Projects
None yet
Development

No branches or pull requests