Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF+8 encodings are broken #54543

Closed
nikhilro opened this issue Aug 24, 2024 · 18 comments
Closed

UTF+8 encodings are broken #54543

nikhilro opened this issue Aug 24, 2024 · 18 comments
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. regression Issues related to regressions. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.

Comments

@nikhilro
Copy link

nikhilro commented Aug 24, 2024

Version

22.7.0

Platform

Linux api-deployment-694785c9f5-8dd8j 5.10.223-211.872.amzn2.x86_64 #1 SMP Mon Jul 29 19:52:29 UTC 2024 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

Hey everyone, I'm not sure how to reproduce but latest node can't parse UTF+8 anymore. It works for the first minute or two (or couple hours if I remove Datadog APM instrumentation) but then returns garbage on the same request. I'm just using postgres.js to fetch and nest.js for the HTTP server. No fancy buffer manipulation.

curl --location 'https://api.vapi.ai/assistant/205deb59-755c-489c-8879-7523b1318ed8' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer XXXXXX'
{"id":"205deb59-755c-489c-8879-7523b1318ed8","orgId":"7616920b-4696-458b-a2aa-3453fd13ace4","name":"éñüçßÆ","createdAt":"2024-08-24T08:58:16.110Z","updatedAt":"2024-08-24T08:58:16.110Z","isServerUrlSecretSet":false}%
 
## 2 minutes later      
curl --location 'https://api.vapi.ai/assistant/205deb59-755c-489c-8879-7523b1318ed8' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer XXXXXX'
{"id":"205deb59-755c-489c-8879-7523b1318ed8","orgId":"7616920b-4696-458b-a2aa-3453fd13ace4","name":"������","createdAt":"2024-08-24T08:58:16.110Z","updatedAt":"2024-08-24T08:58:16.110Z","isServerUrlSecretSet":false}%

Note how éñüçßÆ gets corrupted.

How often does it reproduce? Is there a required condition?

Restart the process, it works for sometime and then corrupts itself.

What is the expected behavior? Why is that the expected behavior?

It should keep returning the current text.

What do you see instead?

Garbage text

Additional information

No response

@nikhilro
Copy link
Author

nikhilro commented Aug 24, 2024

Update: UTF+8 is fine. It's specifically on ASCII extended set.

I tried other versions:

"name": "aa" # works fine
"name": "дмитрий" # works fine
"name": "💩" # works fine
"name": "¿" # doesn't work
"name" "éñüçßÆ" doesn't work

Our best guess is ASCII extended set UTF-8 encoded vs normal ASCII extended set are getting mixed up and corrupted

@nikhilro
Copy link
Author

cc @ronag if you have any ideas.

found 3 commits related to encoding:

@avivkeller
Copy link
Member

avivkeller commented Aug 24, 2024

Hi! v22.7.0 has a few known buffer issues, so could you provide a minimal reproduction so the issue can be narrowed down?

Additionally, could you self-moderate your comment containing curse-words, as it may be offensive to some viewers?

Edit: Thanks!


Possibly a duplicate of: #54521

@ronag
Copy link
Member

ronag commented Aug 24, 2024

Can you check if this fixes it? #54526

@nikhilro
Copy link
Author

@redyetidev

  1. Ah gotcha, likely too hard to narrow down to minimal reproduction. If you wanna get on a call, I can show you a reproduction that happens within ~8 minutes. Happy to just test when the patch comes out though.
  2. Done

@ronag

  1. Similarly, happy to test when the patch comes out. I can setup the build if you really want.

Thanks for confirming both. I'll rollback to 22.6.0 for now.

@sharpner
Copy link

sharpner commented Aug 26, 2024

We also ran into this issue, took us forever to find the culprit

We reproduced it by having a simple express http handler deployed on amazon app runner:

  res.status(200).json({
    umlaute: 'äöü',
  });
};

Further Info:
node 22.6 seems not to be affected, but 22.7 is

@eugene1g
Copy link
Contributor

eugene1g commented Aug 26, 2024

I'm seeing this show up as failed PostgreSQL queries that contain an umlaut as a parameter, with a cryptic-looking error:

Unable to execute query: "invalid byte sequence for encoding "UTF8"

Further, I couldn't reproduce it on Apple silicon, but it fails reliably on Linux.

edit: the test case provided by @blexrob fails reliably in 22.7.0 on both Apple Silicon and Linux (22.6.0 works as expected)

let i = 0;
const testStr = "jürge";
const expected = Buffer.from(testStr).toString("hex");
for(; i < 1_000_000; i++) {
  const buf = Buffer.from(testStr);
  const ashex = buf.toString("hex");
  if (ashex !== expected) {
    console.log(`Decoding changed in iteration ${i} when changing to FastWriteStringUTF8, got ${ashex}, expected ${expected}`);
    break;
  }
}

if(i<1_000_000) {
  console.error("FAILED after %d iterations",i);
} else
  console.log("PASSED after %d iterations",i);

@avivkeller
Copy link
Member

avivkeller commented Aug 26, 2024

I'd like to remind everyone that "me too" comments only add noise to this already noisy topic. Please refrain from commenting until you have something to add to the conversation

Edit: this isn't directed at any comments. This is meant to deter future "me too" comments, as they occur often with issues like this.

@avivkeller avivkeller added confirmed-bug Issues with confirmed bugs. regression Issues related to regressions. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. buffer Issues and PRs related to the buffer subsystem. labels Aug 26, 2024
@avivkeller
Copy link
Member

I feel like one of the patches in v22.8.0 (#54560), when it lands, will resolve this issue. Once that lands, please post a comment whether it resolves this issue. Given it's current state, that could be a few days.

@aapoalas
Copy link

aapoalas commented Aug 27, 2024

I assume you have already tracked this down, but I believe the issue is basically:

  1. After enough calls to some string to buffer writing API, V8 optimizes the call to use the Fast API path.
  2. Node.js' FastWriteString (I'm only guessing this is the API in question, but it is likely the one) gets the call and incorrectly assumes that the v8::FastOneByteString it is given is ASCII: This is not the case, OneByteString is Latin-1 encoded.
  3. The function then directly copies the data into the destination buffer here. The buffer is now assumed to contain the string's data as UTF-8, but instead contains the data as Latin-1.

@avivkeller
Copy link
Member

#54565 will fix this issue for the v22.8.0 release. Then, #54526 (and similar) will be evaluated for a future release.

When #54565 lands, I'll close this issue

@devronhansen
Copy link

This kept us up a couple of nights. Thank you for fixing it!

klechr added a commit to navikt/lydia-radgiver-frontend that referenced this issue Aug 28, 2024
Da chainguard kun har latest node, og som en følge av denne buggen: nodejs/node#54543 vurderer vi det dithen at vi ikke lenger vil være på chainguard.

Co-authored-by: Thomas Dufourd <thomas.dufourd@nav.no>
klechr added a commit to navikt/lydia-radgiver-frontend that referenced this issue Aug 28, 2024
Da chainguard kun har latest node, og som en følge av denne buggen: nodejs/node#54543 vurderer vi det dithen at vi ikke lenger vil være på chainguard.

Co-authored-by: Thomas Dufourd <thomas.dufourd@nav.no>
@adriano-tirloni
Copy link

I am not familiar with the internals of node, but I just lost a few days because of this.
I could not replicate efficiently because it takes a while to happen.

What gave it away was that the same request lifecycle returned intact UTF-8 string to the browser and corrupted UTF-8 to the logger service, which spins up a new worker for log transport.

@avivkeller
Copy link
Member

When #54565 lands, I'll close this issue

This PR has landed. Expect the release to follow shortly:

@SimonX200
Copy link

I hope that the test coverage gets improved with the fix in 22.8

@nicholas-long
Copy link

if this broke your data when saving to mongo, here is a script to help fix it: https://github.com/nicholas-long/mongo-node-fix-54543

tyteen4a03 added a commit to tyteen4a03/payload that referenced this issue Nov 26, 2024
An [important UTF-8 bug](nodejs/node#54543) was discovered in v22.7 and fixed in v22.8. We should only allow v22.8 to avoid this issue for end users.

Also updates Node version used by various CI tooling to be compliant with the new setting
tyteen4a03 added a commit to tyteen4a03/payload that referenced this issue Dec 2, 2024
An [important UTF-8 bug](nodejs/node#54543) was discovered in v22.7 and fixed in v22.8. We should only allow v22.8 to avoid this issue for end users.

Also updates Node version used by various CI tooling to be compliant with the new setting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. regression Issues related to regressions. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch.
Projects
None yet
Development

No branches or pull requests