Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode emoji converted to broken chars in certain extreme long string responses #288

Closed
2 tasks done
stanleyxu2005 opened this issue Mar 28, 2024 · 24 comments · Fixed by #290
Closed
2 tasks done
Labels

Comments

@stanleyxu2005
Copy link
Contributor

stanleyxu2005 commented Mar 28, 2024

Prerequisites

  • I have written a descriptive issue title
  • I have searched existing issues to ensure the bug has not already been reported

Fastify version

4.26.2

Plugin version

7.0.2

Node.js version

18.x

Operating system

Windows

Description

Download a 200kb yaml plain text file with fastify/compress. Certain lines of the yaml contains illegal chars.

image

As you can see, on the left side (without compress), icons in the small green box are same.
But on the right side, one of the icon becomes 2 strange question marks.

The content encoding is br (I tried with gzip, same error)

image

Any thoughts, what can be the root cause?

Steps to Reproduce

const fastify = new Fastify({})
fastify.register(fastifyCompress)

fastify.get('/subscription', async (req, reply) => {
  const subscription = await handler.downloadClashSubscriptionAsync(config)

  // await require('fs').promises
  //   .writeFile('/out.txt', subscription, 'utf-8')  // <--- the file content is okay

  return reply.send(subscription)  // <--- the file content is corrupt
})

Expected Behavior

Dont expect the two question marks appear in the content.

@mcollina
Copy link
Member

Thanks for reporting!

Can you provide steps to reproduce? We often need a reproducible example, e.g. some code that allows someone else to recreate your problem by just copying and pasting it. If it involves more than a couple of different file, create a new repository on GitHub and add a link to that.

@mcollina
Copy link
Member

As some additional checks: can you verify that Node.js brotli compressor is not the source of the problem?

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 28, 2024

As some additional checks: can you verify that Node.js brotli compressor is not the source of the problem?

How can I verify on Node.js side? And it's more than br, I tested with gzip and deflate.

I am happy to verify. It's quite strange behavior.

@mcollina
Copy link
Member

Take a look at the zlib module.

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 28, 2024

Attached a minimal reproducible example.

npm install
node issue-288.js

Open browser and navigate to http://127.0.0.1:6789/issue
When you see the whole file, then search for hydra.alibaba.com

image

@mcollina
Copy link
Member

Screenshot 2024-03-28 at 20 04 37

renders perferctly here.

@mcollina
Copy link
Member

I think you might need to specify the charset in the http header.

const Fastify = require('fastify')
const fastifyCompress = require('@fastify/compress')

const fastify = new Fastify({
  logger: true,
})
fastify.register(fastifyCompress)

fastify.get('/issue', async (req, reply) => {
  const subscription = await require("fs")
    .promises //
    .readFile("./subscription.txt", "utf-8");

  // await require('fs').promises
  //   .writeFile('/out.txt', subscription, 'utf-8')  // <--- the file content is okay
  //

  reply.header('Content-Type', 'text/plain; charset=utf-8')

  return reply.send(subscription)  // <--- the file content is corrupt
})

fastify.listen({ port: 6789, host: '127.0.0.1' })

@stanleyxu2005
Copy link
Contributor Author

I think you might need to specify the charset in the http header.

const Fastify = require('fastify')
const fastifyCompress = require('@fastify/compress')

const fastify = new Fastify({
  logger: true,
})
fastify.register(fastifyCompress)

fastify.get('/issue', async (req, reply) => {
  const subscription = await require("fs")
    .promises //
    .readFile("./subscription.txt", "utf-8");

  // await require('fs').promises
  //   .writeFile('/out.txt', subscription, 'utf-8')  // <--- the file content is okay
  //

  reply.header('Content-Type', 'text/plain; charset=utf-8')

  return reply.send(subscription)  // <--- the file content is corrupt
})

fastify.listen({ port: 6789, host: '127.0.0.1' })

In my initial report, the response header already contains 'Content-Type', 'text/plain; charset=utf-8'. No need to add this.
The test result you see is not at the correct place. The broken icon only appears after hydra.alibaba.com.
(However, the mock text file I shared already contains the broken char, so the text is useless)

@mcollina mcollina reopened this Mar 28, 2024
@mcollina
Copy link
Member

Screenshot 2024-03-28 at 20 19 42

I can see the same broken icons in nVIM

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 28, 2024

Screenshot 2024-03-28 at 20 19 42 I can see the same broken icons in nVIM

Yes. But how can I explain..

My route is like proxy. It downloads a file from external and then sent to my local env.

The actual corrupt raw file is received from external . But I cannot just share , because it contains my private credential.

But if I make any changes to this received corrupt content, the broken question marks will be gone.

This time you are able to see the broken question marks, because it is already in subscription.txt.

I need some time to find a way to reproduce the issue. Happy Ester for now.

@jsumners
Copy link
Member

It is impossible for anyone to assist if you are unable to provide a known good input file that results in a bad output file after being compressed.

const data = fs.readFileSync('/tmp/known_good.txt')
const compressedData = zlib.brotliCompressSync(data)
fs.writeFileSync('/tmp/compressed.txt.brotli', compressedData)
$ brotli -d /tmp/compressed.txt.brotli

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 29, 2024

It is impossible for anyone to assist if you are unable to provide a known good input file that results in a bad output file after being compressed.

const data = fs.readFileSync('/tmp/known_good.txt')
const compressedData = zlib.brotliCompressSync(data)
fs.writeFileSync('/tmp/compressed.txt.brotli', compressedData)
$ brotli -d /tmp/compressed.txt.brotli

Img 1: No broken chars appear in compressed.txt in editor.

image

Img 2: See broken chars in browser (or curl), after calling return reply.send(compressed_txt_data).

image

@jsumners I have done the steps, brotli is fine, but final result is not good.

Let me describe:

Round 1

  1. Download the file "known_good.txt" from external
  2. Open "known_good.txt" in editor (e.g. Sublime, Intellij), content is good.
  3. Use return reply.send(known_good_data) will see broken chars.

Round 2

  1. Use @jsumners code snippet to generate compressed.txt.br and then decompress it.
  2. No file differences detected with diff known_good.txt compressed.txt.
  3. Use return reply.send(compressed_txt_data) will still see broken chars.

image

Round 3

  1. Instead of fs.promises.readFile(filePath, 'utf-8'), if I dont specify the utf-8, it will not show content in browser, but prompt to download as an attachment.
  2. The attachment is correct. Maybe it is sent as a file stream directly.

Current finding:

  1. Brotli compress does not corrupt the data
  2. Not sure if invisible bytes at the beginning of the text data has some impact? (The broken char is at position 62484, which is quite close to 65525)
  3. I'm suspect the handling of reply.send(...) e.g. hook needs further analysis (see new finding 1 and new finding 2 below)

@Uzlopak
Copy link
Contributor

Uzlopak commented Mar 29, 2024

Somehow seeing 62484 seems oddly familiar?! As if i saw it in a different context.

The char is at Position 62484 but it could be byte position 65525, as you mentioned. So very close to the default high watermark of streams?

It can be, that the Buffers are not concatted properly. So basically the following classical misimplementation: toString() is called on each chunk in the stream instead of buffering the chunks into an array of Buffers, run Buffer.concat and then call .toString() once on the result.

@Uzlopak
Copy link
Contributor

Uzlopak commented Mar 29, 2024

Can you provide your mvce as a github repo, please?

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 29, 2024

Here we go
https://github.com/stanleyxu2005/fastify-compress-issue-288

Now, its a bit more closer to the root cause.

I register routes with a route factory,
https://github.com/stanleyxu2005/fastify-compress-issue-288/blob/main/index.js#L10

If I simply define route as

fastify.get(path, async (request, reply) => {
  const longStringWithEmoji = readFile(...)
  return reply.send(longStringWithEmoji) // <-- works, Emoji are shown
})

If I define route like this, it fails

fastify.register((instance, opts, done) => {
  instance.get(path, async (request, reply) => {
    const longStringWithEmoji = readFile(...)
    return reply.send(longStringWithEmoji) // <-- NOT work, sone Emoji are invalid
  })
  done()
})

@climba03003
Copy link
Member

I am sorry to bother you, but the example you shown above do not reproduce on my computer.
I checked with fastify.inject and Firefox. All comes out expected result.

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 29, 2024

I am sorry to bother you, but the example you shown above do not reproduce on my computer. I checked with fastify.inject and Firefox. All comes out expected result.

Could you try toggle the line separator and encoding?
image

Sometimes when you checkout a project, Git would do some line separate convertion for you.

@climba03003
Copy link
Member

climba03003 commented Mar 29, 2024

image
So, it is expected to use LF? I use Windows and clone it. Git will auto use CRLF.

@stanleyxu2005
Copy link
Contributor Author

stanleyxu2005 commented Mar 29, 2024

image So, it is expected to use LF? I use Windows and clone it. Git will auto use CRLF.

@climba03003 @Uzlopak Could you pull my repo again.
I have added a index2.js. This time, the string will be generated on-the-fly.

npm run serve2

image

image

@mcollina
Copy link
Member

mcollina commented Mar 29, 2024

I can reproduce. Note that /good has no compression enabled in the example, so that's why it works.

@climba03003
Copy link
Member

climba03003 commented Mar 29, 2024

I suspect the bug is actually cause by into-stream, when using Readable.from directly. It works as expected.
Do we actually need into-stream?

@climba03003
Copy link
Member

For workaround,

  1. Bypass into-stream using Readable.from directly and specify the Content-Type header.
instance.get('/issue', async (req, reply) => {
  const longStringWithEmoji = await readFile("./test.txt", "utf-8");
  reply.header('content-type', 'text/plain; charset=utf-8')
  return reply.send(Readable.from(longStringWithEmoji, { encoding: 'utf-8'}))
})
  1. Use createReadStream directly
instance.get('/issue', async (req, reply) => {
  reply.header('content-type', 'text/plain; charset=utf-8')
  return reply.send(createReadStream('./test.txt', 'utf-8'))
})

@mcollina
Copy link
Member

Here is the fix: #289

@stanleyxu2005 stanleyxu2005 changed the title Brotli compression lead to some strange encoding issue Unicode emoji converted to broken char for certain extreme long string response Mar 29, 2024
@stanleyxu2005 stanleyxu2005 changed the title Unicode emoji converted to broken char for certain extreme long string response Unicode emoji converted to broken chars in certain extreme long string responses Mar 29, 2024
@stanleyxu2005
Copy link
Contributor Author

Please reopen the issue, until #291 is tested successfully. Thanks

@gurgunday gurgunday reopened this Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants