Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeflateStream behavior for partial results #53502

Closed
shiranshalom opened this issue May 31, 2021 · 7 comments · Fixed by #53644
Closed

DeflateStream behavior for partial results #53502

shiranshalom opened this issue May 31, 2021 · 7 comments · Fixed by #53644

Comments

@shiranshalom
Copy link

I'm trying to use DeflateStream (actually GZipStream but the issue is with the DeflateStream code) to reduce the amount of data that I'm sending over the wire.

What I might be doing differently from the standard approach is that I'm using a single long running stream and expecting to read partial results from the stream.

What I observed is that when I'm using Read(), the operation hangs and when I'm using ReadAsync(), I'm able to process partial responses properly.

For reference, I have a reproduction of the problem below. This simulates a stream that has some compressed data, but not all of it.
When I'm calling Read() on the GZipStream, the underlying deflate will only return on one of the following conditions:

  • The buffer that I passed is full and no more is able to be written to it.
  • The source stream is completed and no more data is available.

On the other hand, in the case of the ReadAsync(), if we have data available from in the user's buffer, we'll return immediately to the user with whatever we have at hand.

This is the part that is responsible for this:

https://github.com/dotnet/runtime/blob/v5.0.6/src/libraries/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs#L482-L486

In the Read() implementation, however, there is no such check.
My guess is that it should go here:
https://github.com/dotnet/runtime/blob/v5.0.6/src/libraries/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs#L272

There is another issue here, which that on 6.0, it seems that the ReadAsync() implementation has been made identical to the Read() implementation and in both cases they would hang on partial responses.

The scenario is very important for wrapping a network connection with a GZipStream and reading potentially small values from it.

As mentioned, here is the reproduction for this issue.

@ayende

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

namespace TestGZipReadWrite
{
    public static class Program
    {
        static void Main(string[] args)
        {
            var gzip = new GZipStream(new Base64Stream(), CompressionMode.Decompress);
            var buffer = new byte[1024 * 16];
            // Hangs:
            //var r = gzip.Read(buffer, 0, buffer.Length);
            // Works
            var r = gzip.ReadAsync(buffer, 0, buffer.Length).Result;
            Console.WriteLine(Encoding.UTF8.GetString(buffer, 0, r));
        }
        public class Base64Stream : Stream
        {
            public string[] Data = new[]
            {
                "H4sIAAAAAAAACg==",
                "qlYKqSxIVbJSCkotyMlMTizJzM/zzMssyUzMCUotLE0tLlGqBQAAAP//",
                "qoapcU8t8UksLnEtSUxX0lEKzi8tSk51SSxJTEosTvVMASowSjNITE6xSNVNMjZM0TVJNTbQtTA3TtG1NExJTjGzMDFPSzbG0OmXmAsyPBEuEVqUA+RnlJQUWOnrGxqZ6xkAoaGVhYGFAVxNCNAJVkqOcL5vYnJGZh7MLP8gVz/dAGelWgAAAAD//w=="
            };
            public int Index;
            public override bool CanRead => true;
            public override bool CanSeek => throw new NotImplementedException();
            public override bool CanWrite => throw new NotImplementedException();
            public override long Length => throw new NotImplementedException();
            public override long Position { get => throw new NotImplementedException(); set => throw new NotImplementedException(); }
            public override void Flush()
            {
                throw new NotImplementedException();
            }
            public override int Read(byte[] buffer, int offset, int count)
            {
                if (Index >= Data.Length)
                {
                    Console.WriteLine("No more data is available...");
                    Console.ReadLine(); // simulate a hang
                }
                var b = Convert.FromBase64String(Data[Index++]);
                Buffer.BlockCopy(b, 0, buffer, offset, b.Length);
                return b.Length;
            }
            public override long Seek(long offset, SeekOrigin origin)
            {
                throw new NotImplementedException();
            }
            public override void SetLength(long value)
            {
                throw new NotImplementedException();
            }
            public override void Write(byte[] buffer, int offset, int count)
            {
                throw new NotImplementedException();
            }
        }
    }
}
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label May 31, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost
Copy link

ghost commented May 31, 2021

Tagging subscribers to this area: @carlossanlop
See info in area-owners.md if you want to be subscribed.

Issue Details

I'm trying to use DeflateStream (actually GZipStream but the issue is with the DeflateStream code) to reduce the amount of data that I'm sending over the wire.

What I might be doing differently from the standard approach is that I'm using a single long running stream and expecting to read partial results from the stream.

What I observed is that when I'm using Read(), the operation hangs and when I'm using ReadAsync(), I'm able to process partial responses properly.

For reference, I have a reproduction of the problem below. This simulates a stream that has some compressed data, but not all of it.
When I'm calling Read() on the GZipStream, the underlying deflate will only return on one of the following conditions:

  • The buffer that I passed is full and no more is able to be written to it.
  • The source stream is completed and no more data is available.

On the other hand, in the case of the ReadAsync(), if we have data available from in the user's buffer, we'll return immediately to the user with whatever we have at hand.

This is the part that is responsible for this:

https://github.com/dotnet/runtime/blob/v5.0.6/src/libraries/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs#L482-L486

In the Read() implementation, however, there is no such check.
My guess is that it should go here:
https://github.com/dotnet/runtime/blob/v5.0.6/src/libraries/System.IO.Compression/src/System/IO/Compression/DeflateZLib/DeflateStream.cs#L272

There is another issue here, which that on 6.0, it seems that the ReadAsync() implementation has been made identical to the Read() implementation and in both cases they would hang on partial responses.

The scenario is very important for wrapping a network connection with a GZipStream and reading potentially small values from it.

As mentioned, here is the reproduction for this issue.

@ayende

using System;
using System.IO;
using System.IO.Compression;
using System.Text;

namespace TestGZipReadWrite
{
    public static class Program
    {
        static void Main(string[] args)
        {
            var gzip = new GZipStream(new Base64Stream(), CompressionMode.Decompress);
            var buffer = new byte[1024 * 16];
            // Hangs:
            //var r = gzip.Read(buffer, 0, buffer.Length);
            // Works
            var r = gzip.ReadAsync(buffer, 0, buffer.Length).Result;
            Console.WriteLine(Encoding.UTF8.GetString(buffer, 0, r));
        }
        public class Base64Stream : Stream
        {
            public string[] Data = new[]
            {
                "H4sIAAAAAAAACg==",
                "qlYKqSxIVbJSCkotyMlMTizJzM/zzMssyUzMCUotLE0tLlGqBQAAAP//",
                "qoapcU8t8UksLnEtSUxX0lEKzi8tSk51SSxJTEosTvVMASowSjNITE6xSNVNMjZM0TVJNTbQtTA3TtG1NExJTjGzMDFPSzbG0OmXmAsyPBEuEVqUA+RnlJQUWOnrGxqZ6xkAoaGVhYGFAVxNCNAJVkqOcL5vYnJGZh7MLP8gVz/dAGelWgAAAAD//w=="
            };
            public int Index;
            public override bool CanRead => true;
            public override bool CanSeek => throw new NotImplementedException();
            public override bool CanWrite => throw new NotImplementedException();
            public override long Length => throw new NotImplementedException();
            public override long Position { get => throw new NotImplementedException(); set => throw new NotImplementedException(); }
            public override void Flush()
            {
                throw new NotImplementedException();
            }
            public override int Read(byte[] buffer, int offset, int count)
            {
                if (Index >= Data.Length)
                {
                    Console.WriteLine("No more data is available...");
                    Console.ReadLine(); // simulate a hang
                }
                var b = Convert.FromBase64String(Data[Index++]);
                Buffer.BlockCopy(b, 0, buffer, offset, b.Length);
                return b.Length;
            }
            public override long Seek(long offset, SeekOrigin origin)
            {
                throw new NotImplementedException();
            }
            public override void SetLength(long value)
            {
                throw new NotImplementedException();
            }
            public override void Write(byte[] buffer, int offset, int count)
            {
                throw new NotImplementedException();
            }
        }
    }
}
Author: shiranshalom
Assignees: -
Labels:

area-System.IO.Compression, untriaged

Milestone: -

@stephentoub
Copy link
Member

Can you share the code that produced the data you're trying to decompress?

@ayende
Copy link
Contributor

ayende commented Jun 1, 2021

Here is it with clean data, in the code above, replace the Data with:

            public string[] Data = new[]
            {
"H4sIAAAAAAAACg==",
"yigpKSi20tdPzyzJKE3SS87P1U/JL8lLLdEvKs0rycxN1c8sLi5NLeblAgAAAP//",
"yiCoMiWzOLm0uDgzPw+oHAAAAP//",
"gilPzdMrz8zOLEhNyUzUyy9K1wfx9H0z8zJzS3PjM/MyMpMyS/KLKuOT8/OSU/NKihJLgAbwcgEAAAD//w==",
"AwA1BEw0mQAAAA=="
            };

And the code to generate this is using:

   using var gzip = new GZipStream(base64Console, CompressionLevel.Optimal);
   using var writer = new StreamWriter(gzip);
   {
       writer.WriteLine("https://github.com/dotnet/runtime/issues");
       writer.Flush();
       writer.WriteLine("https://github.com/dotnet/runtime/discussions");
       writer.Flush(); 
       writer.WriteLine("https://en.wikipedia.org/wiki/Minimum_inhibitory_concentration");
       writer.Flush();
   }

You can observe the same behavior difference between Read() and ReadAsync()

@stephentoub stephentoub self-assigned this Jun 2, 2021
@stephentoub stephentoub added this to the 6.0.0 milestone Jun 2, 2021
@stephentoub
Copy link
Member

It appears DeflateStream.Read has always tried to fill the destination buffer. I agree that's un-Stream-like. I don't know why it was done that way, but it's that way in .NET Framework as well. ReadAsync was just a wrapper for Read until a few versions ago, when it was written to actually be async; that implementation had a few bugs, which have been fixed in .NET 6 by simplifying it and having it mirror the Read implementation, since they should behave identically, albeit with one doing sync I/O and the other doing async I/O. But, as you point out, that's now imbued ReadAsync with the same "fill the destination buffer" behavior that Read had and that ReadAsync also had for most of its existence (but didn't in .NET 5).

It's worth looking at changing both implementations in .NET 6.

cc: @geoffkizer

@stephentoub stephentoub added bug and removed untriaged New issue has not been triaged by the area owner labels Jun 2, 2021
@ayende
Copy link
Contributor

ayende commented Jun 2, 2021

I can submit a PR for that, if this is OK

@stephentoub
Copy link
Member

Thanks. I have a fix (also addressing a long-standing issue around 0-byte reads), but I want to do some perf validation to understand what the impact will be; it was almost certainly written like this initially to reduce various costs.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 2, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 11, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jul 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants