Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guide: "Backpressuring in Streams" #1109

Merged
merged 6 commits into from
Apr 22, 2017

Conversation

jalafel
Copy link
Contributor

@jalafel jalafel commented Jan 23, 2017

This is was an issue opened on the nodejs/node repository before guides were transferred directly to node/nodejs.org.

This PR contains a first go at a guide on backpressures, and missing some vital points that I do not feel equipped to answer. I have left some empty spaces for them! I would appreciate anyone's input on the content, structure, and any additions to the guide.

Some points to expand on would be the problems that streams pose which are solved by backpressure. I think I hit most of the points, but I don't think I'm versed enough to go into detail.

There is also space for an example app, but again, I am not familiar on the current best practices on building streams and for implementing your own backpressure.

Thanks!

Refs: nodejs/node#10766

@lpinca
Copy link
Member

lpinca commented Jan 23, 2017

cc: @nodejs/streams

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work is solid, I think the intro should be changed to explain what is the problem that streams solve.

data and the _consumer_.

In Node.js the source is a [`Readable`][] stream and the consumer is the
[`Writable`][] stream (both of these may be interchanged with a [`Duplex`][]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a Transform

help expand your understanding while reading this guide and for following along
with examples.

## The Problem with Streams
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not a problem with streams. There is a problem that streams solve, which is backpressure, also called flow control in other communities. There are other solutions to backpressure. Unix pipes and TCP sockets solves this as well, and are the root of the following. Also, streams can be used for pull-based data handling.

The usual demo I do to show the problem is zipping a very large data file both in one go and with streams. As the first one crashes, the second one can work with data of any size. (the original credit goes to @davidmarkclements for that example). That shows the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this review @mcollina. So essentially there is not separation from streams & backpressure? As in ... data-handling is tricky, but streams solves it with backpressure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in "data handling is tricky, and streams is the solution that Node has adopted".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syntactically, I am still a bit confused. Wikipedia describes backpressure in IT as

the build-up of data behind an I/O switch if the buffers are full and incapable of receiving any more data; the transmitting device halts the sending of data packets until the buffers have been emptied and are once more capable of storing information.

In this instance .. is backpressure the problem or the I/O switch that streams implements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance .. is backpressure the problem or the I/O switch that streams implements?

Backpressure itself is the problem, and the backpressure handling mechanisms in the streams classes are our solution to it. :)

+------------+ add chunk to queue |
| <--^-------------------<
+============+
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is not exactly clear. Why everything is going back to pipe()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I assumed every event emitter was being found back into pipe, and then pipe would delegate what would do next. Is that incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jessicaquynh it is incorrect. pipe sets up some closures, who would keep working, but the method is not invoked anymore. The diagram should represent what pipe set up, but not show pipe at all: https://github.com/nodejs/node/blob/master/lib/_stream_readable.js#L473-L613


* If the write queue is busy, [`.write()`][] will return false.
* If the data chunk is too large, [`.write()`][] will return false (the limit
is indicated by the variable, [`highWaterMark`][]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are building a Writable directly, this is not needed, as it is handled by the stream state machine. However, it is needed to use a Writable  directly, see my example in the Node docs: https://github.com/nodejs/node/blob/master/doc/api/stream.md#writablewritechunk-encoding-callback

@eljefedelrodeodeljefe
Copy link
Contributor

@jessicaquynh can you incorporate Matteo's comments?

@jalafel
Copy link
Contributor Author

jalafel commented Feb 20, 2017

@eljefedelrodeodeljefe yes definitely! sorry I've been mia on this issue .. will submit changes tomorrow night!

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, and I'm excited to read the currently unfinished sections.

One thing I noticed is the emphasis on writable streams. Writing readable streams that stop when asked to do so is just as important as teaching the writable stream to say stop when needed IMO.


Once the the queue is finished, backpressure will allow data to be sent again.
The space in memory that was being used will free itself up and prepare for the
next glob of data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glob → blob / batch


# Backpressuring in Streams

The purpose of this guide will describe what backpressure is and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will → is to


This effectively allows an fixed amount of memory to be used at any given
time for a [`.pipe()`][] function. There will be no memory leakage, no
indefinite buffering, and the garbage collector will only have to deal with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indefinite → infinite


## Build a WritableStream

Let's extend the prototypical function of [`.write()`][]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not extending .write() here, but extending stream.Writable with a custom ._write() method, which is then used internally by .write().

value for what system is running the application. In instances where you might
want to raise that value, go for it, but do so with caution!

## Build a WritableStream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Writable Stream" (note the space) for consistency

var fs = require('fs');

var inputFile = fs.createReadStream('REALLY_BIG_FILE.x');
var outputFile = fs.createWriteStream('REALLY_BIG_FILE_DEST.x');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

var inputFile = fs.createReadStream('REALLY_BIG_FILE.x');
var outputFile = fs.createWriteStream('REALLY_BIG_FILE_DEST.x');

// Secretly the stream is saying: "whoa, whoa! hang on, this is way too much!"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately clear that writing to the disk is slower than reading. Either mention it explicitly ("on a computer where writing is slower than reading"), or use a slower writable stream (http? zlib?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing to the disk is slower than reading from it. It might not be clear for a reader though, so I'm 👍 in adding a zlib compression before writing down.

one area in memory!

So, if backpressure is so important, why have you (probably) not heard of it?
Well the answer is simple: Node.js does all of this automatically for you.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it be known somewhere that only .pipe() does this automatically (readable.on('data', buf => writable.write(buf)); doesn't use backpressure, for example). I know in the next section you are looking more closely at .pipe(), but it seems to be assumed that the reader knows all this is talking about .pipe().

an isolate application, both [`Readable`][] and [`Writable`][] streams
should be present. If you are writing an application meant to accept a
[`Readable`][] stream, or pipes to a [`Writable`][] stream from another app,
you may omit this detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "application" / "app" means here. A standalone program? A library? A function?

## Example App

Since [Node.js v0.10][], the [`Streams`][] class has offered the ability to
overwrite the functionality of [`.read()`][] or [`.write()`][] by using the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say "overwrite", but rather "modify the behavior"

@mcollina
Copy link
Member

One thing I noticed is the emphasis on writable streams. Writing readable streams that stop when asked to do so is just as important as teaching the writable stream to say stop when needed IMO.

As we have a backpressure (or flow-control) mechanism in place, the maximum amount of data that can flow is controlled by downstream, hence the focus on Writable is correct. However this is not explicit. We should also note that calling readable.push whenever data is ready is as bad as readable.on('data', data => writable.write(data)).

@jalafel
Copy link
Contributor Author

jalafel commented Feb 26, 2017

Sorry it took so long to update the guide. Hope I interpreted the change requests correctly. Ultimately, I am still not 100% what constitutes best practice for building streams.

Would it be fair to say that best practice is also non-contradictory practice? As in, "you have a good stream if you did not do A, B, C"?

@mcollina
Copy link
Member

Sorry it took so long to update the guide. Hope I interpreted the change requests correctly. Ultimately, I am still not 100% what constitutes best practice for building streams.

I am almost always dropping 50-70% of the "best practices" for Streams. There are certain rules to follow, all the rest can be overruled. However, I would not recommend anyone to try writing something like https://github.com/mcollina/syncthrough without a great experience in the internals.

Would it be fair to say that best practice is also non-contradictory practice? As in, "you have a good stream if you did not do A, B, C"?

Yes! The golden rule of streams is to always respect backpressure, A) never push() if you are not asked, and B) never call write() after it returns false but wait for 'drain' instead. The latter holds more power for me and help me reason about things better.

The third rule: C) streams changes between different node versions, and the library you use. Be careful and test things.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is great work. We are getting close, thanks so much for the effort in this. There is one thing that is a must to fix, and the rest is bonus content.

+------------+ add chunk to queue |
| <---^---------------------<
+============+
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is making good progress! But I think you should split pipe() in its own step of the diagram, and not make it part of the "loop". The event handlers are, so we should show the distinction.

The syntax for events, e.g. drain(), is not common throughout the docs, so we should probaly use on('drain', cb)  instead.

also cc @lrlna who did some visualization on this in the past on paper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if by this you meant something like:

                                                     +===================+
                         +-->  Piping functions  +--->   src.pipe(dest)  |
                         x     are set up during     |===================|
                         x     the .pipe method.     |  Event callbacks  |
+===============+        x                           |-------------------|
|   Your Data   |        x     They exist outside    | .on('close', cb)  |
+=======+=======+        x     the data flow, but    | .on('data', cb)   |
        |                x     importantly attach    | .on('drain', cb)  |
        |                x     events, and their     | .on('unpipe', cb) |
+-------v-----------+    x     respective callbacks. | .on('error', cb)  |
|  Readable Stream  +----+                           | .on('finish', cb) |
+-^-------^-------^-+    |                           | .on('end', cb)    |
  ^       |       ^      |                           +-------------------+
  |       |       |      |                           
  |       ^       |      |                                              
  ^       ^       ^      |    +-------------------+         +=================+
  ^       |       ^      +---->  Writable Stream  +--------->  .write(chunk)  |
  |       |       |           +-------------------+         +=======+=========+

Or even have more explicit steps come in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that is way better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi! this is so so good, i love it!

in terms of the diagram, i find the loop a little weird to get around and understand, although i definitely see where you're coming from.

What about making it a bit more linear? I am not good with replicating the fancy thing you have going, but something like this on paper?

stream-backpressure

an isolate application, both [`Readable`][] and [`Writable`][] streams
should be present. If you are writing an application meant to accept a
[`Readable`][] stream, or pipes to a [`Writable`][] stream from another
component, you may omit this detail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a paragraph or a section here to discuss streams pipelines. What happens when a transform is in place?

Copy link
Contributor Author

@jalafel jalafel Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh this might be rhetorical question, but I will answer just in case ... when Transform is in the pipeline, it takes in the Readable output and does the ._transform() and pipes that into the destination Writable? Do you mean to explicate this process in the guide?

Edit: Jw, do Transform Streams or Duplex Streams have special conditions to respect backpressure as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Transform  has both an outgoing and an incoming highWaterMark. Most people write Transform to setup pipelines, so it's important to explain how it all plays out when you chain multiple streams.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, these are the true facts say about a Transform stream:

It is important because its used often to setup pipelines. In it, there are _write() and _read() calls that must be implemented with the same care as with Readable and Writable (with respect to the return values .push() and .write(). And also HWM may change along the pipeline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not correct. Transform  implementors should add _transform, as it is a Duplex that implement both _read and _write, i.e. Transform has those method implemented.


}
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an equal section on 'Write a Readable Stream'.

We should mention that [readable-stream][https://github.com/nodejs/readable-stream] is the best way to build a stream application across multiple versions of Node.js, and the browser too! This will make your application have a stable base, and an implementation that changes with its own lifecycle compared to node core.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general question, is the primarily Node.js-based Readable stream still the best option in the browser? In my experience, the most often cause of using Node.js Readable in the browser is as a shim when creating Browserify bundle. At the same time, readable-stream IIRC depends on additional polyfills like one for events module. Plus, I know WHATWG has been working on their version of streams.

Regarding using the userland module in Node.js, definitely. You may also want to link to @rvagg's blog post

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readable-stream is the way to go in the browser. readable-stream exports Readable & co, so all that is written here is valid there as well.

By using streams in the browser, I mean mostly to leverage code already written elsewhere for the purpose, i.e. the streams ecosystem.

const inp = fs.createReadStream('The.Matrix.1080p.mkv');
const out = fs.createWriteStream('The.Matrix.1080p.mkv.gz');

inp.pipe(gzip).pipe(out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a note here and refer to pump. It's a must have, as this example does not handle error correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the vital difference between chaining pipes together vs pump have to do with error handling? As in chaining pipes together can be erroneous due to the fact there is no error handling? Heh, sorry for my ignorance!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. the problem is that in case of errors the underlining file descriptor on the other side might not get closed.

Copy link
Member

@TimothyGu TimothyGu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see how this is coming along. Apologies for the number of nits, just want to make sure those don't get forgotten when this is applied.

```

While that will take a few minutes to compress, in another shell we may run
a script that takes Node.js' module [`Zlib`][], that wraps around another compression tool, [`gzip(1)`][].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zlib" is how the documentation refers to itself

inp.pipe(gzip).pipe(out);
```

While the first `zip` function will ultimately fail, Node will be able to handle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the "zip function" is referring to. The command run in the shell?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I was referencing the example mcollina was describing. I hope I did it right, should this be more explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "function" was throwing me off a bit. And also why would it fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure. My layman's understanding of its failure is because of the lack of backpressure mechanism. There is a lot of data that is being compressed and the .zip file gets corrupted. But if @mcollina might be able to shed a technical light on this, it'd be much appreciated!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I asked is that I'm fairly certain zip will not encounter this problem. Unlike Node.js, zip and other UNIX utilities operate synchronously. What that means is that they only read() as much as they need at one time, and immediately write() it out afterwards. Since these two system calls are synchronous, there would not be any backpressure problems since by definition they only appear in an asynchronous system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So to be clear, synchronous systems do not encounter backpressure buildup like async does?

I also did test out the zip tool and it does, in fact, corrupt the large file, though I am not sure why, if those are the mechanics behind data flow. (Though gzip on my UNIX command line does not corrupt the large file.)

Any thoughts? My only guess could be because the filetype I am using is video!

* A very overworked garbage collector
* Memory exhaustion

## Overall Dip in System Performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this heading maybe. The following paragraph seems to directly follow the list above.

`1.52 gb`.

Without streams in place to delegate the backpressure, there is a memory space
with an entire degree of magnitude greater that is being allocated. That is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "order of magnitude" is the more common expression

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I went back to edit this many times and couldn't figure out why it sounded so off .. hehe thanks!

```

The maximum byte size occupied by virtual memory turns out to be approximately
`1.52 gb`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto


## Example App

Since [Node.js v0.10][], the [`Streams`][] class has offered the ability to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stream class


}
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general question, is the primarily Node.js-based Readable stream still the best option in the browser? In my experience, the most often cause of using Node.js Readable in the browser is as a shim when creating Browserify bundle. At the same time, readable-stream IIRC depends on additional polyfills like one for events module. Plus, I know WHATWG has been working on their version of streams.

Regarding using the userland module in Node.js, definitely. You may also want to link to @rvagg's blog post

colleagues and friends.

Be sure to read up more on [`Streams`][] for other [`EventEmitters`][] to help
improve your knowledge when building applications with Node.js.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "for other EventEmitters" is trying to say...

// stream is ready or not.
readable.on('data', data =>
writable.write(data);
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add braces around writable.write(data) just to be clear on doesn't care about the return value of write().

is full (which will vary across different machines). Node.js allows you to set
your own custom [`highWaterMark`][], but commonly, the default is the optimal
value for what system is running the application. In instances where you might
want to raise that value, go for it, but do so with caution!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section only deals with rules for writing to a Writable stream from outside the stream. We also need rules for writing to a Readable stream from inside that Readable stream (by respecting the return value of this.push()).

@jalafel
Copy link
Contributor Author

jalafel commented Apr 7, 2017

Sorry for the static on this! I took a look back on this, realized some grammar was off and fixed some formatting.

Added in a few pointers mentioned earlier in the thread, though they didn't necessarily close off all the reviews.

Skimmed out the example application because I felt that was redundant since there are a rich supply of examples found in the documentation, and I didn't want to reinvent the wheel.

I only have one question and that's in regards to the implementation of a Writable stream. Are there any mistakes one can make in _write() or _writev() that would damage backpressure? They seem like pretty foolproof functions to me.

Should the guide extend its scope and talk about creating a custom writable wtream from scratch then? As in:

MyWritable.prototype.write = function(chunk, enc, cb) {
  // code that respects return value here
}

util.inherits(MyWritable, Stream);

Thanks for all the help! Hope to land this soon!

@mcollina
Copy link
Member

mcollina commented Apr 7, 2017

@jessicaquynh _writev goes in combination with .cork(), and it is especially useful if we are writing to a stream directly. The biggest mistake is this pattern:

ws.cork()
ws.write('hello ')
ws.write('world ')
ws.uncork()

ws.cork()
ws.write('from ')
ws.write('Matteo')
ws.uncork()

This will cause two transfers into C++ realm, making the use of cork() a bit useless the correct pattern is:

ws.cork()
ws.write('hello ')
ws.write('world ')
process.nextTick(doUncork, ws)

ws.cork()
ws.write('from ')
ws.write('Matteo')
process.nextTick(doUncork, ws)

// as a global function
function doUncork (stream) {
  stream.uncork()
}

cork() can be called as many times we want, we just need to be careful to call uncork() the same amount of times to make it flow again.

@jalafel
Copy link
Contributor Author

jalafel commented Apr 7, 2017

@mcollina Thank you for your input and help! I've added the changes to the writable section!

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

is full (which will vary across different machines). Node.js allows you to set
your own custom [`highWaterMark`][], but commonly, the default is the optimal
value for what system is running the application. In instances where you might
want to raise that value, go for it, but do so with caution!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should write here the defaults, it usually helps people

@mcollina
Copy link
Member

LGTM

@fhemberger fhemberger merged commit 8491693 into nodejs:master Apr 22, 2017
@fhemberger
Copy link
Contributor

Great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants