Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add size hint for byte iterator over file #81044

Closed
wants to merge 6 commits into from

Conversation

Xavientois
Copy link
Contributor

I noticed that the Bytes iterator over the bytes of a File returned the default value when calling size_hint. Since files are one of the cases where the number of items to iterate over is fixed, I felt that it made sense to take advantage of the available information.

Now, the Bytes Iterator returned by calling bytes() on a File will provide more accurate bounds with its size_hint.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 15, 2021
@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
    Checking miniz_oxide v0.4.0
    Checking object v0.22.0
    Checking hashbrown v0.9.0
    Checking addr2line v0.14.0
error[E0520]: `Item` specializes an item from a parent `impl`, but that item is not marked `default`
     |
     |
2452 | / impl<R: Read> Iterator for Bytes<R> {
2453 | |     type Item = Result<u8>;
2454 | |
2455 | |     fn next(&mut self) -> Option<Result<u8>> {
2465 | |     }
2466 | | }
2466 | | }
     | |_- parent `impl` is here
...
2469 |       type Item = Result<u8>;
     |       ^^^^^^^^^^^^^^^^^^^^^^^ cannot specialize default item `Item`
     |
     = note: to specialize, `Item` in the parent `impl` must be marked `default`

error[E0520]: `size_hint` specializes an item from a parent `impl`, but that item is not marked `default`
     |
     |
2452 | / impl<R: Read> Iterator for Bytes<R> {
2453 | |     type Item = Result<u8>;
2454 | |
2455 | |     fn next(&mut self) -> Option<Result<u8>> {
2465 | |     }
2466 | | }
2466 | | }
     | |_- parent `impl` is here
...
2471 | /     fn size_hint(&self) -> (usize, Option<usize>) {
2472 | |         match self.inner.metadata() {
2473 | |             Ok(metadata) => {
2474 | |                 let file_length = metadata.len() as usize;
2478 | |         }
2479 | |     }
2479 | |     }
     | |_____^ cannot specialize default item `size_hint`
     |
     = note: to specialize, `size_hint` in the parent `impl` must be marked `default`
error: aborting due to 2 previous errors

For more information about this error, try `rustc --explain E0520`.
error: could not compile `std`
error: could not compile `std`

To learn more, run the command again with --verbose.
command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "check" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j" "16" "--release" "--color" "always" "--features" "panic-unwind backtrace compiler-builtins-c" "--manifest-path" "/checkout/library/test/Cargo.toml" "--message-format" "json-render-diagnostics"
failed to run: /checkout/obj/build/bootstrap/debug/bootstrap check
Build completed unsuccessfully in 0:01:45

@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
    Checking addr2line v0.14.0
error[E0308]: mismatched types
    --> library/std/src/io/mod.rs:2467:36
     |
2467 |     default fn size_hint(&self) -> (usize, Option<usize>) {}
     |                ---------           ^^^^^^^^^^^^^^^^^^^^^^ expected tuple, found `()`
     |                |
     |                implicitly returns `()` as its body has no tail or `return` expression
     |
     = note:  expected tuple `(usize, core::option::Option<usize>)`

error: aborting due to previous error

For more information about this error, try `rustc --explain E0308`.
For more information about this error, try `rustc --explain E0308`.
error: could not compile `std`

To learn more, run the command again with --verbose.
command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "check" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j" "16" "--release" "--color" "always" "--features" "panic-unwind backtrace compiler-builtins-c" "--manifest-path" "/checkout/library/test/Cargo.toml" "--message-format" "json-render-diagnostics"
failed to run: /checkout/obj/build/bootstrap/debug/bootstrap check
Build completed unsuccessfully in 0:01:31

@rust-log-analyzer
Copy link
Collaborator

The job x86_64-gnu-llvm-9 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
   Compiling addr2line v0.14.0
error: implementation has missing stability attribute
    --> library/std/src/io/mod.rs:2472:1
     |
2472 | / impl Iterator for Bytes<fs::File> {
2473 | |     fn size_hint(&self) -> (usize, Option<usize>) {
2474 | |         match self.inner.metadata() {
2475 | |             Ok(metadata) => {
2481 | |     }
2482 | | }
     | |_^

@Xavientois
Copy link
Contributor Author

@dtolnay Since this is not modifying the interface, just the implementation, what is the correct way to mark it with a stability attribute?

This is my first time contributing, and I am unsure based on what is said here.

@rust-log-analyzer
Copy link
Collaborator

The job x86_64-gnu-llvm-9 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
   Compiling hashbrown v0.9.0
   Compiling object v0.22.0
   Compiling miniz_oxide v0.4.0
   Compiling addr2line v0.14.0
error[E0547]: missing 'issue'
     |
     |
2472 | #[unstable(feature = "file_size_hint", reason = "New implementation optimization")]

error: aborting due to previous error

error: could not compile `std`

@rust-log-analyzer
Copy link
Collaborator

The job x86_64-gnu-llvm-9 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)
   Compiling hashbrown v0.9.0
   Compiling miniz_oxide v0.4.0
   Compiling object v0.22.0
   Compiling addr2line v0.14.0
error: malformed `unstable` attribute input
     |
2472 | #[unstable]
2472 | #[unstable]
     | ^^^^^^^^^^^ help: must be of the form: `#[unstable(feature = "name", reason = "...", issue = "N")]`
error: aborting due to previous error

error: could not compile `std`

@the8472
Copy link
Member

the8472 commented Jan 15, 2021

This suffers from TOCTOU as files can be changed between any call to metadata and reads, something that generally can't happen to other iterators due to ownership. Additionally it is inefficient as it would perform a syscall every time size_hint is called; some iterator consumers call size_hint in at least some of their loop iterations.
It's also incorrect since metadata doesn't always return the right size for files, e.g. the files in /proc say they're zero-sized even though you can read from them.
On top of that it probably also is futile as size_hint is used for optimizations and Bytes<File> is probably horrendously slow anyways since it has to perform a syscall for each byte read, it only makes sense to use that on very slow byte sources e.g. pipes or character devices sourced from something really slow.

If you're trying to deal with performance issues of a Bytes<File> iterator then you probably want Bytes<BufReader<File>> instead.

@Xavientois
Copy link
Contributor Author

If you're trying to deal with performance issues of a Bytes<File> iterator then you probably want Bytes<BufReader<File>> instead.

Do you think it would be worthwhile to implement it for Bytes<BufReader<File>> then, or would the TOCTOU issue still be a blocker? The documentation for size_hint says:

It is not enforced that an iterator implementation yields the declared number of elements. A buggy iterator may yield less than the lower bound or more than the upper bound of elements.

size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.

That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol.

I would argue that the estimation provided by the current implementation is still correct.

Also, regarding:

On top of that it probably also is futile as size_hint is used for optimizations and Bytes is probably horrendously slow anyways since it has to perform a syscall for each byte read, it only makes sense to use that on very slow byte sources e.g. pipes or character devices sourced from something really slow.

In my (admittedly limited) experience, is usually used ahead of iteration to reserve space for something. It is not a function that I would expect to be called repeatedly. For good measure, I could add a comment in the docs to mention that repeatedly calling it would be slow?

@the8472
Copy link
Member

the8472 commented Jan 15, 2021

If you're trying to deal with performance issues of a Bytes<File> iterator then you probably want Bytes<BufReader<File>> instead.

Do you think it would be worthwhile to implement it for Bytes<BufReader<File>> then, or would the TOCTOU issue still be a blocker?

For a BufReader<T> (it doesn't have to be a File) it could at least provide a lower bound that represents the currently buffered data. This would be a very rough estimate, but still a bit better than the current implementation. But I'm not sure whether that's worth it.

The documentation for size_hint says:

It is not enforced that an iterator implementation yields the declared number of elements. A buggy iterator may yield less than the lower bound or more than the upper bound of elements.
size_hint() is primarily intended to be used for optimizations such as reserving space for the elements of the iterator, but must not be trusted to e.g., omit bounds checks in unsafe code. An incorrect implementation of size_hint() should not lead to memory safety violations.
That said, the implementation should provide a correct estimation, because otherwise it would be a violation of the trait's protocol.

I would argue that the estimation provided by the current implementation is still correct.

I think the standard library should do better than providing an implementation that qualifies as "buggy" and "violation of the trait's protocol" under its own definitions.

Also, regarding:

On top of that it probably also is futile as size_hint is used for optimizations and Bytes is probably horrendously slow anyways since it has to perform a syscall for each byte read, it only makes sense to use that on very slow byte sources e.g. pipes or character devices sourced from something really slow.

In my (admittedly limited) experience, is usually used ahead of iteration to reserve space for something. It is not a function that I would expect to be called repeatedly. For good measure, I could add a comment in the docs to mention that repeatedly calling it would be slow?

A comment wouldn't help because generic code doesn't look at comments, it just consumes iterators and will apply the same logic to all iterators.

Here's an example where size_hint() is called inside a loop.

while let Some(element) = iterator.next() {
let len = self.len();
if len == self.capacity() {
let (lower, _) = iterator.size_hint();
self.reserve(lower.saturating_add(1));
}
unsafe {
ptr::write(self.as_mut_ptr().add(len), element);
// NB can't overflow since we would have had to alloc the address space
self.set_len(len + 1);
}
}

match self.inner.metadata() {
Ok(metadata) => {
let file_length = metadata.len() as usize;
(0, Some(file_length))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation will never actually help, because in practice only the .0 of the size_hint() is ever used: https://internals.rust-lang.org/t/is-size-hint-1-ever-used/8187?u=scottmcm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame. Then this implementation would basically just add system call overhead while providing very little benefit :(

@scottmcm
Copy link
Member

For a BufReader (it doesn't have to be a File) it could at least provide a lower bound that represents the currently buffered data. This would be a very rough estimate, but still a bit better than the current implementation. But I'm not sure whether that's worth it.

This one sounds potentially interesting. It'd at least help the first reallocation of a vector when collecting, and should be pretty cheap.

@Xavientois
Copy link
Contributor Author

For a BufReader (it doesn't have to be a File) it could at least provide a lower bound that represents the currently buffered data. This would be a very rough estimate, but still a bit better than the current implementation. But I'm not sure whether that's worth it.

I agree that this is much better. Should I change the implementation here or open a new PR with the change (since it is different)?

@the8472
Copy link
Member

the8472 commented Jan 15, 2021

I think a new PR makes sense so others don't have to read the unrelated discussion.

@Xavientois
Copy link
Contributor Author

For this PR (and the new one I will make), would I mark the code as #[stable] or #[unstable(feature = "...", issue = "...", reason = "...")]?
If unstable, what would I put as reason or issue?

@the8472
Copy link
Member

the8472 commented Jan 15, 2021

Trait implementations are insta-stable.

@Xavientois
Copy link
Contributor Author

Thanks for the guidance!

@Xavientois Xavientois closed this Jan 15, 2021
@Xavientois Xavientois deleted the file-bytes-size-hint branch January 15, 2021 18:11
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 3, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 4, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 5, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
JohnTitor added a commit to JohnTitor/rust that referenced this pull request Mar 5, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Mar 5, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
m-ou-se added a commit to m-ou-se/rust that referenced this pull request Mar 5, 2021
…ramertj

Improved IO Bytes Size Hint

After trying to implement better `size_hint()` return values for `File` in [this PR](rust-lang#81044) and changing to implementing it for `BufReader` in [this PR](rust-lang#81052), I have arrived at this implementation that provides tighter bounds for the `Bytes` iterator of various readers including `BufReader`, `Empty`, and `Chain`.

Unfortunately, for `BufReader`, the size_hint only improves after calling `fill_buffer` due to it using the contents of the buffer for the hint. Nevertheless, the the tighter bounds  should result in better pre-allocation of space to handle the contents of the `Bytes` iterator.

Closes rust-lang#81052
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants