Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader<R: BufRead> unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751

Closed
tniessen opened this issue Jun 2, 2024 · 15 comments · Fixed by #769

Comments

@tniessen
Copy link

tniessen commented Jun 2, 2024

quick_xml::reader::Reader<R: BufRead> allows reading arbitrarily large files without allocating much memory. I am using this in combination with flate2::read::GzDecoder in order to decompress and simultaneously parse very large XML files (several gigabytes per document) without having to store the decompressed file (either in memory or on disk).

Unfortunately, because offsets within the document are computed as usize internally, this causes quick-xml to panic when the document size exceeds 4 GiB on 32-bit targets.

Notably, buffer_position() returns usize, which is not necessarily sufficient for representing the correct position if the input document size exceeds 4 GiB.

Rust's built-in std::io types typically use u64 for file sizes (and i64 for relative offsets within files). Hence, I believe it would be appropriate to use u64 for internal computations (and to return it from buffer_position() etc.) as well.

@Mingun
Copy link
Collaborator

Mingun commented Jun 2, 2024

Seems reasonable. Mind to crate a PR? I plan to release 0.32 next few days so this change can be included there. error_position() also should be changed and do not forgot to add changelog entry.

@tniessen
Copy link
Author

tniessen commented Jun 2, 2024

@Mingun I'll give it a try! :)

@tniessen
Copy link
Author

While trying to approach this issue, I noticed a much bigger issue: all internal offsets are computed using usize, which breaks on 32-bit systems when processing more than 4 GiB of data in a single document.


Here's a very short reproduction of the issue, beginning with main.rs (with only quick-xml as a dependency):

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
    }
    println!("Done.");
}

You can create such a file large.xml, for example, like this:

(echo -n '<root>' ; perl -e 'print "<child>text</child>"x300000000' ; echo -n '</root>') > large.xml

Then run the code in 32-bit mode:

$ rustup target add i686-unknown-linux-gnu
$ cargo run --release --target i686-unknown-linux-gnu
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `release` profile [optimized] target(s) in 0.27s
     Running `target/i686-unknown-linux-gnu/release/test-quick-xml-4gb`
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

As you can see, and as expected from the internal use of usize for file offsets, parsing fails once the offset within the file exceeds the range of usize.


Since quick_xml::reader::Reader<R: BufRead> should not be limited to files that fit into memory, I believe that fixing these related issues would require changing all internal offset computations to use u64.

@tniessen tniessen changed the title buffer_position() should return u64 Reader<R: BufRead> unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB Jun 22, 2024
@pronebird
Copy link
Contributor

pronebird commented Jun 22, 2024

Looking at the same thing right now. usize is platform dependent and I have read that it's 128-bit on RISC-V but so far by looking at rust docs, usize cannot be larger than 64-bit. So I wonder if having a helper to simply convert all usize to u64 internally would do the trick, i.e:

fn to_u64(value: usize) -> u64 {
    u64::try_from(value).expect("failed to convert usize to u64")
}

Unless I missed something, it looks like we could get away with keeping everything memory related in usize but then convert to u64 when storing last_error_offset and offset.

@Mingun
Copy link
Collaborator

Mingun commented Jun 22, 2024

Yes, internal offsets should be stored in u64. I'm not sure, is it possible to Vec to store arrays which length exceeds usize? If yes, then we should at least have a special handling to break out long Text / CData events into multiple ones. Actually, tag name also could be such absurdly big, but not in a real life. Such improvements, however, can be done in different PRs, but need at least mention such limitations in the documentation.

I did a labor work of prototyping in u64-offsets branch.

@tniessen
Copy link
Author

is it possible to Vec to store arrays which length exceeds usize?

If I'm understanding the question correctly, then the answer is no. A Vec<T> with capacity cap maintains a heap allocation of size mem::size_of::<T>() * cap, so if cap exceeded usize, this would be impossible.

@pronebird
Copy link
Contributor

@Mingun cherry pick fixes for tests from pronebird@a393ffb

@tniessen
Copy link
Author

@pronebird @Mingun I apologize, my previous example did not clear the buffer in-between events properly. This is the correct reproduction:

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
        buf.clear();  // This was missing above.
    }
    println!("Done.");
}

Now we see the expected panic:

$ cargo run --target i686-unknown-linux-gnu
   Compiling cfg-if v1.0.0
   Compiling memchr v2.7.4
   Compiling encoding_rs v0.8.34
   Compiling quick-xml v0.33.0
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.13s
     Running `target/i686-unknown-linux-gnu/debug/test-quick-xml-4gb`
thread 'main' panicked at /home/tniessen/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quick-xml-0.33.0/src/reader/buffered_reader.rs:286:5:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@Mingun
Copy link
Collaborator

Mingun commented Jun 23, 2024

@tniessen, could you test that with u64-offsets branch?

@tniessen
Copy link
Author

@Mingun Neither debug nor release mode panic with that branch (using the minimal repro from my last comment).

@Mingun
Copy link
Collaborator

Mingun commented Jun 24, 2024

Cool! I would be good to finish that work by thinking how we can test it. Of course, having 4GB+ file in the repository is not an option 😄. We could create a special BufRead implementation to stream data until its size exceeded usize::MAX. Or find are more smart way to test that (set internal attributes to a high value and stream some data?). Maybe I would find time in the nearest future to do that myself, but if someone would like to implement that, it would be great. How much time has the 4GB document been read?

@pronebird
Copy link
Contributor

We could create a special BufRead implementation to stream data until its size exceeded usize::MAX

You probably mean exceeds u32::MAX because on 64-bit platforms usize::MAX would be the same as u64::MAX which would result into the same panic.

Graceful solution would be to use checked arithmetics all over the place, catch overflows and return an Err back but I am not sure if realistically anyone is ever going to hit that limit, so I suppose no need to complicate code more than needed.

@Mingun
Copy link
Collaborator

Mingun commented Jun 24, 2024

I made a some checks and here is the results:

Test Time
With checks + small text 438.37s (7m 18.37s)
With checks + long text 15.97s
Without checks + small text 411.09s (6m 51.09s)
Without checks + long text 15.40s
Without checks + small text + --release 28.29s
Without checks + long text + --release 0.45s

"small text" is the

<child>text</child>

"long text" is the Lorem Ipsum with 100 paragraphs generated by https://ru.lipsum.com/feed/html, 61475 bytes long in the following frame:

<content description="Text generated by https://ru.lipsum.com/feed/html: 100 paragraphs">
Lorem Ipsum...
</content>

"With checks" is the assert_eq! of the names returned in the Start and End events, "without checks" -- without them.

The source of XML:

struct Fountain<'a> {
    /// That piece of data repeated infinitely...
    chunk: &'a [u8],
    /// Part of `chunk` that was consumed by BufRead impl
    consumed: usize,
    /// The overall count of read bytes
    overall_read: u64,
}
impl<'a> io::Read for Fountain<'a> {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let available = &self.chunk[self.consumed..];
        let len = buf.len().min(available.len());
        let (portion, _) = available.split_at(len);

        buf.copy_from_slice(portion);
        Ok(len)
    }
}
impl<'a> io::BufRead for Fountain<'a> {
    fn fill_buf(&mut self) -> io::Result<&[u8]> {
        Ok(&self.chunk[self.consumed..])
    }

    fn consume(&mut self, amt: usize) {
        self.consumed += amt;
        if self.consumed == self.chunk.len() {
            self.consumed = 0;
        }
        self.overall_read += amt as u64;
    }
}

Test is ended when overall_read >= u32::MAX.

Because I ran tests on x86_64, I did not run into the problems, so the only question remains how to test that on the correct target to hit the error. I think that should be possible to ensure that we have such target on CI.

@tniessen
Copy link
Author

Thank you @Mingun and @pronebird for your work on this!

@Mingun
Copy link
Collaborator

Mingun commented Jun 24, 2024

I also just released 0.34.0 with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants