Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decoder fuzzing #34

Closed
wants to merge 14 commits into from
Closed

Conversation

Shnatsel
Copy link
Contributor

@Shnatsel Shnatsel commented Apr 16, 2020

Pull Request Overview

  • Add fuzzing targets for decoding each format
  • Ignore all xz checksums when fuzzing because the fuzzer cannot create files with valid checksums
  • Run all fuzzing targets overnight to compile a corpus of interesting inputs
    • xz was seeded with official xz test suite (in public domain)
  • Add fuzzing target for comparing xz decoding to liblzma (via xz2 wrapper crate)
  • Add README with an overview of fuzzing

Testing Strategy

This entire pull request is tests. They seem to detect failures because they've discovered #32.

They haven't found a single bug in decoding, which is completely unexpected. This is the first time I see a codebase that has not been extensively fuzzed previously not fail under fuzzing. But the coverage figures seem to be about right, and failure detection clearly works.

There are some mismatches discovered between liblzma and lzma-rs (see #35).

Supporting Documentation and References

https://rust-fuzz.github.io/book/introduction.html

TODO or Help Wanted

For some reason fuzzing lzma1 decoding is 4x slower than lzma2.

I didn't compare lzma1 decoding to xz2 crate because I'm not sure which variant it xz2 crate uses. It should be trivial to add later if desired.

Continuous fuzzing via e.g. https://fuzzit.dev/ would be very welcome, but can only be set up by maintainers.

It would be nice to advertise in the README that the decoding has been tested to produce identical results to liblzma once the mismatches are fixed (#35).

@Shnatsel
Copy link
Contributor Author

I've found a bug in comparison between liblzma and lzma-rs. I've pushed a fix, it does detect some behavior mismatches now.

Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the corpus from this pull request (including the git history), or make a separate pull request without the corpus, to make it easier to review and to keep the git repository clean?


Pull Request Overview

  • Add fuzzing targets for decoding each format

  • Ignore all xz checksums when fuzzing because the fuzzer cannot create files with valid checksums

  • Run all fuzzing targets overnight to compile a corpus of interesting inputs

    • xz was seeded with official xz test suite (in public domain)

Note that xz is a container format around LZMA2, itself containing raw LZMA streams. My guess is that random bit flips and such mutations applied on xz files would often test the outer layer (container layer), without exercising a lot of the LZMA logic (because most mutations would just produce invalid data at the xz layer).

As a future fuzzing project, it could be interesting to extract an LZMA2 corpus and an LZMA corpus from the xz corpus, to fuzz more specifically the lower layers of the logic.

Testing Strategy

This entire pull request is tests. They seem to detect failures because they've discovered #32.

Good catch!

They haven't found a single bug in decoding, which is completely unexpected. This is the first time I see a codebase that has not been extensively fuzzed previously not fail under fuzzing. But the coverage figures seem to be about right, and failure detection clearly works.

What do you mean by "bug"? I see at least a crash (#32) and a mismatch (#35).

Other than that, my brief explanation would be to run:

  • git grep unsafe
  • git grep unwrap

The goal of this project is to prioritize clarity, safety and correctness over performance.

Another possible explanation: a "dumb" fuzzer without domain knowledge doesn't exercise deep interesting code paths. An example that comes to mind (caught during a code review) would be #22 (comment). It would be interesting to know if the fuzzer would catch such a bug if it was introduced in the code base.

I'm wondering if the coverage numbers you mention can pin-point which lines of the source code have been covered by fuzzing - and how often?

Another interesting fuzzing scenario would be to generate "more valid" inputs (e.g. with #24) to test deeper layers of the logic.

There are some mismatches discovered between liblzma and lzma-rs (see #35).

Good catch!

Supporting Documentation and References

https://rust-fuzz.github.io/book/introduction.html

TODO or Help Wanted

For some reason fuzzing lzma1 decoding is 4x slower than lzma2.

My gut feeling would be that random inputs generated by the fuzzer are more likely to be invalid for lzma2 than for lzma1 - and that the invalidity is detected earlier before any allocation happens, etc.

Some possible explanations:

  • An LZMA stream (what you call "lzma1") has a short header (5 bytes), with the constraint that the first byte must be < 225. After that, it's a range coder (which is rather slow). Once an input satisfies these constraints (at least 5 bytes, first byte < 225), the decoder likely goes quite far in parsing the input (even though the end result is invalid - e.g. an end-of-stream marker is missing).
  • LZMA2 is a wrapper around LZMA chunks. The first byte of a chunk is more constrained (0, 1, 2 or >= 128). Then 4 bytes of size. Then in 3/4 cases a byte < 255. This makes it more likely that a random input is rejected by lzma2 than by lzma1.
  • An LZMA2 stream can contain uncompressed streams (which are copied as-is without processing - which is quite fast).

I didn't compare lzma1 decoding to xz2 crate because I'm not sure which variant it xz2 crate uses. It should be trivial to add later if desired.

It would be interesting to add a scenario that generates LZMA streams that are then wrapped into a valid XZ/LZMA2 container, for differential fuzzing of the LZMA layer, without bothering about the API available in xz2.

Continuous fuzzing via e.g. https://fuzzit.dev/ would be very welcome, but can only be set up by maintainers.

It would be nice to advertise in the README that the decoding has been tested to produce identical results to liblzma once the mismatches are fixed (#35).

I'd be cautious not to over-advertise what's been tested until more complex fuzzing scenarios have been tried as well.

@@ -1,4 +1,3 @@

target
corpus
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional:

  • a pull request with 4574 files is intractable to review,
  • this adds many unnecessary files to the main git repository.

Although I agree that the corpus can have some value, I would keep that outside of the master branch (for example as a separate open pull request).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe having corpus in a PR is a good idea. I'd aim for continuous fuzzing via either fuzzit.dev or Google's oss-fuzz, so I'd suggest looking at what those platforms require and working with that. I've seen repos that use fuzzit.dev keep the corpus out of the repo (e.g. https://github.com/image-rs/image-png/) but I don't know how exactly that's accomplished.

@Shnatsel Shnatsel marked this pull request as draft April 16, 2020 21:56
@Shnatsel
Copy link
Contributor Author

Can you remove the corpus from this pull request (including the git history), or make a separate pull request without the corpus, to make it easier to review and to keep the git repository clean?

Fair point, will do!

As a future fuzzing project, it could be interesting to extract an LZMA2 corpus and an LZMA corpus from the xz corpus, to fuzz more specifically the lower layers of the logic.

Indeed. I'm not familiar with the XZ format, so any help with that would be appreciated.

What do you mean by "bug"? I see at least a crash (#32) and a mismatch (#35).

The panic was already triggered by XZ test files, so it's not really discovered by the fuzzer. I expected to see more panics, or an OOM, or an infinite loop, or something along those lines. I found nothing that was exploitable in any appreciable way, not even a DoS!

An example that comes to mind (caught during a code review) would be #22 (comment). It would be interesting to know if the fuzzer would catch such a bug if it was introduced in the code base.

If it would result in a panic - then most likely yes! It would also detect returned error or incorrect output if comparison against liblzma is used. Instrumentation-guided fuzzers are pretty sophisticated when it comes to binary formats. Combine that with 100,000,000 executions per day on a $100 desktop CPU and you might even stumble into that edge case by sheer brute force; I've had a fuzzer expose an interesting edge case and stumble upon a valid CRC16 for it overnight.

I'm wondering if the coverage numbers you mention can pin-point which lines of the source code have been covered by fuzzing - and how often?

This is based on LLVM sanitizer-coverage, so you can use any tool that works with it to visualize coverage data. Or just run the entire corpus through a very simple decompression program using the coverage tool of your choice, like tarpaulin.

@Shnatsel Shnatsel mentioned this pull request Apr 16, 2020
@Shnatsel
Copy link
Contributor Author

Superseded by #36

@Shnatsel Shnatsel closed this Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants