Skip to content

Commit

Permalink
reorder readme and include it in the crate doc
Browse files Browse the repository at this point in the history
  • Loading branch information
KillingSpark committed Dec 16, 2024
1 parent 5f1710f commit 1ed5fbc
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 46 deletions.
80 changes: 34 additions & 46 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,55 +27,11 @@ On the compression side:
- [ ] Checksums
- [ ] Dictionaries

In terms of speed this library is behind the original C implementation which has a rust binding located [here](https://github.com/gyscos/zstd-rs).

## Speed
In terms of speed this library is behind the original C implementation which has a rust binding located [here](https://github.com/gyscos/zstd-rs).

Measuring with the 'time' utility the original zstd and my decoder both decoding the same enwik9.zst file from aramfs, my decoder is about 3.5 times slower. Enwik9 is highly compressible, for less compressible data (like a ubuntu installation .iso) my decoder comes close to only being 1.4 times slower.

## Can do:

1. Parse all files in /decodecorpus_files. These were generated with [decodecorpus](https://github.com/facebook/zstd/tree/dev/tests) by the original zstd developers
1. Decode all of them correctly into the output buffer
1. Decode all the decode_corpus files (1000+) I created locally
1. Calculate checksums
1. Act as a `zstd -c -d` dropin replacement
1. Can be compiled in a no-std environment that provides alloc

## Cannot do

This decoder is pretty much feature complete. If there are any wishes for new APIs or bug reports please file an issue, I will gladly take a look!

## Roadmap

1. More Performance optimizations
1. sequence_decoding and reverse_bitreader::get_bits. Those account for about 50% of the whole time used in decoding
2. Matching suffixes. This accounts for >60% of the whole time used in encoding
2. Implement encoder features
1. More levels
2. Dictionaries
3. Checksums

## Testing

Tests take two forms.

1. Tests using well-formed files that have to decode correctly and are checked against their originals
1. Tests using malformed input that have been generated by the fuzzer. These don't have to decode (they are garbage) but they must not make the decoder panic

## Fuzzing

Fuzzing has been done on

1. Random input with no initial corpus
2. The \*.zst in /fuzz_decodecorpus

### You want to help fuzz?
Measuring with the 'time' utility the original zstd and my decoder both decoding the same enwik9.zst file from a ramfs, my decoder is about 3.5 times slower. Enwik9 is highly compressible, for less compressible data (like a ubuntu installation .iso) my decoder comes close to only being 1.4 times slower.

Use `cargo +nightly fuzz run decode` or some other fuzz target to run the fuzzer. It is seeded with files created with decodecorpus.

If the fuzzer finds a crash it will be saved to the artifacts dir by the fuzzer. Run `cargo test artifacts` to run the artifacts tests.
This will tell you where the decoder panics exactly. If you are able to fix the issue please feel free to do a pull request. If not please still submit the offending input and I will see how to fix it myself.

# How can you use it?

Expand Down Expand Up @@ -128,6 +84,38 @@ recommended approach.
For an example see the src/bin/zstd.rs file. Basically you can decode the frame until either a
given block count has been decoded or the decodebuffer has reached a certain size. Then you can collect no longer needed bytes from the buffer and do something with them, discard them and resume decoding the frame in a loop until the frame has been decoded completely.

## Roadmap

1. More Performance optimizations
1. sequence_decoding and reverse_bitreader::get_bits. Those account for about 50% of the whole time used in decoding
2. Matching suffixes. This accounts for >60% of the whole time used in encoding
2. Implement encoder features
1. More levels
2. Dictionaries
3. Checksums

## Testing

Tests take two forms.

1. Tests using well-formed files that have to decode correctly and are checked against their originals
1. Tests using malformed input that have been generated by the fuzzer. These don't have to decode (they are garbage) but they must not make the decoder panic

## Fuzzing

Fuzzing has been done on

1. Random input with no initial corpus
2. The \*.zst in /fuzz_decodecorpus


### You want to help fuzz?

Use `cargo +nightly fuzz run decode` or some other fuzz target to run the fuzzer. It is seeded with files created with decodecorpus.

If the fuzzer finds a crash it will be saved to the artifacts dir by the fuzzer. Run `cargo test artifacts` to run the artifacts tests.
This will tell you where the decoder panics exactly. If you are able to fix the issue please feel free to do a pull request. If not please still submit the offending input and I will see how to fix it myself.

# Contributing

Contributions will be published under the same MIT license as this project. Please make an entry in the Changelog.md file when you make a PR.
2 changes: 2 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@
//! The [encoding] module contains the code for compression.
//! Decompression can be achieved by using the [`encoding::compress`]/[`encoding::compress_to_vec`]
//! functions or the [`encoding::frame_compressor::FrameCompressor`]
//!
#![doc = include_str!("../Readme.md")]
#![no_std]
#![deny(trivial_casts, trivial_numeric_casts, rust_2018_idioms)]

Expand Down

0 comments on commit 1ed5fbc

Please sign in to comment.