Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 0.3 #26

Merged
merged 20 commits into from
Sep 12, 2019
Merged

Version 0.3 #26

merged 20 commits into from
Sep 12, 2019

Conversation

bovee
Copy link
Contributor

@bovee bovee commented Aug 22, 2019

Preliminary change list:

  • better error reporting (i.e. parse failure now gives the record it failed on)
  • simplification of fastx_bytes, fastx_stream into parse_sequence_reader and fastx_cli into parse_sequence_path
  • SeqRecord is now SequenceRecord and many of its methods have been spun out into a Sequence trait that allows working on e.g. byte slices
  • the .kmers method has been simplified and a new .canonical_kmers method has been introduced with much of the originals functionality (and an takes an explicit reverse_complement to allow its reference to be & instead of &mut)
  • FASTQ parser tracks second ID field
  • automatic decompression now takes Read instead of Read + Seek so we can handle e.g. gzip files piped in through stdin
  • removal of single-file zip handling to support above (zip requires Seek) 😞
  • lots of cleanup (cargo clippyd)
  • addition of fuzzing targets

@bovee
Copy link
Contributor Author

bovee commented Aug 22, 2019

Benchmarks

Current benchmarks are within range of error, although perhaps slightly slower (this may be expected since we're now doing more tracking of parse state to allow better error messages):

New

test bench_bitkmer_speed ... bench:  89,666,605 ns/iter (+/- 1,059,245)
test bench_fasta_bytes   ... bench:  16,579,247 ns/iter (+/- 1,685,079)
test bench_fasta_file    ... bench:  16,507,717 ns/iter (+/- 1,163,518)
test bench_fastq_bytes   ... bench:   4,438,460 ns/iter (+/- 866,800)
test bench_fastq_file    ... bench:   4,408,823 ns/iter (+/- 316,858)
test bench_kmer_speed    ... bench: 200,107,045 ns/iter (+/- 7,185,221)

Old

test bench_bitkmer_speed ... bench:  90,731,881 ns/iter (+/- 9,133,680)
test bench_fasta_bytes   ... bench:  16,563,511 ns/iter (+/- 1,104,243)
test bench_fasta_file    ... bench:  16,537,416 ns/iter (+/- 646,898)
test bench_fastq_bytes   ... bench:   4,274,139 ns/iter (+/- 159,441)
test bench_fastq_file    ... bench:   4,298,571 ns/iter (+/- 171,011)
test bench_kmer_speed    ... bench: 196,687,450 ns/iter (+/- 9,193,177)

Fuzzing

I set up fuzzing and created test cases for parsing FASTA and FASTQ files. These fairly rapidly caught a small bug with FASTQ files having different lines endings (\r\n vs \n) between their ID and sequence lines causing panics. I then ran cargo fuzz run parse_fasta out past 300,000 iterations and cargo fuzz run parse_fastq out past 500,000 iterations without finding any more issues.

@bovee bovee force-pushed the v0.3 branch 3 times, most recently from 0587cb3 to 09f629c Compare August 22, 2019 21:47
src/util.rs Show resolved Hide resolved
src/util.rs Show resolved Hide resolved
src/formats/fasta.rs Outdated Show resolved Hide resolved
@bovee bovee force-pushed the v0.3 branch 2 times, most recently from 0ea488e to 042e21c Compare August 30, 2019 22:33
@bovee bovee force-pushed the v0.3 branch 7 times, most recently from 715e53d to 66ebe03 Compare September 7, 2019 04:46
@bovee bovee force-pushed the v0.3 branch 2 times, most recently from d721205 to bc6dc45 Compare September 7, 2019 17:13
@codecov
Copy link

codecov bot commented Sep 7, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@a192a17). Click here to learn what that means.
The diff coverage is 90%.

Impacted file tree graph

@@          Coverage Diff           @@
##             master   #26   +/-   ##
======================================
  Coverage          ?   90%           
======================================
  Files             ?     5           
  Lines             ?   590           
  Branches          ?     0           
======================================
  Hits              ?   531           
  Misses            ?    59           
  Partials          ?     0
Impacted Files Coverage Δ
src/formats/mod.rs 86.27% <86.27%> (ø)
src/formats/fasta.rs 88.7% <88.7%> (ø)
src/formats/fastq.rs 90.12% <90.12%> (ø)
src/formats/buffer.rs 94.73% <94.73%> (ø)
tests/format_specimens.rs 97.91% <97.91%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a192a17...b4bf635. Read the comment docs.

@bovee bovee force-pushed the v0.3 branch 2 times, most recently from fb9a8a8 to 368da9e Compare September 11, 2019 15:53
@bovee
Copy link
Contributor Author

bovee commented Sep 12, 2019

This is a breaking update. As a rough updating guide based off porting both finch and some of our internal pipelines to v0.3, the following changes (at a minimum) are going to be necessary to keep code working:

Change

for (ix, kmer, is_canonical) in read.kmers(k, true) {
    ...
}

to

let rc = read.reverse_complement();
for (ix, kmer, is_canonical) in read.canonical_kmers(k, &rc) {
    ...
}

Change

use needletail::seq::SeqRecord;

to

use needletail::SequenceRecord;

Change

use needletail::fastx::fastx_cli;

to

use needletail::parse_sequence_path;

Also, if anyone's using needletail v0.2 and needs help/advice on porting to v0.3, I'm happy to look over code and answer questions.

@bovee bovee merged commit 6e92dc9 into master Sep 12, 2019
@bovee bovee deleted the v0.3 branch September 12, 2019 20:01
@bovee bovee restored the v0.3 branch September 12, 2019 20:01
@boydgreenfield boydgreenfield changed the title [WIP] Version 0.3 Version 0.3 Sep 12, 2019
@boydgreenfield
Copy link
Contributor

👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants