Version 0.3 #26

bovee · 2019-08-22T19:23:42Z

Preliminary change list:

better error reporting (i.e. parse failure now gives the record it failed on)
simplification of fastx_bytes, fastx_stream into parse_sequence_reader and fastx_cli into parse_sequence_path
SeqRecord is now SequenceRecord and many of its methods have been spun out into a Sequence trait that allows working on e.g. byte slices
the .kmers method has been simplified and a new .canonical_kmers method has been introduced with much of the originals functionality (and an takes an explicit reverse_complement to allow its reference to be & instead of &mut)
FASTQ parser tracks second ID field
automatic decompression now takes Read instead of Read + Seek so we can handle e.g. gzip files piped in through stdin
removal of single-file zip handling to support above (zip requires Seek) 😞
lots of cleanup (cargo clippyd)
addition of fuzzing targets

bovee · 2019-08-22T19:25:17Z

Benchmarks

Current benchmarks are within range of error, although perhaps slightly slower (this may be expected since we're now doing more tracking of parse state to allow better error messages):

New

test bench_bitkmer_speed ... bench:  89,666,605 ns/iter (+/- 1,059,245)
test bench_fasta_bytes   ... bench:  16,579,247 ns/iter (+/- 1,685,079)
test bench_fasta_file    ... bench:  16,507,717 ns/iter (+/- 1,163,518)
test bench_fastq_bytes   ... bench:   4,438,460 ns/iter (+/- 866,800)
test bench_fastq_file    ... bench:   4,408,823 ns/iter (+/- 316,858)
test bench_kmer_speed    ... bench: 200,107,045 ns/iter (+/- 7,185,221)

Old

test bench_bitkmer_speed ... bench:  90,731,881 ns/iter (+/- 9,133,680)
test bench_fasta_bytes   ... bench:  16,563,511 ns/iter (+/- 1,104,243)
test bench_fasta_file    ... bench:  16,537,416 ns/iter (+/- 646,898)
test bench_fastq_bytes   ... bench:   4,274,139 ns/iter (+/- 159,441)
test bench_fastq_file    ... bench:   4,298,571 ns/iter (+/- 171,011)
test bench_kmer_speed    ... bench: 196,687,450 ns/iter (+/- 9,193,177)

Fuzzing

I set up fuzzing and created test cases for parsing FASTA and FASTQ files. These fairly rapidly caught a small bug with FASTQ files having different lines endings (\r\n vs \n) between their ID and sequence lines causing panics. I then ran cargo fuzz run parse_fasta out past 300,000 iterations and cargo fuzz run parse_fastq out past 500,000 iterations without finding any more issues.

src/util.rs

…iters. Closes #29, closes #13

src/formats/fasta.rs

codecov · 2019-09-07T18:08:55Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a192a17). Click here to learn what that means.
The diff coverage is 90%.

@@          Coverage Diff           @@
##             master   #26   +/-   ##
======================================
  Coverage          ?   90%           
======================================
  Files             ?     5           
  Lines             ?   590           
  Branches          ?     0           
======================================
  Hits              ?   531           
  Misses            ?    59           
  Partials          ?     0

Impacted Files	Coverage Δ
src/formats/mod.rs	`86.27% <86.27%> (ø)`
src/formats/fasta.rs	`88.7% <88.7%> (ø)`
src/formats/fastq.rs	`90.12% <90.12%> (ø)`
src/formats/buffer.rs	`94.73% <94.73%> (ø)`
tests/format_specimens.rs	`97.91% <97.91%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a192a17...b4bf635. Read the comment docs.

bovee · 2019-09-12T20:00:34Z

This is a breaking update. As a rough updating guide based off porting both finch and some of our internal pipelines to v0.3, the following changes (at a minimum) are going to be necessary to keep code working:

Change

for (ix, kmer, is_canonical) in read.kmers(k, true) {
    ...
}

to

let rc = read.reverse_complement();
for (ix, kmer, is_canonical) in read.canonical_kmers(k, &rc) {
    ...
}

Change

use needletail::seq::SeqRecord;

to

use needletail::SequenceRecord;

Change

use needletail::fastx::fastx_cli;

to

use needletail::parse_sequence_path;

Also, if anyone's using needletail v0.2 and needs help/advice on porting to v0.3, I'm happy to look over code and answer questions.

boydgreenfield · 2019-09-12T20:03:08Z

👏

Roderick Bovee added 2 commits August 20, 2019 21:32

Improve error messaging

8733957

Large refactor, track id2 field on FASTQ, and tweak some fn signatures

4559181

bovee force-pushed the v0.3 branch 3 times, most recently from 0587cb3 to 09f629c Compare August 22, 2019 21:47

Roderick Bovee added 2 commits August 22, 2019 14:52

Unify FASTX parsing functions & allow streaming gz/bz/etc

fe92b71

Fuzz (and fix fuzzing issue)

942a760

bovee force-pushed the v0.3 branch from 09f629c to 942a760 Compare August 22, 2019 21:52

Keats reviewed Aug 27, 2019

View reviewed changes

src/util.rs Show resolved Hide resolved

Keats reviewed Aug 27, 2019

View reviewed changes

src/util.rs Show resolved Hide resolved

Refactor format code in module, relax UTF8 id requirement, and add wr…

4aef928

…iters. Closes #29, closes #13

bovee force-pushed the v0.3 branch from 3011dc7 to 4aef928 Compare August 27, 2019 21:03

Keats reviewed Aug 28, 2019

View reviewed changes

src/formats/fasta.rs Outdated Show resolved Hide resolved

bovee force-pushed the v0.3 branch from 572bd74 to e42ea32 Compare August 29, 2019 03:27

Rearrange tests & simplify buffer API

042e21c

bovee force-pushed the v0.3 branch 2 times, most recently from 0ea488e to 042e21c Compare August 30, 2019 22:33

Rewrite buffer interface to allow headers

5bb2360

bovee force-pushed the v0.3 branch from 7b35ce8 to e3a9888 Compare September 5, 2019 18:39

Use criterion and otherwise cleanup/simplify benchmarking

23bd08a

bovee force-pushed the v0.3 branch from e3a9888 to 23bd08a Compare September 5, 2019 18:58

Adjustments to better support sequence validation

c4ae3d9

bovee force-pushed the v0.3 branch 7 times, most recently from 715e53d to 66ebe03 Compare September 7, 2019 04:46

bovee force-pushed the v0.3 branch 2 times, most recently from d721205 to bc6dc45 Compare September 7, 2019 17:13

Add test against Specimens.jl, improve buffer perf & error msging

1f5ac3a

bovee force-pushed the v0.3 branch from bc6dc45 to 1f5ac3a Compare September 7, 2019 18:08

Keats added 4 commits September 10, 2019 13:43

Add missing newline to Fastq::write

fdc2a22

Rename fasta/fastq to {?}Record

9208776

Move tests to test modules

ddbc60a

Add samples from FormatSpecimens.jl repo

4621ab0

bovee force-pushed the v0.3 branch from 6b4249a to f849cab Compare September 10, 2019 16:14

Refactor sequence handling into trait

6c032c4

bovee force-pushed the v0.3 branch from f849cab to 6c032c4 Compare September 10, 2019 16:20

Roderick Bovee added 2 commits September 10, 2019 13:42

Rewrite buffer creation and split up sequence code

a8c61f4

Change method name to parse_sequence_reader

d11aa4b

bovee force-pushed the v0.3 branch 2 times, most recently from fb9a8a8 to 368da9e Compare September 11, 2019 15:53

Improve error messaging

c97e487

bovee force-pushed the v0.3 branch from 368da9e to c97e487 Compare September 11, 2019 16:35

Roderick Bovee added 2 commits September 12, 2019 09:39

More docs and add parse_sequence_path method

85f90e5

Even more docs to finalize v0.3

b4bf635

bovee merged commit 6e92dc9 into master Sep 12, 2019

bovee deleted the v0.3 branch September 12, 2019 20:01

bovee restored the v0.3 branch September 12, 2019 20:01

boydgreenfield changed the title ~~[WIP] Version 0.3~~ Version 0.3 Sep 12, 2019

luizirber mentioned this pull request Dec 7, 2019

API improvements luizirber/niffler#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 0.3 #26

Version 0.3 #26

bovee commented Aug 22, 2019 •

edited

Loading

bovee commented Aug 22, 2019 •

edited

Loading

codecov bot commented Sep 7, 2019 •

edited

Loading

bovee commented Sep 12, 2019

boydgreenfield commented Sep 12, 2019

Version 0.3 #26

Version 0.3 #26

Conversation

bovee commented Aug 22, 2019 • edited Loading

bovee commented Aug 22, 2019 • edited Loading

Benchmarks

New

Old

Fuzzing

codecov bot commented Sep 7, 2019 • edited Loading

Codecov Report

bovee commented Sep 12, 2019

boydgreenfield commented Sep 12, 2019

bovee commented Aug 22, 2019 •

edited

Loading

bovee commented Aug 22, 2019 •

edited

Loading

codecov bot commented Sep 7, 2019 •

edited

Loading