Skip to content
This repository has been archived by the owner on Nov 9, 2023. It is now read-only.

Split libraries rework #1657

Closed
wants to merge 67 commits into from
Closed

Conversation

wasade
Copy link
Member

@wasade wasade commented Aug 13, 2014

NOT READY FOR MERGE YET

I apologize in advance for the size of this pull request, and for the size of the associated comment.

This pull request covers an attempt at redoing the core concept behind split libraries (incl. fastq) with a lot of input from @walterst early on. This is not a complete rewrite of split_libraries.py or split_libraries_fastq.py, though it covers most of the latter. The intent was to handle some of the core and primary functionality, support fasta/fastq/whatever, develop an easy to extend and test framework, and then merge in additional support and options as we proceed forward, assuming the concept being used here makes sense as a replacement. At the end of this long description is an example of the command and the type of data that is yielded by the Workflow. The use of the Workflow object allows for:

  • execution of methods can be done based on runtime options (e.g., barcode-type)
  • execution of methods can be done based on runtime state (e.g., only execute qual code if there is qual data)
  • grouping of like methods (e.g., all quality oriented methods are part of wf_quality)
  • updating state as processing on a single item proceeds
  • tracking of stats

If the concept does make sense, then there are a few areas that I'd like help with for guiding this to completion. Specifically:

  • Many of the option names were more or less carried over. Do we want to retain these?
  • Currently quite a few stats are tracked by the Workflow, are there others that would be nice to have?
  • The log file is not currently being written but most of the necessary data are there. Should this just write out in a similar to existing form, or do we want to use a Python or IPython logger?
  • fastq writing is not supported yet and fasta writing is using a inline formatter. I didn't see a fasta formatter that could take a single sequence in skbio (0.1.4 i think is the qiime dep)
  • Actual 454 support. Primer handling is not yet exposed to the command line. There is an indirection in the Workflow for instrument specific checks that I wasn't sure about whether we wanted, which blocked actually exposing this.
  • Some of the logic of split_libraries_fastq.py surrounds manipulating inputs (e.g., rc of mapping file barcodes) or manipulating outputs (e.g., rc of reads being written). These are transforms that are external to the Workflow or sequence iterators. Can some of this logic be managed elsewhere?
  • The use of the failure callback in slib when iterating the Workflow here is somewhat awkward, but that is currently the best way to handle sequences that fail.

A few deviations:

  • Using --sample_ids is not currently easy to support as the Workflow does not know what file a given sequence originated from at this time. Is it critical to support this option? If so, it can be done external to the Workflow.
  • --retain_unassigned_reads is just on all the time, much easier to do it this way. This currently includes all failed sequences as well. Is it necessary for the unassigned reads to be raw or should they be processed, and should this include failed reads (e.g., poor quality)?
  • --store_qual_scores is done only if --to-fastq is used
  • only a single mapping file can be passed in. This was done as there isn't a distinction between sequence and source file. Just like --sample_ids, this can be done external to the workflow if this is critical

In addition to the split libraries work, this PR also includes a new command line interface via click, and is stubbed out for a possible structure that can be used for other scripts. This is meant as a possible starting point for using click, and I'm very curious about what other developers think here. The primary point of execution is scripts/qiime. This file simply imports the click command group (qiime_cli) and any other click commands, and additionally, makes the qiime config readily available to any other command via ctx. Though the commands aren't used directly within scripts/qiime, the import makes them part of qiime_cli. The next piece is qiime/cli.py, which provides the qiime command group, and which we likely want to expand out as these are bits that will be common to qiime. Last, there is a qiime/click_commands.py script that contains the slib command. There were a few choices made here that should be explained:

  • imports are done within the slib method. The benefit is that those imports are only executed if the slib command is actually used. If they were at the script level, they would be imported on import of slib thereby adding overhead.
  • ctx is passed in, which makes it trivial to access a common qiime config
  • the Workflow takes kwargs, but not all of the kwargs are relevant for the workflow. As a result, the unused ones are popped off.
  • I didn't see a way to "group" click options on the command line. Right now, there are comments that denote some structure to the options but this is not reflected on the command line.
  • There are various methods defined within slib, the motivation to minimize import overhead if slib is not actually executed.

Example output:

$ ./qiime
Usage: qiime [OPTIONS] COMMAND [ARGS]...

  QIIME, canonically pronounced 'chime'

Options:
  --version
  --help     Show this message and exit.

Commands:
  slib  Quality filter and demultiplex sequences
$ ./qiime slib --help
Usage: qiime slib [OPTIONS]

  Quality filter and demultiplex sequences

Options:
  -i, --sequence-read-fp PATH     Input sequence reads  [required]
  -o, --output-dir PATH           [required]
  -m, --mapping_fp FILENAME       Mapping file  [required]
  -b, --barcode-read-fp PATH      Barcode read files
  --rev-comp / --no-rev-comp      Reverse complement sequences on output
  --start-seq-id INTEGER          The starting unique ID for sequences
  --to-fastq                      Write out in fastq
  --rev-comp-barcodes / --no-rev-comp-barcodes
                                  Reverse complement barcode reads
  --phred-offset [33|64]          The ASCII offset used to decode PHRED scores
  -q, --phred-quality-threshold INTEGER
                                  Minimum PHRED quality score
  --barcode-type [golay_12|hamming_8|not-barcoded]
                                  The type of barcode used
  --max-barcode-error FLOAT       The maximum number of barcode errors allowed
  --retain-primer / --no-retain-primer
                                  Whether to retain the primers or not (if
                                  applicable)
  --max-primer-mismatch INTEGER   Maximum mismatches allowed within the
                                  primers
  --min-seq-len INTEGER           The minimum sequence length
  --max-ambig-count INTEGER       Maximum ambiguous bases allowed
  --rev-comp-mapping-barcodes / --no-rev-comp-mapping-barcodes
                                  Reverse complement the mapping barcodes
  --help                          Show this message and exit.

# Notice that multiple input files and barcodes can specified without using commas
$ ./qiime slib -i ../qiime_test_data/split_libraries_fastq/lane1_read1.fastq.gz -i ../qiime_test_data/split_libraries_fastq/lane2_read1.fastq.gz -m ../qiime_test_data/split_libraries_fastq/map.txt --barcode-type=golay_12 -b ../qiime_test_data/split_libraries_fastq/lane1_barcode.fastq.gz -b ../qiime_test_data/split_libraries_fastq/lane2_barcode.fastq.gz --rev-comp-mapping-barcodes -o slibtest

And finally, here is what one of the yielded items from the Workflow looks like (pprint'd) and the resulting stats tracked for processing a the test data:

# resulting state for a sequence that was successfully processed
{'Barcode': 'AGAGTAGCTAAG',
 'Barcode Qual': None,
 'Barcode errors': 0,
 'BarcodeID': 'M00176:17:000000000-A0CNA:1:1:18006:1821 1:N:0:0',
 'BarcodeQual': array([28, 33, 25, 32, 35, 35, 37, 37, 39, 39, 38, 35], dtype=int8),
 'BarcodeQualID': 'M00176:17:000000000-A0CNA:1:1:18006:1821 1:N:0:0',
 'Final barcode': 'AGAGTAGCTAAG',
 'Forward primer': None,
 'Original barcode': 'AGAGTAGCTAAG',
 'Qual': array([33, 31, 31, 37, 37, 37, 37, 37, 39, 39, 38, 39, 39, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 40, 41, 41, 41,
       40, 41, 38, 38, 39, 40, 41, 41, 41, 41, 41, 40, 40, 41, 39, 38, 40,
       39, 27, 32, 36, 33, 35, 12, 22, 27, 30, 33, 30, 29, 25, 25, 34, 34,
       34, 35, 25, 31, 34, 18, 29, 34, 35, 31, 32, 27, 25, 31, 34, 31, 29,
       31, 34, 25, 29, 34, 34, 34, 33, 34, 34, 31, 35, 35, 35, 35, 19, 25,
       29,  7, 10, 19, 24, 24, 29,  7, 17, 30, 33, 29, 33, 27, 33, 24, 18,
       18, 25, 25, 25, 29, 29, 34, 34, 32, 25, 19, 29, 16, 15, 20,  8, 15,
       25, 19, 10, 25,  7,  7, 10, 25, 32, 32,  7,  7, 19, 29, 34], dtype=int8),
 'QualID': 'M00176:17:000000000-A0CNA:1:1:18006:1821 1:N:0:0',
 'Reverse primer': None,
 'Sample': 's173',
 'Sequence': 'AACATAGGTCACGAGCATTATCCGGATTTATTGGGCGTAAAGGAAGCGTAGGCGGGGAGGTTGATCCATTGTTAAAAGCATTTGCTTAACAAATGTGTGCATTGGAGATCGCCTCCCTAGAGTTAATCAGGGGGTACTGGAATTCAATGTG',
 'SequenceID': 'M00176:17:000000000-A0CNA:1:1:18006:1821 1:N:0:0'}

# resulting stats from processing split_libraries_fastq test data (lane1 and 2)
{'barcode_corrected': 262,
 'exceed_barcode_error': 252,
 'exceeds_max_primer_mismatch': 0,
 'index_ambig_count': 4,
 'max_ambig_count': 2,
 'min_per_read_length_fraction': 0,
 'min_seq_len': 0,
 'quality_max_bad_run_length': 0,
 'sample_counts': Counter({'s41': 8, 's173': 6, 's72': 6, 's120': 6, 's121': 6, 's32': 6, 's141': 6, 's53': 4, 's51': 4, 's101': 4, 's183': 4, 's189': 4, 's17': 4, 's99': 4, 's124': 4, 's133': 4, 's132': 4, 's167': 4, 's179': 2, 's178': 2, 's172': 2, 's174': 2, 's177': 2, 's176': 2, 's55': 2, 's59': 2, 's58': 2, 's135': 2, 's109': 2, 's105': 2, 's107': 2, 's102': 2, 's184': 2, 's185': 2, 's186': 2, 's83': 2, 's181': 2, 's76': 2, 's188': 2, 's87': 2, 's44': 2, 's45': 2, 's117': 2, 's115': 2, 's111': 2, 's110': 2, 's47': 2, 's191': 2, 's70': 2, 's75': 2, 's74': 2, 's77': 2, 's90': 2, 's97': 2, 's98': 2, 's128': 2, 's129': 2, 's122': 2, 's126': 2, 's127': 2, 's68': 2, 's66': 2, 's64': 2, 's62': 2, 's63': 2, 's80': 2, 's81': 2, 's136': 2, 's130': 2, 's88': 2, 's139': 2, 's38': 2, 's35': 2, 's30': 2, 's33': 2, 's140': 2, 's142': 2, 's145': 2, 's146': 2, 's148': 2, 's149': 2, 's151': 2, 's150': 2, 's157': 2, 's154': 2, 's159': 2, 's2': 2, 's7': 2, 's169': 2, 's166': 2, 's165': 2, 's161': 2, 's22': 2, 's26': 2, 's27': 2}),
 'sequence_count': 500,
 'sequence_lengths': Counter({151: 242}),
 'unknown_barcode': 234,
 'unknown_primer_barcode_pair': 0}

wasade added 30 commits December 9, 2013 10:36
* improved docstrings
* removed required methods for subclasses to implement
* an iterator is now passed in instead of constructed
* workflow is determined up front, so functions dependent on options are
    now actually completely avoided
* function priority can be assigned
* simplified get_workflow
* updated license and unittest import
* added all_wf_methods test
* no_requirements
* wf_ functions now need to be tagged
@requires(state=has_barcode)
def _quality_index_ambiguity(self):
barcode_characters = set(self.state['Barcode'])
valid_characters = set(['A', 'T', 'G', 'C', 'a', 't', 'g', 'c'])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should source from skbio

@wasade
Copy link
Member Author

wasade commented Aug 13, 2014

@josenavas brought up an excellent point which is that this lacks usage tests. I'll get this pushed in relatively soon. I haven't used it yet, but click provides a test runner which should be awesome.

On an aside, the coverage on qiime/process_seqs.py was at like 98 or 99% though I know we don't currently track in qiime

@wasade
Copy link
Member Author

wasade commented Aug 13, 2014

Okay, this just went to a whole new level. Click dominates all. If you look at anything in this PR, look at how the usage examples are run.

@ghost
Copy link

ghost commented Aug 14, 2014

Build results will soon be (or already are) available at: http://ci.qiime.org/job/qiime-github-pr/1163/

-b $PWD/lane1_barcode.fastq.gz -b $PWD/lane2_barcode.fastq.gz \
-m $PWD/map.txt -o slout_q20 --rev-comp-mapping-barcodes -q 20
"""
print kwargs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops...

@ghost
Copy link

ghost commented Aug 14, 2014

Build results will soon be (or already are) available at: http://ci.qiime.org/job/qiime-github-pr/1165/

@gregcaporaso
Copy link
Contributor

@wasade, given the size of this, would it make sense for us to sit down together at ISME next week and discuss?

@walterst
Copy link
Contributor

Want me to skype in for this talk?

On Mon, Aug 18, 2014 at 11:00 AM, Greg Caporaso notifications@github.com
wrote:

@wasade https://github.com/wasade, given the size of this, would it
make sense for us to sit down together at ISME next week and discuss?


Reply to this email directly or view it on GitHub
#1657 (comment).

@wasade
Copy link
Member Author

wasade commented Aug 18, 2014

@gregcaporaso agree; @walterst, that would be fantastic if you could join. This is not a small PR, and it definitely needs to be guided to a mergable state.

@gregcaporaso
Copy link
Contributor

Yes, that would be great. I believe ~9am in Korea will be around 5pm in
Boulder - we'll just need to figure those details out. Tuesday or Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald notifications@github.com
wrote:

@gregcaporaso https://github.com/gregcaporaso agree; @walterst
https://github.com/walterst, that would be fantastic if you could join.
This is not a small PR, and it definitely needs to be guided to a mergable
state.

Reply to this email directly or view it on GitHub
#1657 (comment).

@wasade
Copy link
Member Author

wasade commented Aug 18, 2014

Same

On Mon, Aug 18, 2014 at 9:18 AM, Greg Caporaso notifications@github.com
wrote:

Yes, that would be great. I believe ~9am in Korea will be around 5pm in
Boulder - we'll just need to figure those details out. Tuesday or Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald <notifications@github.com

wrote:

@gregcaporaso https://github.com/gregcaporaso agree; @walterst
https://github.com/walterst, that would be fantastic if you could
join.
This is not a small PR, and it definitely needs to be guided to a
mergable
state.

Reply to this email directly or view it on GitHub
#1657 (comment).


Reply to this email directly or view it on GitHub
#1657 (comment).

@walterst
Copy link
Contributor

I can do either, but I'd prefer Monday. I think it's actually 6:00 PM MDT
for 9:00 AM in Korea, but I might be off.

On Mon, Aug 18, 2014 at 11:38 AM, Daniel McDonald notifications@github.com
wrote:

Same

On Mon, Aug 18, 2014 at 9:18 AM, Greg Caporaso notifications@github.com
wrote:

Yes, that would be great. I believe ~9am in Korea will be around 5pm in
Boulder - we'll just need to figure those details out. Tuesday or
Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald <
notifications@github.com

wrote:

@gregcaporaso https://github.com/gregcaporaso agree; @walterst
https://github.com/walterst, that would be fantastic if you could
join.
This is not a small PR, and it definitely needs to be guided to a
mergable
state.

Reply to this email directly or view it on GitHub
#1657 (comment).


Reply to this email directly or view it on GitHub
#1657 (comment).


Reply to this email directly or view it on GitHub
#1657 (comment).

@gregcaporaso
Copy link
Contributor

Monday is going to be hard for me, as it's the first day of the conference,
I'm getting there Sunday night, and I'm on a panel discussion in the
afternoon. Given all of that, I'd rather not plan on a 9am call on Monday,
just in case I need to be sleeping at that time.

On Mon, Aug 18, 2014 at 10:39 AM, Tony notifications@github.com wrote:

I can do either, but I'd prefer Monday. I think it's actually 6:00 PM MDT
for 9:00 AM in Korea, but I might be off.

On Mon, Aug 18, 2014 at 11:38 AM, Daniel McDonald <
notifications@github.com>
wrote:

Same

On Mon, Aug 18, 2014 at 9:18 AM, Greg Caporaso notifications@github.com

wrote:

Yes, that would be great. I believe ~9am in Korea will be around 5pm
in
Boulder - we'll just need to figure those details out. Tuesday or
Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald <
notifications@github.com

wrote:

@gregcaporaso https://github.com/gregcaporaso agree; @walterst
https://github.com/walterst, that would be fantastic if you could
join.
This is not a small PR, and it definitely needs to be guided to a
mergable
state.

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

@walterst
Copy link
Contributor

Ah, to clarify, that would be Monday for me (6 PM), Tuesday for you guys
(9:00 AM).

On Mon, Aug 18, 2014 at 1:56 PM, Greg Caporaso notifications@github.com
wrote:

Monday is going to be hard for me, as it's the first day of the conference,
I'm getting there Sunday night, and I'm on a panel discussion in the
afternoon. Given all of that, I'd rather not plan on a 9am call on Monday,
just in case I need to be sleeping at that time.

On Mon, Aug 18, 2014 at 10:39 AM, Tony notifications@github.com wrote:

I can do either, but I'd prefer Monday. I think it's actually 6:00 PM MDT
for 9:00 AM in Korea, but I might be off.

On Mon, Aug 18, 2014 at 11:38 AM, Daniel McDonald <
notifications@github.com>
wrote:

Same

On Mon, Aug 18, 2014 at 9:18 AM, Greg Caporaso <
notifications@github.com>

wrote:

Yes, that would be great. I believe ~9am in Korea will be around 5pm
in
Boulder - we'll just need to figure those details out. Tuesday or
Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald <
notifications@github.com

wrote:

@gregcaporaso https://github.com/gregcaporaso agree; @walterst
https://github.com/walterst, that would be fantastic if you
could
join.
This is not a small PR, and it definitely needs to be guided to a
mergable
state.

Reply to this email directly or view it on GitHub
<#1657 (comment)
.

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).


Reply to this email directly or view it on GitHub
#1657 (comment).

@gregcaporaso
Copy link
Contributor

Oh, right, sorry. That would work for me. @wasade?

On Mon, Aug 18, 2014 at 10:57 AM, Tony notifications@github.com wrote:

Ah, to clarify, that would be Monday for me (6 PM), Tuesday for you guys
(9:00 AM).

On Mon, Aug 18, 2014 at 1:56 PM, Greg Caporaso notifications@github.com
wrote:

Monday is going to be hard for me, as it's the first day of the
conference,
I'm getting there Sunday night, and I'm on a panel discussion in the
afternoon. Given all of that, I'd rather not plan on a 9am call on
Monday,
just in case I need to be sleeping at that time.

On Mon, Aug 18, 2014 at 10:39 AM, Tony notifications@github.com
wrote:

I can do either, but I'd prefer Monday. I think it's actually 6:00 PM
MDT
for 9:00 AM in Korea, but I might be off.

On Mon, Aug 18, 2014 at 11:38 AM, Daniel McDonald <
notifications@github.com>
wrote:

Same

On Mon, Aug 18, 2014 at 9:18 AM, Greg Caporaso <
notifications@github.com>

wrote:

Yes, that would be great. I believe ~9am in Korea will be around
5pm
in
Boulder - we'll just need to figure those details out. Tuesday or
Thursday
would probably work well for me for that.

On Mon, Aug 18, 2014 at 8:13 AM, Daniel McDonald <
notifications@github.com

wrote:

@gregcaporaso https://github.com/gregcaporaso agree;
@walterst
https://github.com/walterst, that would be fantastic if you
could
join.
This is not a small PR, and it definitely needs to be guided to
a
mergable
state.

Reply to this email directly or view it on GitHub
<
#1657 (comment)
.

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

Reply to this email directly or view it on GitHub
#1657 (comment).

@wasade
Copy link
Member Author

wasade commented Aug 18, 2014

Not aware of any issues, lets do it

@gregcaporaso
Copy link
Contributor

Perfect, thanks!

On Mon, Aug 18, 2014 at 11:06 AM, Daniel McDonald notifications@github.com
wrote:

Not aware of any issues, lets do it

Reply to this email directly or view it on GitHub
#1657 (comment).

@gregcaporaso
Copy link
Contributor

@wasade, based on recent discussions I think we're not merging this, but planning to use it as the basis for demultiplexing in QIIME 2, and will start with this as a model of new QIIME 2 code - is that right? If so, should we close this PR while you port to the metoo repo?

@wasade
Copy link
Member Author

wasade commented Oct 29, 2014

That's my impression as well. It would of course be nice to preserve the
commit history but I don't think we can easily do that, in which case, we
should just forget and move forward

On Mon, Oct 27, 2014 at 2:21 PM, Greg Caporaso notifications@github.com
wrote:

@wasade https://github.com/wasade, based on recent discussions I think
we're not merging this, but planning to use it as the basis for
demultiplexing in QIIME 2, and will start with this as a model of new QIIME
2 code - is that right? If so, should we close this PR while you port to
the metoo repo?


Reply to this email directly or view it on GitHub
#1657 (comment).

@gregcaporaso
Copy link
Contributor

Ok, thanks. I agree that preserving the commit history would be ideal, but not sure how we can do that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants