-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-end Support #50
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you have any questions with the PR! It's quite large!
@@ -5,5 +5,5 @@ show_missing = True | |||
branch = True | |||
omit = | |||
*/yeat/_version.py | |||
*/yeat/Snakefile | |||
*/yeat/workflows/snakefiles/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the coverage warning that alerts the user when Snakemake files are unparceable: CoverageWarning: Couldn't parse Python file
.
filterwarnings = | ||
ignore::DeprecationWarning:ratelimiter.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def batch(self): | ||
self.paired_assemblers = [] | ||
self.pacbio_assemblers = [] | ||
self.oxford_assemblers = [] | ||
self.paired_sample_labels = set() | ||
self.paired_assemblers = set() | ||
self.single_sample_labels = set() | ||
self.single_assemblers = set() | ||
self.pacbio_sample_labels = set() | ||
self.pacbio_assemblers = set() | ||
self.oxford_sample_labels = set() | ||
self.oxford_assemblers = set() | ||
for assembler in self.assemblers: | ||
self.determine_assembler_workflow(assembler) | ||
self.batch = { | ||
"paired": { | ||
"samples": self.get_batch_samples(self.paired_assemblers), | ||
"samples": self.get_samples(self.paired_sample_labels), | ||
"assemblers": self.paired_assemblers, | ||
}, | ||
"single": { | ||
"samples": self.get_samples(self.single_sample_labels), | ||
"assemblers": self.single_assemblers, | ||
}, | ||
"pacbio": { | ||
"samples": self.get_batch_samples(self.pacbio_assemblers), | ||
"samples": self.get_samples(self.pacbio_sample_labels), | ||
"assemblers": self.pacbio_assemblers, | ||
}, | ||
"oxford": { | ||
"samples": self.get_batch_samples(self.oxford_assemblers), | ||
"samples": self.get_samples(self.oxford_sample_labels), | ||
"assemblers": self.oxford_assemblers, | ||
}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the batch function. In this function, samples and assemblers are sorted by "read type". Originally, a list of assemblers were kept and then saved in the self.batch
dictionary. Now, we save sets of sample and assembler objects.
def determine_assembler_workflow(self, assembler): | ||
readtypes = set() | ||
for sample in assembler.samples: | ||
readtypes.update({self.samples[sample].readtype}) | ||
if assembler.algorithm in PAIRED and readtypes.intersection(ILLUMINA_READS): | ||
self.paired_assemblers.append(assembler) | ||
elif assembler.algorithm in PACBIO and readtypes.intersection(PACBIO_READS): | ||
self.pacbio_assemblers.append(assembler) | ||
elif assembler.algorithm in OXFORD and readtypes.intersection(OXFORD_READS): | ||
self.oxford_assemblers.append(assembler) | ||
|
||
def get_batch_samples(self, assemblers): | ||
samples = {} | ||
for assembler in assemblers: | ||
samples = samples | dict( | ||
(key, self.samples[key]) for key in assembler.samples if key in self.samples | ||
) | ||
return samples | ||
readtype = self.samples[sample].readtype | ||
if readtype == "paired": | ||
self.paired_sample_labels.add(sample) | ||
self.paired_assemblers.add(assembler) | ||
elif readtype == "single": | ||
self.single_sample_labels.add(sample) | ||
self.single_assemblers.add(assembler) | ||
elif readtype in PACBIO_READS: | ||
self.pacbio_sample_labels.add(sample) | ||
self.pacbio_assemblers.add(assembler) | ||
elif readtype in OXFORD_READS: | ||
self.oxford_sample_labels.add(sample) | ||
self.oxford_assemblers.add(assembler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of determining which workflow an assembler object belongs to with a bunch of complex conditional statements, we now go by "read type".
label_to_samples = {} | ||
for assembler in assemblers: | ||
label_to_samples[assembler.label] = [ | ||
sample for sample in assembler.samples if sample in samples | ||
] | ||
return dict( | ||
samples={label: sample.to_string() for label, sample in samples.items()}, | ||
labels=[assembler.label for assembler in assemblers], | ||
assemblers={assembler.label: assembler.algorithm for assembler in assemblers}, | ||
extra_args={assembler.label: assembler.extra_args for assembler in assemblers}, | ||
label_to_samples={assembler.label: assembler.samples for assembler in assemblers}, | ||
label_to_samples=label_to_samples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, label_to_samples
would be a dictionary of assembler.label
to assembler.samples
.
This parameter had to be changed because, not all assembler.samples
are used in a certain workflow.
For example, a basic canu
algorithm (no extra parameters) is called on two samples: 1) sample1
which has pacbio-hifi reads, and 2)sample2
which has Oxford Nanopore-reads. When running to_dict(args, readtype='pacbio')
, only sample1
will be available. When running to_dict(args, readtype='Oxford')
, only sample2
will be available. Because of this, we need to do more filtering to get the exact list of samples per label because of snakemake rules.
def get_expected_files(short_qc=False, pacbio_qc=False, nanopore=False): | ||
def get_expected_files(paired_qc=False, combined_qc=False, nanopore=False): | ||
inputlist = [] | ||
for label in config["labels"]: | ||
assembler = config["assemblers"][label] | ||
for sample in config["label_to_samples"][label]: | ||
inputlist.append(f"analysis/{sample}/{label}/{assembler}/quast/{sample}_report.html") | ||
if short_qc: | ||
if paired_qc: | ||
inputlist += expand("seq/fastqc/{sample}/{sample}_{reads}_fastqc.html", sample=[*config["samples"]], reads=["R1", "R2"]) | ||
if pacbio_qc: | ||
if combined_qc: | ||
inputlist += expand("seq/fastqc/{sample}/combined-reads_fastqc.html", sample=[*config["samples"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the function parameters that creates the list of expected files for rule all
.
Instead of short_qc
and pacbio_qc
, I've changed them to paired_qc
and combined_qc
.
paired_qc
will be set to True when dealing with Illumina paired-end reads.
combined_qc
will be set to True when dealing with non-paired-end reads. For example, single-end, pacbio, and Nanopore reads.
- Support for Nanopore-reads assembly with Flye and Canu (#47) | ||
- Oxford Nanopore raw, corrected, and HAC reads (#47) | ||
- Single-end Illumina reads (#50) | ||
- Support for Nanopore-reads assembly with Flye and Canu (#47, #50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, I fixed a few Nanopore bugs that were introduced in #47.
Bugs:
Nanoplot
rule inOxford
snakefile would crash due to incorrect output path.- Tests in
test_oxford.py::test_oxford_nanopore_read_assemblers
didn't test any Oxford tests.
sample_readtype={label: sample.readtype for label, sample in samples.items()}, | ||
threads=args.threads, | ||
downsample=args.downsample, | ||
coverage=args.coverage, | ||
genomesize=args.genome_size, | ||
seed=args.seed, | ||
length_required=args.length_required, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
length_required
is introduced here for the snakefiles to use during the fastp
rules.
yeat/cli/illumina.py
Outdated
def fastp_options(parser): | ||
illumina = parser.add_argument_group("fastp arguments") | ||
illumina.add_argument( | ||
"-l", | ||
"--length-required", | ||
type=int, | ||
metavar="L", | ||
default=50, | ||
help="discard reads shorter than the required L length during fastp; by default L=50", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can now adjust the -l
or --length-required
during the fastp
step through the command line flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest replacing "during fastp" with "after pre-processing" or something like that.
shell: | ||
""" | ||
fastp -i {input.read1} -I {input.read2} -o {output.out1} -O {output.out2} \ | ||
--html seq/fastp/{wildcards.sample}/fastp.html --json seq/fastp/{wildcards.sample}/fastp.json \ | ||
2> seq/fastp/{wildcards.sample}/report.txt | ||
-l {params.length_required} --detect_adapter_for_pe \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new params to fastp
for #48 .
@@ -105,6 +105,6 @@ rule hifiasm_meta: | |||
shell: | |||
""" | |||
hifiasm_meta -o {params.prefix} -t {threads} {params.extra_args} {input[0]} | |||
awk '/^S/{{print ">"$2;print $3}}' {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa | |||
gfatools gfa2fa {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a cool tool called gfatools
from Heng Li. The tool can do more things with rGFA format and different conversion formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is ready for review! Running the tests now. They should all be passing!
I've tried to reduce the test suite run time as much as I could by consolidating them. It still takes a long time but it should be slightly quicker!
It's a pretty big PR. Lots of changes. I'm going to cut a release once this is merged!
EDIT: Tests are passing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed a big update. No big concerns. I'm still waiting for some tests to run, but in the meantime I've got a bit of feedback. See my comments below, but I'll also note that make testnano
is failing for me at the following step.
$ gunzip -c seq/input/Ecoli_K12_MG1655_R10.3_HAC/combined-reads.fq.gz | Nanofilt -q 10 | gzip > seq/nanofilt/Ecoli_K12_MG1655_R10.3_HAC/highQuality-reads.fq.gz
Traceback (most recent call last):
File "/Users/daniel.standage/anaconda3/envs/yeat/bin/Nanofilt", line 10, in <module>
sys.exit(main())
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 44, in main
filter_stream(args.input, args)
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 72, in filter_stream
for rec in SeqIO.parse(fq, "fastq"):
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
return next(self.records)
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 1123, in iterate
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 926, in FastqGeneralIterator
line = next(handle)
File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
Any ideas?
yeat/cli/config.py
Outdated
|
||
def __members(self): | ||
return (self.label,) | ||
|
||
def __eq__(self, other): | ||
if type(other) is type(self): | ||
return self.__members() == other.__members() | ||
else: | ||
return False | ||
|
||
def __hash__(self): | ||
return hash(self.__members()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that. It looks like you based this code on an example incorporating multiple values into the hash function. Since you're only hashing a single value, you could probably simplify with something like this.
def __eq__(self, other):
return hash(self) == hash(other)
def __hash__(self):
return hash(self.label)
The type check is subsumed in the astronomically small possibility that an object of a different data type will hash to the same numerical value.
yeat/cli/illumina.py
Outdated
def fastp_options(parser): | ||
illumina = parser.add_argument_group("fastp arguments") | ||
illumina.add_argument( | ||
"-l", | ||
"--length-required", | ||
type=int, | ||
metavar="L", | ||
default=50, | ||
help="discard reads shorter than the required L length during fastp; by default L=50", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest replacing "during fastp" with "after pre-processing" or something like that.
Okay, I added all the suggested changes to the branch. I think if you pull from this branch the |
The purpose of this PR is to resolve #48, integrate single-end read support, fix some bugs introduced in #47, and clean up the test suite.
In issue thread #48, a request was made to increase the filtering process with
fastp
. By default,fastp
will discard reads that are 15 bp or less. Instead of int [=15], we increased the cutoff to [=50]. In addition to the new cutoff range, when paired-end reads are passed into thefastp
rule, we call --detect_adapter_for_pe to detect and remove adapters.Because #48, mentioned not to call
--detect_adapter_for_pe
for single-end reads, I went ahead and integrated single-end read support withspades
,megahit
, andUnicycler
.In #47, Oxford Read Support was introduced. In the test suite, I noticed that there were some tests that did not test the new feature. While fixing it, I got slightly frustrated with the test suite and decided to clean it up and remove unnecessary, redundant configuration test files. In this PR, correct Oxford tests were introduced, and test data were consolidated and cleaned up.