Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-end Support #50

Merged
merged 21 commits into from
Aug 23, 2023
Merged

Single-end Support #50

merged 21 commits into from
Aug 23, 2023

Conversation

danejo3
Copy link
Collaborator

@danejo3 danejo3 commented Aug 11, 2023

The purpose of this PR is to resolve #48, integrate single-end read support, fix some bugs introduced in #47, and clean up the test suite.

In issue thread #48, a request was made to increase the filtering process with fastp. By default, fastp will discard reads that are 15 bp or less. Instead of int [=15], we increased the cutoff to [=50]. In addition to the new cutoff range, when paired-end reads are passed into the fastp rule, we call --detect_adapter_for_pe to detect and remove adapters.

Because #48, mentioned not to call --detect_adapter_for_pe for single-end reads, I went ahead and integrated single-end read support with spades, megahit, and Unicycler.

In #47, Oxford Read Support was introduced. In the test suite, I noticed that there were some tests that did not test the new feature. While fixing it, I got slightly frustrated with the test suite and decided to clean it up and remove unnecessary, redundant configuration test files. In this PR, correct Oxford tests were introduced, and test data were consolidated and cleaned up.

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you have any questions with the PR! It's quite large!

@@ -5,5 +5,5 @@ show_missing = True
branch = True
omit =
*/yeat/_version.py
*/yeat/Snakefile
*/yeat/workflows/snakefiles/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the coverage warning that alerts the user when Snakemake files are unparceable: CoverageWarning: Couldn't parse Python file.

Comment on lines +7 to +8
filterwarnings =
ignore::DeprecationWarning:ratelimiter.*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out how to remove the deprecation warning: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead.

Now, when you run the test suite, you'll get a green pass bar instead of yellow!

image

Comment on lines 87 to 115
def batch(self):
self.paired_assemblers = []
self.pacbio_assemblers = []
self.oxford_assemblers = []
self.paired_sample_labels = set()
self.paired_assemblers = set()
self.single_sample_labels = set()
self.single_assemblers = set()
self.pacbio_sample_labels = set()
self.pacbio_assemblers = set()
self.oxford_sample_labels = set()
self.oxford_assemblers = set()
for assembler in self.assemblers:
self.determine_assembler_workflow(assembler)
self.batch = {
"paired": {
"samples": self.get_batch_samples(self.paired_assemblers),
"samples": self.get_samples(self.paired_sample_labels),
"assemblers": self.paired_assemblers,
},
"single": {
"samples": self.get_samples(self.single_sample_labels),
"assemblers": self.single_assemblers,
},
"pacbio": {
"samples": self.get_batch_samples(self.pacbio_assemblers),
"samples": self.get_samples(self.pacbio_sample_labels),
"assemblers": self.pacbio_assemblers,
},
"oxford": {
"samples": self.get_batch_samples(self.oxford_assemblers),
"samples": self.get_samples(self.oxford_sample_labels),
"assemblers": self.oxford_assemblers,
},
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the batch function. In this function, samples and assemblers are sorted by "read type". Originally, a list of assemblers were kept and then saved in the self.batch dictionary. Now, we save sets of sample and assembler objects.

Comment on lines 117 to +131
def determine_assembler_workflow(self, assembler):
readtypes = set()
for sample in assembler.samples:
readtypes.update({self.samples[sample].readtype})
if assembler.algorithm in PAIRED and readtypes.intersection(ILLUMINA_READS):
self.paired_assemblers.append(assembler)
elif assembler.algorithm in PACBIO and readtypes.intersection(PACBIO_READS):
self.pacbio_assemblers.append(assembler)
elif assembler.algorithm in OXFORD and readtypes.intersection(OXFORD_READS):
self.oxford_assemblers.append(assembler)

def get_batch_samples(self, assemblers):
samples = {}
for assembler in assemblers:
samples = samples | dict(
(key, self.samples[key]) for key in assembler.samples if key in self.samples
)
return samples
readtype = self.samples[sample].readtype
if readtype == "paired":
self.paired_sample_labels.add(sample)
self.paired_assemblers.add(assembler)
elif readtype == "single":
self.single_sample_labels.add(sample)
self.single_assemblers.add(assembler)
elif readtype in PACBIO_READS:
self.pacbio_sample_labels.add(sample)
self.pacbio_assemblers.add(assembler)
elif readtype in OXFORD_READS:
self.oxford_sample_labels.add(sample)
self.oxford_assemblers.add(assembler)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of determining which workflow an assembler object belongs to with a bunch of complex conditional statements, we now go by "read type".

Comment on lines +143 to +153
label_to_samples = {}
for assembler in assemblers:
label_to_samples[assembler.label] = [
sample for sample in assembler.samples if sample in samples
]
return dict(
samples={label: sample.to_string() for label, sample in samples.items()},
labels=[assembler.label for assembler in assemblers],
assemblers={assembler.label: assembler.algorithm for assembler in assemblers},
extra_args={assembler.label: assembler.extra_args for assembler in assemblers},
label_to_samples={assembler.label: assembler.samples for assembler in assemblers},
label_to_samples=label_to_samples,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, label_to_samples would be a dictionary of assembler.label to assembler.samples.

This parameter had to be changed because, not all assembler.samples are used in a certain workflow.

For example, a basic canu algorithm (no extra parameters) is called on two samples: 1) sample1 which has pacbio-hifi reads, and 2)sample2 which has Oxford Nanopore-reads. When running to_dict(args, readtype='pacbio'), only sample1 will be available. When running to_dict(args, readtype='Oxford'), only sample2 will be available. Because of this, we need to do more filtering to get the exact list of samples per label because of snakemake rules.

Comment on lines -10 to 19
def get_expected_files(short_qc=False, pacbio_qc=False, nanopore=False):
def get_expected_files(paired_qc=False, combined_qc=False, nanopore=False):
inputlist = []
for label in config["labels"]:
assembler = config["assemblers"][label]
for sample in config["label_to_samples"][label]:
inputlist.append(f"analysis/{sample}/{label}/{assembler}/quast/{sample}_report.html")
if short_qc:
if paired_qc:
inputlist += expand("seq/fastqc/{sample}/{sample}_{reads}_fastqc.html", sample=[*config["samples"]], reads=["R1", "R2"])
if pacbio_qc:
if combined_qc:
inputlist += expand("seq/fastqc/{sample}/combined-reads_fastqc.html", sample=[*config["samples"]])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the function parameters that creates the list of expected files for rule all.

Instead of short_qc and pacbio_qc, I've changed them to paired_qc and combined_qc.

paired_qc will be set to True when dealing with Illumina paired-end reads.

combined_qc will be set to True when dealing with non-paired-end reads. For example, single-end, pacbio, and Nanopore reads.

- Support for Nanopore-reads assembly with Flye and Canu (#47)
- Oxford Nanopore raw, corrected, and HAC reads (#47)
- Single-end Illumina reads (#50)
- Support for Nanopore-reads assembly with Flye and Canu (#47, #50)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I fixed a few Nanopore bugs that were introduced in #47.

Bugs:

  • Nanoplot rule in Oxford snakefile would crash due to incorrect output path.
  • Tests in test_oxford.py::test_oxford_nanopore_read_assemblers didn't test any Oxford tests.

sample_readtype={label: sample.readtype for label, sample in samples.items()},
threads=args.threads,
downsample=args.downsample,
coverage=args.coverage,
genomesize=args.genome_size,
seed=args.seed,
length_required=args.length_required,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length_required is introduced here for the snakefiles to use during the fastp rules.

Comment on lines 13 to 22
def fastp_options(parser):
illumina = parser.add_argument_group("fastp arguments")
illumina.add_argument(
"-l",
"--length-required",
type=int,
metavar="L",
default=50,
help="discard reads shorter than the required L length during fastp; by default L=50",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can now adjust the -l or --length-required during the fastp step through the command line flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest replacing "during fastp" with "after pre-processing" or something like that.

shell:
"""
fastp -i {input.read1} -I {input.read2} -o {output.out1} -O {output.out2} \
--html seq/fastp/{wildcards.sample}/fastp.html --json seq/fastp/{wildcards.sample}/fastp.json \
2> seq/fastp/{wildcards.sample}/report.txt
-l {params.length_required} --detect_adapter_for_pe \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new params to fastp for #48 .

@@ -105,6 +105,6 @@ rule hifiasm_meta:
shell:
"""
hifiasm_meta -o {params.prefix} -t {threads} {params.extra_args} {input[0]}
awk '/^S/{{print ">"$2;print $3}}' {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa
gfatools gfa2fa {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a cool tool called gfatools from Heng Li. The tool can do more things with rGFA format and different conversion formats.

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready for review! Running the tests now. They should all be passing!

I've tried to reduce the test suite run time as much as I could by consolidating them. It still takes a long time but it should be slightly quicker!

It's a pretty big PR. Lots of changes. I'm going to cut a release once this is merged!

EDIT: Tests are passing!

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a big update. No big concerns. I'm still waiting for some tests to run, but in the meantime I've got a bit of feedback. See my comments below, but I'll also note that make testnano is failing for me at the following step.

$ gunzip -c seq/input/Ecoli_K12_MG1655_R10.3_HAC/combined-reads.fq.gz | Nanofilt -q 10 | gzip > seq/nanofilt/Ecoli_K12_MG1655_R10.3_HAC/highQuality-reads.fq.gz
Traceback (most recent call last):
  File "/Users/daniel.standage/anaconda3/envs/yeat/bin/Nanofilt", line 10, in <module>
    sys.exit(main())
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 44, in main
    filter_stream(args.input, args)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 72, in filter_stream
    for rec in SeqIO.parse(fq, "fastq"):
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
    return next(self.records)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 1123, in iterate
    for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 926, in FastqGeneralIterator
    line = next(handle)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any ideas?

Comment on lines 202 to 213

def __members(self):
return (self.label,)

def __eq__(self, other):
if type(other) is type(self):
return self.__members() == other.__members()
else:
return False

def __hash__(self):
return hash(self.__members())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. It looks like you based this code on an example incorporating multiple values into the hash function. Since you're only hashing a single value, you could probably simplify with something like this.

def __eq__(self, other):
    return hash(self) == hash(other)

def __hash__(self):
    return hash(self.label)

The type check is subsumed in the astronomically small possibility that an object of a different data type will hash to the same numerical value.

Comment on lines 13 to 22
def fastp_options(parser):
illumina = parser.add_argument_group("fastp arguments")
illumina.add_argument(
"-l",
"--length-required",
type=int,
metavar="L",
default=50,
help="discard reads shorter than the required L length during fastp; by default L=50",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest replacing "during fastp" with "after pre-processing" or something like that.

@danejo3
Copy link
Collaborator Author

danejo3 commented Aug 23, 2023

Okay, I added all the suggested changes to the branch. I think if you pull from this branch the nanofilt problem should be fixed. I ran into this problem as well and somehow fixed it in this branch.

@standage standage merged commit cccbe5f into main Aug 23, 2023
@standage standage deleted the single-end branch August 23, 2023 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update fastp parameters
2 participants