Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-end Support #50

Merged
merged 21 commits into from
Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,5 @@ show_missing = True
branch = True
omit =
*/yeat/_version.py
*/yeat/Snakefile
*/yeat/workflows/snakefiles/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the coverage warning that alerts the user when Snakemake files are unparceable: CoverageWarning: Couldn't parse Python file.

*/yeat/tests/*
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
- Input fastq read support:
- Pacbio raw and corrected reads (#47)
- Oxford Nanopore raw, corrected, and HAC (#47)
- Support for Nanopore-reads assembly with Flye and Canu (#47)
- Oxford Nanopore raw, corrected, and HAC reads (#47)
- Single-end Illumina reads (#50)
- Support for Nanopore-reads assembly with Flye and Canu (#47, #50)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, I fixed a few Nanopore bugs that were introduced in #47.

Bugs:

  • Nanoplot rule in Oxford snakefile would crash due to incorrect output path.
  • Tests in test_oxford.py::test_oxford_nanopore_read_assemblers didn't test any Oxford tests.

- New assemblers:
- Hifiasm and Hifiasm-meta (#49)

Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ $ yeat --outdir {path} {config}
| Readtype | Algorithm |
| ------------- | ------------- |
| paired | spades, megahit, unicycler |
| single | spades, megahit, unicycler |
| pacbio-raw | flye, canu |
| pacbio-corr | flye, canu |
| pacbio-hifi | flye, canu, hifiasm, hifiasm-meta |
Expand All @@ -72,8 +73,8 @@ $ yeat --outdir {path} {config}
},
"sample2": {
"paired": [
"yeat/tests/data/Animal_289_R1.fastq.gz",
"yeat/tests/data/Animal_289_R2.fastq.gz"
"yeat/tests/data/Animal_289_R1.fq.gz",
"yeat/tests/data/Animal_289_R2.fq.gz"
]
},
"sample3": {
Expand All @@ -89,7 +90,7 @@ $ yeat --outdir {path} {config}
},
"assemblers": [
{
"label": "default-spades",
"label": "spades-default",
"algorithm": "spades",
"extra_args": "",
"samples": [
Expand All @@ -106,7 +107,7 @@ $ yeat --outdir {path} {config}
]
},
{
"label": "nanoflye",
"label": "flye_ONT",
"algorithm": "flye",
"extra_args": "",
"samples": [
Expand Down
1 change: 1 addition & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ dependencies:
- nanoplot>=1.20
- hifiasm>=0.19
- hifiasm_meta>=hamtv0.3
- gfatools>=0.5
2 changes: 2 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ markers =
long: long running tests
hifi: pacbio hifi-reads tests
nano: oxford nanopore-reads tests
filterwarnings =
ignore::DeprecationWarning:ratelimiter.*
Comment on lines +7 to +8
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out how to remove the deprecation warning: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead.

Now, when you run the test suite, you'll get a green pass bar instead of yellow!

image

13 changes: 7 additions & 6 deletions yeat/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Development Center.
# -------------------------------------------------------------------------------------------------

from . import downsample
from . import illumina
from .config import AssemblerConfig
from argparse import Action, ArgumentParser
import json
Expand All @@ -26,16 +26,16 @@
},
"sample2": {
"paired": [
"yeat/tests/data/Animal_289_R1.fastq.gz",
"yeat/tests/data/Animal_289_R2.fastq.gz",
"yeat/tests/data/Animal_289_R1.fq.gz",
"yeat/tests/data/Animal_289_R2.fq.gz",
]
},
"sample3": {"pacbio-hifi": ["yeat/tests/data/ecoli.fastq.gz"]},
"sample4": {"nano-hq": ["yeat/tests/data/ecolk12mg1655_R10_3_guppy_345_HAC.fastq.gz"]},
},
"assemblers": [
{
"label": "default-spades",
"label": "spades-default",
"algorithm": "spades",
"extra_args": "",
"samples": ["sample1", "sample2"],
Expand All @@ -46,7 +46,7 @@
"extra_args": "genomeSize=4.8m",
"samples": ["sample3"],
},
{"label": "nanoflye", "algorithm": "flye", "extra_args": "", "samples": ["sample4"]},
{"label": "flye_ONT", "algorithm": "flye", "extra_args": "", "samples": ["sample4"]},
],
}

Expand Down Expand Up @@ -93,7 +93,8 @@ def options(parser):
def get_parser(exit_on_error=True):
parser = ArgumentParser(exit_on_error=exit_on_error)
options(parser)
downsample.options(parser)
illumina.fastp_options(parser)
illumina.downsample_options(parser)
parser.add_argument("config", type=str, help="config file")
return parser

Expand Down
78 changes: 53 additions & 25 deletions yeat/cli/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@


PAIRED = ("spades", "megahit", "unicycler")
SINGLE = ("spades", "megahit", "unicycler")
PACBIO = ("canu", "flye", "hifiasm", "hifiasm-meta")
OXFORD = ("canu", "flye")
ALGORITHMS = set(PAIRED + PACBIO + OXFORD)
ALGORITHMS = set(PAIRED + SINGLE + PACBIO + OXFORD)

ILLUMINA_READS = ("paired",)
ILLUMINA_READS = ("paired", "single")
PACBIO_READS = ("pacbio-raw", "pacbio-corr", "pacbio-hifi")
OXFORD_READS = ("nano-raw", "nano-corr", "nano-hq")
LONG_READS = PACBIO_READS + OXFORD_READS
Expand Down Expand Up @@ -84,44 +85,53 @@ def create_sample_and_assembler_objects(self):
]

def batch(self):
self.paired_assemblers = []
self.pacbio_assemblers = []
self.oxford_assemblers = []
self.paired_sample_labels = set()
self.paired_assemblers = set()
self.single_sample_labels = set()
self.single_assemblers = set()
self.pacbio_sample_labels = set()
self.pacbio_assemblers = set()
self.oxford_sample_labels = set()
self.oxford_assemblers = set()
for assembler in self.assemblers:
self.determine_assembler_workflow(assembler)
self.batch = {
"paired": {
"samples": self.get_batch_samples(self.paired_assemblers),
"samples": self.get_samples(self.paired_sample_labels),
"assemblers": self.paired_assemblers,
},
"single": {
"samples": self.get_samples(self.single_sample_labels),
"assemblers": self.single_assemblers,
},
"pacbio": {
"samples": self.get_batch_samples(self.pacbio_assemblers),
"samples": self.get_samples(self.pacbio_sample_labels),
"assemblers": self.pacbio_assemblers,
},
"oxford": {
"samples": self.get_batch_samples(self.oxford_assemblers),
"samples": self.get_samples(self.oxford_sample_labels),
"assemblers": self.oxford_assemblers,
},
}
Comment on lines 87 to 115
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the batch function. In this function, samples and assemblers are sorted by "read type". Originally, a list of assemblers were kept and then saved in the self.batch dictionary. Now, we save sets of sample and assembler objects.


def determine_assembler_workflow(self, assembler):
readtypes = set()
for sample in assembler.samples:
readtypes.update({self.samples[sample].readtype})
if assembler.algorithm in PAIRED and readtypes.intersection(ILLUMINA_READS):
self.paired_assemblers.append(assembler)
elif assembler.algorithm in PACBIO and readtypes.intersection(PACBIO_READS):
self.pacbio_assemblers.append(assembler)
elif assembler.algorithm in OXFORD and readtypes.intersection(OXFORD_READS):
self.oxford_assemblers.append(assembler)

def get_batch_samples(self, assemblers):
samples = {}
for assembler in assemblers:
samples = samples | dict(
(key, self.samples[key]) for key in assembler.samples if key in self.samples
)
return samples
readtype = self.samples[sample].readtype
if readtype == "paired":
self.paired_sample_labels.add(sample)
self.paired_assemblers.add(assembler)
elif readtype == "single":
self.single_sample_labels.add(sample)
self.single_assemblers.add(assembler)
elif readtype in PACBIO_READS:
self.pacbio_sample_labels.add(sample)
self.pacbio_assemblers.add(assembler)
elif readtype in OXFORD_READS:
self.oxford_sample_labels.add(sample)
self.oxford_assemblers.add(assembler)
Comment on lines 117 to +131
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of determining which workflow an assembler object belongs to with a bunch of complex conditional statements, we now go by "read type".


def get_samples(self, labels):
return {label: self.samples[label] for label in labels}

def to_dict(self, args, readtype="all"):
if readtype == "all":
Expand All @@ -130,18 +140,24 @@ def to_dict(self, args, readtype="all"):
else:
samples = self.batch[readtype]["samples"]
assemblers = self.batch[readtype]["assemblers"]
label_to_samples = {}
for assembler in assemblers:
label_to_samples[assembler.label] = [
sample for sample in assembler.samples if sample in samples
]
return dict(
samples={label: sample.to_string() for label, sample in samples.items()},
labels=[assembler.label for assembler in assemblers],
assemblers={assembler.label: assembler.algorithm for assembler in assemblers},
extra_args={assembler.label: assembler.extra_args for assembler in assemblers},
label_to_samples={assembler.label: assembler.samples for assembler in assemblers},
label_to_samples=label_to_samples,
Comment on lines +143 to +153
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, label_to_samples would be a dictionary of assembler.label to assembler.samples.

This parameter had to be changed because, not all assembler.samples are used in a certain workflow.

For example, a basic canu algorithm (no extra parameters) is called on two samples: 1) sample1 which has pacbio-hifi reads, and 2)sample2 which has Oxford Nanopore-reads. When running to_dict(args, readtype='pacbio'), only sample1 will be available. When running to_dict(args, readtype='Oxford'), only sample2 will be available. Because of this, we need to do more filtering to get the exact list of samples per label because of snakemake rules.

sample_readtype={label: sample.readtype for label, sample in samples.items()},
threads=args.threads,
downsample=args.downsample,
coverage=args.coverage,
genomesize=args.genome_size,
seed=args.seed,
length_required=args.length_required,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

length_required is introduced here for the snakefiles to use during the fastp rules.

)


Expand Down Expand Up @@ -184,3 +200,15 @@ def check_canu_required_params(self):
raise ValueError(
"Canu requires at least 4 avaliable cores; increase `--threads` to 4 or more"
)

def __members(self):
return (self.label,)

def __eq__(self, other):
if type(other) is type(self):
return self.__members() == other.__members()
else:
return False

def __hash__(self):
return hash(self.__members())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the batch(), we have sets of labels and assembly objects. To enable set of object, the hash and eq functions needs to be implemented. Because the self.label will always be unique, I have the hash function hash that unique string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. It looks like you based this code on an example incorporating multiple values into the hash function. Since you're only hashing a single value, you could probably simplify with something like this.

def __eq__(self, other):
    return hash(self) == hash(other)

def __hash__(self):
    return hash(self.label)

The type check is subsumed in the astronomically small possibility that an object of a different data type will hash to the same numerical value.

24 changes: 18 additions & 6 deletions yeat/cli/downsample.py → yeat/cli/illumina.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@
from argparse import ArgumentTypeError


def fastp_options(parser):
illumina = parser.add_argument_group("fastp arguments")
illumina.add_argument(
"-l",
"--length-required",
type=int,
metavar="L",
default=50,
help="discard reads shorter than the required L length during fastp; by default L=50",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can now adjust the -l or --length-required during the fastp step through the command line flags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest replacing "during fastp" with "after pre-processing" or something like that.



def check_positive(value):
try:
value = int(value)
Expand All @@ -20,33 +32,33 @@ def check_positive(value):
return value


def options(parser):
short = parser.add_argument_group("downsample arguments")
short.add_argument(
def downsample_options(parser):
illumina = parser.add_argument_group("downsample arguments")
illumina.add_argument(
"-c",
"--coverage",
type=check_positive,
metavar="C",
default=150,
help="target an average depth of coverage Cx when auto-downsampling; by default, C=150",
)
short.add_argument(
illumina.add_argument(
"-d",
"--downsample",
type=int,
metavar="D",
default=0,
help="randomly sample D reads from the input rather than assembling the full set; set D=0 to perform auto-downsampling to a desired level of coverage (see --coverage); set D=-1 to disable downsampling; by default D=0",
)
short.add_argument(
illumina.add_argument(
"-g",
"--genome-size",
type=int,
metavar="G",
default=0,
help="provide known genome size in base pairs (bp); by default, G=0",
)
short.add_argument(
illumina.add_argument(
"--seed",
type=int,
metavar="S",
Expand Down
22 changes: 22 additions & 0 deletions yeat/tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@
# Development Center.
# -------------------------------------------------------------------------------------------------

import json
import multiprocessing
import os
from pathlib import Path
from pkg_resources import resource_filename


Expand All @@ -20,3 +22,23 @@ def data_file(path):

def get_core_count():
return multiprocessing.cpu_count()


def write_config(labels, wd, cfg):
data = json.load(open(data_file(f"configs/{cfg}")))
assemblers = []
for assembler in data["assemblers"]:
if assembler["label"] in labels:
assemblers.append(assembler)
data["assemblers"] = assemblers
json.dump(data, open(Path(wd) / cfg, "w"))
return assemblers


def files_exists(wd, assemblers, expected):
analysis_dir = Path(wd).resolve() / "analysis"
for assembler in assemblers:
for sample in assembler["samples"]:
label = assembler["label"]
algorithm = assembler["algorithm"]
assert (analysis_dir / sample / label / algorithm / expected).exists()
Comment on lines +27 to +44
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of these two functions (write_config() and file_exists()) is to consolidate and share code for the following tests:

  • test_oxford.py::test_oxford_nanopore_read_assemblers
  • test_pacbio.py::test_pacbio_hifi_read_assemblers
  • test_pacbio.py::test_pacbio_hifi_read_metagenomic_assemblers
  • test_paired.py::test_paired_end_assemblers
  • test_single.py::test_single_end_assemblers

Binary file added yeat/tests/data/chr11-2M.fq.gz
Binary file not shown.
19 changes: 0 additions & 19 deletions yeat/tests/data/configs/canu.cfg

This file was deleted.

Loading