Single-end Support #50

danejo3 · 2023-08-11T17:28:14Z

The purpose of this PR is to resolve #48, integrate single-end read support, fix some bugs introduced in #47, and clean up the test suite.

In issue thread #48, a request was made to increase the filtering process with fastp. By default, fastp will discard reads that are 15 bp or less. Instead of int [=15], we increased the cutoff to [=50]. In addition to the new cutoff range, when paired-end reads are passed into the fastp rule, we call --detect_adapter_for_pe to detect and remove adapters.

Because #48, mentioned not to call --detect_adapter_for_pe for single-end reads, I went ahead and integrated single-end read support with spades, megahit, and Unicycler.

In #47, Oxford Read Support was introduced. In the test suite, I noticed that there were some tests that did not test the new feature. While fixing it, I got slightly frustrated with the test suite and decided to clean it up and remove unnecessary, redundant configuration test files. In this PR, correct Oxford tests were introduced, and test data were consolidated and cleaned up.

danejo3

Let me know if you have any questions with the PR! It's quite large!

danejo3 · 2023-08-18T14:53:03Z

.coveragerc

@@ -5,5 +5,5 @@ show_missing = True
 branch = True
 omit = 
    */yeat/_version.py
-    */yeat/Snakefile
+    */yeat/workflows/snakefiles/*


Removed the coverage warning that alerts the user when Snakemake files are unparceable: CoverageWarning: Couldn't parse Python file.

danejo3 · 2023-08-18T14:54:49Z

pytest.ini

+filterwarnings =
+    ignore::DeprecationWarning:ratelimiter.*


I figured out how to remove the deprecation warning: DeprecationWarning: "@coroutine" decorator is deprecated since Python 3.8, use "async def" instead.

Now, when you run the test suite, you'll get a green pass bar instead of yellow!

danejo3 · 2023-08-18T15:03:17Z

yeat/cli/config.py

    def batch(self):
-        self.paired_assemblers = []
-        self.pacbio_assemblers = []
-        self.oxford_assemblers = []
+        self.paired_sample_labels = set()
+        self.paired_assemblers = set()
+        self.single_sample_labels = set()
+        self.single_assemblers = set()
+        self.pacbio_sample_labels = set()
+        self.pacbio_assemblers = set()
+        self.oxford_sample_labels = set()
+        self.oxford_assemblers = set()
        for assembler in self.assemblers:
            self.determine_assembler_workflow(assembler)
        self.batch = {
            "paired": {
-                "samples": self.get_batch_samples(self.paired_assemblers),
+                "samples": self.get_samples(self.paired_sample_labels),
                "assemblers": self.paired_assemblers,
            },
+            "single": {
+                "samples": self.get_samples(self.single_sample_labels),
+                "assemblers": self.single_assemblers,
+            },
            "pacbio": {
-                "samples": self.get_batch_samples(self.pacbio_assemblers),
+                "samples": self.get_samples(self.pacbio_sample_labels),
                "assemblers": self.pacbio_assemblers,
            },
            "oxford": {
-                "samples": self.get_batch_samples(self.oxford_assemblers),
+                "samples": self.get_samples(self.oxford_sample_labels),
                "assemblers": self.oxford_assemblers,
            },
        }


Updated the batch function. In this function, samples and assemblers are sorted by "read type". Originally, a list of assemblers were kept and then saved in the self.batch dictionary. Now, we save sets of sample and assembler objects.

danejo3 · 2023-08-18T15:06:04Z

yeat/cli/config.py

    def determine_assembler_workflow(self, assembler):
-        readtypes = set()
        for sample in assembler.samples:
-            readtypes.update({self.samples[sample].readtype})
-        if assembler.algorithm in PAIRED and readtypes.intersection(ILLUMINA_READS):
-            self.paired_assemblers.append(assembler)
-        elif assembler.algorithm in PACBIO and readtypes.intersection(PACBIO_READS):
-            self.pacbio_assemblers.append(assembler)
-        elif assembler.algorithm in OXFORD and readtypes.intersection(OXFORD_READS):
-            self.oxford_assemblers.append(assembler)
-
-    def get_batch_samples(self, assemblers):
-        samples = {}
-        for assembler in assemblers:
-            samples = samples | dict(
-                (key, self.samples[key]) for key in assembler.samples if key in self.samples
-            )
-        return samples
+            readtype = self.samples[sample].readtype
+            if readtype == "paired":
+                self.paired_sample_labels.add(sample)
+                self.paired_assemblers.add(assembler)
+            elif readtype == "single":
+                self.single_sample_labels.add(sample)
+                self.single_assemblers.add(assembler)
+            elif readtype in PACBIO_READS:
+                self.pacbio_sample_labels.add(sample)
+                self.pacbio_assemblers.add(assembler)
+            elif readtype in OXFORD_READS:
+                self.oxford_sample_labels.add(sample)
+                self.oxford_assemblers.add(assembler)


Instead of determining which workflow an assembler object belongs to with a bunch of complex conditional statements, we now go by "read type".

danejo3 · 2023-08-18T15:07:13Z

yeat/cli/config.py

+        label_to_samples = {}
+        for assembler in assemblers:
+            label_to_samples[assembler.label] = [
+                sample for sample in assembler.samples if sample in samples
+            ]
        return dict(
            samples={label: sample.to_string() for label, sample in samples.items()},
            labels=[assembler.label for assembler in assemblers],
            assemblers={assembler.label: assembler.algorithm for assembler in assemblers},
            extra_args={assembler.label: assembler.extra_args for assembler in assemblers},
-            label_to_samples={assembler.label: assembler.samples for assembler in assemblers},
+            label_to_samples=label_to_samples,


Originally, label_to_samples would be a dictionary of assembler.label to assembler.samples.

This parameter had to be changed because, not all assembler.samples are used in a certain workflow.

For example, a basic canu algorithm (no extra parameters) is called on two samples: 1) sample1 which has pacbio-hifi reads, and 2)sample2 which has Oxford Nanopore-reads. When running to_dict(args, readtype='pacbio'), only sample1 will be available. When running to_dict(args, readtype='Oxford'), only sample2 will be available. Because of this, we need to do more filtering to get the exact list of samples per label because of snakemake rules.

danejo3 · 2023-08-21T13:59:59Z

yeat/workflows/snakefiles/__init__.py

-def get_expected_files(short_qc=False, pacbio_qc=False, nanopore=False):
+def get_expected_files(paired_qc=False, combined_qc=False, nanopore=False):
    inputlist = []
    for label in config["labels"]:
        assembler = config["assemblers"][label]
        for sample in config["label_to_samples"][label]:
            inputlist.append(f"analysis/{sample}/{label}/{assembler}/quast/{sample}_report.html")
-    if short_qc:
+    if paired_qc:
        inputlist += expand("seq/fastqc/{sample}/{sample}_{reads}_fastqc.html", sample=[*config["samples"]], reads=["R1", "R2"])
-    if pacbio_qc:
+    if combined_qc:
        inputlist += expand("seq/fastqc/{sample}/combined-reads_fastqc.html", sample=[*config["samples"]])


Updated the function parameters that creates the list of expected files for rule all.

Instead of short_qc and pacbio_qc, I've changed them to paired_qc and combined_qc.

paired_qc will be set to True when dealing with Illumina paired-end reads.

combined_qc will be set to True when dealing with non-paired-end reads. For example, single-end, pacbio, and Nanopore reads.

danejo3 · 2023-08-21T14:02:51Z

CHANGELOG.md

- Support for Nanopore-reads assembly with Flye and Canu (#47)
+    - Oxford Nanopore raw, corrected, and HAC reads (#47)
+    - Single-end Illumina reads (#50)
+- Support for Nanopore-reads assembly with Flye and Canu (#47, #50)


In this PR, I fixed a few Nanopore bugs that were introduced in #47.

Bugs:

Nanoplot rule in Oxford snakefile would crash due to incorrect output path.

Tests in test_oxford.py::test_oxford_nanopore_read_assemblers didn't test any Oxford tests.

danejo3 · 2023-08-21T14:23:09Z

yeat/cli/config.py

            sample_readtype={label: sample.readtype for label, sample in samples.items()},
            threads=args.threads,
            downsample=args.downsample,
            coverage=args.coverage,
            genomesize=args.genome_size,
            seed=args.seed,
+            length_required=args.length_required,


length_required is introduced here for the snakefiles to use during the fastp rules.

danejo3 · 2023-08-21T14:24:49Z

yeat/cli/illumina.py

+def fastp_options(parser):
+    illumina = parser.add_argument_group("fastp arguments")
+    illumina.add_argument(
+        "-l",
+        "--length-required",
+        type=int,
+        metavar="L",
+        default=50,
+        help="discard reads shorter than the required L length during fastp; by default L=50",
+    )


Users can now adjust the -l or --length-required during the fastp step through the command line flags.

I'd suggest replacing "during fastp" with "after pre-processing" or something like that.

danejo3 · 2023-08-21T14:38:00Z

yeat/workflows/snakefiles/Paired

    shell:
        """
        fastp -i {input.read1} -I {input.read2} -o {output.out1} -O {output.out2} \
-            --html seq/fastp/{wildcards.sample}/fastp.html --json seq/fastp/{wildcards.sample}/fastp.json \
-            2> seq/fastp/{wildcards.sample}/report.txt
+            -l {params.length_required} --detect_adapter_for_pe \


Added new params to fastp for #48 .

danejo3 · 2023-08-21T18:01:59Z

yeat/workflows/snakefiles/Pacbio

@@ -105,6 +105,6 @@ rule hifiasm_meta:
    shell:
        """
        hifiasm_meta -o {params.prefix} -t {threads} {params.extra_args} {input[0]}
-        awk '/^S/{{print ">"$2;print $3}}' {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa
+        gfatools gfa2fa {params.prefix}.p_ctg.gfa > {params.prefix}.p_ctg.fa


Found a cool tool called gfatools from Heng Li. The tool can do more things with rGFA format and different conversion formats.

danejo3

This PR is ready for review! Running the tests now. They should all be passing!

I've tried to reduce the test suite run time as much as I could by consolidating them. It still takes a long time but it should be slightly quicker!

It's a pretty big PR. Lots of changes. I'm going to cut a release once this is merged!

EDIT: Tests are passing!

standage

This is indeed a big update. No big concerns. I'm still waiting for some tests to run, but in the meantime I've got a bit of feedback. See my comments below, but I'll also note that make testnano is failing for me at the following step.

$ gunzip -c seq/input/Ecoli_K12_MG1655_R10.3_HAC/combined-reads.fq.gz | Nanofilt -q 10 | gzip > seq/nanofilt/Ecoli_K12_MG1655_R10.3_HAC/highQuality-reads.fq.gz
Traceback (most recent call last):
  File "/Users/daniel.standage/anaconda3/envs/yeat/bin/Nanofilt", line 10, in <module>
    sys.exit(main())
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 44, in main
    filter_stream(args.input, args)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/nanofilt/NanoFilt.py", line 72, in filter_stream
    for rec in SeqIO.parse(fq, "fastq"):
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/Interfaces.py", line 72, in __next__
    return next(self.records)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 1123, in iterate
    for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/site-packages/Bio/SeqIO/QualityIO.py", line 926, in FastqGeneralIterator
    line = next(handle)
  File "/Users/daniel.standage/anaconda3/envs/yeat/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Any ideas?

standage · 2023-08-22T19:19:35Z

yeat/cli/config.py

+
+    def __members(self):
+        return (self.label,)
+
+    def __eq__(self, other):
+        if type(other) is type(self):
+            return self.__members() == other.__members()
+        else:
+            return False
+
+    def __hash__(self):
+        return hash(self.__members())


Roger that. It looks like you based this code on an example incorporating multiple values into the hash function. Since you're only hashing a single value, you could probably simplify with something like this.

def __eq__(self, other): return hash(self) == hash(other) def __hash__(self): return hash(self.label)

The type check is subsumed in the astronomically small possibility that an object of a different data type will hash to the same numerical value.

standage · 2023-08-22T19:22:02Z

yeat/cli/illumina.py

+def fastp_options(parser):
+    illumina = parser.add_argument_group("fastp arguments")
+    illumina.add_argument(
+        "-l",
+        "--length-required",
+        type=int,
+        metavar="L",
+        default=50,
+        help="discard reads shorter than the required L length during fastp; by default L=50",
+    )


I'd suggest replacing "during fastp" with "after pre-processing" or something like that.

danejo3 · 2023-08-23T13:29:40Z

Okay, I added all the suggested changes to the branch. I think if you pull from this branch the nanofilt problem should be fixed. I ran into this problem as well and somehow fixed it in this branch.

danejo3 added 5 commits August 11, 2023 13:27

first commit

b650a28

updating tests

03a9da1

bandage

de5a56f

cleaned up code

361de80

adding tests

6eb00c4

danejo3 mentioned this pull request Aug 16, 2023

Adding Hifiasm and Hifiasm-meta for Pacbio Hifi-reads #49

Merged

Merge branch 'main' into single-end

44d0ab7

danejo3 force-pushed the single-end branch from 190c364 to 44d0ab7 Compare August 16, 2023 14:17

danejo3 added 7 commits August 16, 2023 10:53

cleaning up code

760ea9c

cleaning up test suite

2df4e6e

cleaning up test suite

9e92be3

cleaned up code; fixed tests

0a7456c

cleaning up code

c5bae5c

updated readme and changelog

ad3a067

updated fastp

d83abc2

danejo3 commented Aug 21, 2023

View reviewed changes

danejo3 added 4 commits August 21, 2023 11:08

update tests

52e6012

updating test

75caa50

updating test

3a86a92

gfatools and updated tests

7571dff

danejo3 commented Aug 21, 2023

View reviewed changes

danejo3 requested a review from standage August 21, 2023 18:09

danejo3 mentioned this pull request Aug 21, 2023

Handling single-end short reads #36

Closed

fix test

bdea828

standage reviewed Aug 22, 2023

View reviewed changes

danejo3 added 3 commits August 23, 2023 09:21

added suggestions

3f44322

added suggestions

1a8e914

added suggestions

d79acf0

standage approved these changes Aug 23, 2023

View reviewed changes

standage merged commit cccbe5f into main Aug 23, 2023

standage deleted the single-end branch August 23, 2023 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-end Support #50

Single-end Support #50

danejo3 commented Aug 11, 2023 •

edited

Loading

danejo3 left a comment

danejo3 Aug 18, 2023

danejo3 Aug 18, 2023

danejo3 Aug 18, 2023

danejo3 Aug 18, 2023

danejo3 Aug 18, 2023

danejo3 Aug 21, 2023

danejo3 Aug 21, 2023

danejo3 Aug 21, 2023

danejo3 Aug 21, 2023

standage Aug 22, 2023

danejo3 Aug 21, 2023

danejo3 Aug 21, 2023

danejo3 left a comment •

edited

Loading

standage left a comment •

edited

Loading

standage Aug 22, 2023

standage Aug 22, 2023

danejo3 commented Aug 23, 2023

		filterwarnings =
		ignore::DeprecationWarning:ratelimiter.*

Single-end Support #50

Single-end Support #50

Conversation

danejo3 commented Aug 11, 2023 • edited Loading

danejo3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danejo3 left a comment • edited Loading

Choose a reason for hiding this comment

standage left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danejo3 commented Aug 23, 2023

danejo3 commented Aug 11, 2023 •

edited

Loading

danejo3 left a comment •

edited

Loading

standage left a comment •

edited

Loading