Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip removing small fastq.gz files after the length filter with --list or --sample #279

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

MarieLataretu
Copy link
Collaborator

Hi folks,

We have been using --fastq_pass --samples as input configuration, but we will switch to --fastq --list at some point. While testing, I noticed that the negative control didn't appear in the reports.

Now, I'm not sure if I understand the reasoning behind these lines completely: https://github.com/replikation/poreCov/blob/master/modules/filter_fastq_by_length.nf#L34-L36

However, we'd still like to have the negative control in the report, so with this PR, the removal of small fastq.gz files after the length filter is skipped with --list or --sample.

@replikation
Copy link
Owner

When using the e.g. 96 barcode kit you get tons of "false" barcodes. thats why we remove the small ones.

@hoelzer
Copy link
Collaborator

hoelzer commented Dec 20, 2024

True, but I see that this is a problem when a negative control is included (which should have few reads, but people want to see it in the report)

@MarieLataretu
Copy link
Collaborator Author

When using the e.g. 96 barcode kit you get tons of "false" barcodes. thats why we remove the small ones.

I see! Do you start poreCov without --list and without --sample then?

Because when the user specifies --list or --sample, the user explicitly lists the expected samples/barcodes 🤔

@MarieLataretu
Copy link
Collaborator Author

(CI fail could be related to nextflow-io/nextflow#5456, which should be fixed in nextflow version 24.11.0-edge; CI uses version 24.10.3)

@DataSpott
Copy link
Collaborator

DataSpott commented Jan 8, 2025

We use --samples in combination with --fastq_pass for our general analyses.
When we reanalyze files (fastq or fasta) we feed them directly to the according flag, like: --fastq '/path/to/fastq/dir/*.fastq' (single quotation marks necessary).
--list is never used on our end. Christian meant it is probably an artifact from earlier implementations, as there is no documentation.

But I also agree with you that it then should omit the filtering step, if you feed specific files to the workflow.
E.g.:

  • If you feed files generally like --fastq '/path/to/*.fastq' have the filter active and sort out files that are to small.
  • If you instead use --fastq '/path/to/list.csv' --list you specify directly which files should be analyzed and the filtering is omitted.

Yet we should add how --list works to the documentation and that it actively omits the filtering step.

@DataSpott
Copy link
Collaborator

Please correct me if I'm wrong here, but to my knowledge, --list cannot rename samples.
In my test with fasta-files, the samples were not renamed by the provided csv-file. Instead, poreCov used the fasta-headers to name the samples.

@replikation
Copy link
Owner

--list i think is something we never use. so feel free to adapt or adjust this.

@MarieLataretu
Copy link
Collaborator Author

Please correct me if I'm wrong here, but to my knowledge, --list cannot rename samples. In my test with fasta-files, the samples were not renamed by the provided csv-file. Instead, poreCov used the fasta-headers to name the samples.

Ah, I haven't checked out fasts input yet! Can you try maybe it with the latest change?

I think the file size filter was not documented anywhere; please feel free to add/edit

@DataSpott
Copy link
Collaborator

I tested your changes with fastas now. The basic function is fine, but there are two issues I ran into:

  1. In the final html-report the samples are named by the fasta-header, yet all files in the output dir are named like in the csv-file I fed for --list
  2. There is a problem with multi-fasta files breaking the html-report generation. The causative error comes from summary_report.py: "ValueError: cannot reindex from a duplicate axis"
    -> we usually solved this by splitting multi- into single-fastas, each holding exactly one sequence
    -> yet it would mean we can not use --list with multi-fasta files, unless we find a way to split and then rename them properly

@MarieLataretu
Copy link
Collaborator Author

uff, the multifastas add another dimension here 😅

Should we disable the combination of --fasta and --list?

@MarieLataretu
Copy link
Collaborator Author

MarieLataretu commented Jan 15, 2025

Alrighty, I disabled --list in combination with --fasta.

--fastq samples.csv --list works for our use-case.

I run the test profiles (-profile test_fast[a|q|5]) locally without errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants