-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364
1357 grouping strategy applied by counting number of FASTQ files generated by FASTP #1364
Conversation
Changes: - The grouping strategy for sharded data has been improved - The number of BAM files per sample is calculated by grouping the sample by ID after splitting the FASTQ files, then counting the total number of FASTQ files created. - This has to wait for all FASTQ files to be produced by FASTP, but is more reliable. - After alignment, the number of FASTQ files is used to wait to determine the expected number of BAM files used by groupBy. Fixes #1357
Changes: - FASTP uses blocks of 250 reads when splitting a FASTQ file. - This update makes 250 the minimum sized block to split a FASTQ file into. - Updates help text accordingly Fixes #1363
|
@FriederikeHanssen @maxulysse I created a test set of data which included only 60 reads for lane 2 to recreate this problem. Attached here, you'll have to modify the path in the input samplesheet. There are no tests in Sarek right now. What shall we do? It's easy to do but we'd need to add some more data to test-datasets (such as those FASTQ files). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Can you update changelog too? |
Testing might be good, but that data probably can't be added to the modules repo, right? |
I don't see why not. I just sliced 60 reads from the existing data. Alternatively we could generate it on the fly? Here is the mini workflow to generate a channel with a slice of the reads: workflow UNEVEN_FASTQ {
take:
csv
main:
ch_csv = Channel.fromPath(csv, checkIfExists: true)
.splitCsv(header: true)
.map { row ->
[
[
patient: row.patient,
sex: row.sex,
status: row.status,
sample: row.sample,
lane: "small_lane"
],
file(row.fastq_1),
file(row.fastq_2)
]
}
.first()
ch_csv
.splitFastq(by: 60, file: true, pe: true)
.map { meta, read1, read2 -> [ meta, [ read1, read2 ] ] }
.first()
.mix(ch_csv)
.set { fastq }
emit:
fastq
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 😍
// Group | ||
.groupTuple() | ||
|
||
bai_mapped = FASTQ_ALIGN_BWAMEM_MEM2_DRAGMAP_SENTIEON.out.bai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bai at only produced/tested with sentieon, but since it is the same, should work
PR checklist
nf-core lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).