Check for invalid BC tags in fastq output. #1518
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I don't know how best to deal with such situations, so rather than a hard error we treat the barcode component the same as if no BC tag was stored - with "0". It also whinges, and you'll probably get this message many times, but take that as "encouragement" to fix the input data!
Also fixed the barcode to be uppercase, incase of e.g. "ac+gt". I'm not sure if it's required, but we may as well.
Fixes samtools/samtools#1728
Note it's still clear as mud as to what this field really should contain. The Illumina documentation is pretty atrocious here.
The old (deprecated) bcl2fastq had the most information: https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2-v2-20-software-guide-15051736-03.pdf but it says "index sequence or sample number", so a numeric is permitted here but not in our source BC tag (as per SAMtags spec). It doesn't say anything about the mark up of sequence, or even mention the ability of +, - or other separators in multi-barcodes.
The replacement tool bcl_convert is even more opaque:https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl_convert/bcl-convert-v3-7-5-software-guide-1000000163594-00.pdf. That doesn't even have a single example of what fastq looks like or acknowledge the existance of the data after the identifier.
Finally the latest is the DRAGEN bcl conversion, but that too is opaque: https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/ToolsiBCL_fDG.htm. I'm guessing it's because DRAGEN doesn't generate fastq as it goes direct from BCL to alignments.
None of these mention "casava", which was the original software and how our options and documentation is named, so a huge shrug on trying to figure out what we're meant to do.