Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for invalid BC tags in fastq output. #1518

Merged
merged 1 commit into from
Oct 25, 2022

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Oct 25, 2022

I don't know how best to deal with such situations, so rather than a hard error we treat the barcode component the same as if no BC tag was stored - with "0". It also whinges, and you'll probably get this message many times, but take that as "encouragement" to fix the input data!

Also fixed the barcode to be uppercase, incase of e.g. "ac+gt". I'm not sure if it's required, but we may as well.

Fixes samtools/samtools#1728

Note it's still clear as mud as to what this field really should contain. The Illumina documentation is pretty atrocious here.

The old (deprecated) bcl2fastq had the most information: https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl2fastq/bcl2fastq2-v2-20-software-guide-15051736-03.pdf but it says "index sequence or sample number", so a numeric is permitted here but not in our source BC tag (as per SAMtags spec). It doesn't say anything about the mark up of sequence, or even mention the ability of +, - or other separators in multi-barcodes.

The replacement tool bcl_convert is even more opaque:https://emea.support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/bcl_convert/bcl-convert-v3-7-5-software-guide-1000000163594-00.pdf. That doesn't even have a single example of what fastq looks like or acknowledge the existance of the data after the identifier.

Finally the latest is the DRAGEN bcl conversion, but that too is opaque: https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/ToolsiBCL_fDG.htm. I'm guessing it's because DRAGEN doesn't generate fastq as it goes direct from BCL to alignments.

None of these mention "casava", which was the original software and how our options and documentation is named, so a huge shrug on trying to figure out what we're meant to do.

I don't know how best to deal with such situations, so rather than a
hard error we treat the barcode component the same as if no BC tag was
stored - with "0".  It also whinges, and you'll probably get this
message many times, but take that as "encouragement" to fix the input
data!

Also fixed the barcode to be uppercase, incase of e.g. "ac+gt".  I'm
not sure if it's required, but we may as well.

Fixes samtools/samtools#1728
@daviesrob
Copy link
Member

The check for valid sequence bases is a bit generous, but I guess we can tighten it up later if we need to.

@daviesrob daviesrob merged commit 2ff4b13 into samtools:develop Oct 25, 2022
@jkbonfield
Copy link
Contributor Author

I was struggling to work out how far to go, because I don't know what is permitted in BC tags. The spec claims "sequence", but does that include ambiguity codes? In theory it could, if we have degenerate tags and use multiple barcodes mixed together. So then it comes down to whether to check for the non-IUPAC codes, such as "J" and "Z". In the end I figured it's likely just not worth it as the problem in reality is caused by numerics and/or punctuation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

samtools fastq losing data in single end input
2 participants