Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permit fastq output to create empty FASTQ records for seq "*". #1576

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Mar 2, 2023

This is rather questionable, but htslib can read in empty fastq records and generates "*" for SEQ and QUAL. The same is true vice versa now. Eg:

name	4	*	0	0	*	*	0	0	*	*

becomes

@name

+

Bwa mem and minimap2 can both read these fastq entries too, although the SAM output is bugged as it outputs an empty field instead of "*" for SEQ. (Note this is still readable by htslib and interpreted the same).

Potential reasons for accepting this:

  • When dealing with paired data, we don't want to output a differing number of records from samtools fastq if read1 has seq and read2 has "*".

    Note as this filtered at the htslib layer, it's not considered as a singleton so fastq -s won't rescue this.

  • At least some aligners apparently support this format. Although inevitably they just produce unmapped data.

  • Arguably this is a case of silly input => silly output!

  • Users can manually elect to "samtools view -e 'length(seq) > 0'" before using samtools fastq, which then fixes the not-a-singleton problem.

  • It converts samtools fastq output back to how it was in pre 1.13 era, where we rewrote it to use htslib's interfaces.

Potential reason to reject:

  • It may yield output which trips up some poorly written tools.

Fixes samtools/samtools#1799

This is rather questionable, but htslib can now output empty SAM
records as empty fastq records.  Eg:

    name	4	*	0	0	*	*	0	0	*	*

becomes

    @name

    +

Htslib is happy to read this back in and produces the original SAM
once more.

Bwa mem and minimap2 can both read these fastq entries too, although
the SAM output is bugged as they output an empty field instead of "*"
for SEQ.

Potential reasons for accepting this:

- When dealing with paired data, we don't want to output a differing
  number of records from samtools fastq if read1 has seq and read2 has
  "*".

  Note as this filtered at the htslib layer, it's not considered as a
  singleton so fastq -s won't rescue this.

- At least some aligners apparently support this format.  Although
  inevitably they just produce unmapped data.

- Arguably this is a case of silly input => silly output!

- Users can manually elect to "samtools view -e 'length(seq) > 0'"
  before using samtools fastq, which then fixes the not-a-singleton
  problem.

- It converts samtools fastq output back to how it was in pre 1.13
  era, where we rewrote it to use htslib's interfaces.

Potential reason to reject:

- It may yield output which trips up some poorly written tools.

Fixes samtools/samtools#1799
@daviesrob daviesrob self-assigned this Mar 7, 2023
@daviesrob
Copy link
Member

On discussion, we decided that avoiding mis-paired reads was preferable to avoiding empty sequences. I suspect this rarely happens, but I guess we may find out if anyone notices...

@daviesrob daviesrob merged commit dcd20d9 into samtools:develop Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

samtools fastq skips over reads with no bases/qualities
2 participants