Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM flag filter specification strings and parsing & filtering API functions #1361

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions htslib/sam.h
Original file line number Diff line number Diff line change
Expand Up @@ -1090,6 +1090,73 @@ hts_pos_t bam_cigar2rlen(int n_cigar, const uint32_t *cigar);
HTSLIB_EXPORT
hts_pos_t bam_endpos(const bam1_t *b);

/*! @typedef
* @abstract Flag filtering settings structure
*/
typedef struct sam_flag_filter_s {
uint32_t has_all:1, has_none:1, has_any:1, has_omit_any:1, has_exact:1, :27;
// To pass this filter, a read must have...
unsigned int all; // all of these flags
unsigned int none; // none of these
unsigned int any; // at least one of these
unsigned int omit_any; // OMIT at least one of these
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a while to figure out what this meant on the first read through. I think not_all is a clearer term than omit_any, or maybe any_unset.

That gives groupings (reordered for clarity) of all/not-all and any/not-any, or all-set/all-unset and any-set/any-unset.

I don't mind if these are perhaps just added in comments for the sake of clarity, but mixing up different terms ("none", "omit") doesn't aid understanding of what is effectively a trivial bit check.

Copy link
Member Author

@jmarshall jmarshall Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been strenuously avoiding not_all because it is ambiguous between “it must be that not all of these are 1“ and “it must be that all of these are not 1” — which are very different.

any_unset is probably better though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note {any,all}-{set,unset} is what we ended up picking in bcftools too, albeit that was "skip" rather than "keep", as it was being used as a filter.

unsigned int exact; // exactly these flags (and no others)
} sam_flag_filter_t;

/// Check whether the flags pass the collected flag filters
/** @param flag Bitwise SAM flags
@param filt Pointer to the flag filtering settings
@return 1 if @a flag passes all the active filters, 0 if not
*/
static inline
int sam_flag_passes_flag_filter(unsigned int flag, const sam_flag_filter_t *filt) {
if (filt->has_all && (flag & filt->all) != filt->all) return 0;
if (filt->has_none && (flag & filt->none) != 0) return 0;
if (filt->has_any && (flag & filt->any) == 0) return 0;
if (filt->has_omit_any && (flag & filt->omit_any) == filt->omit_any) return 0;
if (filt->has_exact && flag != filt->exact) return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set filt->has_all = filt->has_exact and filt->has_none = ~filt->has_exact then I think this line (and ultimately exact variables) are redundant. That's an implementation optimisation, and doesn't need to impact the public API and expected command line usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been thinking about reworking it so exact: is implemented in terms of those two as well. The downside is if someone specifies all three it will discard some of their settings, but no-one should be using exact: in conjunction with the others anyway…

return 1;
}

/// Check whether the alignment record's flags pass the collected flag filters
/** @param b Pointer to the BAM record to be checked
@param filt Pointer to the flag filtering settings
@return 1 if the record passes all the active filters, 0 if not
*/
static inline
int sam_passes_flag_filter(const bam1_t *b, const sam_flag_filter_t *filt) {
return sam_flag_passes_flag_filter(b->core.flag, filt);
}

/// Update flag filtering settings from a flag specification string
/** @param s Flag specification string
@param mode Initial punctuation mode character (one of [-&!^+|~=])
@param filt Pointer to the flag filtering settings to update
@return >= 0 on success, -1 if the string syntax was invalid

Flag specification strings are of the form `#flag,...,flag,#flag,...,flag,...`,
i.e., a comma-separated list of flag items (each of which is either a numeric
flag bitset, a symbolic (/[pPuUrR12sxdS_]+/) flag bitset, or a single flag
name (eg "PAIRED")), each of which may be prefixed by a punctuation character
that controls which filter the subsequent items contribute to (until the next
punctuation character):

Chr To pass filtering and be kept, the read must...
- & ...have ALL flags prefixed by either - or &
! ^ ...have NONE of the flags prefixed by either ! or ^
+ | ...have ANY (at least one) of the flags prefixed by either + or |
~ ...NOT have at least one of the flags prefixed by ~
= ...have EXACTLY the flags prefixed by = (and no others)

Flag items before the first punctuation character are controlled by @a mode.
The settings are updated via `|=` so that sam_parse_flag_filter() can be used
repeatedly to record filters from several specification strings. Hence
@a filt should be zero-initialised before the first call.
*/
HTSLIB_EXPORT
int sam_parse_flag_filter(const char *s, char mode, sam_flag_filter_t *filt);


HTSLIB_EXPORT
int bam_str2flag(const char *str); /** returns negative value on error */

Expand Down
2 changes: 1 addition & 1 deletion sam.5
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ lcbl.
0x0040 1 the read is the first read in a pair
0x0080 2 the read is the second read in a pair
0x0100 s the alignment is not primary
0x0200 f the read fails platform/vendor quality checks
0x0200 x filtered out, e.g., by platform/vendor quality checks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text change here reflects samtools/hts-specs#85.

0x0400 d the read is either a PCR or an optical duplicate
0x0800 S the alignment is supplementary
.TE
Expand Down
158 changes: 158 additions & 0 deletions sam.c
Original file line number Diff line number Diff line change
Expand Up @@ -4955,6 +4955,164 @@ char *sam_open_mode_opts(const char *fn,
return mode_opts;
}

// Parses one item, which may be a set of flags specified numerically
// or via symbolic letter codes, or a single flag via its long name.
static unsigned long sam_parse_flag1(const char *s, char **endptr)
{
if (isdigit_c(*s)) return strtoul(s, endptr, 0);

unsigned long flag = 0;
char name[24];
int i = 0, short_ok = 1;

for (; isalpha_c(*s) || *s == '_'; s++) {
if (i < sizeof name) name[i++] = toupper_c(*s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for copying / uppercasing from s to name character by character before checking to avoid the POSIX strncasecmp function? Note MSVC has _strnicmp which is identical (albeit renamed, because... Microsoft). Plus the library already uses this function, so it being missing is likely only true on old MSVC releases given recent discussions.

It would seem simpler to just check s direct for the long names. There's nothing specifically wrong with the code though so this isn't a hard "thou shalt rewrite" demand. :) Just curious why the decision to do it this way, and noting if it's to avoid POSIX-only things then there's no need.

See also bam_str2flag. I wonder if we should be naming the function in a similar fashion, or at least referencing it in the comments for both as it may be unclear to people why there are two functions with apparently identical purposes. The lack of any comment above bam_str2flag doesn't help (I know you'll find it if you hunt in the header file, but that doesn't help a casual reason of the .c file).

Maybe bam_str2flag needs rewriting to be a loop around successive calls to sam_parse_flag1 so it's obvious how the two relate to each other (ie single flag vs multiple flags).

Copy link
Member Author

@jmarshall jmarshall Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have an irrational prejudice against strcasecmp and friends. The claimed rational basis is that in this checking against a table of keywords scenario, using strncasecmp implies O(N²) locale-aware character case conversions (really O(MN) where M is the number of entries in the table and N is the average word length) while canonicalising the input costs only O(N) case conversions.

Also they require you to add #include <strings.h> because POSIX does not say that they are implied by <string.h>. Blarg.

But admittedly computers are fast nowadays and strncasecmp is indeed used elsewhere in htslib… 🤷

I have a followup commit that rewrites bam_str2flag in terms of sam_parse_flag1 but I didn't want to muddy the waters with it just yet :-).

Copy link
Contributor

@jkbonfield jkbonfield Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: bam_str2flag: we can do it in two stages fine.

Regarding strncasecmp, I don't follow the notion of higher complexity. With a length limit it's just the same deal, and it can be used just as before in a switch statement to restrict the search space. At least I'd always assumed the only difference between strncasecmp and strncmp is the initial case permutation of each character (or word) before comparing. But anyway, as I say it's not a major concern. I was just curious why the extra temporary buffer was used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do M strncasecmp(input, keyword[i]) checks (for M valid keywords and with 0 <= i < M), you have to canonicalise each character of input M times and each character of each of the M keyword strings once apiece — this calculation is complicated by only having to canonicalise the pairs of characters at the start of the strings that you actually need to compare, but is nonetheless O(MN).

OTOH if you write your keyword table pre-canonicalised, then you only need to canonicalise each character of input exactly once, which is O(N). In fact, it's exactly N character case conversions.

I said it was irrational! I may rewrite this using the combination of the existing switch (firstchar) style with strncasecmp. As you say, the extra fixed length buffer is not great either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for M strings being compared of length N, I'd see both strncasecmp and strcasecmp being O(MN), but with different constant scaling factors.

Personally for printable strings I'd do something as trivial as if ((c1 & ~0x20) != (c2 & ~0x20)) break (case insensitive) vs if (c1 != c2) break (case sensitive), but looking at the code I see it's a bit more flexible and has a charmap[u1] != charmap[u2] lookup instead. That basically means the canonicalisation is simply a constant slow down on any comparison.

Agreed it's perhaps slower than doing it once upfront to avoid the canonicalisation bit being repeated, but the complexity itself is the same order. Anyway, it's arguing over something that frankly irrelevant. You've got to be pretty special to have argument parsing being the bottleneck of your work. :-) I don't really mind either way.

switch (*s) {
case 'p': flag |= BAM_FPAIRED; break;
case 'P': flag |= BAM_FPROPER_PAIR; break;
case 'u': flag |= BAM_FUNMAP; break;
case 'U': flag |= BAM_FMUNMAP; break;
case 'r': flag |= BAM_FREVERSE; break;
case 'R': flag |= BAM_FMREVERSE; break;
case '1': flag |= BAM_FREAD1; break;
case '2': flag |= BAM_FREAD2; break;
case 's': flag |= BAM_FSECONDARY; break;
case 'x': flag |= BAM_FQCFAIL; break;
case 'd': flag |= BAM_FDUP; break;
case 'S': flag |= BAM_FSUPPLEMENTARY; break;
case '_': break;
// TODO In future, add an option to allow [mMfF] as no-ops

default:
short_ok = 0;
break;
}
}

*endptr = (char *) s;

if (i < sizeof name) {
name[i] = '\0';
switch (name[0]) {
case 'D':
if (strcmp(name, "DUP") == 0) return BAM_FDUP;
break;

case 'M':
if (strcmp(name, "MUNMAP") == 0) return BAM_FMUNMAP;
else if (strcmp(name, "MREVERSE") == 0) return BAM_FMREVERSE;
break;

case 'P':
if (strcmp(name, "PAIRED") == 0) return BAM_FPAIRED;
else if (strcmp(name, "PROPER_PAIR") == 0) return BAM_FPROPER_PAIR;
break;

case 'Q':
if (strcmp(name, "QCFAIL") == 0) return BAM_FQCFAIL;
break;

case 'R':
if (strcmp(name, "REVERSE") == 0) return BAM_FREVERSE;
else if (strcmp(name, "READ1") == 0) return BAM_FREAD1;
else if (strcmp(name, "READ2") == 0) return BAM_FREAD2;
break;

case 'S':
if (strcmp(name, "SECONDARY") == 0) return BAM_FSECONDARY;
else if (strcmp(name, "SUPPLEMENTARY") == 0) return BAM_FSUPPLEMENTARY;
break;

case 'U':
if (strcmp(name, "UNMAP") == 0) return BAM_FUNMAP;
break;
}
}

return short_ok? flag : 0;
}

// Parses one colon-terminated mode prefix, if present.
static char parse_optional_mode_prefix(const char *s, char **endptr)
{
char prefix[8];
int i = 0;

for (; isalpha_c(*s) && i < sizeof prefix; s++)
prefix[i++] = tolower_c(*s);

if (*s != ':') return '\0';

*endptr = (char *) &s[1];

if (i < sizeof prefix) {
prefix[i] = '\0';
switch (prefix[0]) {
case 'a':
if (strcmp(prefix, "all") == 0) return '&';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above regarding strncasecmp possible alternative,

Although I note one benefit(?) of this is "fubar:" will return '\0' while also marching on endptr to skip past the unknown token. Whether skipping it is helpful or not I don't know, given it's failed.

else if (strcmp(prefix, "any") == 0) return '|';
else if (strcmp(prefix, "anynot") == 0) return '~';
break;

case 'e':
if (strcmp(prefix, "exact") == 0) return '=';
break;

case 'n':
if (strcmp(prefix, "none") == 0) return '^';
else if (strcmp(prefix, "notall") == 0) return '~';
break;
}
}

return '\0';
}

int sam_parse_flag_filter(const char *s, char mode, sam_flag_filter_t *filt)
{
while (*s) {
if (isalnum_c(*s) || *s == '_') {
char *end;
char newmode = parse_optional_mode_prefix(s, &end);
if (newmode) mode = newmode, s = end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that "anynot:10,exatc:20" is silently parsed as "anynot:10,anynot:20"? It looks it. I'd think it would be best for an error to be returned when the prefix is an unrecognised term. It also permits potential future updates while ensuring the newer terms being passed to older libraries will be noticed rather than giving incorrect results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

unsigned long flag = sam_parse_flag1(s, &end);
if (!flag) return -1;
s = end;

switch (mode) {
case '-': case '&':
filt->has_all = 1;
filt->all |= flag;
break;
case '!': case '^':
filt->has_none = 1;
filt->none |= flag;
break;
case '+': case '|':
filt->has_any = 1;
filt->any |= flag;
break;
case '~':
filt->has_omit_any = 1;
filt->omit_any |= flag;
break;
case '=':
filt->has_exact = 1;
filt->exact |= flag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment - this is where I think it can simply set the all and none variables instead.

break;
default:
return -1;
}
}
else if (*s == ',')
s++;
else
mode = *s++;
}

return 0;
}

#define STRNCMP(a,b,n) (strncasecmp((a),(b),(n)) || strlen(a)!=(n))
int bam_str2flag(const char *str)
{
Expand Down