Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAM flag filter specification strings and parsing & filtering API functions #1361

Closed
wants to merge 1 commit into from

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Nov 25, 2021

In samtools/samtools#1442 (comment) I wrote:

(I can never remember which option I want to use and have to look it up each time. What I would really like to have for this is to come up with some [+-=].* syntax something like chmod(1) symbolic permissions so that all these options could be unified into a single -f/--flags option — at which point these legacy options would become somewhat moot.)

It turns out I was thinking of find … -perm [+-]mode syntax probably more than symbolic Unix permissions per se, but the point stands: that perhaps some other notation would solve the ‑f/ ‑‑ff/ ‑‑rf/etc plethora of confusing flag filtering options problem better than trying to come up with multiple more memorable overlapping option names.

Motivated by the renewed attempt to come up with a new set of option names in samtools/bcftools#1611, here is a specific proposal for that:

Flag specification strings are of the form #flag,…,flag,#flag,…,flag,…, i.e., a comma-separated list of flag items (each of which is either a numeric flag bitset, a symbolic flag bitset (/[pPuUrR12sxdS_]+/, as in sam.5 and samtools/htsjdk#232), or a single flag name (eg PAIRED)), each of which may be prefixed by a punctuation character (represented by # in the previous template) that controls which filter the subsequent items contribute to (until the next punctuation character):

Chr To pass filtering and be kept, the read must…
- & …have ALL of the flags prefixed by either - or &
! ^ …have NONE of the flags prefixed by either ! or ^
+ | …have ANY (at least one) of the flags prefixed by either + or |
~ …NOT have at least one of the flags prefixed by ~
= …have EXACTLY the flags prefixed by = (and no others)

(- and + are the same characters as used by find … -perm for similar tests, and & and | represent AND (all of these flags) and OR (any of these flags). ! and ^ are as per regexp and glob character classes. ~ suggests negative, but is a more tentative suggestion than the others.)

The various samtools/bcftools commands would then use sam_parse_flag_filter() with different initial mode arguments to parse their various flag filtering command line options. Hence for example the following would be equivalent:

samtools view -f '3|12^48~192' …

samtools view -f '&1,2' -f '|4,8' -f '^16,32' -f '~64,128' …

samtools view -f '&1,2,|4,8,^16,32,~64,128' …

samtools view -f 1 -f 2 --rf 4 --rf 8 -F 16 -F 32 -G 64 -G 128 …

samtools view --require-flags 3 --include-flags 12 --exclude-flags 48 -G 192 …

Is something like this better or worse than trying to come up with mnemonic long option names and mnemonic short option letters for all these possibilities?

IMHO such punctuation characters certainly have more mnemonic value than a plethora of even more cryptic ‑f/‑F/‑‑ff/‑‑rf/-G short options. Whether they are more mnemonic than ‑‑require‑flags/ ‑‑excl[ude]‑flags/ ‑‑incl[ude]‑flags/ ‑‑something‑flags long options is more of an open question: the long option names are quite distinctive from each other but as we've seen whether e.g. require means & or | is not quite as obvious as would be ideal.

@@ -56,7 +56,7 @@ lcbl.
0x0040 1 the read is the first read in a pair
0x0080 2 the read is the second read in a pair
0x0100 s the alignment is not primary
0x0200 f the read fails platform/vendor quality checks
0x0200 x filtered out, e.g., by platform/vendor quality checks
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text change here reflects samtools/hts-specs#85.

@jkbonfield
Copy link
Contributor

Right now I think we have simple cases (albeit hard to rember) or if we want utmost flexibility we have the expression language to be very precise what bits we want / don't want.

My gut tells me I'd prefer use of those over a new mini-language. (Although this may well be faster to execute.) I'll need to think more on this though.

@jmarshall
Copy link
Member Author

jmarshall commented Nov 25, 2021

Agreed that now that there is an arbitrary expression language via view -e et al, ideally most of these bespoke flag options are obsolete and wouldn't need to exist.

But to the extent that they do… ‑f/‑F/‑‑ff/‑‑rf/-G and -f -&!^+|~= are both mini-languages for the simple cases. The question is which one is more mnemonic.

@daviesrob
Copy link
Member

I think the basic idea is reasonable, although as noted you can get the same result with the existing expression language. I'm not at all sure about the punctuation symbols though, which reminds me of "write-only" languages like regular expression syntaxes or APL derivatives. They also have a high risk of shell catastrophe if you forget to quote them properly. Something like -f all[<list>],any[<list>],none[<list>],not[<list>] where <list> is a list of flag specifiers as you already use might be a bit easier to understand and remember.

@jmarshall
Copy link
Member Author

By Jove, I think you've cracked it!

I think typing the close square bracket would be a bit of a pain and still at risk of quoting disasters, but

-f all:<list>,any:<list>,none:<list>,not:<list>

would work very nicely, perhaps keeping the punctuation characters as even shorter synonyms of these word specifiers that would be the canonical spellings of these specifiers.

I think not:… needs a bit more workshopping, as the distinction between the meanings of none: and not: will not be clear in isolation. (This is where the punctuation characters have an advantage, as ~ doesn't particularly have a misleading inherent meaning…)

@jkbonfield
Copy link
Contributor

It's wordy, but the ambiguous nature of English words such as none vs not is why I suggested things like all-on, all-off, any-on, any-off in samtools/bcftools#1611 (comment) (well actually I had set/unset instead of on/off, but the latter is shorter).

We also need to consider what "filter" means. My gut instinct when I see a filter is removing something - eg in the way that filter paper works, a fuel filter, a filter tip, etc. Admittedly in some contexts it's the reverse, such as a notch-filter for signals. Whatever we use it must be consistent, and this is where I feel the current terms are a bit of a mess as they go against (my) expectation of what a filter does. If it was "--require-flags none:<>,all:<> etc then it'd be explicit as to which direction it's applying.

@jmarshall
Copy link
Member Author

Indeed; perhaps notall:<list> or anynot:<list> (or accept either) would be clearer words to use than not: for the hard-to-name entry.

As for “filter”, this PR adds flag_filter routines to the htslib API that are modelled on the naming of hts_expr.h's filter routines. How it would be best described in the user-facing samtools.1 documentation is a separate question.

IMHO the f in samtools view -f stands for “flag” 😄

Add sam_flag_filter_t structure that encapsulates all the basic flag
filtering checks; routines to evaluate whether a flag bitset or a bam1_t
passes the checks; and a routine to create a sam_flag_filter_t by parsing
a flag specification string.

Flag specification strings provide unified syntax for specifying flag
bitsets numerically, symbolically (using the sam.5 / samtools/htsjdk#232
syntax), or via bam_str2flag()'s flag names, and for choosing all/any/none
etc filtering via mnemonic option argument prefixes rather than via a
plethora of separate command line options.
@jmarshall
Copy link
Member Author

I've updated this to use the word prefixes as workshopped between Rob and me, thus:

samtools view -f all:3,any:12,none:48,anynot:192 …

# or equivalently…

samtools view -f 3 -f any:12 -f none:48 -f anynot:192 …

I'm torn between removing the punctuation codes and just having these, and keeping them for brevity in quick command line experimenting. I suspect trying to write the documentation will push me over to favouring just the word prefixes… 😄

@jmarshall
Copy link
Member Author

jmarshall commented Jan 18, 2022

To be clear, the purpose of this PR is to provide infrastructure that samtools and bcftools can use instead of needing their own code and multiple command line options to implement flag-based filtering. So if the maintainers like the approach, there would be followup samtools and bcftools PRs to use the new centralised facility.

This allows usage such as

samtools view -f all:3 -f any:12 …
bcftools mpileup --ff all:3,any:12 …
bcftools mpileup --ff all:3 --ff any:12 …

which I believe, and I hope the maintainers agree, is much more user-friendly and memorable than e.g. the --ls/--ns/--lu/--nu command line options proposed in PR samtools/bcftools#1611.

For this PR to proceed, it would be useful for the maintainers to comment on the direction: in particular, I am now leaning towards dropping the punctuation characters entirely in favour of the word prefixes as per #1361 (comment). If the maintainers are in favour of that approach, I will rework sam_parse_flag_filter()'s initial mode parameter etc accordingly and finish up the proposed documentation.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jan 18, 2022

Sorry, this was languishing as while in Draft we were assuming it was still in a state of flux. I assume by your comment you're happy with it now and want a review again. (Ie, I'll take another look soon.)

if (filt->has_none && (flag & filt->none) != 0) return 0;
if (filt->has_any && (flag & filt->any) == 0) return 0;
if (filt->has_omit_any && (flag & filt->omit_any) == filt->omit_any) return 0;
if (filt->has_exact && flag != filt->exact) return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set filt->has_all = filt->has_exact and filt->has_none = ~filt->has_exact then I think this line (and ultimately exact variables) are redundant. That's an implementation optimisation, and doesn't need to impact the public API and expected command line usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've been thinking about reworking it so exact: is implemented in terms of those two as well. The downside is if someone specifies all three it will discard some of their settings, but no-one should be using exact: in conjunction with the others anyway…

unsigned int all; // all of these flags
unsigned int none; // none of these
unsigned int any; // at least one of these
unsigned int omit_any; // OMIT at least one of these
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a while to figure out what this meant on the first read through. I think not_all is a clearer term than omit_any, or maybe any_unset.

That gives groupings (reordered for clarity) of all/not-all and any/not-any, or all-set/all-unset and any-set/any-unset.

I don't mind if these are perhaps just added in comments for the sake of clarity, but mixing up different terms ("none", "omit") doesn't aid understanding of what is effectively a trivial bit check.

Copy link
Member Author

@jmarshall jmarshall Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been strenuously avoiding not_all because it is ambiguous between “it must be that not all of these are 1“ and “it must be that all of these are not 1” — which are very different.

any_unset is probably better though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note {any,all}-{set,unset} is what we ended up picking in bcftools too, albeit that was "skip" rather than "keep", as it was being used as a filter.

int i = 0, short_ok = 1;

for (; isalpha_c(*s) || *s == '_'; s++) {
if (i < sizeof name) name[i++] = toupper_c(*s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for copying / uppercasing from s to name character by character before checking to avoid the POSIX strncasecmp function? Note MSVC has _strnicmp which is identical (albeit renamed, because... Microsoft). Plus the library already uses this function, so it being missing is likely only true on old MSVC releases given recent discussions.

It would seem simpler to just check s direct for the long names. There's nothing specifically wrong with the code though so this isn't a hard "thou shalt rewrite" demand. :) Just curious why the decision to do it this way, and noting if it's to avoid POSIX-only things then there's no need.

See also bam_str2flag. I wonder if we should be naming the function in a similar fashion, or at least referencing it in the comments for both as it may be unclear to people why there are two functions with apparently identical purposes. The lack of any comment above bam_str2flag doesn't help (I know you'll find it if you hunt in the header file, but that doesn't help a casual reason of the .c file).

Maybe bam_str2flag needs rewriting to be a loop around successive calls to sam_parse_flag1 so it's obvious how the two relate to each other (ie single flag vs multiple flags).

Copy link
Member Author

@jmarshall jmarshall Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have an irrational prejudice against strcasecmp and friends. The claimed rational basis is that in this checking against a table of keywords scenario, using strncasecmp implies O(N²) locale-aware character case conversions (really O(MN) where M is the number of entries in the table and N is the average word length) while canonicalising the input costs only O(N) case conversions.

Also they require you to add #include <strings.h> because POSIX does not say that they are implied by <string.h>. Blarg.

But admittedly computers are fast nowadays and strncasecmp is indeed used elsewhere in htslib… 🤷

I have a followup commit that rewrites bam_str2flag in terms of sam_parse_flag1 but I didn't want to muddy the waters with it just yet :-).

Copy link
Contributor

@jkbonfield jkbonfield Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: bam_str2flag: we can do it in two stages fine.

Regarding strncasecmp, I don't follow the notion of higher complexity. With a length limit it's just the same deal, and it can be used just as before in a switch statement to restrict the search space. At least I'd always assumed the only difference between strncasecmp and strncmp is the initial case permutation of each character (or word) before comparing. But anyway, as I say it's not a major concern. I was just curious why the extra temporary buffer was used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do M strncasecmp(input, keyword[i]) checks (for M valid keywords and with 0 <= i < M), you have to canonicalise each character of input M times and each character of each of the M keyword strings once apiece — this calculation is complicated by only having to canonicalise the pairs of characters at the start of the strings that you actually need to compare, but is nonetheless O(MN).

OTOH if you write your keyword table pre-canonicalised, then you only need to canonicalise each character of input exactly once, which is O(N). In fact, it's exactly N character case conversions.

I said it was irrational! I may rewrite this using the combination of the existing switch (firstchar) style with strncasecmp. As you say, the extra fixed length buffer is not great either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for M strings being compared of length N, I'd see both strncasecmp and strcasecmp being O(MN), but with different constant scaling factors.

Personally for printable strings I'd do something as trivial as if ((c1 & ~0x20) != (c2 & ~0x20)) break (case insensitive) vs if (c1 != c2) break (case sensitive), but looking at the code I see it's a bit more flexible and has a charmap[u1] != charmap[u2] lookup instead. That basically means the canonicalisation is simply a constant slow down on any comparison.

Agreed it's perhaps slower than doing it once upfront to avoid the canonicalisation bit being repeated, but the complexity itself is the same order. Anyway, it's arguing over something that frankly irrelevant. You've got to be pretty special to have argument parsing being the bottleneck of your work. :-) I don't really mind either way.

prefix[i] = '\0';
switch (prefix[0]) {
case 'a':
if (strcmp(prefix, "all") == 0) return '&';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above regarding strncasecmp possible alternative,

Although I note one benefit(?) of this is "fubar:" will return '\0' while also marching on endptr to skip past the unknown token. Whether skipping it is helpful or not I don't know, given it's failed.

if (isalnum_c(*s) || *s == '_') {
char *end;
char newmode = parse_optional_mode_prefix(s, &end);
if (newmode) mode = newmode, s = end;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that "anynot:10,exatc:20" is silently parsed as "anynot:10,anynot:20"? It looks it. I'd think it would be best for an error to be returned when the prefix is an unrecognised term. It also permits potential future updates while ensuring the newer terms being passed to older libraries will be noticed rather than giving incorrect results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

break;
case '=':
filt->has_exact = 1;
filt->exact |= flag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment - this is where I think it can simply set the all and none variables instead.

@jkbonfield
Copy link
Contributor

For this PR to proceed, it would be useful for the maintainers to comment on the direction: in particular, I am now leaning towards dropping the punctuation characters entirely in favour of the word prefixes as per #1361 (comment). If the maintainers are in favour of that approach, I will rework sam_parse_flag_filter()'s initial mode parameter etc accordingly and finish up the proposed documentation.

That gets my thumbs up. I think I'd find the single character codes somewhat tricky to remember. It's like a nearly-regexp but not enough to be mnemonically unambiguous.

@jkbonfield
Copy link
Contributor

jkbonfield commented Apr 26, 2022

What's the status of this, other that "draft"?

I had a play again this morning and think there are still some open questions. Eg:

  • Why does the external struct fields such as omit_any, but the parse API uses string "notall". We should be consistent.

  • Lack of testing. Could we get a test tool?

  • Lack of documentation. Eg "notall" string occurs nowhere in the header file, only in the source. Some of it needs more improvement too. Eg what does #flag mean? It didn't accept a literal # and it's not clear it means numeric given we mix real numbers with symbolic flag numbers (1 and 2). Some examples would help.

I wrote a tiny demo program just to explore, but I'm still struggling to understand some of it.

#include <stdio.h>                                                                                                      
#include "htslib/sam.h"                                                         
                                                                                
int main(int argc, char **argv) {                                               
    sam_flag_filter_t filt = {0};                                               
                                                                                
    printf("Parse=%d\n", sam_parse_flag_filter(argv[2], *argv[1], &filt));      
    printf("%d    all\t%x\n",     filt.has_all,      filt.all);                 
    printf("%d    none\t%x\n",    filt.has_none,     filt.none);                
    printf("%d    any\t%x\n",     filt.has_any,      filt.any);                 
    printf("%d    omitany\t%x\n", filt.has_omit_any, filt.omit_any);            
    printf("%d    exact\t%x\n",   filt.has_exact,    filt.exact);               
                                                                                
    return 0;                                                                   
}                                                                               

This does seem to imply though that there are issues with the parsing. Eg:

$ ./a.out + 'd,R,2'
Parse=0
0    all	0
0    none	0
1    any	422
0    omitany	0
0    exact	0

d, R and 2 are all listed in the documentation as symbolic names:

Flag specification strings are of the form #flag,...,flag,#flag,...,flag,...,
i.e., a comma-separated list of flag items (each of which is either a numeric
flag bitset, a symbolic (/[pPuUrR12sxdS_]+/) flag bitset, or a single flag
name (eg "PAIRED")), each of which may be prefixed by a punctuation character
that controls which filter the subsequent items contribute to (until the next
punctuation character):

However 2 has been taken as the number 2 and not symbol 2. How to distinguish? It seems impossible.

Edit: I can use "REVERSE", "DUP" etc, but "READ2" fails. I think this is because the token ends after the D due to the isalpha or '_' check, but I haven't verified that yet.

@jkbonfield
Copy link
Contributor

As an attempt to fix the READ1/READ2 bug, my first stab was this:

@@ -5014,8 +5018,16 @@ static unsigned long sam_parse_flag1(const char *s, char **endptr)
 
         case 'R':
             if (strcmp(name, "REVERSE") == 0) return BAM_FREVERSE;
-            else if (strcmp(name, "READ1") == 0) return BAM_FREAD1;
-            else if (strcmp(name, "READ2") == 0) return BAM_FREAD2;
+            else if (strcmp(name, "READ") == 0) {
+                if (*s == '1') {
+                    (*endptr)++;
+                    return BAM_FREAD1;
+                }
+                if (*s == '2') {
+                    (*endptr)++;
+                    return BAM_FREAD2;
+                }
+            }
             break;

It's a bit hacky, but it works. As expected, the issue is the first loop of for (; isalpha_c(*s) || *s == '_'; s++) truncates on "READ", leaving name and endptr incorrect.

My fix does cure this, but I'm thinking it's not the right one. Specifically if we do something like "UNMAP2REVERSE" then it's parsed as if it was "UNMAP,2,REVERSE" and gives us flag 0x16. I'm thinking it should be a parse error. Likely an isalnum above (starting from an isalpha for first char) instead would make these an error while also fixing the READ1/READ2 parsing.

@jmarshall
Copy link
Member Author

The status of this is draft, and as of back in January it is being refactored as discussed at the time. So don't bother reading the current draft's code too closely, as I'll push something rather different when I get back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants