-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAM flag filter specification strings and parsing & filtering API functions #1361
Conversation
@@ -56,7 +56,7 @@ lcbl. | |||
0x0040 1 the read is the first read in a pair | |||
0x0080 2 the read is the second read in a pair | |||
0x0100 s the alignment is not primary | |||
0x0200 f the read fails platform/vendor quality checks | |||
0x0200 x filtered out, e.g., by platform/vendor quality checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text change here reflects samtools/hts-specs#85.
Right now I think we have simple cases (albeit hard to rember) or if we want utmost flexibility we have the expression language to be very precise what bits we want / don't want. My gut tells me I'd prefer use of those over a new mini-language. (Although this may well be faster to execute.) I'll need to think more on this though. |
Agreed that now that there is an arbitrary expression language via But to the extent that they do… |
I think the basic idea is reasonable, although as noted you can get the same result with the existing expression language. I'm not at all sure about the punctuation symbols though, which reminds me of "write-only" languages like regular expression syntaxes or APL derivatives. They also have a high risk of shell catastrophe if you forget to quote them properly. Something like |
By Jove, I think you've cracked it! I think typing the close square bracket would be a bit of a pain and still at risk of quoting disasters, but
would work very nicely, perhaps keeping the punctuation characters as even shorter synonyms of these word specifiers that would be the canonical spellings of these specifiers. I think |
It's wordy, but the ambiguous nature of English words such as none vs not is why I suggested things like all-on, all-off, any-on, any-off in samtools/bcftools#1611 (comment) (well actually I had set/unset instead of on/off, but the latter is shorter). We also need to consider what "filter" means. My gut instinct when I see a filter is removing something - eg in the way that filter paper works, a fuel filter, a filter tip, etc. Admittedly in some contexts it's the reverse, such as a notch-filter for signals. Whatever we use it must be consistent, and this is where I feel the current terms are a bit of a mess as they go against (my) expectation of what a filter does. If it was "--require-flags none:<>,all:<> etc then it'd be explicit as to which direction it's applying. |
Indeed; perhaps As for “filter”, this PR adds IMHO the |
Add sam_flag_filter_t structure that encapsulates all the basic flag filtering checks; routines to evaluate whether a flag bitset or a bam1_t passes the checks; and a routine to create a sam_flag_filter_t by parsing a flag specification string. Flag specification strings provide unified syntax for specifying flag bitsets numerically, symbolically (using the sam.5 / samtools/htsjdk#232 syntax), or via bam_str2flag()'s flag names, and for choosing all/any/none etc filtering via mnemonic option argument prefixes rather than via a plethora of separate command line options.
e3bbf46
to
46ebc5b
Compare
I've updated this to use the word prefixes as workshopped between Rob and me, thus:
I'm torn between removing the punctuation codes and just having these, and keeping them for brevity in quick command line experimenting. I suspect trying to write the documentation will push me over to favouring just the word prefixes… 😄 |
To be clear, the purpose of this PR is to provide infrastructure that samtools and bcftools can use instead of needing their own code and multiple command line options to implement flag-based filtering. So if the maintainers like the approach, there would be followup samtools and bcftools PRs to use the new centralised facility. This allows usage such as
which I believe, and I hope the maintainers agree, is much more user-friendly and memorable than e.g. the For this PR to proceed, it would be useful for the maintainers to comment on the direction: in particular, I am now leaning towards dropping the punctuation characters entirely in favour of the word prefixes as per #1361 (comment). If the maintainers are in favour of that approach, I will rework |
Sorry, this was languishing as while in Draft we were assuming it was still in a state of flux. I assume by your comment you're happy with it now and want a review again. (Ie, I'll take another look soon.) |
if (filt->has_none && (flag & filt->none) != 0) return 0; | ||
if (filt->has_any && (flag & filt->any) == 0) return 0; | ||
if (filt->has_omit_any && (flag & filt->omit_any) == filt->omit_any) return 0; | ||
if (filt->has_exact && flag != filt->exact) return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set filt->has_all = filt->has_exact
and filt->has_none = ~filt->has_exact
then I think this line (and ultimately exact
variables) are redundant. That's an implementation optimisation, and doesn't need to impact the public API and expected command line usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've been thinking about reworking it so exact:
is implemented in terms of those two as well. The downside is if someone specifies all three it will discard some of their settings, but no-one should be using exact:
in conjunction with the others anyway…
unsigned int all; // all of these flags | ||
unsigned int none; // none of these | ||
unsigned int any; // at least one of these | ||
unsigned int omit_any; // OMIT at least one of these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took a while to figure out what this meant on the first read through. I think not_all
is a clearer term than omit_any
, or maybe any_unset
.
That gives groupings (reordered for clarity) of all/not-all and any/not-any, or all-set/all-unset and any-set/any-unset.
I don't mind if these are perhaps just added in comments for the sake of clarity, but mixing up different terms ("none", "omit") doesn't aid understanding of what is effectively a trivial bit check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been strenuously avoiding not_all
because it is ambiguous between “it must be that not all of these are 1“ and “it must be that all of these are not 1” — which are very different.
any_unset
is probably better though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note {any,all}-{set,unset} is what we ended up picking in bcftools too, albeit that was "skip" rather than "keep", as it was being used as a filter.
int i = 0, short_ok = 1; | ||
|
||
for (; isalpha_c(*s) || *s == '_'; s++) { | ||
if (i < sizeof name) name[i++] = toupper_c(*s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason for copying / uppercasing from s to name character by character before checking to avoid the POSIX strncasecmp
function? Note MSVC has _strnicmp
which is identical (albeit renamed, because... Microsoft). Plus the library already uses this function, so it being missing is likely only true on old MSVC releases given recent discussions.
It would seem simpler to just check s
direct for the long names. There's nothing specifically wrong with the code though so this isn't a hard "thou shalt rewrite" demand. :) Just curious why the decision to do it this way, and noting if it's to avoid POSIX-only things then there's no need.
See also bam_str2flag
. I wonder if we should be naming the function in a similar fashion, or at least referencing it in the comments for both as it may be unclear to people why there are two functions with apparently identical purposes. The lack of any comment above bam_str2flag
doesn't help (I know you'll find it if you hunt in the header file, but that doesn't help a casual reason of the .c file).
Maybe bam_str2flag
needs rewriting to be a loop around successive calls to sam_parse_flag1
so it's obvious how the two relate to each other (ie single flag vs multiple flags).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have an irrational prejudice against strcasecmp
and friends. The claimed rational basis is that in this checking against a table of keywords scenario, using strncasecmp
implies O(N²) locale-aware character case conversions (really O(MN) where M is the number of entries in the table and N is the average word length) while canonicalising the input costs only O(N) case conversions.
Also they require you to add #include <strings.h>
because POSIX does not say that they are implied by <string.h>
. Blarg.
But admittedly computers are fast nowadays and strncasecmp
is indeed used elsewhere in htslib… 🤷
I have a followup commit that rewrites bam_str2flag
in terms of sam_parse_flag1
but I didn't want to muddy the waters with it just yet :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: bam_str2flag
: we can do it in two stages fine.
Regarding strncasecmp
, I don't follow the notion of higher complexity. With a length limit it's just the same deal, and it can be used just as before in a switch statement to restrict the search space. At least I'd always assumed the only difference between strncasecmp and strncmp is the initial case permutation of each character (or word) before comparing. But anyway, as I say it's not a major concern. I was just curious why the extra temporary buffer was used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do M strncasecmp(input, keyword[i])
checks (for M valid keywords and with 0 <= i < M
), you have to canonicalise each character of input
M times and each character of each of the M keyword
strings once apiece — this calculation is complicated by only having to canonicalise the pairs of characters at the start of the strings that you actually need to compare, but is nonetheless O(MN).
OTOH if you write your keyword
table pre-canonicalised, then you only need to canonicalise each character of input
exactly once, which is O(N). In fact, it's exactly N character case conversions.
I said it was irrational! I may rewrite this using the combination of the existing switch (firstchar)
style with strncasecmp
. As you say, the extra fixed length buffer is not great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for M strings being compared of length N, I'd see both strncasecmp
and strcasecmp
being O(MN), but with different constant scaling factors.
Personally for printable strings I'd do something as trivial as if ((c1 & ~0x20) != (c2 & ~0x20)) break
(case insensitive) vs if (c1 != c2) break
(case sensitive), but looking at the code I see it's a bit more flexible and has a charmap[u1] != charmap[u2]
lookup instead. That basically means the canonicalisation is simply a constant slow down on any comparison.
Agreed it's perhaps slower than doing it once upfront to avoid the canonicalisation bit being repeated, but the complexity itself is the same order. Anyway, it's arguing over something that frankly irrelevant. You've got to be pretty special to have argument parsing being the bottleneck of your work. :-) I don't really mind either way.
prefix[i] = '\0'; | ||
switch (prefix[0]) { | ||
case 'a': | ||
if (strcmp(prefix, "all") == 0) return '&'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above regarding strncasecmp
possible alternative,
Although I note one benefit(?) of this is "fubar:" will return '\0' while also marching on endptr to skip past the unknown token. Whether skipping it is helpful or not I don't know, given it's failed.
if (isalnum_c(*s) || *s == '_') { | ||
char *end; | ||
char newmode = parse_optional_mode_prefix(s, &end); | ||
if (newmode) mode = newmode, s = end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that "anynot:10,exatc:20" is silently parsed as "anynot:10,anynot:20"? It looks it. I'd think it would be best for an error to be returned when the prefix is an unrecognised term. It also permits potential future updates while ensuring the newer terms being passed to older libraries will be noticed rather than giving incorrect results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
break; | ||
case '=': | ||
filt->has_exact = 1; | ||
filt->exact |= flag; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment - this is where I think it can simply set the all
and none
variables instead.
That gets my thumbs up. I think I'd find the single character codes somewhat tricky to remember. It's like a nearly-regexp but not enough to be mnemonically unambiguous. |
What's the status of this, other that "draft"? I had a play again this morning and think there are still some open questions. Eg:
I wrote a tiny demo program just to explore, but I'm still struggling to understand some of it.
This does seem to imply though that there are issues with the parsing. Eg:
However 2 has been taken as the number 2 and not symbol 2. How to distinguish? It seems impossible. Edit: I can use "REVERSE", "DUP" etc, but "READ2" fails. I think this is because the token ends after the D due to the isalpha or '_' check, but I haven't verified that yet. |
As an attempt to fix the READ1/READ2 bug, my first stab was this:
It's a bit hacky, but it works. As expected, the issue is the first loop of My fix does cure this, but I'm thinking it's not the right one. Specifically if we do something like "UNMAP2REVERSE" then it's parsed as if it was "UNMAP,2,REVERSE" and gives us flag 0x16. I'm thinking it should be a parse error. Likely an |
The status of this is draft, and as of back in January it is being refactored as discussed at the time. So don't bother reading the current draft's code too closely, as I'll push something rather different when I get back to it. |
In samtools/samtools#1442 (comment) I wrote:
It turns out I was thinking of
find … -perm [+-]mode
syntax probably more than symbolic Unix permissions per se, but the point stands: that perhaps some other notation would solve the‑f
/‑‑ff
/‑‑rf
/etc plethora of confusing flag filtering options problem better than trying to come up with multiple more memorable overlapping option names.Motivated by the renewed attempt to come up with a new set of option names in samtools/bcftools#1611, here is a specific proposal for that:
Flag specification strings are of the form
#flag,…,flag,#flag,…,flag,…
, i.e., a comma-separated list of flag items (each of which is either a numeric flag bitset, a symbolic flag bitset (/[pPuUrR12sxdS_]+/
, as in sam.5 and samtools/htsjdk#232), or a single flag name (egPAIRED
)), each of which may be prefixed by a punctuation character (represented by#
in the previous template) that controls which filter the subsequent items contribute to (until the next punctuation character):- &
-
or&
! ^
!
or^
+ |
+
or|
~
~
=
=
(and no others)(
-
and+
are the same characters as used byfind … -perm
for similar tests, and&
and|
represent AND (all of these flags) and OR (any of these flags).!
and^
are as per regexp and glob character classes.~
suggests negative, but is a more tentative suggestion than the others.)The various samtools/bcftools commands would then use
sam_parse_flag_filter()
with different initialmode
arguments to parse their various flag filtering command line options. Hence for example the following would be equivalent:Is something like this better or worse than trying to come up with mnemonic long option names and mnemonic short option letters for all these possibilities?
IMHO such punctuation characters certainly have more mnemonic value than a plethora of even more cryptic
‑f
/‑F
/‑‑ff
/‑‑rf
/-G
short options. Whether they are more mnemonic than‑‑require‑flags
/‑‑excl[ude]‑flags
/‑‑incl[ude]‑flags
/‑‑something‑flags
long options is more of an open question: the long option names are quite distinctive from each other but as we've seen whether e.g.require
means&
or|
is not quite as obvious as would be ideal.