-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readability vs Context-Sensitive Validity #29
Comments
I think it's reasonable and possible to "escape everything" without impairing readability too much. Numeric literals are the most "surprising", but they would only need be escaped if they are the first character in the string. Similarly, I believe only : ! also only need to be escaped at the beginning of the string. Being able to use the output in character classes and after '(' in groups seems definitely worth it, in terms of making this useful to programmers without surprising corner cases. The output is not meant to be human readable. |
What is the perceived value in readability of the escaped string? It seems to me that the primary usage is going to be along the lines of Surely the final regex will be potentially less readable, but again, a dynamic regex from user input? I imaging that will be immediately used for matches, a test, or a replacement, and the intermediary regex discarded. |
I often had to debug the resulting string of a compound regular expression, having a less readable string is a real pain and the discussions in Python and Perl prove that I'm not the only one with that problem as they've opted out of that strategy.
Yes, numeric literals would need escaping at the beginning of the string only as would
I tend to think so too. What do you think about the
Well, I'd like to do my best to keep it as readable as possible given the changes and trends in other languages. I hold Python in high regard and they've reached a reverse conclusion (moving from "escape everything" to "escape only a small set"). (cc @ljharb ) |
I have no problem with escaping So basically: put me in the camp that says we should escape everything that might be a gotcha in a reasonable use case, including |
Is I also don’t see how |
@minitech good point, the :! Escaping is for looks lookahead/behind |
Sure, escape leading a-f. |
Wouldn't it need to escape any hex character then? eg, |
@benjamingr, could you give an example of where escaping |
Sure var a = "(?" + RegExp.escape(":") + ")";
var b = RegExp(a); // should error, but doesn't |
Escaping :? example: new RegExp("("+RegExp.escape(x)+")")
And in response to the other question, yes, we were escaping [0-9] already
for octal escapes and back references. So we're just proposing to add
[a-fA-F], and only as the initial character.
|
To clarify, does this mean we'd rather go in the "escape on the really safe side" way and not the "escape without context assumptions" way? |
Yes. Encode the fewest characters consistent with safety. I expect that
|
@benjamingr, I don’t think it’s important to do that; |
Right, you'd sometimes get a syntax error, and sometimes not - depending on On Fri, Jun 26, 2015 at 1:27 AM, Ryan O’Hara notifications@github.com
|
I don't think hex need escaping as they're different from the capture group number reference because |
Consider `new RegExp('\ufff'+RegExp.escape(x))`
It would be better (more deterministic) if that were always an error.
|
Naw, you're getting into completing partially escaped things which is a rabbit hole. What if they did RegExp('\\' + RegExp.escape(whatever)) I think it's reasonable to scope it to completed escape sequences that would have their meaning changed if an unescape character were added. |
I agree it's a corner case. It would be nice to draw a bright line here. But if you'd like to say that the first character (whatever it is) should |
The reason I brought it up is that |
I agree that : and ! are the same issue as [a-f] and should be treated
consistently. ? is different, of course.
|
I should add that what I'd really like to see is a concise and accurate
statement of the goal. It occurs to me that something like, "the behavior
should be identical to the behavior you'd get if every character were
replaced with its octal escape" would give us a firm basis. Then we could
discuss "optimizations" that would preserve extra "readability" while
preserving the same behavior.
That approach seems more sound than selecting an ad hoc set of escapes.
|
@cscott the first guarantee I'd like to give is that passing the result of A safety guarantee would look like "the behavior should be identical to the behavior you'd get if the escaped sequence fulfills the first guarantee and in addition it is wrapped in a non-capturing group". An octal escape guarantee is also OK I guess. That said, an ad-hoc set of escapes is perfectly fine, there are only very few special cases in RegExp in JavaScript with context since RegExp syntax is pretty simple after all. You're either inside a character set or not, and you might be added as part of an escape sequence. Those are pretty much all the special cases that need to be addresse. |
@minitech, @benjamingr: Consider:
versus:
and then what happens with the latter code if It would also be nice to be able to do:
that is, to be able to easily get a version of |
IMO it would suffice to just have two modes: escape every character ever, and escape the characters that are necessary in |
@cscott interesting, would you mind coming up with a use case where we'd want to escape every character, or escape every number? Also, what would |
@benjamingr The use case to escape every character is to accomodate users like the I think allowing a flexible "extra escape" mode would eliminate the need for some of the oddball escapes, such as @allenwb's // "escapeAll" -- encode all characters as shortest-possible escape sequence
var escapeHelper = function(x) {
var cp = x.codePointAt(0), len = cp < 0x10000 ? 1 : 2;
var rest = x.length > len ? escapeHelper(x.slice(len)) : '';
if (cp == 0) {
return '\\0' + rest; // XXX is this desirable?
} else if (cp <= 26) {
return '\\c' + String.fromCodePoint(64 + cp) + rest; // XXX is this desirable?
} else if (cp <= 0x7F && !/^[DdWwSstrnvf0CcXxUuBb0-9]/.test(x)) {
return '\\' + String.fromCodePoint(cp) + rest; // XXX is this desirable?
} else {
var result = x.toString(16);
switch(result.length) {
case 1: return '\\x0' + result + rest;
case 2: return '\\x' + result + rest;
case 3: return '\\u0' + result + rest;
case 4: return '\\u' + result + rest;
default: return '\\u{' + result + '}' + rest; // requires u flag to RegExp
}
}
}
RegExp.escape = function(str, regex) {
// XXX: type check str and regexp?
if (typeof regex == 'string') {
// optional, but it illustrates an interesting use case for RegExp.escape
regex = new RegExp('[' + RegExp.escape(regex) + ']', 'gu');
}
var re = /[-\\^$*+?.()[\]{}]|^[0-9A-Fa-f]/gu;
if (regex!==undefined) re = re.alt(regex); // RegExp combinator
// XXX should we force 'g' and 'u' flags on re?
return String(s).replace(re, escapeHelper);
} I'm inclined to think it would be better if This uses RegExp.prototype.alt = function(re) {
return new RegExp('(?:'+this.source+')|(?:'+RegExp(re).source+')', this.flags + re.flags);
} I think there are some questions about how the flags should be combined here, which generalize to questions about the second parameter of |
@cscott why is |
@benjamingr As noted above, we need to protect leading hex digits, in case of:
The first one was uncontroversial, I think. The second follows from @bergus' formalization of the goal. |
Oh, you are absolutely right. I mistook what the ^ meant with the gu flags. That makes sense. So, you're proposing:
I'm leaning towards this idea at the moment minus the string as the second parameter. Why are |
@benjamingr I'm using @bergus' statement of the goal, which means that You can always do I could be persuaded otherwise, of course. I think if you use my goal (something like "the semantics should be the same as if every character was replaced by its unicode escape") you'd have to add those characters, and also tweak Oh -- and I think r = new RegExp('[' + RegExp.escape(chars, /-/g) + ']', 'g'); instead. Maybe we should tweak @bergus' goal to mention character classes. |
@cscott Ah, an implementation!
We consider usage of |
Don't you think earlier errors are worth it though? I can look into finding data to see if people actually concat strings this way. |
class SafeRegExp extends RegExp { ... }
SafeRegExp.escape = function(s, re) {
return super.escape(s, re ? /x/g.alt(re) : /x/g); // case-sensitive
}
// what should this print? Is the `i` infectious? ignored?
console.log(SafeRegExp.escape('XY', /y/i));
// ideally that would print "X\\u0059" but the implementation is tricky. |
Astonishing. Confusing. OK, it seems to somehow work in regex, but not in string literals.
I get
I can see :-) Either output should be acceptable imo.
Yup, though that seems to be solved OK. Maybe we should just loop the string manually and test the standard |
Hm -- is there any reason we should add And I think we should use the standard See https://mathiasbynens.be/notes/es6-unicode-regex for more details. So I think the above implementation is okay, but I'm mentioning the issues here so others can sanity-check my logic. (Note that with the EDIT: I added |
@bergus yep, you're right about the
I think that implementation would be reasonable for a polyfill. It takes a little care, since without the But as a matter of spec, it wouldn't be great if the Nth subclass of RegExp had to scan through the string N times looking for each component regexp. And you really want to be able to combine the regexps using something like I think this is one of the things it would be easier to write correctly in the spec than implement as a javascript polyfill, since in the spec you have access to the RegExp internals and aren't limited to the methods exposed to user code. |
Everyone, please see https://github.com/benjamingr/RegExp.escape/blob/master/EscapedChars.md on what characters we're escaping in each proposal. Comments appreciated here (or in separate issues about specific characters). LGTMs (and DLGTMs) also appreciated. |
From this comment (#35):
Notes:
|
To be completely honest I'm not sure what _ControlLetter_s are for and how On Mon, Jul 6, 2015 at 5:44 PM, André Bargull notifications@github.com
|
Control letter -> ASCII characters below U+0020. |
@anba thanks. If you see anything else we missed I'd love to hear it. I was unable to find usages of this on GH. |
Not that these codepoints are not 'letters', which is why the wiki refers to them as control characters ;-) |
According to http://ecma-international.org/ecma-262/6.0/#sec-forbidden-extensions if |
This issue is to be decided in the next TC meeting. Will update. |
Currently I see two opposites:
RegExp.escape
to support passing in a context-free way to theRegExp
constructor. This means dropping]
and}
from the escaped set and telling people that they should be aware of context. This would provide a very readable output./
to allow eval** as @allenwb has suggested and escaping capturing group identifiers:
and!
.It sounds like more people tend to prefer the latter, I'm not convinced because of lacking usage data indicating that the problems it solves happen in real code (data indicates otherwise) but a single counter-example would go a long way to convince me that this is a problem we need to address. I'm very tempted by the safety guarantees it provides for context sensitivity.
** but not whitespace as there is no ״ignore whitespace mode" in JS.
The text was updated successfully, but these errors were encountered: