Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make S"..." and "..." throw errors identically #107

Closed
StefanKarpinski opened this issue Jul 11, 2011 · 25 comments
Closed

make S"..." and "..." throw errors identically #107

StefanKarpinski opened this issue Jul 11, 2011 · 25 comments
Assignees

Comments

@StefanKarpinski
Copy link
Sponsor Member

We're close to this, but the error messaging of these two forms is still quite different — the cases in which errors occur is (nearly) the same, but the way errors are shown is quite different. For maximum seamlessness between the two forms, all errors should appear the same.

@ghost ghost assigned StefanKarpinski Jul 11, 2011
@vtjnash
Copy link
Sponsor Member

vtjnash commented Feb 1, 2013

@StefanKarpinski when do these throw an error? and what is an example of the output in each case?

@JeffBezanson
Copy link
Sponsor Member

I guess this is an example:

julia> "\xz"
ERROR: syntax: invalid escape sequence

julia> E"\xz"
ERROR: \x used with no following hex digits

I almost wonder if escape sequences and interpolation should just be features of all strings, period, and custom string literals are just a way to invoke a macro on that parser output. For interpolation, you'd get multiple arguments, some of which are strings. This has the advantage of being really predictable --- the escaping rules never change --- plus custom string literals can support interpolation, and you don't have to know about calling unescape or interp_parse. \ and " are unavoidably metacharacters (for example L"\" is never going to work), so why not just add $ to that list.
Or if there were no interpolation, it would be even better, but let's not gild the lilly :)

cc @nolta

@StefanKarpinski
Copy link
Sponsor Member Author

Regular expressions would not fare well under such a plan.

@JeffBezanson
Copy link
Sponsor Member

All you have to do is escape $ and . I'm not even sure it's unequivocally good that regexes don't support escape sequences right now.

@pao
Copy link
Member

pao commented Mar 8, 2013

Not sure what you're getting at Jeff. PCRE supports escapes just fine:

julia> m = match(r"\012", "\012")
RegexMatch("\n")

@JeffBezanson
Copy link
Sponsor Member

Oh, I saw this:

julia> r"\u2200"
ERROR: compile: PCRE does not support \L, \l, \N{name}, \U, or \u at position 2 in "\\u2200"

So we are dealing with 2 escape syntaxes, which is also not great.

@pao
Copy link
Member

pao commented Mar 8, 2013

Ahh, fair enough.

@StefanKarpinski
Copy link
Sponsor Member Author

Ah, yeah, that's a good point. Slashes, however still present a problem as far as I can tell. How can you possibly make r"\d" work?

@StefanKarpinski
Copy link
Sponsor Member Author

Backslashes, rather.

@JeffBezanson
Copy link
Sponsor Member

Under this design r"<string>" would always be equivalent to Regex("<string>") with the exact same text in both cases. So you would write r"\\d".

@StefanKarpinski
Copy link
Sponsor Member Author

Honestly, I feel that's unacceptable.

@JeffBezanson
Copy link
Sponsor Member

In that case it might help just to get rid of the str macros that produce strings, and only use custom string literals for different types of objects. The idea then is that custom string literals are different enough not to want the same features that normal string literals have.

Meanwhile, b_str and I_str are now special cases in the parser since they support interpolation, so they are no longer legit string macros. But maybe b_str isn`t as useful as we thought; we haven't used it anywhere (though maybe somebody has).

So you see I find this messy and unsatisfying. True, r"\\d" is very annoying, but there are no deep weird fiddly bugs involved. However we could improve the situation without changing the language by removing all str macros except r_str and v_str.

One telling flaw is that a str macro can't end with a single backslash. Another one, AFAICT, is that there is no way to put a " in a regex literal without also inserting a backslash. Granted those cases are rarer than \d, but they are why I prefer to have one universal escaping scheme and stick with it. Nobody knows what r"\"" means.

Come to think of it, maybe this is a regression?

julia> r"\""
r"\\""

If these things don't support escapes, then we should say you just can't have embedded quote characters, and you get whatever text is before the next " no matter what.

@pao
Copy link
Member

pao commented Mar 8, 2013

Another one, AFAICT, is that there is no way to put a " in a regex literal without also inserting a backslash.

That could be fixed with allowing triplequoting for regexes, which would be useful with the m suffix anyways.

@StefanKarpinski
Copy link
Sponsor Member Author

There are a few separate issues at play here for non-standard string literals:

  1. The parser handling quote escaping (good).
  2. The parser handling general escaping (bad).
  3. The parser handling interpolation (mixed).

The reason the L"\"" thing is weird is the same reason you can't put a single " into a regex literal without a backslash: the parser knows to ignore escaped quotes for parsing but doesn't remove them. This is not really right and there's no fully correct way around it; the solution is that the parser should handle a layer of escaping. This is slightly tricky to make so it interacts well with additional layers of escaping, but it can be done.

The issue of how regular expressions handle things like r"\u2200" is just a bug. I passed things through to PCRE as-is because it seemed to work and I didn't really know what else to do at the time. The correct behavior is to provide a full, correct translation from surface escapes to escapes and/or raw unicode data that PCRE understands. This is actually an argument against the parser handling escaping, rather than for it, since otherwise we'll never have the flexibility and power to get this right.

We've come around to the idea that the parser should handle interpolation in standard string literals, which I think is good, and has nothing to do with general escaping in standard or non-standard string literals. When x"\a$b\c" is encountered, this could be parsed as a call to @x_str L"\a" b L"\c" so that the parser handles everything dealing with quoting but the varargs @x_str macro still gets to handle escaping (typically using the escape_string utility function). This would even allow interpolation into regular expressions, although, of course, that would entail recompilation on every execution. Non-standard string literals that don't handle interpolation at all could simply emit a clear error message if you try to do interpolation.

The main downside of doing interpolation for all string literals is that you can't have non-standard literals where $ as just a character. That's especially problematic for regular expressions where $ already has a meaning and is extremely common. Perl handles this in classic fashion by cleverly disambiguating the two cases for you in the parser. We could try to do something like that, but it would get pretty insane.

I want to point out that my original proposal was that the parser handle string interpolation for standard string literals while no interpolation would be done for non-standard string literals. I'm not clear on why that's not an acceptable solution. The only counterargument I've seen anywhere is b"" strings, which do binary interpolation in their current form, but as @JeffBezanson pointed out, are not used anywhere.

@nolta
Copy link
Member

nolta commented Mar 8, 2013

This might be a band-aid, but it appears we can fix the r"\u2200" problem by adding JAVASCRIPT_COMPAT to the list of DEFAULT_OPTS in regex.jl:

julia> s = "\u2200"
"∀"

julia> ismatch(r"\u2200", s)
true

@StefanKarpinski
Copy link
Sponsor Member Author

Well, that's certainly better than failing :-)

nolta added a commit that referenced this issue Mar 8, 2013
@JeffBezanson
Copy link
Sponsor Member

Thanks mike for fixing that.

I don't see why it's bad for the parser to handle escaping in normal strings. It's part of the lexical syntax. There is also escaping in char literals, which there aren't macros for.

In custom string literals, the options are (1) have no parser escaping of quotes at all, or (2) make the sequences \\ and \" special, for inserting a single backslash or a quote. So you could still have r"\d", but r"\\" would be interpreted by the parser.

@StefanKarpinski
Copy link
Sponsor Member Author

I have no problem with the parser handling escaping in normal strings or characters. That's completely sensible. I was exclusively talking about non-standard strings, and I was arguing for option (2), making sure that it plays well with subsequent layers of unescaping. That still doesn't address the interpolation issue.

@JeffBezanson
Copy link
Sponsor Member

Ok, so r"\\" will be a regex of a single backslash (which gives a PCRE error, which is fine)? This is fine with me.

For interpolation, let's just get rid of b_str and friends. Or we could keep only b_str (since it yields an array, not a string), but not have it support interpolation. The less interpolation the better :)

@StefanKarpinski
Copy link
Sponsor Member Author

Sure, that seems fine. As long as r"\d" works. I haven't entirely thought through how the parser unescaping needs to interact with the macro unescaping though. Seems like you have a handle on it.

@StefanKarpinski
Copy link
Sponsor Member Author

Sorry.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 9, 2013

so, how are these supposed to be written now:

    const path_separator    = "\\"
    const path_separator_re = r"[/\\]+"
    const path_absolute_re  = r"^(?:\w+:)[/\\]"
    const path_directory_re = r"(?:^|[/\\])\.{0,2}$"
    const path_dir_splitter = r"^(.*?)([/\\]+)([^/\\]*)$"
    const path_ext_splitter = r"^((?:.*[/\\])?(?:\.|[^/\\\.])[^/\\]*?)(\.[^/\\\.]*|)$"

    function splitdrive(path::String)
        m = match(r"^(\w+:|\\\\\w+\\\w+|\\\\\?\\UNC\\\w+\\\w+|\\\\\?\\\w+:|)(.*)$", path)
        m.captures[1], m.captures[2]
    end

@JeffBezanson
Copy link
Sponsor Member

There are two options: replace \\ with \\\\, or change the parser again so that only \" is special in custom string literals. The second option is better for the code here, and its only cost is that a regex literal can't end in a backslash.

@JeffBezanson
Copy link
Sponsor Member

Ok for now I have put it so those regexes will continue to work as before.

@mgkuhn
Copy link
Contributor

mgkuhn commented Aug 3, 2022

The “band-aid” fixed the match(r"\u2200", "\u2200") example simply because that is where JavaScript and Julia syntax overlap, but it fails already at minor variations where they don't, such as match(r"\U102200", "\U102200"). It just partially masks the fact that in the first string the \ is a PCRE escape character, whereas in the second string it is a Julia string-literal escape character. On the other hand, it now confuses users of the \x syntax in PCRE, which was disabled by adding the PCRE.JAVASCRIPT_COMPAT regex compile flag (or today PCRE.ALT_BSUX as it's called in PCRE2). See #46137.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants