Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making # free #3498

Closed
msadeqhe opened this issue Dec 12, 2023 · 14 comments
Closed

Making # free #3498

msadeqhe opened this issue Dec 12, 2023 · 14 comments
Labels
leads question A question for the leads team

Comments

@msadeqhe
Copy link

msadeqhe commented Dec 12, 2023

Summary of issue:

Carbon can use '"text"' notation instead of #"text"# to make # available for other purposes.

Details:

Currently # is used in string literals. What if we drop # and use ' instead?

var m: String = #"text"#;
var n: String = '"text"';

' is already used for character literals in many programming languages. If we require " to be escaped in character literals, we can use notation '"..."' for string literals:

// Character literal
var a: String = '\"';

// String literal
var b: String = '"text"';

// ERROR! '" is the start of string, but it doesn't have the ending quote "'
var c: String = '"';

So Carbon reuses ', and # will be free for other uses in the language.

//            ##"First line\##nNext line"##;
let a: auto = ''"First line\''nNext line"'';

Alternative Syntax

Instead of repeating 's, Carbon may use a notation similar to C++ raw string literals. We can put other symbols (except ' and ") between ' and " in a way that they must have reversed order in ending quote (If we put one of [({< characters in starting quote, correspondingly we have to use >})] in the ending quote). Also escape sequence character \ must include those extra characters. So for example:

// Before
###"First line\###nNext line"###

// After
'[("First line\[(n)]Next line")]'
'#^"First line\#^n^#Next line"^#'

IMO having notation around escape sequences, makes them readable:

// Before
##"first item\##nnext item"##

// After
'<"first item\<n>next item">'

\<n>next is more readable than \##nnext. Thanks.

Any other information that you want to share?

I had suggested a similar notation (alternative syntax) for string literals in Cpp2 (Cppfront) in this discussion (Please read it for explanation and why '"text"' is chosen). It seems Cpp2 won't get the new notation (maybe because of compatibility with Cpp1, similar to how TypeScript is compatible with JavaScript).

@msadeqhe msadeqhe added the leads question A question for the leads team label Dec 12, 2023
@msadeqhe
Copy link
Author

msadeqhe commented Dec 12, 2023

I've to mention that '"text"' is easier to type than #"text"#. Because ' and " have the same key on the keyboard. Also ' is already related to characters and strings. But # is a sign for numbers, values (like pound (weight)) and tags.

@geoffromer
Copy link
Contributor

See here for the rationale for using # this way, and for not using a C++-style raw literal syntax. I can't tell why ' isn't listed as one of the characters considered as an alternative to #.

@jonmeow
Copy link
Contributor

jonmeow commented Dec 12, 2023

It's worth bearing in mind here, proposal #1360 adopted ''' over """ for block string literals in order to reduce lexing ambiguity (as described in the proposal).

To note a few risks of ' instead of '#':

  • The code '"' would be ambiguous as either a character literal of " or the start of a raw string literal beginning with '.
    • The ambiguity could be resolved by requiring either the character literal to be written as '\"' or the raw string literal to be written as '"\'' (the latter option would probably have lower impact).
    • For comparison, #"# is a raw string literal beginning with #; there is no ambiguity.
  • ' for raw string literals conflicts with ''' for block string literals.
    • Ambiguity would require changing block string literal syntax.
    • [note, the proposal seems to be anchored on C++ R() syntax: Carbon treats raw and block string literals as separate and combinable]
  • ' is used in popular languages such as Python for regular string literals, so using it for raw string literals poses a risk for understandability.
    • As noted in proposal String literals #199, # has precedent for raw string literals, which should yield at least some familiarity.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 12, 2023

' for raw string literals conflicts with ''' for block string literals.

  • Ambiguity would require changing block string literal syntax.
  • [note, the proposal seems to be anchored on C++ R() syntax: Carbon treats raw and block string literals as separate and combinable]

If Carbon uses '"text"', does it still need block string literals?

  • ' is used in popular languages such as Python for regular string literals, so using it for raw string literals poses a risk for understandability.

In other languages such as shell script (that's well known), $(...) is not allowed within ' quotes. So it somehow resembles raw string literals.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

What advantages '"text"' syntax has over #"text"#?

  • '" and "' are easier/faster to type with keyboard than #" and "#.
  • ' is widely known as single quote in many programming languages (even on live languages such as English). But # is not universally known as a quote symbol in ASCII, Unicode or any other international standard.
  • ' is already taken in Carbon. # may be used for other purposes.

What advantages C++-style raw string literal has over repeated quotes?

  • The programmer is free to use its own characters to form complex quotes. They are not forced to repeat many #s. It helps when the string contains # characters:
    // Before
    let a: auto = #####"text "####" text"#####;
    
    // After
    let b: auto =    '~"text "####" text"~';
    EDIT: Thanks to @jonmeow, I've corrected this example.
  • The simplest form of it looks similar to regular string literals:
    // Before
    let a: auto = #"text"#;
    
    // After
    let b: auto = '"text"';
  • It improves escape sequences readability:
    // Before
    let a: auto = ##"item\##nnext"##;
    
    // After
    let b: auto = '~"item~\n~next"~';
  • Repeated quotes are a limited syntax of C++-style raw string literals:
    // Before
    let a: auto = ####"text"####;
    
    // After
    let b: auto = '~~~"text"~~~';

Carbon can use a modified/enhanced C++-style raw string literals for readability. We can put symbols and characters between ' and " in '"text"':

  • Every symbol or character that we put in opening quote, we must put them in reversed order in closing quote (But in C++ they are not in reversed order):
    let x: auto = '#^*"text"*^#';
  • If we put characters that are valid identifiers/numbers witin opening quote, we must put them exactly the same within closing quote:
    let x: auto = 'ab100c"text"ab100c";
  • If we put <, {, [ or ( within opening quote, we have to put their matching closing symbols within closing quote, correspondingly they are >, }, ] or ) (But in C++ they are not matching closing symbols):
    let x: auto = '<{[("text")]}>';
  • All of the previous rules can be combined together:
    let x: auto = 'a2b#100("text")100#a2b';
  • Escape sequences must be enclosed within extra characters of quotes (But in C++ escape sequences are not working within raw string literals):
    let x: auto = '(*"item(*\n*)next"*)';

Although it may seems over complicated rules, and Carbon may simply ban numbers and letters from them:

// ERROR! opening and closing quotes cannot contain letters and numbers.
let x: auto = 'ab"text"ab';

// OK.
let x: auto = '(*"text"*)';

Alternatively, \ in escape sequences can be before extra characters of quotes:

let x: auto = '(*"item\(*n*)next"*)';

But IMO they look better when \ is near the escape character:

let x: auto = '(*"item(*\n*)next"*)';

BTW it depends on compiler implementations and somehow it maybe is opinion-based.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

Although it may seems over complicated rules, and Carbon may simply ban numbers and letters from them:

// ERROR! opening and closing quotes cannot contain letters and numbers.
let x: auto = 'ab"text"ab';

// OK.
let x: auto = '(*"text"*)';

This limitation has a great advantage, it allows syntactically to put keywords within opening quote. For example:

let x: auto = 'multiline"
    This is a multiline text.
    It's the next line.
    "';

multiline is a keyword that allows to have multi-line string literals without any aditional syntax. Keywords can be combined with symbols, but the keyword must be the first:

let x: auto = 'multiline ~~~~"
    This is a multiline text.
    It's the next line.
    "~~~~';

let y: auto = 'multiline (*"
    This is a multiline text.
    It's the next line.
    "*)';

White-space after multiline is optional. Any other feature for string literals can be introduced in this way.

Another example for the keyword in string quotes, is if we want to copy/paste code:

var x: auto = 'code ***"python
    def name():
        pass
    "***';

In this example, code is a string keyword, when a keyword is used, the first line in string literal is like a comment to be used by Carbon compiler. In this example, python is a documentary comment.

Another possible feature is template strings:

var x: auto = 'template *(*"FunctionName(Abc)
    \Name text \Another
    text.
    "*)*';

In this example, template is a string keyword, again the first line is a comment because a keyword is used within string literal. FunctionName is a function that takes the string and replaces \Name and \Another with corresponding members from object Abc. The compiler validates FunctionName(Abc) as a Carbon code.

In general:

var x: auto = 'KEYWORD OPENING"DATA
    CONTENT
    "CLOSING';

KEYWORD is a string keyword to determine the behavior of string literal. OPENING is a combination of symbols that are used to make a complex opening quote. DATA format depends on the KEYWORD. CONTENT is the content of string literal. The compiler uses KEYWORD to find what to do with DATA and CONTENT. CLOSING is a combination of symbols that has reversed order of OPENING to end the string literal. Optionally there may be white-space between KEYWORD and OPENING.

These examples demonstrate how this feature can be expanded in the future. Thanks.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

Alternative Syntax for Multiline String Literals

Another alternative way to have multiline is using ''"..."'' instead of '"..."'. That extra ' around string literal means it must be a multiline string literal:

var x: auto = ''"comment
    text
    "'';

var y: auto = ''^(*"comment
    text
    "*)^'';

var z: auto = ''OPENING"comment
    text
    "CLOSING'';

This works because '' (an empty character literal) is already an error (unlike string literals that can be empty with ""). OPENING and CLOSING are explained in previous comment.

The advantage of this approach is that it's simple. It doesn't introduce keywords inside string literals. And if it's necessary, string keywords can be introduced any time without any syntax limitation or compatibility problem (because numbers and letters may be banned in OPENING and CLOSING quotes for future use).

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

Visually comparing '-style and #-style

It's the way we use string literals with #:

// Carbon
let a: auto = "text";
let b: auto = #"text"#;
let c: auto = ##"text"##;
let d: auto = '''comment
    text
    ''';
let e: auto = #'''comment
    text
    '''#;

It's the way we use string literals with ' (I've used * in quotes, I think * fits well between ' and ". Also we use * to quote bold and italic text in Markdown):

// Carbon
let a: auto = "text";
let b: auto = '"text"';
let c: auto = '*"text"*';
let d: auto = ''"comment
    text
    "'';
let e: auto = ''*"comment
    text
    "*'';

They look similar to each other.

In #-style string literals, ''' is used for multi line and " is used for single line strings. On the other hand, in '-style string literals, " is used consistently for both multi line ''"..."'' and single line '"..."' strings. Interestingly, '"..."' has single (one) ' so it means single line string, and ''"..."'' has multiple (two) '' so it means multi line string.

@jonmeow
Copy link
Contributor

jonmeow commented Dec 13, 2023

If Carbon uses '"text"', does it still need block string literals?

Yes; proposal #199 that geoffromer linked to offers rationale for block string literals. Let me try to give you an example of why block string literals are helpful:

// Before
let a: auto = #"""carbon
  It uses #""" """\## delimiters.
  An escape sequence looks like \\##t.
  Here's an actual tab: '#\t'.
  """#

This yields a string of content type carbon with the value:

It uses #""" """# delimiters.
An escape sequence looks like \#.
Here's an actual tab: '<tab>'.

Indent is removed because as part of block string literal syntax.

Content type (carbon) is an important feature, for example used by clang-format. I'd recommend in your suggested syntax considering what it would look like to write a similar string, one that contains syntax conflicts with the proposed close delimiter and escape syntax -- I think the results might look pretty poor, such as needing to write a character as a hex-escape in order to prevent the full string from being interpreted as an escape itself (carbon\0x63carbonarbon -> carbon). I think the current syntax offers a very good solution to the issue.

Also, a couple syntax corrections where I think you might've misunderstood the syntax:

// Before
let a: auto = #####"text #### text"#####;

Similar to other raw string literal syntax, it's looking for the close token, which is "#. A single # is not a close token. As a consequence, this would be written:

let a: auto = #"text #### text"#;
  • It improves escape sequences readability:
    // Before
    let a: auto = ##"item\##nnext"##;

Similar to the above, I'd expect the first example to be written with one #:

let a: auto = #"item\#nnext"#;

Note, I'm going to leave more comment to others... other than the syntax misunderstanding, it sounds like you're mostly arguing readability benefits, which I would disagree with.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

Also, a couple syntax corrections where I think you might've misunderstood the syntax:

// Before
let a: auto = #####"text #### text"#####;

Similar to other raw string literal syntax, it's looking for the close token, which is "#. A single # is not a close token. As a consequence, this would be written:

let a: auto = #"text #### text"#;

Thanks. I forgot to write " inside literal, the correct code is:

// Before
let a: auto = #####"text "####" text"#####;

And it requires 5 #s (considering we want the text to be copy/pastable without modification and escape sequences).

Similar to the above, I'd expect the first example to be written with one #:

let a: auto = #"item\#nnext"#;

Infact I wanted to share how escape sequences can be unreadable as \##nnext.

@msadeqhe
Copy link
Author

msadeqhe commented Dec 13, 2023

I'd recommend in your suggested syntax considering what it would look like to write a similar string, one that contains syntax conflicts with the proposed close delimiter and escape syntax -- I think the results might look pretty poor, such as needing to write a character as a hex-escape in order to prevent the full string from being interpreted as an escape itself (carbon\0x63carbonarbon -> carbon). I think the current syntax offers a very good solution to the issue.

Sorry I don't get (understand) this part. But if I understand correctly, I suggest to ban numbers and letter between ' and ", so we cannot have carbon\ncarbon escape sequence:

// ERROR!
var x: auto = 'carbon"carbon\ncarbon"carbon';

// OK.
var y: auto = '~<*"~<*\n*>~"*>~';
// y == "\n"

~<*\...*>~ (escape sequence) and "*>~' (closing quote) are as much as complex that we can assume it cannot be a part of normal text. But \####... and "#### have a higher chance to be a part of normal text.

Note, I'm going to leave more comment to others... other than the syntax misunderstanding, it sounds like you're mostly arguing readability benefits, which I would disagree with.

Additionally, I'm arguing to make # free, and to reserve it for other features.

@msadeqhe
Copy link
Author

I have to mention that '"..."' can have other rules. The rules that I have explained are not really a part of my suggestion. If you agree to change the syntax of string literals, I can prepare a proposal in detail to improve/enhance the rules.

@chandlerc
Copy link
Contributor

While we understand the desire to use # in other parts of the grammar, the suggestion to use '" and "' instead introduces a more complex lexical structure that we don't think pulls its weight. We can consider # for other use cases, and if one of them is sufficiently compelling, find another way to write raw string literals. But ideally, that other way will work to preserve the simplicity of the current design -- both lexically, visually, when teaching or learning, and in composing cleanly with block string literals.

There are a number of other issues somewhat raised here, but it doesn't seem like we can do much with them in this issue. We remain interested in having block string literals in the language, as well as raw block string literals. Other ideas (interpolated strings, etc) should be explored in discussions and eventually draft proposal documents that surface the motivating problem and proposed solution.

Combined, the decision here is "no change at this point".

For reference, leads issues are better suited to resolving a specific and fairly narrow question of how to move forward, proposals documents should be used to outline a significant change holistically and work through the rationale for it and alternatives.

@chandlerc chandlerc closed this as not planned Won't fix, can't repro, duplicate, stale Dec 22, 2023
@chandlerc chandlerc reopened this Dec 22, 2023
@chandlerc
Copy link
Contributor

While we understand the desire to use # in other parts of the grammar, the suggestion to use '" and "' instead introduces a more complex lexical structure that we don't think pulls its weight. We can consider # for other use cases, and if one of them is sufficiently compelling, find another way to write raw string literals. But ideally, that other way will work to preserve the simplicity of the current design -- both lexically, visually, when teaching or learning, and in composing cleanly with block string literals.

There are a number of other issues somewhat raised here, but it doesn't seem like we can do much with them in this issue. We remain interested in having block string literals in the language, as well as raw block string literals. Other ideas (interpolated strings, etc) should be explored in discussions and eventually draft proposal documents that surface the motivating problem and proposed solution.

Combined, the decision here is "no change at this point".

For reference, leads issues are better suited to resolving a specific and fairly narrow question of how to move forward, proposals documents should be used to outline a significant change holistically and work through the rationale for it and alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
None yet
Development

No branches or pull requests

4 participants