Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(cpp) syntax should be split into C and C++ syntaxes #2146

Closed
mortie opened this issue Oct 3, 2019 · 18 comments
Closed

(cpp) syntax should be split into C and C++ syntaxes #2146

mortie opened this issue Oct 3, 2019 · 18 comments
Labels
enhancement An enhancement or new feature help welcome Could use help from community language

Comments

@mortie
Copy link
Contributor

mortie commented Oct 3, 2019

Currently, C and C++ are both highlighted the same way, but that's not ideal. For example, the words string and vector are highlighted as built-ins, which makes no sense in a C context.

@marcoscaceres
Copy link
Contributor

hhmm... yeah, might be worth splitting into two.

@joshgoebel joshgoebel changed the title C and C++ probably shouldn't be identical cpp syntax should be split into C and C++ syntaxes Oct 7, 2019
@joshgoebel joshgoebel changed the title cpp syntax should be split into C and C++ syntaxes [cpp] syntax should be split into C and C++ syntaxes Oct 7, 2019
@joshgoebel joshgoebel added the help welcome Could use help from community label Oct 7, 2019
@joshgoebel
Copy link
Member

It's also possible we should start with a "C-like" like Prism does. My recent work on CPP/Arduino might make sense to look at as a potential pattern.

@joshgoebel joshgoebel changed the title [cpp] syntax should be split into C and C++ syntaxes (cpp) syntax should be split into C and C++ syntaxes Oct 13, 2019
@joshgoebel
Copy link
Member

Currently, C and C++ are both highlighted the same way, but that's not ideal. For example, the words string and vector are highlighted as built-ins, which makes no sense in a C context.

But those things wouldn't really be using in a C context either, would they?

@joshgoebel joshgoebel added language enhancement An enhancement or new feature and removed new language labels Oct 17, 2019
@mortie
Copy link
Contributor Author

mortie commented Oct 17, 2019

@yyyc514 In a language where the words "vector" and "string" have no special meaning, they're pretty reasonable variable names.

The context I encountered this in was with tagged unions; something like

struct whatever {
    enum {
        WHATEVER_NUMBER,
        WHATEVER_STRING
    } tag;
    union {
        double number;
        char *string;
    } val;
};

@joshgoebel
Copy link
Member

Now just having separate keyword lists would be pretty easy.

@joshgoebel
Copy link
Member

@egor-rogov How would you feel if we prepared for this by simply making C a requirement of C++ but for v10 effectively they were still the exact same thing? Ie a patch would only:

  • Create a new c language file
  • Move the content to c
  • cpp depends on c
  • Add some comments to explain this.

This would get the "breaking" part of the change out of the way and avoid us having to release v11 just to do this later... Thoughts?

@joshgoebel
Copy link
Member

In my perfect world v10 would break a lot of things, then we'd go back to stable releases for a year or so... I don't want to get into the habit of breaking something with every release. So I think that requires a little planning ahead.

@egor-rogov
Copy link
Collaborator

I'm okay with it. Having separated keywords for these two languages sounds like a good idea.

joshgoebel added a commit that referenced this issue Jan 1, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 1, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 1, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 1, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 2, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 2, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 31, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
joshgoebel added a commit that referenced this issue Jan 31, 2020
- chore(parser): effectively rename `cpp.js` to `c-like.js` [Josh Goebel][]
- chore(parser): create new `c.js` (C), depends on `c-like` now [Josh Goebel][]
- chore(parser): create new `cpp.js` (C), depends on `c-like` now [Josh Goebel][]
- This will allow us to clean up C/C++ in the future without another breaking change
  by getting this require change out of the way early.
  (#2146)
@ztane
Copy link

ztane commented Oct 4, 2020

It doesn't really make sense to have C and C++ share the same code - there are almost more differences than there are similarities, it is not just about keywords. The preprocessor and the basic literal tokens and comments could use the same structure but that's about it. For example C does not have templates, so there is no point in trying to figure out if x<y>z is a template or not.

@jleffler
Copy link

jleffler commented Oct 4, 2020

Currently, C and C++ are both highlighted the same way, but that's not ideal. For example, the words string and vector are highlighted as built-ins, which makes no sense in a C context.

But those things wouldn't really be using in a C context either, would they?

The name string is used by the Harvard CS50 course (typedef char *string;) so you see a lot of uses of string in C code for the CS50 course. Questions about that are asked on both StackOverflow (tag [cs50]) and presumably on CS50 Stack Exchange (https://cs50.stackexchange.com/) — see https://cs50.stackexchange.com/questions/37312/cs50-cipher-text-not-printing for one example. The name vector is much less frequently used in the C that I observe, but it shouldn't be mistreated when the source is C code — it is a regular identifier in user space and in no way special (not a keyword, not a builtin, not used by the standard C library, etc).

@joshgoebel
Copy link
Member

joshgoebel commented Oct 4, 2020

For example C does not have templates, so there is no point in trying to figure out if xz is a template or not.

Yep that's definitely C++ only... and if we had a common foundation that wouldn't go in the foundation... ie it really doesn't belong in c-like, it'd go in cpp... c-like is just getting everything now since that's all we have because this item hasn't been fully completed.


But I could be persuaded that the grammars should be entirely separate if the C stand-alone grammar is simple enough... If someone knowledgable in C wants to take the c-like grammar and whittle it down to just a workable c foundation I'd be happy to review that PR. :-) We'd of course also need some pure C test cases in tests/markup, etc...

Any takers?

Also: Is the C language still evolving these days or is it pretty much static?

@joshgoebel joshgoebel added the good first issue Should be easier for first time contributors label Oct 5, 2020
@jleffler
Copy link

jleffler commented Oct 5, 2020

Also: Is the C language still evolving these days or is it pretty much static?

For the most part, the C language is static, but not wholly static.

The C2x proposal includes new syntax for attributes such as [[deprecated]] with doubled square brackets. It also introduces :: as an operator associated with attributes. These are borrowed from C++, I believe. (See: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2479.pdf p116ff).

@joshgoebel
Copy link
Member

@klmr Not sure if you might have any interest in helping out here... I can't imagine this would be super hard to accomplish for someone who knows C well. (although perhaps you're only a C++ guru) :-)

@jleffler
Copy link

jleffler commented Oct 6, 2020

I've not studied the documentation (so I may just need pointing at the right bit), but what's the attitude of highlight.js to code that is not precisely syntactically correct, but that is mostly correct?

The reason for asking is that in the c-like code, there are expressions for C++ numbers. These can use single quotes between digits (e.g. 0b0010'0110 for a binary constant), but the matching regexes allow the single quotes at numerous places that would be invalid (e.g. 0b'0'). Is it worth spending time making the patterns more rigorous, or is it better to leave them flexible. Is highlight.js intended to handle valid code properly and not necessarily handle invalid code sensibly? Or should it try to do something semi-sane with slightly invalid code?

On a related point, some C compilers allow the binary constant notation, but standard C does not (they're a part of C++). And I'm not sure whether those C compilers allow the single quotes feature — again, that's primarily C++. Should the (currently hypothetical) C scanner handle such extensions?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 6, 2020

but the matching regexes allow the single quotes at numerous places that would be invalid (e.g. 0b'0')

This is probably fine (ie, it doesn't necessarily need to be fixed)... this should never exist in the wild (since it's invalid) so it doesn't matter if we'd highlight it wrong.

or is it better to leave them flexible

Depends. Often it can be better to leave them more readable UNLESS it's super easy (and not too complex) to be more precise... if it's just expanding the regex slightly, OK. If it adds 20 lines of code or a whole bunch of impossible to read regex, then we'd really want to consider the value. Precision can improve our auto-detection abilities.

On a related point, some C compilers allow the binary constant notation, but standard C does not (they're a part of C++). And I'm not sure whether those C compilers allow the single quotes feature — again, that's primarily C++. Should the (currently hypothetical) C scanner handle such extensions?

I really don't know, but I'd say we could go minimal for starters and then add things back later if it proves to be an issue? Although if we think it's common AND it's not really going to "break" anything I also don't see the harm in including say single quotes for C...

@joshgoebel joshgoebel removed the good first issue Should be easier for first time contributors label Dec 10, 2020
@joshgoebel
Copy link
Member

#2954

I'm fully splitting the grammars and killing c-like... but simplifying the C grammar will still need to happen later.

@joshgoebel
Copy link
Member

Closing this issue after the split. I assume the two will diverge over time, not see a single larger rework (although that's also welcome).

@joshgoebel
Copy link
Member

@klmr C is ripe for simplification now if you're still interested. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature help welcome Could use help from community language
Projects
None yet
Development

No branches or pull requests

7 participants
@joshgoebel @ztane @marcoscaceres @mortie @jleffler @egor-rogov and others