Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

properly support lookahead expresions in begin matches #2135

Merged
merged 10 commits into from
Oct 19, 2019

Conversation

joshgoebel
Copy link
Member

  • passes the FULL remaining buffer to processLexeme so
    that when it attemps to detemine the next mode it can
    run the ruleset against the FULL buffer, not just the
    terminator which resulted in a match (which will not
    be sufficient enough since lookaheads aren't included
    in the actual match)

@joshgoebel
Copy link
Member Author

I think we may need to do this same thing for endOfMode to support end matches, can anyone confirm? The use case I cared about was begin matches, and that's what I've tested.

If this looks good there are a few syntaxes that can be updated now that we have PROPER lookahead support - since a few current have HACKS to deal with the prior broken support such as:

XML.js

        /*
        The lookahead pattern (?=...) ensures that 'begin' only matches
        '<style' as a single word, followed by a whitespace or an
        ending braket. The '$' is needed for the lexeme to be recognized
        by hljs.subMode() that tests lexemes outside the stream.
        */
        begin: '<style(?=\\s|>|$)', end: '>',

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 23, 2019

Can someone tel me the correct way to run tests from console? npm test seems to be confused about where ../../build is supposed to be.

Ok got it now, will get the tests passing. :-)

@joshgoebel
Copy link
Member Author

Ha, now the issue is we have "broken"? rulesets that will actually match over and over if they can see the full data, such as Fortran:

/(?=\b|\+|\-|\.)(?=\.\d|\d)(?:\d+)?(?:\.?\d*)(?:[de][+-]?\d+)?\b\.?/im

This can actually return a 0 length match, which won't push the position forward, then it'll find AGAIN the same 0 length match in an infinite loop. :-)

I'm thinking perhaps we should ignore 0 length matches.

@egor-rogov
Copy link
Collaborator

Hey @yyyc514, how is it going with this PR? Haven't heard from you in a while.

@marcoscaceres
Copy link
Contributor

Heh, so this is how it all started 😂

@joshgoebel
Copy link
Member Author

I got a little distracted, it's true. But I haven't forgotten this. :-)

@joshgoebel joshgoebel added enhancement An enhancement or new feature on hold labels Oct 7, 2019
@joshgoebel joshgoebel changed the title properly support lookahead expresions in matches [WIP] properly support lookahead expresions in matches Oct 10, 2019
@joshgoebel
Copy link
Member Author

Diff gets really confused around line 490 or so... probably easier to read lines 539 to 644 outside the diff. They should be fairly easy to follow.

I shrunk processLexeme and moved a lot of the function into doBeginMatch and doEndMatch.

The end of compileMode moves into buildModeRegex which builds a 'super matcher' or whatever you'd like to call it. :-) It pretends to be a regex (supporting exec) but also returns metadata reporting WHICH mode matched, etc... so we don't have to recheck.

@joshgoebel
Copy link
Member Author

Broke the browser build somehow, no idea how.

@egor-rogov
Copy link
Collaborator

I'm gonna look into this, but do not expect a quick feedback (:

@joshgoebel
Copy link
Member Author

No problem. Nothing too advanced here I don’t think. It’s really just a purer implementation of the goals of the original system. Maybe it didn’t occur to them at the time you could just use capture groups to track which rule matched without the need to re-run the rule sets.

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 10, 2019

Not saying it's worth the time but so many places where we use returnBegin right now could be replaced with look-aheads with no negative issues, and perhaps even resulting in simplifying the grammar, like in the abnf case. (since most of the time you're replacing nesting with just a single, simple regex)

@joshgoebel joshgoebel changed the title [WIP] properly support lookahead expresions in matches properly support lookahead expresions in matches Oct 11, 2019
@joshgoebel joshgoebel self-assigned this Oct 11, 2019
@joshgoebel joshgoebel changed the title properly support lookahead expresions in matches properly support lookahead expresions in begin matches Oct 11, 2019
@egor-rogov
Copy link
Collaborator

I wish I had more time to grok this...
Anyway, what I see:

  • all tests are passing
  • speed remains roughly the same (around 9 sec on my laptop for npm test)
  • lookaheads work!

So I can see no reason to delay this change. However I suggest to split it in 2 (maybe more) PRs: one for highlight.js itself, and the other(s) for improving grammar of languages.

@marcoscaceres You may want to have a look either, 'cause it changes the core of the system.

@egor-rogov
Copy link
Collaborator

I think we may need to do this same thing for endOfMode to support end matches, can anyone confirm? The use case I cared about was begin matches, and that's what I've tested.

I think it makes perfect sense.
@yyyc514 it would be great if you can add a section to the developer's docs on what regexp features are supported (like lookaheads (thanks to this PR) and backreferences (thanks to #1897)) and what are not (lookbehinds?).

@joshgoebel
Copy link
Member Author

Well, look behinds technically aren't in JS yet, are they?... but when they are added this PR should allow them to work just fine with begin matchers, and we should keep that in mind for end matchers also.

When you say "developer docs" which docs are you referring to exactly?

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 12, 2019

However I suggest to split it in 2 (maybe more) PRs: one for highlight.js itself, and the other(s) for improving grammar of languages.

I can definitely do that. If we didn't "Always squash" this could stand-alone. I'm not sure I'm a fan of this always squashing. I see how it makes it easier for people new to git, but when you have something like this that's already packaged in nice small commits it's kind of annoying to have to split it out AGAIN.

I see a dropdown, perhaps it's possible to change it to non-squash... in which case I'd just clean up the one weird commit and then leave the history as is, rebase onto master, and then do a clean fast-forward merge.

@joshgoebel joshgoebel added this to the 9.15.11 milestone Oct 12, 2019
@egor-rogov
Copy link
Collaborator

When you say "developer docs" which docs are you referring to exactly?

The one in docs/ folder (which goes to https://highlightjs.readthedocs.io/en/latest/)

@joshgoebel
Copy link
Member Author

Would you suggest a new document to talk about regex features?

@egor-rogov
Copy link
Collaborator

language-guide looks like a proper place for it.

@joshgoebel
Copy link
Member Author

I think I might keep it small and focus on what we DON'T support... I think the idea is its regex... it should just work... If someone wants to learn about regex there are all sorts of resources.

So right now the caveats (after this PR is merged):

  • "look-ahead" doesn't work properly in end expressions.
  • "look-behind" (when JS supports it) also doesn't work properly in end expressions.

@egor-rogov
Copy link
Collaborator

I thought that what we do support (and since what version) may be also important for those who previously stumbled on something unsupported.

@joshgoebel
Copy link
Member Author

Does that addition to docs help?

@joshgoebel
Copy link
Member Author

@egor-rogov We good here yet?

docs/language-guide.rst Outdated Show resolved Hide resolved
@egor-rogov
Copy link
Collaborator

Ok, here we go!
I think we shouldn’t squash to keep separate commits.

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 17, 2019

Agree. I'll rebase and fixup the last commit and the plain jain merge when ready.

- begin matches are matched a single time
  (they no longer need to be rematched after found)
- look-ahead should now work properly for begin matches
  because of this change
- should be a tiny bit faster

Before

The old parser would build a list of regexes per mode and then
combine that into a large regex.  This is what was used to scan
your code for matches.  But after a match was found it had no way
on known WHICH match - so it would then have to re-run all the rules
sequentially on the bit of match text trying to figure out which
rule had matched.

The problem is while the original matcher was running agianst the
full code this "rematch" was only running aginst the matched text.
So look-ahead matches would naturally fail becasue the content they
were tryign to look-ahead to was no longer present.

After

We take the list of regexes per mode and combine then into a larger
regex, but with match groups.  We keep track of which match group
position correspond to which rule.  Now when we hit a match we can
check which match group was matched and find the associated rule/mode
that was matched withotu having to double check.

Look-ahead begin matches now "just work" because the rules are always
running against the full body of text and not just a subset.

Caveats

This doesn't solve look-ahead for end matching so naturally it also
does nothing for endSameAsBegin. IE, don't expect look-aheads to work
properly in those situations yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement or new feature parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants