Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow GFM spec on EM and STRONG delimiters #1686

Merged
merged 26 commits into from
Jul 13, 2020

Conversation

calculuschild
Copy link
Contributor

@calculuschild calculuschild commented May 21, 2020

Starting toward better adherence to the GFM spec on Emphasis, specifically Left-flanking-delimiter-runs.

Marked version: 1.1.0

Markdown flavor: CommonMark|GitHub Flavored Markdown

Description

  • Fixes Examples:
    (em) 341, 367, 368, 371, 372, 379, 390, 406, 417, 441, 444
    (strong) 391, 397, 399, 400, 401, 431, 443, 471, 475, 476, 479, 480

What was attempted

Applying the GFM spec for Left-flanking-delimiter-runs and right-flanking-delimiter runs more accurately for EM tags and STRONG tags.

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

  • Draft GitHub release notes have been updated.
  • CI is green (no forced merge required).
  • Merge PR

@vercel
Copy link

vercel bot commented May 21, 2020

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/qv2lelel1
✅ Preview: https://markedjs-git-fork-calculuschild-emphasisfixes.markedjs.vercel.app

@calculuschild calculuschild marked this pull request as draft May 22, 2020 13:45
@UziTech
Copy link
Member

UziTech commented May 22, 2020

I don't have a lot of time to look at this right now but it could be a conflict with lists since * could be the start of a list as well as em and strong.

Added a check for the previous character to the  *em* Tokenizer. Needed to pass any tests where the em block starts with a punctuation character (e.g. commonmark example 368)
@calculuschild calculuschild marked this pull request as ready for review May 29, 2020 20:28
@calculuschild
Copy link
Contributor Author

calculuschild commented May 29, 2020

OK, so I figured out what the issue was. I now have it working quite well except for two test cases I could use some help on:

  • Original strong_and_em_together
    This test passes fine normally, but when run with the 'pedantic' tag it breaks. I'm not totally sure what the pedantic tag is actually doing so I would like some feedback. Edit: Fixed!

  • CommonMark Links example 519
    I have a pattern in there to skip any * inside \[ *stuff* \] to avoid conflict with links. This last example breaks on things between brackets that aren't actually links. I'm not sure how to handle this because it could be a reflink in certain cases but I can't tell without looking behind?

Note, I removed a line from the New em_2char test that was invalid according to the spec. I don't handle it properly here either, but my code isn't "breaking" it because it wasn't correct to begin with.

Also note, I haven't applied these changes to the _ version of em tags, but I expect several more tests will begin passing with that as well once these two issues are worked out.

@calculuschild calculuschild changed the title Follow GFM spec on Left-flanking-delimiter-runs Follow GFM spec on EM tags May 29, 2020
@UziTech
Copy link
Member

UziTech commented May 29, 2020

The pedantic option uses the pedantic rules to tokenize the markdown.

Basically making marked use the original spec instead of CommonMark.

@calculuschild
Copy link
Contributor Author

calculuschild commented May 29, 2020

OK gotcha. My change to the em tokenizer confused it. I can fix this pretty easily.
Edit: This has been fixed. Just this last bit I'm struggling on:

Any insight on detecting whether or not a set of square brackets is a reflink or not?

@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 1, 2020

Can anyone help me out with detecting when square brackets are part of a reflink or not for commonmark Example 519?

@UziTech Would you have any suggestion?

@UziTech
Copy link
Member

UziTech commented Jun 10, 2020

Can anyone help me out with detecting when square brackets are part of a reflink or not for commonmark Example 519?

You could try a lookahead or you might need to do some parsing in the em tokenizer.

Modifies the em rule after the block tokens are generated to detect known reflinks and skip over them so they don't get mistakenly italicized.
@calculuschild
Copy link
Contributor Author

calculuschild commented Jun 12, 2020

Tada!! My solution was to inject known reflink labels into the em rules right after the block sequence in the lexer is finished. This way the em rule can properly skip over any links that might contain *, but still allow emphasis inside of [...] if it is not a link.

I'm hoping this tweaking of the lexer is alright. I'm not sure if there's some way people could inject malicious regex this way by giving their link labels some weird names, but I assume it would just require some further character escaping on the label names before injection.

If this looks good, I can move forward with the _ variant of the em tags as well as **strong** which uses a lot of the same logic.

Now fixes three more cases
@calculuschild
Copy link
Contributor Author

Underscore em rules added. Fixes 3 more examples (371, 372, 406)

I would love some feedback on this PR!

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work!

test/specs/commonmark/commonmark.0.29.json Outdated Show resolved Hide resolved
src/Lexer.js Outdated Show resolved Hide resolved
src/Lexer.js Outdated Show resolved Hide resolved
@calculuschild
Copy link
Contributor Author

What do you guys think? Is this baby ready to go?

Copy link
Contributor

@davisjam davisjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for ReDoS

Copy link
Member

@UziTech UziTech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for all your hard work. 💯

@UziTech UziTech merged commit dddf9ae into markedjs:master Jul 13, 2020
@UziTech UziTech mentioned this pull request Jul 13, 2020
12 tasks
@brainchild0
Copy link

brainchild0 commented Jul 14, 2020

So 424 and 425 are considered out of scope of this change set?

Or more simply:

$ git show -s
commit 6b729ed8cdb98ea75d4031f6218a1f58b9f02d8a (HEAD, EmphasisFixes)
Merge: e27e6f9 ad720c1
Author: Trevor Buckner <calculuschild@gmail.com>
Date:   Thu Jul 9 19:37:22 2020 -0400

    Merge branch 'EmphasisFixes' of https://github.com/calculuschild/marked into EmphasisFixes
$ 
$ echo '**a **b** c**' |  bin/marked 
<p><strong>a **b</strong> c**</p>

(Should be <p><strong>a <strong>b</strong> c</strong></p>.)

@UziTech
Copy link
Member

UziTech commented Jul 14, 2020

@brainchild0 if you want to fix it I would be happy to review a PR 😁

@brainchild0
Copy link

For me the starting point is understanding what makes this particular case more challenging than the ones that have been resolved in this merge. Intuitively they all seem roughly equally complex. Obviously, I make the remark abstractly, having no familiarity with the design.

Or another way to ask the question, starting from the top and moving down: The general rule is to collect a stack of emphasis delimiters, which may be any of *, _, **, or __. As long as this rule may be somehow fully implemented, along with also dropping those same items from the stack, because of close (right flanking) delimiters, then all cases would be supported. What is missing, currently, that leaves out support for the remaining cases?

@calculuschild
Copy link
Contributor Author

calculuschild commented Jul 14, 2020

What is missing, currently, that leaves out support for the remaining cases?

The current implementation does not use a stack at all. It simply checks for existence of a left delimiter, then if found, find the first available matching right delimiter. Finally, ensure the text between the two is valid, meaning ignore any delimiters found inside links or code spans etc., and any other delimiters inside must occur in even pairs. If not valid, get the next possible end delimiter and check the middle again, until you run out of matching delimiters or you get a valid middle.

We already have regex for the left and right delimiters, so it would just be the extra effort of building up a stack in the tokenizer.

stevenjoezhang added a commit to stevenjoezhang/mmp-build that referenced this pull request Sep 2, 2020
@fredck
Copy link

fredck commented Sep 3, 2020

There is a good chance that this PR introduced #1754.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants