Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended unicode characters discarded from auto heading IDs #56

Closed
jkboxomine opened this issue Dec 5, 2019 · 5 comments
Closed

Extended unicode characters discarded from auto heading IDs #56

jkboxomine opened this issue Dec 5, 2019 · 5 comments
Labels

Comments

@jkboxomine
Copy link

Goldmark 1.1.8 implementation only takes into account one-byte code point (ASCII) while generating auto heading IDs, simply discarding extended latin characters (2 bytes) and other international characters (3 bytes).

https://github.com/yuin/goldmark/blob/master/parser/parser.go#L83-L85

In multilingual sites, this causes imperfect heading IDs to be generated.

@yuin
Copy link
Owner

yuin commented Dec 5, 2019

As commented on #57, I think that's enough as default implementation. If you need to more preferable( for you) heading ids, you can use the WithIDs option. And if you think your auto heading id generation logic may be useful for other people, you can create it as an extension and publish it on GitHub etc.

Because of I'm Japanese(using CJK characters), I agree your thought. But I think that's enough as default implementation. I would like to keep default implementation as simple as possible.

Again, goldmark is an extensible library. you are welcome to publish your auto heading id generation logic as an extension :)

@yuin yuin closed this as completed Dec 5, 2019
@yuin yuin added the proposal label Dec 6, 2019
@jkboxomine
Copy link
Author

jkboxomine commented Dec 8, 2019

Hello @yuin , I've been reading the Goldmark code to see if I can develop an extension for more complete auto heading ID generation. Here are a few questions regarding this:

  1. Is it possible to generate auto heading ID at the renderer step (by writing a renderer that implements renderer.NodeRenderer)? It will be much more configurable to generate auto heading from the full heading text at the rendering step, if at all possible. (such as using regex for pattern matching and replace) Also, it appears that RegisterFuncs is provided by Renderer only, but not by Parser.
  2. If the above is not possible, should I write a parser that implements parser.BlockParser? Also, should I re-implement the whole atxHeadingParser and setextHeadingParser? The generateAutoHeadingID is part of the implementation of those two parsers and I'm afraid I cannot fully override that.

@yuin
Copy link
Owner

yuin commented Dec 8, 2019

@jkboxomine , You are overthinking it. All you have got to do is implement parser.IDs .

Users who want to use your auto heading id generation logic will use your library like the following:

ctx := parser.NewContext(parser.WithIDs(yourlib.NewYourAutoHeadingGenerator()))
markdown := goldmark.New(WithParserOptions(parser.WithAutoHeadingID()))
err := markdown.Convert(source, &b, parser.WithContext(ctx))

@yuin
Copy link
Owner

yuin commented Dec 8, 2019

@jkboxomine Please let me know when you have published your library by PR that adds your library to the README :)

@inwardmovement
Copy link

The "minimal defaults" approach is legitimate, but can it at least not strip accentuated characters, but instead "slugify" them by removing accents? é > e, œ > oe, etc.

For now we find oursleves with missing letters in words and the urls are gibberish, impacting readability as well as SEO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants