Extended unicode characters discarded from auto heading IDs #56

jkboxomine · 2019-12-05T15:56:04Z

Goldmark 1.1.8 implementation only takes into account one-byte code point (ASCII) while generating auto heading IDs, simply discarding extended latin characters (2 bytes) and other international characters (3 bytes).

https://github.com/yuin/goldmark/blob/master/parser/parser.go#L83-L85

In multilingual sites, this causes imperfect heading IDs to be generated.

yuin · 2019-12-05T16:53:38Z

As commented on #57, I think that's enough as default implementation. If you need to more preferable( for you) heading ids, you can use the WithIDs option. And if you think your auto heading id generation logic may be useful for other people, you can create it as an extension and publish it on GitHub etc.

Because of I'm Japanese(using CJK characters), I agree your thought. But I think that's enough as default implementation. I would like to keep default implementation as simple as possible.

Again, goldmark is an extensible library. you are welcome to publish your auto heading id generation logic as an extension :)

jkboxomine · 2019-12-08T14:34:30Z

Hello @yuin , I've been reading the Goldmark code to see if I can develop an extension for more complete auto heading ID generation. Here are a few questions regarding this:

Is it possible to generate auto heading ID at the renderer step (by writing a renderer that implements renderer.NodeRenderer)? It will be much more configurable to generate auto heading from the full heading text at the rendering step, if at all possible. (such as using regex for pattern matching and replace) Also, it appears that RegisterFuncs is provided by Renderer only, but not by Parser.
If the above is not possible, should I write a parser that implements parser.BlockParser? Also, should I re-implement the whole atxHeadingParser and setextHeadingParser? The generateAutoHeadingID is part of the implementation of those two parsers and I'm afraid I cannot fully override that.

yuin · 2019-12-08T16:02:31Z

@jkboxomine , You are overthinking it. All you have got to do is implement parser.IDs .

Users who want to use your auto heading id generation logic will use your library like the following:

ctx := parser.NewContext(parser.WithIDs(yourlib.NewYourAutoHeadingGenerator()))
markdown := goldmark.New(WithParserOptions(parser.WithAutoHeadingID()))
err := markdown.Convert(source, &b, parser.WithContext(ctx))

yuin · 2019-12-08T16:05:50Z

@jkboxomine Please let me know when you have published your library by PR that adds your library to the README :)

inwardmovement · 2019-12-11T13:08:44Z

The "minimal defaults" approach is legitimate, but can it at least not strip accentuated characters, but instead "slugify" them by removing accents? é > e, œ > oe, etc.

For now we find oursleves with missing letters in words and the urls are gibberish, impacting readability as well as SEO.

jkboxomine mentioned this issue Dec 5, 2019

Improve automatic heading ID generation #57

Closed

yuin closed this as completed Dec 5, 2019

yuin added the proposal label Dec 6, 2019

jkboxomine mentioned this issue Dec 14, 2019

Incomplete auto heading ID for extended Latin and CJK characters gohugoio/hugo#6616

Closed

Lemmingh mentioned this issue Jul 26, 2020

Add gitea slugification yzhang-gh/vscode-markdown#763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extended unicode characters discarded from auto heading IDs #56

Extended unicode characters discarded from auto heading IDs #56

jkboxomine commented Dec 5, 2019

yuin commented Dec 5, 2019 •

edited

Loading

jkboxomine commented Dec 8, 2019 •

edited

Loading

yuin commented Dec 8, 2019

yuin commented Dec 8, 2019

inwardmovement commented Dec 11, 2019

Extended unicode characters discarded from auto heading IDs #56

Extended unicode characters discarded from auto heading IDs #56

Comments

jkboxomine commented Dec 5, 2019

yuin commented Dec 5, 2019 • edited Loading

jkboxomine commented Dec 8, 2019 • edited Loading

yuin commented Dec 8, 2019

yuin commented Dec 8, 2019

inwardmovement commented Dec 11, 2019

yuin commented Dec 5, 2019 •

edited

Loading

jkboxomine commented Dec 8, 2019 •

edited

Loading