Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

  triggers "hidden Unicode characters" warning #18324

Closed
silverwind opened this issue Jan 18, 2022 · 9 comments · Fixed by #18333
Closed

  triggers "hidden Unicode characters" warning #18324

silverwind opened this issue Jan 18, 2022 · 9 comments · Fixed by #18333
Labels

Comments

@silverwind
Copy link
Member

silverwind commented Jan 18, 2022

Description

https://try.gitea.io/silverwind/symlink-test/src/branch/master/bidi.md
https://github.com/silverwind/symlink-test/blob/master/bidi.md

The file contains a b and triggers the "hidden Unicode characters" warning on Gitea while it doesn't on GitHub. I think this character may need to be excluded from the warning as it's pretty commonly used in Markdown for non-nefarious purposes.

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Jan 19, 2022

U+0020 SPACE foo bar Depends on font, typically 1/4 em, often adjusted
U+00A0 NO-BREAK SPACE foo bar As a space, but often not adjusted
U+1680 OGHAM SPACE MARK foo bar Unspecified; usually not really a space but a dash
U+180E MONGOLIAN VOWEL SEPARATOR foo᠎bar 0
U+2000 EN QUAD foo bar 1 en (= 1/2 em)
U+2001 EM QUAD foo bar 1 em (nominally, the height of the font)
U+2002 EN SPACE (nut) foo bar 1 en (= 1/2 em)
U+2003 EM SPACE (mutton) foo bar 1 em
U+2004 THREE-PER-EM SPACE (thick space) foo bar 1/3 em
U+2005 FOUR-PER-EM SPACE (mid space) foo bar 1/4 em
U+2006 SIX-PER-EM SPACE foo bar 1/6 em
U+2007 FIGURE SPACE foo bar “Tabular width”, the width of digits
U+2008 PUNCTUATION SPACE foo bar The width of a period “.”
U+2009 THIN SPACE foo bar 1/5 em (or sometimes 1/6 em)
U+200A HAIR SPACE foo bar Narrower than THIN SPACE
U+200B ZERO WIDTH SPACE foo​bar 0
U+202F NARROW NO-BREAK SPACE foo bar Narrower than NO-BREAK SPACE (or SPACE), “typically the width of a thin space or a mid space”
U+205F MEDIUM MATHEMATICAL SPACE foo bar 4/18 em
U+3000 IDEOGRAPHIC SPACE foo bar The width of ideographic (CJK) characters.
U+FEFF ZERO WIDTH NO-BREAK SPACE foobar 0

image

@zeripath
Copy link
Contributor

th‍is is not the same as thi‍s.

nor is ä the same as ä!

@wxiaoguang
Copy link
Contributor

Yep, unicode is very tricky. The question is what should we do or should we be too strict?

	s := string([]byte{0xE2, 0x84, 0xAA})
	log.Print(s)
	log.Print("K" == s)
	log.Print(strings.EqualFold("K", s))

2022/01/19 17:31:11 K
2022/01/19 17:31:11 false
2022/01/19 17:31:11 true

@zeripath
Copy link
Contributor

I tried to just warn for bad bidi and was forced to add warning for every use of bidi. At that point I simply added escaping for every non-visible character. Genuine homoglyph detection stalled because the original PR completely stalled.

The code has the ability to mark the escape with the type of character that is escaping. We can tune where and when the warning occurs instead of just for any escaping. Adding in a classifier to the type of escaping would also be easy and so would then be escaping only for certain types of characters etc.

@silverwind
Copy link
Member Author

Isn't it sufficient to only highlight the bidi characters?

What's the purpose of the "invisible characters" warning? I mean any space-type character is "invisible" by definition and not really a concern, or is it?

@zeripath
Copy link
Contributor

th‍is is not the same as thi‍s.

nor is ä the same as ä!

@silverwind silverwind changed the title   incorrectly triggers "hidden Unicode characters" warning   triggers "hidden Unicode characters" warning Jan 19, 2022
@zeripath
Copy link
Contributor

diff --git a/options/locale/locale_en-US.ini b/options/locale/locale_en-US.ini
index d8398f6d9..036a1a983 100644
--- a/options/locale/locale_en-US.ini
+++ b/options/locale/locale_en-US.ini
@@ -1015,6 +1015,9 @@ bidi_bad_description_escaped = `This file contains unexpected Bidirectional Unic
 unicode_header = `This file contains hidden Unicode characters!`
 unicode_description = `This file contains hidden Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.`
 unicode_description_escaped = `This file contains hidden Unicode characters. Hidden unicode characters are escaped below. Use the Unescape button to show how they render.`
+bidi_header = `This file contains Bidirectional Unicode characters!`
+bidi_description = `This file contains Bidirectional Unicode characters that may be processed differently from what appears below. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to reveal hidden characters.`
+bidi_description_escaped = `This file contains Bidirectional Unicode characters. Hidden unicode characters are escaped below. Use the Unescape button to show how they render.`
 line_unicode = `This line has hidden unicode characters`
 
 escape_control_characters = Escape
diff --git a/templates/repo/unicode_escape_prompt.tmpl b/templates/repo/unicode_escape_prompt.tmpl
index 855d7866a..171752498 100644
--- a/templates/repo/unicode_escape_prompt.tmpl
+++ b/templates/repo/unicode_escape_prompt.tmpl
@@ -7,13 +7,13 @@
 			</div>
 			<p>{{$.root.i18n.Tr "repo.bidi_bad_description" | Str2html}}</p>
 		</div>
-	{{else if .EscapeStatus.Escaped}}
+	{{else if .EscapeStatus.HasBIDI}}
 		<div class="ui warning message unicode-escape-prompt">
 			<span class="close icon hide-panel button" data-panel-closest=".message">{{svg "octicon-x" 16 "close inside"}}</span>
 			<div class="header">
-				{{$.root.i18n.Tr "repo.unicode_header"}}
+				{{$.root.i18n.Tr "repo.bidi_header"}}
 			</div>
-			<p>{{$.root.i18n.Tr "repo.unicode_description" | Str2html}}</p>
+			<p>{{$.root.i18n.Tr "repo.bidi_description" | Str2html}}</p>
 		</div>
 	{{end}}
 {{end}}

Would simply drop the header and only show it if there are bidi characters.

The ⚠️ indicators would still be present on the lines that had unicode spaces etc. on them - I'd note that monaco has highlights nbsp characters.

@silverwind
Copy link
Member Author

silverwind commented Jan 19, 2022

Yes, non-ASCII spaces could still be highlighted, they are usually an error in code. But I'd certainly get rid of the obnoxious yellow box unless there are BIDI characters.

&nbsp; in rendered markdown should also not show a ⚠️, as an exception.

@zeripath
Copy link
Contributor

zeripath commented Jan 19, 2022

&nbsp; in rendered markdown should also not show a ⚠️, as an exception.

Without the "obnoxious" box there is nothing shown in rendered markdown except that the "escape" button is present.

zeripath added a commit to zeripath/gitea that referenced this issue Jan 19, 2022
Fix go-gitea#18324

Signed-off-by: Andrew Thornton <art27@cantab.net>
zeripath added a commit that referenced this issue Jan 19, 2022
Fix #18324

Signed-off-by: Andrew Thornton <art27@cantab.net>
Chianina pushed a commit to Chianina/gitea that referenced this issue Mar 28, 2022
Fix go-gitea#18324

Signed-off-by: Andrew Thornton <art27@cantab.net>
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants