[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

s1sw4nto · 2022-08-31T19:41:43Z

Checklist

[✓] I'm asking a question
[✓] I've looked through the README and FAQ for similar questions
[✓] I've searched the bugtracker for similar questions including closed ones

Question

Example:
youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗�𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )

My out file: 𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 - 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 - 𝘿𝙟 𝙋𝙤 - Radio Dangdut 24 Jam.mp3

That mp3 playing fine, no problem, but filename like that.
Help, how to convert that title to defaut text, in Linux (bash script)
Thanks.

dirkf · 2022-09-01T13:18:01Z

The title is produced using Unicode characters from the Unicode block of Mathematical Alphanumeric Symbols.

~~There is no direct way to convert this abstruse encoding to semantically valid characters~~. You'd have to create a translation table or rely on each variant of the symbols being a sequence A-Za-z starting at the code point for the glyph that resembles A.

The POSIX tr program is the tool to use in a shell script.

The --restrict-filenames option does handle this title, but elides any run of symbol characters to a single _, which is probably not what you want.

s1sw4nto · 2022-09-01T21:49:22Z

I found simple solution: iconv -f utf-8 -t ascii//translit

Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS - Dike Sabrina - Dj Popo - Radio Dangdut 24 Jam.mp3

s1sw4nto · 2022-09-01T22:35:04Z

Ok thanks you

dirkf · 2022-09-03T07:27:55Z

Good to know that iconv implements this conversion. There is this wrapper that implements codecs using iconv, GPL3 and Py3.6+.

pukkandan · 2022-09-03T10:26:30Z

Since --restrict-filename already attempts to clean up accents and the like, I wouldn't say this is out of scope. Especially, since there is no need for us to maintain a mapping - Python already does that for us. All we need to do is pass the filename though unicodedata.normalize. It probably wouldn't work to everyone's liking, but is good enough imo

Relevent yt-dlp code: https://github.com/yt-dlp/yt-dlp/blob/adba24d2079d350fc03226adff3cae919d7a11db/yt_dlp/utils.py#L676-L677

dirkf · 2022-09-03T11:41:01Z

So that's practical:

--- old/youtube_dl/utils.py
+++ new/youtube_dl/utils.py
@@ -33,6 +33,7 @@ import sys
 import tempfile
 import time
 import traceback
+import unicodedata
 import xml.etree.ElementTree
 import zlib
 
@@ -2118,6 +2119,9 @@ def sanitize_filename(s, restricted=False, is_id=False):
             return '_'
         return char
 
+    # Replace look-alike Unicode glyphs
+    if restricted and not is_id:
+        s = unicodedata.normalize('NFKC', s)
     # Handle timestamps
     s = re.sub(r'[0-9]+(?::[0-9]+)+', lambda m: m.group(0).replace(':', '_'), s)
     result = ''.join(map(replace_insane, s))

Then:

$ python 
Python 2.7.17 (default, Jul 28 2022, 20:17:29) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.normalize('NFKC',u'𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )')
u'Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " \u2757( Dj Popo )'
>>> 
$ python -m youtube_dl --get-title 'https://youtu.be/uHbbM4_Y-m8'
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )
$ python -m youtube_dl --get-filename 'https://youtu.be/uHbbM4_Y-m8'
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 ' 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 ' ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )-uHbbM4_Y-m8.mp4
$ python -m youtube_dl --get-filename --restrict-filenames 'https://youtu.be/uHbbM4_Y-m8'
Dj_Opo_Iseh_Ono_THAILAND_STYLE_x_SLOW_BASS_Dike_Sabrina_Dj_Popo-uHbbM4_Y-m8.mp4
$

... There is no direct way to convert this abstruse encoding to semantically valid characters ...

... unless you know about the unicodedata module (I suppose that the iconv package adds more functionality)!

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

pukkandan · 2022-09-03T12:33:29Z

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

yt-dlp has unicode normalization built into the --print/-o. Not sure if you want to expand/complicate output template syntax like that

❯ yt-dlp -O %(title)+U https://youtu.be/uHbbM4_Y-m8
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " ❗( Dj Popo )

dirkf · 2022-09-03T19:09:26Z

That doesn't affect the unrestricted filename, though?

Anyway, absent a compelling PR offered by someone else, I wouldn't want to implement the formatting syntax from yt-dlp for the moment, but at least --[no-]unicode-normalization might be possible, or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I'd expect that --unicode-normalization would apply to these free-text non-ID fields:

    title
    alt_title
    description
    uploader
    creator
    channel
    comments[n]['author']
    comments[n]['text']
    categories[n]
    tags[n]
    chapters[n]['title']
    chapter
    series
    episode
    track
    artist
    album
    album_artist

pukkandan · 2022-09-04T02:05:41Z

That doesn't affect the unrestricted filename, though?

It does, if you use +U in -o

or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I don't think this should be done. First off, while some users may prefer the normalized metadata, the unicode is the correct one. Normalization should only be done with a user-facing option. Also, letting extractors do this creates inconsistencies, which will get harder and harder to standardize over time

s1sw4nto · 2022-09-08T11:52:56Z

On next update youtube-d please add option --unicode-normalization
I can't used yt-dlp, my python 2.6

…when --restrict-filenames Implements #31216 (comment), which has a test.

dirkf · 2022-10-20T10:32:16Z

The original YT video is no longer available. If someone has a current URL that generates a filename with Unicode look-alike characters, we can demonstrate the result of the above commit using --restrict-filenames.

Presumably bots that search for potentially copyright-infringing material also know about the transformation in use, so the practice of using such characters may wither away.

…when --restrict-filenames Implements ytdl-org#31216 (comment), which has a test.

mansourmoufid · 2023-05-02T23:53:42Z

Since macOS 13.3.1, it is very difficult to open files with names encoded in Unicode normal form C (NFC), only normal form D (NFD) is supported. You can create, read, write, etc., files in either encoding just fine with the Unix API (where file names are just bytes), but AppKit will refuse to open such files now.

To reproduce the issue:

echo 'Bonjour!' > français.txt
open -a TextEdit français.txt

and TextEdit will just hang. Same for QuickTime with Unicode-titled videos downloaded by youtube-dl.

Ideally, Python would handle this using os.fsencode but this function is implemented as a simple string.encode call.

So I guess the right place is in utils.sanitize_filename, like above. It would be nice if this function could include something like:

if sys.platform == 'darwin':
    s = unicodedata.normalize('NFKD', s)

dirkf · 2023-05-03T01:14:50Z

Are you really saying that Apple has built programs that don't understand U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA), which AIUI is the result of NFKC processing for anything that looks like ç?

Isn't that an Apple bug?

Also, NFD_rename.py?:

def main():

    import os
    import unicodedata
    import sys

    filename = sys.argv[1]
    dirname, base = os.path.split(filename)
    if not base:
        return
    base = unicodedata.normalize('NFD', base)
    nfd_filename = os.path.join(dirname, base)
    if filename != nfd_filename:
        os.rename(filename, nfd_filename)

(or NFKD, etc, as required)

mansourmoufid · 2023-05-03T01:41:33Z

Yes and yes. It's the craziest Apple bug I ever saw. And os.rename() from NFC to NFD form works.

Update: I just checked on the bug tracker, and there's an update from 7 hours ago:

macOS Ventura 13.4 Beta 4 Release Notes
Fixed a regression in macOS Ventura 13.3 where a security check causes bookmark resolution to fail when the path contains Unicode characters stored with composed normalization. As an example, this prevented files in Finder from opening when double-clicked. (107550080)

Sorry, I should have checked that first before commenting.

dirkf · 2023-05-03T08:40:28Z

Completed in c94a459.

s1sw4nto added the question label Aug 31, 2022

s1sw4nto changed the title ~~[ask] Youtube get title: Italic bold to default text~~ [ask] Youtube get title: Font style to default font Aug 31, 2022

s1sw4nto closed this as completed Sep 1, 2022

s1sw4nto reopened this Sep 1, 2022

dirkf closed this as completed Sep 3, 2022

dirkf reopened this Sep 3, 2022

This comment was marked as off-topic.

Sign in to view

dirkf added the patch-available label Sep 21, 2022

dirkf added a commit that referenced this issue Oct 11, 2022

[utils] Sanitize look-alike Unicode glyphs in non-ID filename fields …

c94a459

…when --restrict-filenames Implements #31216 (comment), which has a test.

alxlive pushed a commit to alxlive/youtube-dl that referenced this issue Feb 27, 2023

[utils] Sanitize look-alike Unicode glyphs in non-ID filename fields …

46de920

…when --restrict-filenames Implements ytdl-org#31216 (comment), which has a test.

dirkf changed the title ~~[ask] Youtube get title: Font style to default font~~ [Youtube] Convert title text from surprising look-alike Unicode glyhs May 3, 2023

dirkf closed this as completed May 3, 2023

dirkf changed the title ~~[Youtube] Convert title text from surprising look-alike Unicode glyhs~~ [Youtube] Convert title text from surprising look-alike Unicode glyphs Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

s1sw4nto commented Aug 31, 2022 •

edited by dirkf

Loading

dirkf commented Sep 1, 2022 •

edited

Loading

s1sw4nto commented Sep 1, 2022

s1sw4nto commented Sep 1, 2022

dirkf commented Sep 3, 2022

pukkandan commented Sep 3, 2022 •

edited

Loading

dirkf commented Sep 3, 2022

pukkandan commented Sep 3, 2022

dirkf commented Sep 3, 2022

pukkandan commented Sep 4, 2022

s1sw4nto commented Sep 8, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

dirkf commented Oct 20, 2022

mansourmoufid commented May 2, 2023

dirkf commented May 3, 2023 •

edited

Loading

mansourmoufid commented May 3, 2023

dirkf commented May 3, 2023 •

edited

Loading

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

Comments

s1sw4nto commented Aug 31, 2022 • edited by dirkf Loading

Checklist

Question

dirkf commented Sep 1, 2022 • edited Loading

s1sw4nto commented Sep 1, 2022

s1sw4nto commented Sep 1, 2022

dirkf commented Sep 3, 2022

pukkandan commented Sep 3, 2022 • edited Loading

dirkf commented Sep 3, 2022

pukkandan commented Sep 3, 2022

dirkf commented Sep 3, 2022

pukkandan commented Sep 4, 2022

s1sw4nto commented Sep 8, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

dirkf commented Oct 20, 2022

mansourmoufid commented May 2, 2023

dirkf commented May 3, 2023 • edited Loading

mansourmoufid commented May 3, 2023

dirkf commented May 3, 2023 • edited Loading

s1sw4nto commented Aug 31, 2022 •

edited by dirkf

Loading

dirkf commented Sep 1, 2022 •

edited

Loading

pukkandan commented Sep 3, 2022 •

edited

Loading

dirkf commented May 3, 2023 •

edited

Loading

dirkf commented May 3, 2023 •

edited

Loading