Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

Closed
s1sw4nto opened this issue Aug 31, 2022 · 18 comments
Closed

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

s1sw4nto opened this issue Aug 31, 2022 · 18 comments

Comments

@s1sw4nto
Copy link

s1sw4nto commented Aug 31, 2022

Checklist

  • [✓] I'm asking a question
  • [✓] I've looked through the README and FAQ for similar questions
  • [✓] I've searched the bugtracker for similar questions including closed ones

Question

Example:
youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗�𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )

My out file: 𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 - 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 - 𝘿𝙟 𝙋𝙤 - Radio Dangdut 24 Jam.mp3

That mp3 playing fine, no problem, but filename like that.
Help, how to convert that title to defaut text, in Linux (bash script)
Thanks.

Screenshot_20220901-053542_AndFTP

@s1sw4nto s1sw4nto changed the title [ask] Youtube get title: Italic bold to default text [ask] Youtube get title: Font style to default font Aug 31, 2022
@dirkf
Copy link
Contributor

dirkf commented Sep 1, 2022

The title is produced using Unicode characters from the Unicode block of Mathematical Alphanumeric Symbols.

There is no direct way to convert this abstruse encoding to semantically valid characters. You'd have to create a translation table or rely on each variant of the symbols being a sequence A-Za-z starting at the code point for the glyph that resembles A.

The POSIX tr program is the tool to use in a shell script.

The --restrict-filenames option does handle this title, but elides any run of symbol characters to a single _, which is probably not what you want.

@s1sw4nto
Copy link
Author

s1sw4nto commented Sep 1, 2022

I found simple solution: iconv -f utf-8 -t ascii//translit

youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8 | iconv -f utf-8 -t ascii//translit | sed -E 's/[^[:alnum:][:blank:]]+/-/g' | sed 's/- -/-/g' | sed 's/ -*$//g' | sed 's/-*$//g' | sed 's/_*$//g' | sed 's/$/ - Radio Dangdut 24 Jam.mp3/g'

Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS - Dike Sabrina - Dj Popo - Radio Dangdut 24 Jam.mp3

@s1sw4nto s1sw4nto closed this as completed Sep 1, 2022
@s1sw4nto s1sw4nto reopened this Sep 1, 2022
@s1sw4nto
Copy link
Author

s1sw4nto commented Sep 1, 2022

Ok thanks you

@dirkf
Copy link
Contributor

dirkf commented Sep 3, 2022

Good to know that iconv implements this conversion. There is this wrapper that implements codecs using iconv, GPL3 and Py3.6+.

@dirkf dirkf closed this as completed Sep 3, 2022
@pukkandan
Copy link
Contributor

pukkandan commented Sep 3, 2022

Since --restrict-filename already attempts to clean up accents and the like, I wouldn't say this is out of scope. Especially, since there is no need for us to maintain a mapping - Python already does that for us. All we need to do is pass the filename though unicodedata.normalize. It probably wouldn't work to everyone's liking, but is good enough imo

Relevent yt-dlp code: https://github.com/yt-dlp/yt-dlp/blob/adba24d2079d350fc03226adff3cae919d7a11db/yt_dlp/utils.py#L676-L677

@dirkf
Copy link
Contributor

dirkf commented Sep 3, 2022

So that's practical:

--- old/youtube_dl/utils.py
+++ new/youtube_dl/utils.py
@@ -33,6 +33,7 @@ import sys
 import tempfile
 import time
 import traceback
+import unicodedata
 import xml.etree.ElementTree
 import zlib
 
@@ -2118,6 +2119,9 @@ def sanitize_filename(s, restricted=False, is_id=False):
             return '_'
         return char
 
+    # Replace look-alike Unicode glyphs
+    if restricted and not is_id:
+        s = unicodedata.normalize('NFKC', s)
     # Handle timestamps
     s = re.sub(r'[0-9]+(?::[0-9]+)+', lambda m: m.group(0).replace(':', '_'), s)
     result = ''.join(map(replace_insane, s))

Then:

$ python 
Python 2.7.17 (default, Jul 28 2022, 20:17:29) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.normalize('NFKC',u'𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )')
u'Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " \u2757( Dj Popo )'
>>> 
$ python -m youtube_dl --get-title 'https://youtu.be/uHbbM4_Y-m8'
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )
$ python -m youtube_dl --get-filename 'https://youtu.be/uHbbM4_Y-m8'
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 ' 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 ' ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )-uHbbM4_Y-m8.mp4
$ python -m youtube_dl --get-filename --restrict-filenames 'https://youtu.be/uHbbM4_Y-m8'
Dj_Opo_Iseh_Ono_THAILAND_STYLE_x_SLOW_BASS_Dike_Sabrina_Dj_Popo-uHbbM4_Y-m8.mp4
$

... There is no direct way to convert this abstruse encoding to semantically valid characters ...

... unless you know about the unicodedata module (I suppose that the iconv package adds more functionality)!

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

@dirkf dirkf reopened this Sep 3, 2022
@pukkandan
Copy link
Contributor

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

yt-dlp has unicode normalization built into the --print/-o. Not sure if you want to expand/complicate output template syntax like that

❯ yt-dlp -O %(title)+U https://youtu.be/uHbbM4_Y-m8
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " ❗( Dj Popo )

@dirkf
Copy link
Contributor

dirkf commented Sep 3, 2022

That doesn't affect the unrestricted filename, though?

Anyway, absent a compelling PR offered by someone else, I wouldn't want to implement the formatting syntax from yt-dlp for the moment, but at least --[no-]unicode-normalization might be possible, or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I'd expect that --unicode-normalization would apply to these free-text non-ID fields:

    title
    alt_title
    description
    uploader
    creator
    channel
    comments[n]['author']
    comments[n]['text']
    categories[n]
    tags[n]
    chapters[n]['title']
    chapter
    series
    episode
    track
    artist
    album
    album_artist

@pukkandan
Copy link
Contributor

That doesn't affect the unrestricted filename, though?

It does, if you use +U in -o

or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I don't think this should be done. First off, while some users may prefer the normalized metadata, the unicode is the correct one. Normalization should only be done with a user-facing option. Also, letting extractors do this creates inconsistencies, which will get harder and harder to standardize over time

@s1sw4nto
Copy link
Author

s1sw4nto commented Sep 8, 2022

On next update youtube-d please add option --unicode-normalization
I can't used yt-dlp, my python 2.6

@rautamiekka

This comment was marked as off-topic.

@s1sw4nto

This comment was marked as off-topic.

@rautamiekka

This comment was marked as off-topic.

dirkf added a commit that referenced this issue Oct 11, 2022
…when --restrict-filenames

Implements #31216 (comment), which has a test.
@dirkf
Copy link
Contributor

dirkf commented Oct 20, 2022

The original YT video is no longer available. If someone has a current URL that generates a filename with Unicode look-alike characters, we can demonstrate the result of the above commit using --restrict-filenames.

Presumably bots that search for potentially copyright-infringing material also know about the transformation in use, so the practice of using such characters may wither away.

alxlive pushed a commit to alxlive/youtube-dl that referenced this issue Feb 27, 2023
@mansourmoufid
Copy link

Since macOS 13.3.1, it is very difficult to open files with names encoded in Unicode normal form C (NFC), only normal form D (NFD) is supported. You can create, read, write, etc., files in either encoding just fine with the Unix API (where file names are just bytes), but AppKit will refuse to open such files now.

To reproduce the issue:

echo 'Bonjour!' > français.txt
open -a TextEdit français.txt 

and TextEdit will just hang. Same for QuickTime with Unicode-titled videos downloaded by youtube-dl.

Ideally, Python would handle this using os.fsencode but this function is implemented as a simple string.encode call.

So I guess the right place is in utils.sanitize_filename, like above. It would be nice if this function could include something like:

if sys.platform == 'darwin':
    s = unicodedata.normalize('NFKD', s)

@dirkf
Copy link
Contributor

dirkf commented May 3, 2023

Are you really saying that Apple has built programs that don't understand U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA), which AIUI is the result of NFKC processing for anything that looks like ç?

Isn't that an Apple bug?

Also, NFD_rename.py?:

def main():

    import os
    import unicodedata
    import sys

    filename = sys.argv[1]
    dirname, base = os.path.split(filename)
    if not base:
        return
    base = unicodedata.normalize('NFD', base)
    nfd_filename = os.path.join(dirname, base)
    if filename != nfd_filename:
        os.rename(filename, nfd_filename)

(or NFKD, etc, as required)

@mansourmoufid
Copy link

Yes and yes. It's the craziest Apple bug I ever saw. And os.rename() from NFC to NFD form works.

Update: I just checked on the bug tracker, and there's an update from 7 hours ago:

macOS Ventura 13.4 Beta 4 Release Notes
Fixed a regression in macOS Ventura 13.3 where a security check causes bookmark resolution to fail when the path contains Unicode characters stored with composed normalization. As an example, this prevented files in Finder from opening when double-clicked. (107550080)

Sorry, I should have checked that first before commenting.

@dirkf dirkf changed the title [ask] Youtube get title: Font style to default font [Youtube] Convert title text from surprising look-alike Unicode glyhs May 3, 2023
@dirkf
Copy link
Contributor

dirkf commented May 3, 2023

Completed in c94a459.

@dirkf dirkf closed this as completed May 3, 2023
@dirkf dirkf changed the title [Youtube] Convert title text from surprising look-alike Unicode glyhs [Youtube] Convert title text from surprising look-alike Unicode glyphs Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants