-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216
Comments
The title is produced using Unicode characters from the Unicode block of Mathematical Alphanumeric Symbols.
The POSIX The |
I found simple solution:
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS - Dike Sabrina - Dj Popo - Radio Dangdut 24 Jam.mp3 |
Ok thanks you |
Good to know that iconv implements this conversion. There is this wrapper that implements codecs using iconv, GPL3 and Py3.6+. |
Since Relevent yt-dlp code: https://github.com/yt-dlp/yt-dlp/blob/adba24d2079d350fc03226adff3cae919d7a11db/yt_dlp/utils.py#L676-L677 |
So that's practical: --- old/youtube_dl/utils.py
+++ new/youtube_dl/utils.py
@@ -33,6 +33,7 @@ import sys
import tempfile
import time
import traceback
+import unicodedata
import xml.etree.ElementTree
import zlib
@@ -2118,6 +2119,9 @@ def sanitize_filename(s, restricted=False, is_id=False):
return '_'
return char
+ # Replace look-alike Unicode glyphs
+ if restricted and not is_id:
+ s = unicodedata.normalize('NFKC', s)
# Handle timestamps
s = re.sub(r'[0-9]+(?::[0-9]+)+', lambda m: m.group(0).replace(':', '_'), s)
result = ''.join(map(replace_insane, s)) Then:
... unless you know about the One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without |
yt-dlp has unicode normalization built into the
|
That doesn't affect the unrestricted filename, though? Anyway, absent a compelling PR offered by someone else, I wouldn't want to implement the formatting syntax from yt-dlp for the moment, but at least I'd expect that
|
It does, if you use
I don't think this should be done. First off, while some users may prefer the normalized metadata, the unicode is the correct one. Normalization should only be done with a user-facing option. Also, letting extractors do this creates inconsistencies, which will get harder and harder to standardize over time |
On next update youtube-d please add option --unicode-normalization |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
…when --restrict-filenames Implements #31216 (comment), which has a test.
The original YT video is no longer available. If someone has a current URL that generates a filename with Unicode look-alike characters, we can demonstrate the result of the above commit using Presumably bots that search for potentially copyright-infringing material also know about the transformation in use, so the practice of using such characters may wither away. |
…when --restrict-filenames Implements ytdl-org#31216 (comment), which has a test.
Since macOS 13.3.1, it is very difficult to open files with names encoded in Unicode normal form C (NFC), only normal form D (NFD) is supported. You can create, read, write, etc., files in either encoding just fine with the Unix API (where file names are just bytes), but AppKit will refuse to open such files now. To reproduce the issue:
and TextEdit will just hang. Same for QuickTime with Unicode-titled videos downloaded by youtube-dl. Ideally, Python would handle this using os.fsencode but this function is implemented as a simple string.encode call. So I guess the right place is in utils.sanitize_filename, like above. It would be nice if this function could include something like:
|
Are you really saying that Apple has built programs that don't understand Isn't that an Apple bug? Also,
(or |
Yes and yes. It's the craziest Apple bug I ever saw. And os.rename() from NFC to NFD form works. Update: I just checked on the bug tracker, and there's an update from 7 hours ago:
Sorry, I should have checked that first before commenting. |
Completed in c94a459. |
Checklist
Question
Example:
youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8
𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 " 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗�𝙞𝙣𝙖 " ❗( 𝘿𝙟 𝙋𝙤𝙥𝙤 )
My out file: 𝘿𝙟 𝙊𝙥𝙤 𝙄𝙨𝙚𝙝 𝙊𝙣𝙤 𝙏𝙃𝘼𝙄𝙇𝘼𝙉𝘿 𝙎𝙏𝙔𝙇𝙀 𝙭 𝙎𝙇𝙊𝙒 𝘽𝘼𝙎𝙎 - 𝘿𝙞𝙠𝙚 𝙎𝙖𝙗𝙧𝙞𝙣𝙖 - 𝘿𝙟 𝙋𝙤 - Radio Dangdut 24 Jam.mp3
That mp3 playing fine, no problem, but filename like that.
Help, how to convert that title to defaut text, in Linux (bash script)
Thanks.
The text was updated successfully, but these errors were encountered: