ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
132.32k stars 10.03k forks source link

[Youtube] Convert title text from surprising look-alike Unicode glyphs #31216

Closed s1sw4nto closed 1 year ago

s1sw4nto commented 2 years ago

Checklist

Question

Example: youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8 π˜Ώπ™Ÿ π™Šπ™₯𝙀 π™„π™¨π™šπ™ π™Šπ™£π™€ π™π™ƒπ˜Όπ™„π™‡π˜Όπ™‰π˜Ώ π™Žπ™π™”π™‡π™€ 𝙭 π™Žπ™‡π™Šπ™’ π˜½π˜Όπ™Žπ™Ž " π˜Ώπ™žπ™ π™š π™Žπ™–π™—οΏ½π™žπ™£π™– " ❗( π˜Ώπ™Ÿ 𝙋𝙀π™₯𝙀 )

My out file: π˜Ώπ™Ÿ π™Šπ™₯𝙀 π™„π™¨π™šπ™ π™Šπ™£π™€ π™π™ƒπ˜Όπ™„π™‡π˜Όπ™‰π˜Ώ π™Žπ™π™”π™‡π™€ 𝙭 π™Žπ™‡π™Šπ™’ π˜½π˜Όπ™Žπ™Ž - π˜Ώπ™žπ™ π™š π™Žπ™–π™—π™§π™žπ™£π™– - π˜Ώπ™Ÿ 𝙋𝙀 - Radio Dangdut 24 Jam.mp3

That mp3 playing fine, no problem, but filename like that. Help, how to convert that title to defaut text, in Linux (bash script) Thanks.

Screenshot_20220901-053542_AndFTP

dirkf commented 2 years ago

The title is produced using Unicode characters from the Unicode block of Mathematical Alphanumeric Symbols.

There is no direct way to convert this abstruse encoding to semantically valid characters. You'd have to create a translation table or rely on each variant of the symbols being a sequence A-Za-z starting at the code point for the glyph that resembles A.

The POSIX tr program is the tool to use in a shell script.

The --restrict-filenames option does handle this title, but elides any run of symbol characters to a single _, which is probably not what you want.

s1sw4nto commented 2 years ago

I found simple solution: iconv -f utf-8 -t ascii//translit

youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8 | iconv -f utf-8 -t ascii//translit | sed -E 's/[^[:alnum:][:blank:]]+/-/g' | sed 's/- -/-/g' | sed 's/ -*$//g' | sed 's/-*$//g' | sed 's/_*$//g' | sed 's/$/ - Radio Dangdut 24 Jam.mp3/g'

Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS - Dike Sabrina - Dj Popo - Radio Dangdut 24 Jam.mp3

s1sw4nto commented 2 years ago

Ok thanks you

dirkf commented 2 years ago

Good to know that iconv implements this conversion. There is this wrapper that implements codecs using iconv, GPL3 and Py3.6+.

pukkandan commented 2 years ago

Since --restrict-filename already attempts to clean up accents and the like, I wouldn't say this is out of scope. Especially, since there is no need for us to maintain a mapping - Python already does that for us. All we need to do is pass the filename though unicodedata.normalize. It probably wouldn't work to everyone's liking, but is good enough imo

Relevent yt-dlp code: https://github.com/yt-dlp/yt-dlp/blob/adba24d2079d350fc03226adff3cae919d7a11db/yt_dlp/utils.py#L676-L677

dirkf commented 2 years ago

So that's practical:

--- old/youtube_dl/utils.py
+++ new/youtube_dl/utils.py
@@ -33,6 +33,7 @@ import sys
 import tempfile
 import time
 import traceback
+import unicodedata
 import xml.etree.ElementTree
 import zlib

@@ -2118,6 +2119,9 @@ def sanitize_filename(s, restricted=False, is_id=False):
             return '_'
         return char

+    # Replace look-alike Unicode glyphs
+    if restricted and not is_id:
+        s = unicodedata.normalize('NFKC', s)
     # Handle timestamps
     s = re.sub(r'[0-9]+(?::[0-9]+)+', lambda m: m.group(0).replace(':', '_'), s)
     result = ''.join(map(replace_insane, s))

Then:

$ python 
Python 2.7.17 (default, Jul 28 2022, 20:17:29) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.normalize('NFKC',u'π˜Ώπ™Ÿ π™Šπ™₯𝙀 π™„π™¨π™šπ™ π™Šπ™£π™€ π™π™ƒπ˜Όπ™„π™‡π˜Όπ™‰π˜Ώ π™Žπ™π™”π™‡π™€ 𝙭 π™Žπ™‡π™Šπ™’ π˜½π˜Όπ™Žπ™Ž " π˜Ώπ™žπ™ π™š π™Žπ™–π™—π™§π™žπ™£π™– " ❗( π˜Ώπ™Ÿ 𝙋𝙀π™₯𝙀 )')
u'Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " \u2757( Dj Popo )'
>>> 
$ python -m youtube_dl --get-title 'https://youtu.be/uHbbM4_Y-m8'
π˜Ώπ™Ÿ π™Šπ™₯𝙀 π™„π™¨π™šπ™ π™Šπ™£π™€ π™π™ƒπ˜Όπ™„π™‡π˜Όπ™‰π˜Ώ π™Žπ™π™”π™‡π™€ 𝙭 π™Žπ™‡π™Šπ™’ π˜½π˜Όπ™Žπ™Ž " π˜Ώπ™žπ™ π™š π™Žπ™–π™—π™§π™žπ™£π™– " ❗( π˜Ώπ™Ÿ 𝙋𝙀π™₯𝙀 )
$ python -m youtube_dl --get-filename 'https://youtu.be/uHbbM4_Y-m8'
π˜Ώπ™Ÿ π™Šπ™₯𝙀 π™„π™¨π™šπ™ π™Šπ™£π™€ π™π™ƒπ˜Όπ™„π™‡π˜Όπ™‰π˜Ώ π™Žπ™π™”π™‡π™€ 𝙭 π™Žπ™‡π™Šπ™’ π˜½π˜Όπ™Žπ™Ž ' π˜Ώπ™žπ™ π™š π™Žπ™–π™—π™§π™žπ™£π™– ' ❗( π˜Ώπ™Ÿ 𝙋𝙀π™₯𝙀 )-uHbbM4_Y-m8.mp4
$ python -m youtube_dl --get-filename --restrict-filenames 'https://youtu.be/uHbbM4_Y-m8'
Dj_Opo_Iseh_Ono_THAILAND_STYLE_x_SLOW_BASS_Dike_Sabrina_Dj_Popo-uHbbM4_Y-m8.mp4
$

... There is no direct way to convert this abstruse encoding to semantically valid characters ...

... unless you know about the unicodedata module (I suppose that the iconv package adds more functionality)!

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

pukkandan commented 2 years ago

One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none.

yt-dlp has unicode normalization built into the --print/-o. Not sure if you want to expand/complicate output template syntax like that

❯ yt-dlp -O %(title)+U https://youtu.be/uHbbM4_Y-m8
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " ❗( Dj Popo )
dirkf commented 2 years ago

That doesn't affect the unrestricted filename, though?

Anyway, absent a compelling PR offered by someone else, I wouldn't want to implement the formatting syntax from yt-dlp for the moment, but at least --[no-]unicode-normalization might be possible, or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I'd expect that --unicode-normalization would apply to these free-text non-ID fields:

    title
    alt_title
    description
    uploader
    creator
    channel
    comments[n]['author']
    comments[n]['text']
    categories[n]
    tags[n]
    chapters[n]['title']
    chapter
    series
    episode
    track
    artist
    album
    album_artist
pukkandan commented 2 years ago

That doesn't affect the unrestricted filename, though?

It does, if you use +U in -o

or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.

I don't think this should be done. First off, while some users may prefer the normalized metadata, the unicode is the correct one. Normalization should only be done with a user-facing option. Also, letting extractors do this creates inconsistencies, which will get harder and harder to standardize over time

s1sw4nto commented 2 years ago

On next update youtube-d please add option --unicode-normalization I can't used yt-dlp, my python 2.6

rautamiekka commented 2 years ago

my python 2.6

Why ?

s1sw4nto commented 2 years ago

my python 2.6

Why ?

My virtual server used CentOS 6, i cant install python 3

rautamiekka commented 2 years ago

Yeah, an EOL distro version for years; there seems to be Py >=3.6 in RH-sanctioned 3rd-party repos, though. The host (and you) is playing with fire using a dead version.

dirkf commented 2 years ago

The original YT video is no longer available. If someone has a current URL that generates a filename with Unicode look-alike characters, we can demonstrate the result of the above commit using --restrict-filenames.

Presumably bots that search for potentially copyright-infringing material also know about the transformation in use, so the practice of using such characters may wither away.

mansourmoufid commented 1 year ago

Since macOS 13.3.1, it is very difficult to open files with names encoded in Unicode normal form C (NFC), only normal form D (NFD) is supported. You can create, read, write, etc., files in either encoding just fine with the Unix API (where file names are just bytes), but AppKit will refuse to open such files now.

To reproduce the issue:

echo 'Bonjour!' > français.txt
open -a TextEdit français.txt 

and TextEdit will just hang. Same for QuickTime with Unicode-titled videos downloaded by youtube-dl.

Ideally, Python would handle this using os.fsencode but this function is implemented as a simple string.encode call.

So I guess the right place is in utils.sanitize_filename, like above. It would be nice if this function could include something like:

if sys.platform == 'darwin':
    s = unicodedata.normalize('NFKD', s)
dirkf commented 1 year ago

Are you really saying that Apple has built programs that don't understand U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA), which AIUI is the result of NFKC processing for anything that looks like Γ§?

Isn't that an Apple bug?

Also, NFD_rename.py?:

def main():

    import os
    import unicodedata
    import sys

    filename = sys.argv[1]
    dirname, base = os.path.split(filename)
    if not base:
        return
    base = unicodedata.normalize('NFD', base)
    nfd_filename = os.path.join(dirname, base)
    if filename != nfd_filename:
        os.rename(filename, nfd_filename)

(or NFKD, etc, as required)

mansourmoufid commented 1 year ago

Yes and yes. It's the craziest Apple bug I ever saw. And os.rename() from NFC to NFD form works.

Update: I just checked on the bug tracker, and there's an update from 7 hours ago:

macOS Ventura 13.4 Beta 4 Release Notes Fixed a regression in macOS Ventura 13.3 where a security check causes bookmark resolution to fail when the path contains Unicode characters stored with composed normalization. As an example, this prevented files in Finder from opening when double-clicked. (107550080)

Sorry, I should have checked that first before commenting.

dirkf commented 1 year ago

Completed in c94a459.