Closed s1sw4nto closed 1 year ago
The title is produced using Unicode characters from the Unicode block of Mathematical Alphanumeric Symbols.
There is no direct way to convert this abstruse encoding to semantically valid characters. You'd have to create a translation table or rely on each variant of the symbols being a sequence A-Za-z starting at the code point for the glyph that resembles A.
The POSIX tr
program is the tool to use in a shell script.
The --restrict-filenames
option does handle this title, but elides any run of symbol characters to a single _
, which is probably not what you want.
I found simple solution: iconv -f utf-8 -t ascii//translit
youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8 | iconv -f utf-8 -t ascii//translit | sed -E 's/[^[:alnum:][:blank:]]+/-/g' | sed 's/- -/-/g' | sed 's/ -*$//g' | sed 's/-*$//g' | sed 's/_*$//g' | sed 's/$/ - Radio Dangdut 24 Jam.mp3/g'
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS - Dike Sabrina - Dj Popo - Radio Dangdut 24 Jam.mp3
Ok thanks you
Good to know that iconv implements this conversion. There is this wrapper that implements codecs using iconv, GPL3 and Py3.6+.
Since --restrict-filename
already attempts to clean up accents and the like, I wouldn't say this is out of scope. Especially, since there is no need for us to maintain a mapping - Python already does that for us. All we need to do is pass the filename though unicodedata.normalize
. It probably wouldn't work to everyone's liking, but is good enough imo
Relevent yt-dlp code: https://github.com/yt-dlp/yt-dlp/blob/adba24d2079d350fc03226adff3cae919d7a11db/yt_dlp/utils.py#L676-L677
So that's practical:
--- old/youtube_dl/utils.py
+++ new/youtube_dl/utils.py
@@ -33,6 +33,7 @@ import sys
import tempfile
import time
import traceback
+import unicodedata
import xml.etree.ElementTree
import zlib
@@ -2118,6 +2119,9 @@ def sanitize_filename(s, restricted=False, is_id=False):
return '_'
return char
+ # Replace look-alike Unicode glyphs
+ if restricted and not is_id:
+ s = unicodedata.normalize('NFKC', s)
# Handle timestamps
s = re.sub(r'[0-9]+(?::[0-9]+)+', lambda m: m.group(0).replace(':', '_'), s)
result = ''.join(map(replace_insane, s))
Then:
$ python
Python 2.7.17 (default, Jul 28 2022, 20:17:29)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.normalize('NFKC',u'πΏπ ππ₯π€ ππ¨ππ ππ£π€ πππΌπππΌππΏ πππππ π ππππ π½πΌππ " πΏππ π ππππ§ππ£π " β( πΏπ ππ€π₯π€ )')
u'Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " \u2757( Dj Popo )'
>>>
$ python -m youtube_dl --get-title 'https://youtu.be/uHbbM4_Y-m8'
πΏπ ππ₯π€ ππ¨ππ ππ£π€ πππΌπππΌππΏ πππππ π ππππ π½πΌππ " πΏππ π ππππ§ππ£π " β( πΏπ ππ€π₯π€ )
$ python -m youtube_dl --get-filename 'https://youtu.be/uHbbM4_Y-m8'
πΏπ ππ₯π€ ππ¨ππ ππ£π€ πππΌπππΌππΏ πππππ π ππππ π½πΌππ ' πΏππ π ππππ§ππ£π ' β( πΏπ ππ€π₯π€ )-uHbbM4_Y-m8.mp4
$ python -m youtube_dl --get-filename --restrict-filenames 'https://youtu.be/uHbbM4_Y-m8'
Dj_Opo_Iseh_Ono_THAILAND_STYLE_x_SLOW_BASS_Dike_Sabrina_Dj_Popo-uHbbM4_Y-m8.mp4
$
... There is no direct way to convert this abstruse encoding to semantically valid characters ...
... unless you know about the unicodedata
module (I suppose that the iconv package adds more functionality)!
One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without --restrict-filename
, since otherwise the metadata is probably meaningless; eg: --unicode-normalize all|metadata-list|none
.
One might consider whether (perhaps as an option) this transformation should be applied to all textual metadata, and also to the filename without
--restrict-filename
, since otherwise the metadata is probably meaningless; eg:--unicode-normalize all|metadata-list|none
.
yt-dlp has unicode normalization built into the --print
/-o
. Not sure if you want to expand/complicate output template syntax like that
β― yt-dlp -O %(title)+U https://youtu.be/uHbbM4_Y-m8
Dj Opo Iseh Ono THAILAND STYLE x SLOW BASS " Dike Sabrina " β( Dj Popo )
That doesn't affect the unrestricted filename, though?
Anyway, absent a compelling PR offered by someone else, I wouldn't want to implement the formatting syntax from yt-dlp for the moment, but at least --[no-]unicode-normalization
might be possible, or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.
I'd expect that --unicode-normalization
would apply to these free-text non-ID fields:
title
alt_title
description
uploader
creator
channel
comments[n]['author']
comments[n]['text']
categories[n]
tags[n]
chapters[n]['title']
chapter
series
episode
track
artist
album
album_artist
That doesn't affect the unrestricted filename, though?
It does, if you use +U
in -o
or any extractor that might regularly suffer from improperly rendered metadata could use the transformation.
I don't think this should be done. First off, while some users may prefer the normalized metadata, the unicode is the correct one. Normalization should only be done with a user-facing option. Also, letting extractors do this creates inconsistencies, which will get harder and harder to standardize over time
On next update youtube-d please add option --unicode-normalization I can't used yt-dlp, my python 2.6
my python 2.6
Why ?
my python 2.6
Why ?
My virtual server used CentOS 6, i cant install python 3
Yeah, an EOL distro version for years; there seems to be Py >=3.6 in RH-sanctioned 3rd-party repos, though. The host (and you) is playing with fire using a dead version.
The original YT video is no longer available. If someone has a current URL that generates a filename with Unicode look-alike characters, we can demonstrate the result of the above commit using --restrict-filenames
.
Presumably bots that search for potentially copyright-infringing material also know about the transformation in use, so the practice of using such characters may wither away.
Since macOS 13.3.1, it is very difficult to open files with names encoded in Unicode normal form C (NFC), only normal form D (NFD) is supported. You can create, read, write, etc., files in either encoding just fine with the Unix API (where file names are just bytes), but AppKit will refuse to open such files now.
To reproduce the issue:
echo 'Bonjour!' > français.txt
open -a TextEdit français.txt
and TextEdit will just hang. Same for QuickTime with Unicode-titled videos downloaded by youtube-dl.
Ideally, Python would handle this using os.fsencode but this function is implemented as a simple string.encode call.
So I guess the right place is in utils.sanitize_filename, like above. It would be nice if this function could include something like:
if sys.platform == 'darwin':
s = unicodedata.normalize('NFKD', s)
Are you really saying that Apple has built programs that don't understand U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA)
, which AIUI is the result of NFKC processing for anything that looks like Γ§
?
Isn't that an Apple bug?
Also, NFD_rename.py
?:
def main():
import os
import unicodedata
import sys
filename = sys.argv[1]
dirname, base = os.path.split(filename)
if not base:
return
base = unicodedata.normalize('NFD', base)
nfd_filename = os.path.join(dirname, base)
if filename != nfd_filename:
os.rename(filename, nfd_filename)
(or NFKD
, etc, as required)
Yes and yes. It's the craziest Apple bug I ever saw. And os.rename() from NFC to NFD form works.
Update: I just checked on the bug tracker, and there's an update from 7 hours ago:
macOS Ventura 13.4 Beta 4 Release Notes Fixed a regression in macOS Ventura 13.3 where a security check causes bookmark resolution to fail when the path contains Unicode characters stored with composed normalization. As an example, this prevented files in Finder from opening when double-clicked. (107550080)
Sorry, I should have checked that first before commenting.
Completed in c94a459.
Checklist
Question
Example: youtube-dl --get-title https://youtu.be/uHbbM4_Y-m8 πΏπ ππ₯π€ ππ¨ππ ππ£π€ πππΌπππΌππΏ πππππ π ππππ π½πΌππ " πΏππ π ππποΏ½ππ£π " β( πΏπ ππ€π₯π€ )
My out file: πΏπ ππ₯π€ ππ¨ππ ππ£π€ πππΌπππΌππΏ πππππ π ππππ π½πΌππ - πΏππ π ππππ§ππ£π - πΏπ ππ€ - Radio Dangdut 24 Jam.mp3
That mp3 playing fine, no problem, but filename like that. Help, how to convert that title to defaut text, in Linux (bash script) Thanks.