ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.47k stars 9.96k forks source link

Retrieve JSON data in unicode (Encoding UTF-8) #11696

Open linglung opened 7 years ago

linglung commented 7 years ago

I need JSON data containing unicode (utf-8) from Youtube-dl, sadly it couldn't retrieve JSON data from YouTube video in UTF-8 (?).

Trying to print JSON info with -j, --dump-json or -J, --dump-single-json , --print-json and or wrote directly into JSON file with --write-info-json. All results were printed in non unicode data string like originally of video source.

The paramaters which were used with/out --encoding utf-8

youtube-dl --write-info-json --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl -j --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl -J --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

youtube-dl --print-json --encoding utf-8 -f mp4 -o "%(title)s.%(ext)s" https://www.youtube.com/watch?v=0alnhFO1B7Y -v

The log output:

[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--write-info-json', '--encoding', 'utf-8', '-f', 'mp4', '-o', '%(title)s.%(ext)s', 'https://www.youtube.com/watch?v=0alnhFO1B7Y', '-v']
[debug] Encodings: locale cp1252, fs mbcs, out cp1252, pref utf-8
[debug] youtube-dl version 2017.01.10
[debug] Python version 3.4.4 - Windows-10-10.0.14393
[debug] exe versions: ffmpeg N-82966-g6993bb4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
[youtube] 0alnhFO1B7Y: Downloading webpage
[youtube] 0alnhFO1B7Y: Downloading video info webpage
[youtube] 0alnhFO1B7Y: Extracting video information
[youtube] 0alnhFO1B7Y: Downloading MPD manifest
[info] Writing video description metadata as JSON to: 香港麥當勞42年歷史大盤點.info.json
[debug] Invoking downloader on 'https://r4---sn-npoeen7k.googlevideo.com/videoplayback?signature=9CF8920347BA9578C4C6C1909BF07083928118A5.C4840C9CB5EE1D0233179DF1EBB58DBF74095DD8&initcwndbps=6973750&mime=video%2Fmp4&key=yt6&ei=yiJ4WKbGGcugoQOQprzACg&upn=-mYp2oMPHqQ&expire=1484289834&dur=105.581&lmt=1484189129530128&clen=8002642&gir=yes&nh=IgpwcjAyLnNpbjExKg03NC4xMjUuNTEuMTcz&ratebypass=yes&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Crequiressl%2Csource%2Cupn%2Cexpire&requiressl=yes&itag=18&source=youtube&id=o-AOL2Ym3gKDUBTYFiGyLZ6ipSYhPAoMG_7kFGBFNI5-ti&pl=18&ms=au&mt=1484268048&mv=m&mm=31&ip=128.199.217.235&mn=sn-npoeen7k&ipbits=0'
[download] Destination: 香港麥當勞42年歷史大盤點.mp4
[download] 100% of 7.63MiB

Below is log of JSON data (this is only a part of full logs - but it represent the essential of this issue) as JSON data contains a huge string data. For example: Title, tags and descriptions :

"title": "\u9999\u6e2f\u9ea5\u7576\u52de42\u5e74\u6b77\u53f2\u5927\u76e4\u9ede", "url": "https://r4---sn-npoeen7k.googlevideo.com/videoplayback?nh=IgpwcjAyLnNpbjExKg03NC4xMjUuNTEuMTcz&mm=31&mime=video%2Fmp4&pl=18&itag=18&mv=m&mt=1484268354&ms=au&ei=iiN4WN3qBc-XoQOAkrCgAQ&requiressl=yes&gir=yes&ratebypass=yes&mn=sn-npoeen7k&clen=8002642&initcwndbps=6792500&source=youtube&id=o-AGC0tYRBdONrnPr4dLWQi5RZD33w4-n6WvsXmWUoX6-W&lmt=1484189129530128&key=yt6&ip=128.199.217.235&expire=1484290026&dur=105.581&upn=1lnlrUcKnCg&signature=D656C1E342F3EFA9F3C8D5DE181169801D2B52F9.17D6C35911182083504F92FEDE78E6010D62E3B6&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Cratebypass%2Crequiressl%2Csource%2Cupn%2Cexpire&ipbits=0", "categories": ["News & Politics"], "duration": 106, "uploader": "\u860b\u679c\u52d5\u65b0\u805e HK Apple Daily", "uploader_id": "appleactionews", "subtitles": {}, "format": "18 - 640x360 (medium)", "abr": 96, "ext": "mp4", "upload_date": "20170110", "thumbnail": "https://i.ytimg.com/vi/0alnhFO1B7Y/hqdefault.jpg", "formats": [{"height": null, "format_note": "DASH audio", "tbr": 57, "fps": null, "vcodec": "none", "url": 
 "description": "\u3010\u672c\u5831\u8a0a\u3011\u9ea5\u7576\u52de\u9003\u4e0d\u904e\u67d3\u7d05\u547d\u904b\uff0c\u6e2f\u4eba\u559c\u6b61\u53eb\u9ea5\u7576\u52de\u505a\u300c\u8001\u9ea5\u300d\u3001\u300c\u9ea5\u8a18\u300d\uff0c\u5168\u56e0\u9ea5\u7576\u52de\u5df2\u966a\u4f34\u6e2f\u4eba\u903e42\u500b\u5e74\u982d\uff0c\u9ea5\u7576\u52de\u53d4\u53d4\u3001\u958b\u751f\u65e5\u6703\u7b49\u96c6\u9ad4\u56de\u61b6\u6df1\u5165\u6c11\u5fc3\uff0c\u9ea5\u7576\u52de\u66fe\u63a8\u63db\u8cfc\u53f2\u8afe\u6bd4\u516c\u4ed4\u6380\u5168\u57ce\u6392\u968a\u71b1\u6f6e\uff0c\u4ea6\u5920\u7d93\u5178\u3002\n\n\u860b\u679c\u65e5\u5831\uff1ahttp://hk.apple.nextmedia.com\n\u5373like\u860b\u679cfb\uff1ahttp://www.facebook.com/hk.nextmedia\niPhone App\uff1ahttp://bit.ly/AppleDailyApp-iPhone\nAndroid App\uff1ahttp://bit.ly/AppleDailyApp-Android", "http_headers": {"Accept-Language": "en-us,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20150101 Firefox/47.0 (Chrome)"}, "start_time": null, "player_url": null, "playlist_index": null, "like_count": 490, "protocol": "https", "format_id": "18"}
{"tags": ["\u860b\u679c\u52d5\u65b0\u805e", "\u860b\u679c\u65e5\u5831", "appledaily", "Apple Daily(Newspaper)", "Hong Kong", "news", "\u52d5\u65b0\u805e", "\u65b0\u805e", "hk", "\u9999\u6e2f"], "uploader_url": "http://www.youtube.com/user/appleactionews", "license": "Standard YouTube License", "age_limit": 0, "resolution": "640x360", "id": "0alnhFO1B7Y",
yan12125 commented 7 years ago

Well the second time people are looking forward to unescaped strings (#10927). It might worth an option.

Here's a quick hack:

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
index 5d654f55f..d7374e820 100755
--- a/youtube_dl/YoutubeDL.py
+++ b/youtube_dl/YoutubeDL.py
@@ -1535,7 +1535,7 @@ class YoutubeDL(object):
         if self.params.get('forceformat', False):
             self.to_stdout(info_dict['format'])
         if self.params.get('forcejson', False):
-            self.to_stdout(json.dumps(info_dict))
+            self.to_stdout(json.dumps(info_dict, ensure_ascii=False))

         # Do nothing else if in simulate mode
         if self.params.get('simulate', False):
linglung commented 7 years ago

Using git shell, got like this:

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
diff: unknown option -- git
diff: Try 'diff --help' for more information.

I try to configure it manually. Edit YoutubeDL.py file from zip master, add your approach self.to_stdout(json.dumps(info_dict, ensure_ascii=False)) in line 1540. Then Execute it as developer mode to test it : python -m youtube_dl --write-info-json https://www.youtube.com/watch?v=of0B-ZvxYI4.

Same result. 😢

yan12125 commented 7 years ago

Well, --write-info-json uses a different function.

diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 12863e74a..6ded34832 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -231,7 +231,7 @@ def write_json_file(obj, fn):

     try:
         with tf:
-            json.dump(obj, tf)
+            json.dump(obj, tf, ensure_ascii=False)
         if sys.platform == 'win32':
             # Need to remove existing file on Windows, else os.rename raises
             # WindowsError or FileExistsError.

On Linux/Mac/... you can use patch to apply the change. On Windows, I'm afraid you'll need to change those files by hands.

linglung commented 7 years ago

Great..!. It works as expected.

"title": "【激震】松本伊代(51)が逮捕の可能性…(画像あり)", "alt_title": null, "thumbnail": "https://i.ytimg.com/vi/of0B-ZvxYI4/hqdefault.jpg", 
"description": "これはいかんやろ\n\n【おすすめサイト】\nびっくり映像まとめ\nhttp://lifestylemovie305.club/\n癒し系感動画像まとめ\nhttp://lifestyle305.link/\n\n引用元\nまとめもりー\n\n関連動画\n【警察がガラスを割って逃走車を逮捕の大暴れの瞬間\nhttps://youtu.be/FRc_PDxdaKk\n\n【親友】草なぎ剛の逮捕後あいつだけが連絡をくれたんだ【芸能ゴシップch】\nhttps://youtu.be/F7u-eeVqvNo\n\n【逮捕】ヤマト運輸チェーンソー襲撃事件\nhttps://youtu.be/Kr4k1RXmBXk", "categories": ["Entertainment"], "tags": ["松本伊代", "逮捕", "鉄ヲタ", "侵入", "芸能ゴシップチャンネル"], "subtitles": {}, "automatic_captions": {}, "duration": 44, "age_limit": 0, "annotations": null, 
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--write-info-json', 'https://www.youtube.com/watch?v=of0B-ZvxYI4', '-v']
[debug] Encodings: locale cp1252, fs utf-8, out cp1252, pref cp1252
[debug] youtube-dl version 2017.01.10
[debug] Git HEAD: 250a6a6
[debug] Python version 3.6.0 - Windows-10-10.0.14393-SP0
[debug] exe versions: ffmpeg 2.8.4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
[youtube] of0B-ZvxYI4: Downloading webpage
[youtube] of0B-ZvxYI4: Downloading video info webpage
[youtube] of0B-ZvxYI4: Extracting video information
[youtube] of0B-ZvxYI4: Downloading MPD manifest
[info] Writing video description metadata as JSON to: 51▒-of0B-ZvxYI4.info.json
WARNING: Requested formats are incompatible for merge and will be merged into mkv.
[debug] Invoking downloader on 'https://r1---sn-npoeene7.googlevideo.com/videoplayback/id/a1fd01f99bf1608e/itag/137/source/youtube/requiressl/yes/pl/20/ms/au/mv/m/mm/31/mn/sn-npoeene7/nh/IgpwcjAyLnNpbjExKgkxMjcuMC4wLjE/initcwndbps/5181250/ratebypass/yes/mime/video%2Fmp4/otfp/1/gir/yes/clen/15804514/lmt/1484537241771041/dur/44.010/mt/1484587873/signature/51F5F5775AFC186891468FEA3189DE2C4363AEC0.73349929948C64E626C984C19B4450A69ADFBC48/key/dg_yt0/upn/TvBQw5qcbLw/ip/128.199.120.49/ipbits/0/expire/1484609801/sparams/ip,ipbits,expire,id,itag,source,requiressl,pl,ms,mv,mm,mn,nh,initcwndbps,ratebypass,mime,otfp,gir,clen,lmt,dur/'
[dashsegments] Total fragments: 10
[download] Destination: 51▒-of0B-ZvxYI4.f137.mp4
[download] 100% of 15.07MiB in 00:10
[debug] Invoking downloader on 'https://r1---sn-npoeene7.googlevideo.com/videoplayback?keepalive=yes&ei=qAR9WLyQCqWWoAOU2LGQAg&lmt=1484537811953574&sparams=clen%2Cdur%2Cei%2Cgir%2Cid%2Cinitcwndbps%2Cip%2Cipbits%2Citag%2Ckeepalive%2Clmt%2Cmime%2Cmm%2Cmn%2Cms%2Cmv%2Cnh%2Cpl%2Crequiressl%2Csource%2Cupn%2Cexpire&gir=yes&nh=IgpwcjAyLnNpbjExKgkxMjcuMC4wLjE&signature=E0FCAFA6A26E36BBAF079871A1245E52D44F38BA.39EE2958290D2587D1F9133C72A6DD80542DCE0F&dur=44.021&initcwndbps=5181250&itag=251&clen=721390&ipbits=0&key=yt6&upn=XK0SksNiZ_k&expire=1484609800&mv=m&mt=1484587873&ms=au&id=o-AAGPZjeL-9r4CcQxIfhSH50qx54cLzbhhisXP7f74bbJ&mn=sn-npoeene7&pl=20&source=youtube&mm=31&ip=128.199.120.49&mime=audio%2Fwebm&requiressl=yes&ratebypass=yes'
[download] Destination: 51▒-of0B-ZvxYI4.f251.webm
[download] 100% of 704.48KiB in 00:01
[ffmpeg] Merging formats into "51▒-of0B-ZvxYI4.mkv"
[debug] ffmpeg command line: ffmpeg -y -i 'file:51▒-of0B-ZvxYI4.f137.mp4' -i 'file:51▒-of0B-ZvxYI4.f251.webm' -c copy -map 0:v:0 -map 1:a:0 'file:51▒-of0B-ZvxYI4.temp.mkv'
Deleting original file 51▒-of0B-ZvxYI4.f137.mp4 (pass -k to keep)
Deleting original file 51▒-of0B-ZvxYI4.f251.webm (pass -k to keep)
linglung commented 7 years ago

Sadly if i used your first approach with dump json -j or -J (no write json file), it didn't work. FYI, first i restore the original utils.py file before doing this, and changed the lines of YouTubeDL.py file as your 1st approach.

and the logs:

python -m youtube_dl -j https://www.youtube.com/watch?v=of0B-ZvxYI4 -v
Traceback (most recent call last):
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "C:\Users\Google\AppData\Local\Programs\Python\Python36-32\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "C:\Users\Google\Documents\GitHub\ytdl\youtube_dl\__init__.py", line 45, in <module>
    from .YoutubeDL import YoutubeDL
  File "C:\Users\Google\Documents\GitHub\ytdl\youtube_dl\YoutubeDL.py", line 1540
    self.to_stdout(json.dumps(info_dict, ensure_ascii=False))
                                                            ^
TabError: inconsistent use of tabs and spaces in indentation
yan12125 commented 7 years ago

Most likely there are tabs - replace them all with spaces.

linglung commented 7 years ago

@yan12125 Perfect. Fix now. Thank you so much 😄

python -m youtube_dl -j --encoding utf-8 https://www.youtube.com/watch?v=of0B-ZvxYI4 -v
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['-j', '--encoding', 'utf-8', 'https://www.youtube.com/watch?v=of0B-ZvxYI4', '-v']
[debug] Encodings: locale cp1252, fs utf-8, out cp1252, pref utf-8
[debug] youtube-dl version 2017.01.10
[debug] Git HEAD: 250a6a6
[debug] Python version 3.6.0 - Windows-10-10.0.14393-SP0
[debug] exe versions: ffmpeg 2.8.4, ffprobe N-82966-g6993bb4
[debug] Proxy map: {}
{"id": "of0B-ZvxYI4", "uploader": "芸能 ゴシップ チャンネル", "uploader_id": "UC0OUfSvMHCpn2sukdhH-5kw", "uploader_url": "http://www.youtube.com/channel/UC0OUfSvMHCpn2sukdhH-5kw", "upload_date": "20170115", "license": "Standard YouTube License", "creator": null, "title": "【激震】松本伊代(51)が逮捕の可能性…(画像あり)", "alt_title": null, "thumbnail": "https://i.ytimg.com/vi/of0B-ZvxYI4/hqdefault.jpg", "description": "これはいかんやろ\n\n【おすすめサイト】\nびっくり映像まとめ\nhttp://lifestylemovie305.club/\n癒し系感動画像まとめ\nhttp://lifestyle305.link/\n\n引用元\nまとめもりー\n\n関連動画\n【警察がガラスを割って逃走車を逮捕の大暴れの瞬間\nhttps://youtu.be/FRc_PDxdaKk\n\n【親友】草なぎ剛の逮捕後あいつだけが連絡をくれたんだ【芸能ゴシップch】\nhttps://youtu.be/F7u-eeVqvNo\n\n【逮捕】ヤマト運輸チェーンソー襲撃事件\nhttps://youtu.be/Kr4k1RXmBXk", "categories": ["Entertainment"], "tags": ["松本伊代", "逮捕", "鉄ヲタ", "侵入", "芸能ゴシップチャンネル"], "subtitles": {}, "automatic_captions": {}, "duration": 44, "age_limit": 0, "annotations": null, "webpage_url": "https://www.youtube.com/watch?v=of0B-ZvxYI4", "view_count": 285206, "like_count": 62, "dislike_count": 523, "average_rating": 1.42393159866, "formats":
one2gov commented 7 years ago

self.to_stdout(json.dumps(info_dict, ensure_ascii=False)) makes -j works, butjson.dump(obj, tf, ensure_ascii=False) doesn't make a difference for --write-info-json

youtube-dl --encoding utf-8 --write-info-json https://www.youtube.com/watch?v=VA0rAN0GRY4

linglung commented 7 years ago

why this didn't applied as the default setting in every YouTube-dl released version?

AraHaan commented 7 years ago

actually @yan12125 you can apply the patch on windows if you use git for windows (git bash). Well at least I can. Also to do it on Windows I am affraid you have to write the diffs to file [filename].patch and then you can use git patch[filename].patch``

yan12125 commented 7 years ago

To @linglung: It may sound silly, but not all environments supports raw (not-encoded) UTF-8. youtube-dl aims to keep compatibility with most systems, so it can't be the default.

AraHaan commented 7 years ago

hmm you could in this case use sys.platform and use the values from that to determine which ones @yan12125 that is how I determine to use system opus / ffmpeg on linux but not on windows in 1 of my projects.

yan12125 commented 7 years ago

Linux does not indicate full UTF-8 support. If one uses LC_ALL=C or LC_ALL=POSIX, UTF-8 strings can break the console. Such a setting is common in containers like Docker. (http://bugs.python.org/issue28180) On the other hand, since Python 3.6 UTF-8 support seems quite fine on Windows. (PEP528, PEP529) The logic for determining UTF-8 can be rather complicated.

AraHaan commented 7 years ago

which is why you could have it like this on both diffs.

diff --git a/youtube_dl/YoutubeDL.py b/youtube_dl/YoutubeDL.py
index 5d654f55f..d7374e820 100755
--- a/youtube_dl/YoutubeDL.py
+++ b/youtube_dl/YoutubeDL.py
@@ -1535,7 +1535,7 @@ class YoutubeDL(object):
         if self.params.get('forceformat', False):
             self.to_stdout(info_dict['format'])
         if self.params.get('forcejson', False):
-            self.to_stdout(json.dumps(info_dict))
+            if sys.platform == 'win32':
+                self.to_stdout(json.dumps(info_dict, ensure_ascii=False))
+            else:
+                self.to_stdout(json.dumps(info_dict))

         # Do nothing else if in simulate mode
         if self.params.get('simulate', False):

diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 12863e74a..6ded34832 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -231,7 +231,7 @@ def write_json_file(obj, fn):

     try:
         with tf:
-            json.dump(obj, tf)
+           if sys.platform == 'win32':
+                json.dump(obj, tf, ensure_ascii=False)
+           else:
+                json.dump(obj, tf)
         if sys.platform == 'win32':
             # Need to remove existing file on Windows, else os.rename raises
             # WindowsError or FileExistsError.