ytdl-org / youtube-dl

Command-line program to download videos from YouTube.com and other video sites
http://ytdl-org.github.io/youtube-dl/
The Unlicense
131.44k stars 9.96k forks source link

Way to specify additional encoding for options decoding #6119

Open MrS0m30n3 opened 9 years ago

MrS0m30n3 commented 9 years ago

Hi @dstftw

On your commit you replaced the utf-8 encoding with the preferredencoding() function. I just wanted to ask if it is possible to use an existing command line option like --encoding in order to specify an additional encoding.

For example if someone wants to call youtube-dl.exe on windows using the subprocess module (which does not support unicode on Python 2.x) he has to encode the input to the subprocess module. With your implementation the user has to use the locale.getpreferredencoding() to encode the data else the decoding on the side of youtube-dl will fail. But if the returned encoding from locale.getpreferredencoding() can't encode the input some of the characters get lost.

I am currently working on https://github.com/rg3/youtube-dl/issues/5527 so it would be helpful if the user had the power to select the encoding both for the encoding and the decoding phase.

We can achieve this behaviour using something like this on options.py

def compat_conf(conf):
    if sys.version_info < (3,):
+      enc = conf[conf.index(str('--encoding')) + 1] if str('--encoding') in conf else preferredencoding()
+      return [a.decode(enc), 'replace') for a in conf]
    return conf
dstftw commented 9 years ago

The only problem with respecting --encoding is that you forget about configuration files. --encoding can be provided in user/system configuration and handling this case results in rather complicated and clumsy logic. So, under python 2, we should look through command_line_conf, user_conf and system_conf, find the most prioritive --encoding and decode byte strings to unicode strings according to it. Fine, but before we should take into account --ignore-config that should be adapted for the case when conf's are yet lists of byte strings. Than, if we respect --encoding, for all python we should do _readOptions with this encoding and not with locale.getpreferredencoding() as open by default do. But, to know the encoding we have to read a configuration file already, so obviously we'll need to read it with some basic encoding, e.g. aforementioned preferredencoding, extract --encoding and reread it with extracted encoding. And more fun further. Of course, we can revert to decoding with utf-8, but currently configurations are supposed to be in locale.getpreferredencoding() that is kind of inconsistent.

MrS0m30n3 commented 9 years ago

We could define a new command line option to set the encoding only for the command_line_conf decoding and decode user_conf and system_conf with the preferredencoding().

We basically search the sys.argv for the new option and if one is presented we extract the given encoding. Then we can use this encoding to further decode the command_line_conf. If no option is specified we can fallback to the preferredencoding() function.

Example:

+ def compat_conf(conf, encoding=preferredencoding()):                                   
    if sys.version_info < (3,):                                          
+      return [a.decode(encoding, 'replace') for a in conf]                 
    return conf                                                         

+ # Try to extract the encoding for the command_line_conf decode process                                                                             
+ enc = sys.argv[sys.argv.index(str('--new-option')) + 1] if str('--new-option') in sys.argv else None
+ command_line_conf = compat_conf(sys.argv[1:], enc)