Open MrS0m30n3 opened 9 years ago
The only problem with respecting --encoding
is that you forget about configuration files. --encoding
can be provided in user/system configuration and handling this case results in rather complicated and clumsy logic.
So, under python 2, we should look through command_line_conf
, user_conf
and system_conf
, find the most prioritive --encoding
and decode byte strings to unicode strings according to it. Fine, but before we should take into account --ignore-config
that should be adapted for the case when conf
's are yet lists of byte strings. Than, if we respect --encoding
, for all python we should do _readOptions
with this encoding and not with locale.getpreferredencoding()
as open
by default do. But, to know the encoding we have to read a configuration file already, so obviously we'll need to read it with some basic encoding, e.g. aforementioned preferredencoding, extract --encoding
and reread it with extracted encoding. And more fun further.
Of course, we can revert to decoding with utf-8, but currently configurations are supposed to be in locale.getpreferredencoding()
that is kind of inconsistent.
We could define a new command line option to set the encoding only for the command_line_conf
decoding and decode user_conf
and system_conf
with the preferredencoding()
.
We basically search the sys.argv for the new option and if one is presented we extract the given encoding. Then we can use this encoding to further decode the command_line_conf
. If no option is specified we can fallback to the preferredencoding()
function.
Example:
+ def compat_conf(conf, encoding=preferredencoding()):
if sys.version_info < (3,):
+ return [a.decode(encoding, 'replace') for a in conf]
return conf
+ # Try to extract the encoding for the command_line_conf decode process
+ enc = sys.argv[sys.argv.index(str('--new-option')) + 1] if str('--new-option') in sys.argv else None
+ command_line_conf = compat_conf(sys.argv[1:], enc)
Hi @dstftw
On your commit you replaced the utf-8 encoding with the preferredencoding() function. I just wanted to ask if it is possible to use an existing command line option like --encoding in order to specify an additional encoding.
For example if someone wants to call youtube-dl.exe on windows using the subprocess module (which does not support unicode on Python 2.x) he has to encode the input to the subprocess module. With your implementation the user has to use the locale.getpreferredencoding() to encode the data else the decoding on the side of youtube-dl will fail. But if the returned encoding from locale.getpreferredencoding() can't encode the input some of the characters get lost.
I am currently working on https://github.com/rg3/youtube-dl/issues/5527 so it would be helpful if the user had the power to select the encoding both for the encoding and the decoding phase.
We can achieve this behaviour using something like this on options.py