ziglang / zig

General-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
https://ziglang.org
MIT License
33.44k stars 2.44k forks source link

ArgIteratorWindows uses non-Unicode GetCommandLineA and a custom parser #2222

Closed ljmccarthy closed 3 weeks ago

ljmccarthy commented 5 years ago

std.os.ArgIteratorWindows currently uses GetCommandLineA which doesn't support Unicode and implements it's own command line parser.

I'm sure it's a good parser, but there is already a Windows function to implement this called CommandLineToArgvW which can be used with the result of GetCommandLineW. I would suggest that using this instead would make sure that Zig doesn't parse command-lines differently to regular C/C++ programs on Windows and would reduce code size.

I have implemented a proof-of-concept using these Win32 functions, returning an array of UTF-8 strings:

https://gist.github.com/ljmccarthy/4f372f060d2d2e11b594ad9188fd4720

I'm also not a big for of the ArgIterator interface, which is rather painful to use, but I will create a separate issue for that :-)

ljmccarthy commented 5 years ago

Here's ArgIteratorWindows re-written using CommandLineToArgvW:

https://gist.github.com/ljmccarthy/ef3e4e828a940812ae9d23f4d3713053

Note I've had to add a deinit() function.

ljmccarthy commented 5 years ago

Maybe this isn't the best approach. UCRT also has it's own parser, and it's not clear if it works exactly like CommandLineToArgvW. It seems this is a bit of a mess on Windows with no real standard, so perhaps the existing parser is fine. If we are linking with UCRT we could use it's parsed argv exported by the argc and argv/__wargv symbols, but it would still require a fallback for when we don't want to link to UCRT.

andrewrk commented 5 years ago

See also #534

shawnl commented 5 years ago

I'm also not a big for of the ArgIterator interface, which is rather painful to use, but I will create a separate issue for that :-)

I agree. It also consumes bytes in every binary. I did some work on it https://github.com/shawnl/zig/tree/startup

My real plan though is to have a startup function separate from main that does argument and environment parsing, and then it will return to the startup code, which will remove the environment, and then call main. If you have a Linux box type env in a terminal. every process you spawn in that terminal has that much wasted space. (plus some more, it is somewhat complicated) By clearing this we also enforce separation of code that deals with user-supplied data.

cartr commented 4 years ago

UCRT also has it's own parser, and it's not clear if it works exactly like CommandLineToArgvW.

The rules for CommandLineToArgvW are documented here, and the rules for the C main function's argv are documented here, and they look pretty similar. I investigated further by writing tests in Zig to compare ArgIteratorWindows and CommandLineToArgvW, as well as a C++ program to compare CommandLineToArgvW and C's argv.

It looks like CommandLineToArgvW has one main difference from argv. The docs say that "This function accepts command lines that contain a program name; the program name can be enclosed in quotation marks or not." But if you invoke the program with quotation marks in the middle of the program name, the argument list is parsed incorrectly:

Quote marks behave completely as expected if used anywhere else. It's just the program name (~=argv[0]) that's parsed oddly like this.

I'm not sure if this means Zig should avoid CommandLineToArgvW (since it doesn't work in the program-invoked-with-quotation-marks-in-the-middle-of-the-name case) or if it means Zig should start using CommandLineToArgvW (since the tests suggest it works correctly otherwise).

squeek502 commented 3 weeks ago

This was addressed by https://github.com/ziglang/zig/pull/18309 and https://github.com/ziglang/zig/pull/19655