yargs / yargs-parser

:muscle: the mighty option parser used by yargs
http://yargs.js.org/
ISC License
493 stars 118 forks source link

Hopefully ending any and all discussion about how to handle quotes.... #300

Open Mike111177 opened 4 years ago

Mike111177 commented 4 years ago

So whether or not to remove quotes and where seems to have been plaguing this repo with endless issues and PRs. Sorry to wake up the dead here but see #145, #82, #93, #140, #138, #153, #180, #201, #222 and probably more I didn't find.

Overall the goal of this issue, discussion, and potential future PR is to making typing a command into your favorite shell or into your online bot interface more consistent, and also to make quote handling easier to test and more predictable.

I have two examples that well demonstrate what I think the main issues currently are:

I loaded this test.js into the node REPL before both of these examples to set up the command line and tokenizers.

//To grab arguments as passed by a shell if we recieve args
if (process.argv.length>2){
  console.log(JSON.stringify(process.argv.slice(2)))
  process.exit()
}

const fs = require('fs');
const {execSync} = require('child_process')
const util = require('util')
const parse = require('yargs-parser')

//Tokenizers
function extTokenizer(exe, ext, prefix=""){
  return command=>{
    tkfile = `tokenize.${ext}`
    fs.writeFileSync(tkfile, `${prefix}node test.js ${command}`)
    result = execSync(`${exe}${tkfile}`);
    return JSON.parse(result)
  }
}
const basicTokenizer = str=>str.split(" ") //Simple whitespace split
const yargsTokenizer = require('./tokenize-arg-string')
const cmdTokenizer = extTokenizer('cmd /C ', 'bat', "@echo off\n")
const powershellTokenizer = extTokenizer('powershell -ExecutionPolicy Bypass -File ', 'ps1')
const bashTokenizer = extTokenizer('bash ', 'sh', "/home/mike/.nvm/versions/node/v14.8.0/bin/") //Tested with WSL

const argfile = "command.txt"
const commandline = fs.readFileSync(argfile).toString()

Here I created a gist containing a quick and dirty tool for trying this out for yourself

1. Inconsistent removal of quotes depending on context of token

Post tokenizer, the argument parser seems to pick and choose which arguments aught to be stripped of quotes.

> commandline
`"positional" -a 'flag'`
> tokens = yargsTokenizer(commandline)
[ '"positional"', '-a', "'flag'" ]
> parse(tokens)
{ _: [ '"positional"' ], a: 'flag' }

Notice How the flag argument gets its quotes removed, but the positional argument does not. This inconsistency is confusing.

Now you may ask "then how do I enter a command like domything "to this other thing" from a string statement. Well that brings me to problem 2.

2. Tokenizer inconsistency with shell interfaces.

So if tokenizeArgString is supposed to be the middleware that converts your text string to the argument (token) array in the case where the shell you are in hasn't done it for you, arguably the output of tokenizeArgString should emulate the same expected results as you would get by reading process.argv. Since process.argv is controlled by the various shells, lets see how tokenArgString stacks up against the most popular ones (that I could easily test).

> commandline
`'single quote' "double quote" \\"escaped quote\\"`
> yargsTokenizer(commandline)
[ "'single quote'", '"double quote"', '\\"escaped quote\\"' ]
> cmdTokenizer(commandline)
[ "'single", "quote'", 'double quote', '"escaped', 'quote"' ]
> powershellTokenizer(commandline)
[ 'single quote', 'double quote', '\\escaped quote"' ]
> bashTokenizer(commandline)
[ 'single quote', 'double quote', '"escaped', 'quote"' ]

There are a few observations to be made here:

Which one of these methods is correct? None of them, they are all different flavors on the same end goal. However people are used to them and generally expect similar behavior (whether or not they get it) and more importantly, they are testable.

My Proposal

  1. Do not strip quotes from arguments passed as an array, ever. Assume they have already been tokenized by a shell or tokenizer. I would argue that parse with an argument of any[] should leave every single argument unmolested as it should assume that it is receiving the arguments directly from process.argv. Otherwise it is difficult to figure out when and where quotations need to be added or removed. Therefore it should assume that the inputted quotes are intended to remain. This is probably a small breaking change, you could probably fix this with a deprecated option as well, but in my opinion maintaining both modes seems like a bit of a pain.
    In the case where the user has specified in options that they want quotes removed the question becomes how many times to you strip the quotes, this needs further discussion as this seems to be a post processing measure.

  2. Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash. This way not only can the programmer select which style they want, it also means that there is no longer a "but i want my quotes removed here but not here" issue because you can respond, "you selected the powershell style and thats how powershell does it", removing any opinion on how the quotes are removed and ending this plague of issues.
    You also get the benefit of being able to unit test directly against the shells themselves. Im not totally confident that the exact method to test them I wrote above is very CI friendly, however a similar concept could be used. Just take a bunch of predefined argument strings and see if the relate shell has the same output as what the flavored tokenizer outputs.
    You could layout the top level API like the following:

    
    //If you know you are tokenizing, don't accept tokens
    export type Tokenizer = (arg: string) => string[];
    export type CommandLineParser = (arg: string, opts?: Partial<Options>) =>Arguments;

//Same stuff as before for backwards compatibility export interface Parser { (args: ArgsInput, opts?: Partial): Arguments; //Will now by default use tokenize.default to tokenize detailed(args: ArgsInput, opts?: Partial): DetailedArguments; camelCase(str: string): string; decamelize(str: string, joinString?: string): string; //new stuff //included tokenizers tokenize { (arg: string) => string[]; //Calling require('yargs-parser').tokenize("") should just use the default tokenizer default: Tokenizer; //freeze the current tokenizer for backward compat cmd: Tokenizer; powershell: Tokenizer; bash: Tokenizer; }; //Include tokenizer flavors as full parsers cmd: CommandLineParser; powershell: CommandLineParser; bash: CommandLineParser; //Allow programmers to feed it their own tokenizer? So less complaining about format parser: (tokenizer: Tokenizer) => CommandLineParser; }


I understand this goes somewhat against previous efforts (See #153), but ultimatley the tokenizer is meant to take the place of a shell that does it for you, so it doesn't make sense not to try to emulate the function of the shell in this regard. 

This can also maybe open up an opportunity to seperate the tokenizers into their own npm package. It would be neat to have a package that is just designed to emulate how shells tokenize to argv. `yargs-tokenizer` anybody?

Let me know what yalls thoughts are on this, I'd be willing to do some of the ground work for this. I also believe that most (not all) of this can be introduced without any breaking changes...
bcoe commented 4 years ago

Thanks for this great research you've done up front 👍

Have the tokenizer handle removing quotes in accordance with how users would expect it from their shell. That means supporting three different tokenizers that are flavored like cmd, powershell, and bash.

Over the years with yargs we've tried really hard to find compromises that meet the widest variety of user expectations and, ideally, are identical in a wide variety of environments.

Sometimes we introduce configuration options, when folks have irreconcilable goals ... and it seems like there are enough people on both sides of the fence that it's worth doing. Ideally these configuration options are added rarely.

I would rather start by looking at your findings, and seeing if there are a few additional bugs we can fix for some shell environments, without breaking the existing behavior. Only if it's absolutely necessary, e.g., a situation where we need to do things one way on powershell vs., bash, and it's impossible to find a non-breaking middle-ground, would I suggest that we add additional configuration options.

I do not like the idea of decomposing the tokenizers into their own packages. Ideally we'd have one tokenizer that supports as wide a variety of process.argv formats as possible, and we'd maybe introduce some additional config settings for the tokenizers.

Note of clarification: along with moving towards supporting Deno/TypeScript/ESM, I've been putting work into consolidating yargs' footprint, with regards to dependencies. Several popular alternatives like commander have zero dependencies ... I'm a proponent of decomposing libraries, but have definitely gotten some flack for yargs' footprint, tldr; I've been trying to be really selective about what additional dependencies we add or break out.