yaml / pyyaml

Canonical source repository for PyYAML
MIT License
2.54k stars 515 forks source link

Question about a "curly quotes" safe check #800

Closed giuliohome closed 4 months ago

giuliohome commented 4 months ago

Is it possible to raise an exception when parsing a yaml file that is using left double quotation mark () instead of straight double quotation marks (") as string delimiters? (also somehow related)

Thanks

giuliohome commented 4 months ago

Notice that a rudimentary approach to implement this check on Windows is to open the file without using UTF-8, thus triggering the exception mentioned here.

nitzmahone commented 4 months ago

While it seems unlikely we'd ever add the ability to fail on valid documents for arbitrary reasons (that's arguably the job of a linter), it's not hard to implement most any kind of rules like that you want by pre-inspecting or intercepting the token stream yourself:

import re
import yaml

from yaml.tokens import ScalarToken

doc = """
double: "mar"
single: 'mar'
none: mar
leftright: “mar” 
"""

# inspect pre-scan
curly_quoted_scalars: list[ScalarToken] = [tok for tok in yaml.scan(doc) if isinstance(tok, ScalarToken) and re.search('[“”]', tok.value)]

for tok in curly_quoted_scalars:
   print(f'found curly-quote scalar at {tok.start_mark}')

# bake it into a custom loader   
class NoCurlyQuotesSafeLoader(yaml.SafeLoader):
    def get_token(self):
        tok = super().get_token()

        if isinstance(tok, ScalarToken) and re.search('[“”]', tok.value):
            raise ScannerError(problem="this loader does not allow curly-quoted scalars", problem_mark=tok.start_mark)

        return tok

print(yaml.load(doc, Loader=NoCurlyQuotesSafeLoader))
giuliohome commented 4 months ago

@nitzmahone Excellent! Thank you very much!!!