Bug: yamlpath 3.4.1 API: Unexpected behavior from Deep Traversal searches when any leaf node has a null value

rbordelo commented 3 years ago

Operating System

Name/Distribution: Ubuntu
Version: 20.10

Version of Python and packages in use at the time of the issue.

Distribution: CPython
Python Version: 3.9.4
Version of yamlpath installed: 3.4.1
Version of ruamel.yaml installed: 0.17.4

Minimum sample of YAML (or compatible) data necessary to trigger the issue

Things:
  - name: first thing
    rank: 42
  - name: second thing
    rank: 5
  - name: third thing
    rank: null
  - name: fourth thing
    rank: 1

Complete steps to reproduce the issue when triggered via:

Load YAML document that has at least one leaf element explicitly set to null.
Preform a search using a YAML path that includes a deep traversal. The specified path does not necessarily have to match the leaf node(s) that are set to null. Example: /**/name or **.name.
Iterate over the returned results. An error is thrown once the first leaf node that is set to null is reached.
Command-Line Tools (yaml-get, yaml-set, or eyaml-rotate-keys): N/A
Libraries (yamlpath.*): Minimum amount of code necessary to trigger the defect.

#!/usr/bin/env python3
"""Test script."""
from os.path import dirname, join
from yamlpath.common import Parsers
from yamlpath.wrappers import ConsolePrinter
from yamlpath import Processor

class LogArgs:
    """Data object for configuring yamlpath logging."""

logargs = LogArgs()
setattr(logargs, "debug", False)
setattr(logargs, "verbose", True)
setattr(logargs, "quiet", False)

def main():
    """Print found YAML paths of a deep traversal search."""
    yaml_file = join(dirname(__file__), "bad.yaml")

    log = ConsolePrinter(logargs)
    yaml = Parsers.get_yaml_editor()
    (yaml_data, doc_loaded) = Parsers.get_yaml_data(yaml, log, yaml_file)

    if not doc_loaded:
        exit(1)

    processor = Processor(log, yaml_data)

    path = "/**/name"
    for node_coordinate in processor.get_nodes(path):
        log.info(f"{node_coordinate.path} = {str(node_coordinate.node)}")

if __name__ == "__main__":
    main()

Expected Outcome

The iteration of search results should complete successfully. If the leaf node(s) that were set to null matched the yamlpath search criteria, they should be returned with their values set to None. This was working as expected in yamlpath 3.4.0.

Actual Outcome

The following error is thrown:

$: ./test_script.py
Things[0].name = first thing
Things[1].name = second thing
Things[2].name = third thing
Traceback (most recent call last):
  File ".../test_script.py", line 38, in <module>
    main()
  File ".../test_script.py", line 33, in main
    for node_coordinate in processor.get_nodes(path):
  File ".../lib/python3.9/site-packages/yamlpath/processor.py", line 94, in get_nodes
    for opt_node in self._get_optional_nodes(
  File ".../lib/python3.9/site-packages/yamlpath/processor.py", line 1397, in _get_optional_nodes
    for node_coord in self._get_optional_nodes(
  File ".../lib/python3.9/site-packages/yamlpath/processor.py", line 1513, in _get_optional_nodes
    raise YAMLPathException(
yamlpath.exceptions.yamlpathexception.YAMLPathException: Cannot add PathSegmentTypes.KEY subreference to scalars at 'name' in '/**/name'.

yamlpath_bug.tar.gz

Screenshot(s), if Available

wwkimball commented 3 years ago

Please try telling the engine to only return nodes that already exist by changing for node_coordinate in processor.get_nodes(path): to for node_coordinate in processor.get_nodes(path, mustexist=True):. I expect this will not only resolve the issue for you but it will also ensure there are no inadvertent changes to the DOM. The default -- perhaps counterintuitively -- is mustexist=False, which is most useful only when deliberately changing the DOM (which happened to be my own most-pressing use-case for these tools when I first started writing them).

From the Processor class documentation:

        Keyword Parameters:
        * mustexist (bool) Indicate whether yaml_path must exist
          in data prior to this query (lest an Exception be raised);
          default=False

Also from the release notes for 3.4.1:

Bug Fixes:

    yaml-set (and the underlying Processor class) were unable to change nodes
    having a null (None) value to anything else. This changes how null/None
    values are handled by the Processor during node retrieval; they are no longer
    discarded, so you will receive None as the data of any retrieved NodeCoords
    for appropriate null/None leaf nodes.

Without setting mustexist=True, the processor attempts to build any parts of the DOM necessary to match the given YAML Path, where possible. When it is not possible, an exception is thrown. I evidently didn't test deep traversal with null nodes because it looks like the new null node behavior is tricking the deep traversal code into an attempt to build an unknown DOM in its place.

While setting mustexist=True is most likely what you need to be doing anyway -- since you don't appear to be deliberately change the DOM -- I will take a look at this anyway because a deep traversal shouldn't be attempting to build any DOM, no matter the setting of mustexist.

Thanks!!

rbordelo commented 3 years ago

Thank you for your analysis. I've modified the test script by adding a mustexist=True parameter to the processor.get_nodes() method and changed the output to show the repr of each nodecoords that is emitted by the generator returned by processor.get_nodes(). Adding the mustexist=True parameter did prevent a YAMLPathException from being thrown (good), but now I'm seeing spurious nodecoords being emitted for every node whose value is null, whether it matches the search criteria or not (strange.) Here's the output I am getting now (the second to last line of each trial output is the spurious noodecoords):

$: # Trial 1: path='**/name'
$: ./test_script.py
repr(node_coordinate): NodeCoords('first thing', 'ordereddict([('name', 'first thing'), ('rank', 42)])', 'name')
repr(node_coordinate): NodeCoords('second thing', 'ordereddict([('name', 'second thing'), ('rank', 5)])', 'name')
repr(node_coordinate): NodeCoords('third thing', 'ordereddict([('name', 'third thing'), ('rank', None)])', 'name')
repr(node_coordinate): NodeCoords('None', 'ordereddict([('name', 'third thing'), ('rank', None)])', 'rank')
repr(node_coordinate): NodeCoords('fourth thing', 'ordereddict([('name', 'fourth thing'), ('rank', 1)])', 'name')
$:
$: # Trial 2: path='**/rank'
$: ./test_script.py
repr(node_coordinate): NodeCoords('42', 'ordereddict([('name', 'first thing'), ('rank', 42)])', 'rank')
repr(node_coordinate): NodeCoords('5', 'ordereddict([('name', 'second thing'), ('rank', 5)])', 'rank')
repr(node_coordinate): NodeCoords('None', 'ordereddict([('name', 'third thing'), ('rank', None)])', 'rank')
repr(node_coordinate): NodeCoords('None', 'ordereddict([('name', 'third thing'), ('rank', None)])', 'rank')
repr(node_coordinate): NodeCoords('1', 'ordereddict([('name', 'fourth thing'), ('rank', 1)])', 'rank')

I can work around this issue for now by skipping all nodecoords elements emitted by the generator whose path attribute is None.

yamlpath2_bug.tar.gz

wwkimball commented 3 years ago

That is unexpected! I'll add a new unit test to trap for this in addition to spending time with the deep traversal code to give it the clearly missing update it needed for the new null node handling.

Thank you for bringing this up!

wwkimball commented 3 years ago

This was a simple fix and will appear in the upcoming 3.5.0 release. In the traversal code, I'd mistakenly added the new null node handling code before checking whether a filter (more YAML Path after the ** segment) was present. Simply moving it after the check resolves the issue. I've added a new unit test to your credit just to ensure this never regresses.

Thanks again!

wwkimball commented 3 years ago

I have published version 3.5.0, which includes this fix.

rbordelo commented 3 years ago

I've installed yampath 3.5.0 and it's working well in my application, fixing the problems I saw in version 3.4.1. Thanks for the update.

Robert On Sun, 2021-04-25 at 18:33 -0700, William W. Kimball, Jr., MBA, MSIS wrote:

I have published version 3.5.0, which includes this fix. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wwkimball / yamlpath