weichch / system-text-json-jsondiffpatch

High-performance, low-allocating JSON object diff and patch extension for System.Text.Json. Support generating patch document in RFC 6902 JSON Patch format.
MIT License
101 stars 13 forks source link

SuppressDetectArrayMove does not work as expected #38

Open andrewslavin opened 1 year ago

andrewslavin commented 1 year ago

Steps to reproduce:

JsonDiffPatcher.DefaultOptions =
    () => new JsonDiffOptions { SuppressDetectArrayMove = true };

var x = JsonDocument.Parse(@"{""a"":[{""b"":1},{""b"":2}]}");
var y = JsonDocument.Parse(@"{""a"":[{""b"":1},{""b"":2}]}");
var z = JsonDocument.Parse(@"{""a"":[{""b"":2},{""b"":1}]}");

var test1 = x.DeepEquals(y);  //test1 is true
var test2 = x.DeepEquals(z);  //test2 is false, expected true

x.Deserialize<JsonNode>().ShouldEqual(y.Deserialize<JsonNode>());  //no exception
x.Deserialize<JsonNode>().ShouldEqual(z.Deserialize<JsonNode>());  //throws JsonEqualException, expected no exception

Either SuppressDetectArrayMove does not work, or I misunderstood what it does? I this case how can I configure JsonDiffPatcher to ignore array element order?

weichch commented 1 year ago

I don' think DeepEquals supports SuppressDetectArrayMove and ignoring element order. Can you switch to use Diff?

andrewslavin commented 1 year ago

Thanks @weichch. I tried this:

var d1 = x.RootElement.Deserialize<JsonNode>().Diff(y.Deserialize<JsonNode>());
var d2 = x.RootElement.Deserialize<JsonNode>().Diff(z.Deserialize<JsonNode>());

d1 is null as expected, but d2 is not.

weichch commented 1 year ago

Array diff is a bit complex.

The generated diff (in your case d1 and d2) reflects the operations required to make x become y, or x become z.

In your case you have two complex objects in the array: {"b": 1} and {"b": 2}. By default, the two objects are compared with DeepEquals (which compares whether they contain the same properties and whether each property has the same value etc), and the two objects aren't deeply equal to each other, so a diff is then generated to make [{"b": 1},{"b": 2}] become [{"b": 2},{"b": 1}]. In this process, you can't ignore ordering of the two elements because their ordering matters (in both DeepEquals and Diff).

You could try sort the array before calling Diff or DeepEquals.

If this isn't an option, you might need to override the array item diffing behavior by implementing your own ArrayItemMatch and force the two non-identical elements to be identical.

For example, below code would make two objects identical when they both contain a property b:

JsonDiffPatcher.DefaultOptions = () => new JsonDiffOptions
{
    ArrayItemMatcher = MatchArrayItem,
};

static bool MatchArrayItem(ref ArrayItemMatchContext context)
{
    if (context.Left.AsObject().ContainsKey("b") && context.Right.AsObject().ContainsKey("b"))
    {
        // Suppress diff of the two non-identical objects
        context.DeepEqual();
        return true;
    }

    if (context.Left.DeepEquals(context.Right))
    {
        context.DeepEqual();
        return true;
    }

    return false;
}

SuppressDetectArrayMove doesn't control diffing behaviors, it controls how a diff is generated.

For example, in these two arrays [1,2] and [2,1], to make left become right, we could simply move the element at index 0 in the left to index 1. And this is called a move. However, there is another way to achieve this, by deleting the element at index 0 in the left and inserting a new element at index 1. When you have SuppressDetectArrayMove set to false (default), the diff engine will detect if a move operation can be generated therefore minimal diff result can be generated. Otherwise, it will generate the second diff because it's a lot faster to generate, but the diff is in a larger size.

andrewslavin commented 1 year ago

Thank you for the detailed explanation. It makes sense, but would be nice to have an option that "properly" ignores array order. Two arrays are equal if:

  1. They have same length.
  2. For each element in left, there's an equal element in right.
  3. Elements in the left and right arrays can, in turn, can contain arrays too. Obviously when comparing those arrays we use rules 1 and 2 above.

Is such option possible but has not been implemented, or does it have some logical problem I'm not seeing?

weichch commented 1 year ago

would be nice to have an option that "properly" ignores array order.

From JSON definition's perspective, the order of elements in an array shouldn't be ignored.

In RFC 7159:

An array is an ordered sequence of zero or more values.

From a functional perspective, sure, it can be done but need more thinking.

Two arrays are equal if: They have same length. For each element in left, there's an equal element in right.

We will have to consider duplicates here. For example, the arrays below meet the rules 1 & 2, but the two aren't necessarily equal. It is likely we need to sort the arrays before comparing them or verify that they both contain the same elements.

left = [2, 2]
right = [2, 1]

Both approaches require the caller to pass in a hash/comparer function for complex objects.

andrewslavin commented 1 year ago

Yes I can see the problem with my logic of comparing arrays based on your above example. Sorting is probably not an option if we do not want to require passing in a hash/comparer function. When we do "For each element in left, there's an equal element in right, we need to know whether a given right element has already been matched to some left element or not. An immediate solution that comes to mind is maintaining a List<int> with positions of right elements that have already been matched to some left element.

weichch commented 1 year ago

If the approach is checking whether left contains right and then right contains left, then tracking the comparison result is going to improve the performance by avoiding comparing same set of elements twice.

There are also other approaches such as hashing:

Dictionary<JsonNode, int> appearanceOfEachElement;

In this approach, left can be appended to the dictionary with the Value of each entry being the number of appearances of the Key. Then go scan right and decrement the value by 1 for the same element found and remove the Key when its Value becomes 0. If the two arrays are equal, the dictionary should be empty by the end. This should avoid (in best case scenario) comparing each element twice so there isn't necessarily the need to track the comparison results. This is probably the most optimal solution as it has a complexity of O(n) in both time and space, compared to tracking the results where it has a complexity of O(n2) in both time and space.

However, this approach requires a hash function to work as we can't rely on the default GetHashCode function on JsonNode. Two JsonNodes would always have different hash values because one JsonNode can also be added into one parent, so they are always different objects.

Another thing we'd also need to consider is if we do provide an option to disable checking for order of elements, then it is likely we need to implement this in Diff rather than DeepEquals (because it won't be semantically correct there):

if(options.IgnoreArrayOrder && left.Length == right.Length)
{
    // do the comparison and suppress diff generation if equal 
}

We might have to keep the materialized JsonNode used in this routine and try to reuse them in the the following array diff. JsonNode isn't well designed to be used in a comparison scenario especially when its inner value is represented in JsonElement, and materializing JsonElement isn't very efficient. You will see what I mean in JsonString and JsonNumber and how they are cached in LCS.

weichch commented 1 year ago

Do you mind (me) changing the title of this issue to something like "Compare two arrays ignoring order" because I thought it is what we're discussing here and the topic is useful.

andrewslavin commented 1 year ago

Do you mind (me) changing the title of this issue

Yes definitely. I thought SuppressDetectArrayMove was not working as expected but that's not the case.

Regarding the exact left and right comparison logic, I'll leave it up to you. Though I do not understand why we need a hash function - we already have a way to compare two JsonNodes and decide if they are equal or not. That's pretty much what I'm using this nuget package for :).