zgrossbart / jdd

A semantic JSON compare tool
http://www.jsondiff.com
Apache License 2.0
1.05k stars 183 forks source link

Proper LCS #5

Closed jtheisen closed 7 years ago

jtheisen commented 7 years ago

In contrast to the typical diff implementations for text files (gnu diff anyway), JSON Diff doesn't mind giving a longer-than-necessary list of changes. For example:

[ "1", "2", "1" ] <-> [ "1", "1" ]

If the elements where lines then gnu diff would tell me that this is one insertion, and not one modification plus one insertion.

I'm sure that by giving only the smallest diff for arrays, you get a much neater diff.

The implementation requires implementing the LCS (largest common subsequence) algorithm, which is the dual problem of finding the shortest diff.

Note that I personally don't need this right now, but I thought it worth giving this bit of input.

In any case, I too think structured diffing, in particular for JSON, is a cool thing.

zgrossbart commented 7 years ago

Thank you very much for the suggestion. I'm not sure I fully understand what you're looking for. Was there a specific diff that JSONDiff didn't handle well? Can you help me find a use case where LCS would give a better answer.

Thanks, Zack

jtheisen commented 7 years ago

I just realized that there's more to it than LCS alone. Let's take the following example:

[
  ".",
  ".",
  ".",
  ".",
  { "id": 43243, "content": { "value1": "bar" } },
  ".",
  ".",
  ".",
  "."
]
[
  { "id": 13432, "value1": "foo" },
  ".",
  ".",
  ".",
  ".",
  { "id": 43243, "content": { "value1": "bar-changed" } },
  ".",
  ".",
  ".",
  "."
]

The current algorithm says that

But that's really not what a human sees. We'd much rather want:

To get there, however, one doesn't only need LCS but most importantly a reasonable notion of when elements in an array are considered equal.

Gnu diff & co work with line equality, possibly while stripping gratuitous whitespace first.

What could we use here, if anything?

That leaves objects, and the only thing I can think of in that case is that the diff algorithm needs to know more about what identifies objects - hence the "id" in my example.

If the diff algorithm knows (or is told by the user) which object values can be taken to define an identity, then, together with LCS, the above example could be made to produce the desired output.

Unfortunately, that may be outside the scope of what you want to do here, and I think without the identity bit, the LCS bit isn't of that much use either.

Maybe this obstacle is why there's so little structural diffing out there.

zgrossbart commented 7 years ago

You're right that the identity of the object is the problem. In your case all of the objects have handy ID tags, but even with those tags it is very difficult for us to determine that the object { "id": 43243, "content": { "value1": "bar" } } is the same as object { "id": 43243, "content": { "value1": "bar-changed" } }. There's no way for JSONDiff to know how to compare those objects and figure out that they are the same.

Even strings are difficult. Take this example:

[
  ".",
  "my string",
  "."
]

and

[
  ".",
  ".",
  "my changed string",
]

In this case you might want to know that the string changed instead of one was added and one was removed, but what if the change was larger? At what point do you start calling the changed string a new string? When more than 50% changes? When the beginning changes?

Whatever algorithm you come up with has to make sense to someone using the tool. Otherwise it just seems random when one string is considered a change and another is considered a deletion and addition.

Showing you more structural diffing of JSON is a great idea, but it's very difficult to make it work well in a generic case.

What do you think?

jtheisen commented 7 years ago

In you example with strings that wouldn't work well with gnu diff either - it's just the entire line that's being compared. I would argue that this is acceptable because most (?) JSON files don't have exceedingly long strings. They may well have exceedingly long and nested objects though.

The object identity determination could only work in the presence of a specific notion of an id attribute, so yes, that's a bit bad in the case of JSON that has no such notion built-in.

I just realized that it would work for various other formats such as HTML and all programming languages, as these things have the notion of names for all the things they contain.

Anyway, sorry for the unfruitful disturbance... :-)

zgrossbart commented 7 years ago

No worries at all. I'm glad you like the tool and it was a fun problem to think about for a little while.