SchemaBuilder incorrecly marking field as required for copied schema

rsteiger commented 1 month ago

Here's a reproduction of the issue:

builder_1 = genson.SchemaBuilder()
builder_1.add_object({"a": 0})
builder_1.add_object({"b": 0})

# Make a copy of builder_1.
builder_2 = genson.SchemaBuilder()
builder_2.add_schema(builder_1)

# This passes.
assert builder_1.to_schema() == builder_2.to_schema()

another_schema = {
    '$schema': 'http://json-schema.org/schema#',
    'properties': dict(c={'type': 'integer'}),
    'required': ['c'],
    'type': 'object'
}
builder_1.add_schema(another_schema)
builder_2.add_schema(another_schema)

# This fails, builder_2 marks 'c' as required.
assert builder_1.to_schema() == builder_2.to_schema()

wolverdude commented 1 month ago

Issue #25 seems to be rearing its ugly head again. You can find my explanation of the basic reason for why this is happening here. I apparently failed to add docs the README as I had said I would, but that said, I thought I had fixed the problem. Let me look a little deeper here.

wolverdude commented 1 month ago

Okay, so it looks like I fixed the problem that was specifically described in issue #25, but while yours is the same underlying problem, it wasn't fixed by that particular solution. Yay for treating symptoms, not causes! Unfortunately, I think we're stuck with this particular cause, so I'll have to stay focused on the symptoms.

What's happening is that GenSON implicitly converts a SchemaBuilder into a dict schema if it gets passed to add_schema(). This is logically simple, but it means that any information not serialized by to_schema() gets lost in the transition. This is actually good, because it forces GenSON not to stay very close to the bare JSON-Schema spec, and thus (hopefully) be more intuitive. But for the arcane reasons outlined in issue #25, the presence of an empty required field is a nonintuitive thing that gets lost.

The fix for issue #25 solved this by ensuring output schemas would always contain the required field if any input schema contained it, even if empty. But that doesn't work in this case because the input schema (builder_1) is itself a SchemaBuilder, and when serialized it "helpfully" leaves out the empty required field because no one passed it an input schema that directed it to do otherwise.

wolverdude commented 1 month ago

One way around this is just to use deepcopy to create an actual recursive duplicate of the builder object. But that will only work for fresh objects, not for adding a schema to another schema that already has its own data. In theory, SchemaBuilder itself could do something like this instead of the implicit dict conversion when another SchemaBuilder object is passed to add_schema(), but that would require adding an extra internal-only API all the way down, and I would rather not do that.

Outside of that, I unfortunately don't think this is fixable without seriously messing up the seed schema functionality. If this is a serious problem for your use-case, and you can do without seed schemas, I would suggest creating a custom object SchemaStrategy that inherits from the canonical one and sets this ivar to True instead of False. That will tell it to always include the required key in the output, so then the info that no keys are required won't be lost between builder_1 and builder_2.

rsteiger commented 1 month ago

Thank you for investigating. In my case I have a distributed workload where I'm loading a shared schema, adding objects, then updating the shared schema. Now, within a worker, instead of passing a schema around I am passing a builder around and only serializing it when updating the shared schema. It would be nice to not rely on this behavior, but it sounds like a difficult fix.

wolverdude / GenSON

SchemaBuilder incorrecly marking field as required for copied schema #78