weaviate / weaviate-python-client

A python native client for easy interaction with a Weaviate instance.
https://weaviate.io/developers/weaviate/current/client-libraries/python.html
BSD 3-Clause "New" or "Revised" License
162 stars 74 forks source link

Proposal: Models #205

Open barakalon opened 1 year ago

barakalon commented 1 year ago

This is a proposal for an ORM-style interface for weaviate-python-client.

I have implemented something very similar to this proposal in a private project. I want to make sure this makes sense for the general case before moving it here.

I am very new to Weaviate - apologies in advance for any ignorance...

More context:

Declare

Models are declared by subclassing the Model class.

from weaviate.models import Model, props

class Author(Model):
    name = props.String()
    age = props.Integer()
    born = props.Date()
    won_nobel_prize = props.Boolean()
    description = props.Text()

    # 'lambda' because this references a model that hasn't been declared yet
    writes_for = props.Reference(lambda: Publication)

class Publication(Model):
    # Schema can be customized with property arguments
    name = props.String(tokenization="word", description="The name of the Publication")

    # 'inverse' makes it easier to update bi-directional relationships
    has = props.Reference(Author, inverse=Author.writes_for)

Schema

Model schemas are code-first (as opposed to schema-first).

from weaviate import Client
client = Client()

Author.schema()
#  {
#          "class": "Author",
#          "properties": [
#              {
#                  "dataType": ["string"],
#                  "moduleConfig": {
#                      "text2vec-openai": {"skip": True, "vectorizePropertyName": False}
#                  },
#                  "name": "firstName",
#              },
#  ...

client.schema.create([Author, Publication])

# It should also be possible to calculate diffs, and possibly a schema change "migration" system
client.schema.diff([Author, Publication])

Instantiation

Models act like plain ol' objects.

author = Author(
    name="Kazuo", 
    age=68, 
    born=date(1954, 11, 8),
    won_nobel_prize=False,
    description="British novelist, screenwriter, musician, and short-story writer"
)
author.won_nobel_prize = True

publication = Publication(
    name="Faber and Faber",
    has=[author],
)

assert publication.has[0].name == "Kazuo"

Persistence

Persistence must go through the existing Client.

with client.batch as batch:
    batch.add_model(author)

    # This adds two cross references, including 'Author.writes_for' since 'inverse' was set
    batch.add_model(publication)

Many ORMs bind model instances to a connection, allowing for something like author.save(). This can lead to complex state management, and I propose we stay out of this business for the time being.

For partial updates (i.e. via Client.data_object.update), Models will only include fields that have been set in the generated data_object (similar to Pydantic).

Query

Models can be used for query building.

# Returns a list of 'Author' instances
client.query.do(
    Author
        .get()
)

# Join relationships
client.query.do(
    Author
        .get(
            Author, 
            Author.writes_for)
)

# Join specific relationship type
client.query.do(
    Author
        .get(
            Author, 
            Author.writes_for >> Publication)
)

# Select specific properties
client.query.do(
    Author
        .get(
            Author.name, 
            Author.age, 
            Author.writes_for >> Publication.name)
)

# Filter
client.query.do(
    Author
        .get()
        .where(
            and_(
                Author.first_name == "Kazuo", 
                Author.writes_for >> Publication.name == "Faber and Faber"
            ))
        .near_text("UK novels")
        .limit(10)
)

Type safety

props are all descriptors and properly type annotated.

reveal_type(publication.has[0].name) # Union[builtins.str, None]
reveal_type(Publication.has >> Author.name)  # weaviate.models.Path[Publication]

Open questions:

dirkkul commented 1 year ago

Hi @barakalon

Sorry that it took so long to get back to it. We discussed it internally and we all really like your proposal and think it would make the UX of the client much better. However we have many things going on and this is not our top priority right now and we cannot invest too many resources into this.

On our side, we want to evaluate first if we can do something similar in our Go/Java/JS clients. I'd assume so, but I think we would need to at least formulate out a proposal for each of them. If not it does not necessary mean that we would not do it for Python, but we want to do an informed decision. We have an internal deadline until the end of february to figure this out.

Moreover, we need to carefully think about the API with more examples/edge cases before commiting to something. (@StefanBogdan will think about this and post a few)

I have a couple more comments/questions/points of discussion:

I don' expect you to answer all of them and there might be more comming up, but I think that would be things we need to answer before commiting to doing this :)

And more concrete to your proposal

Weaviate doesn't have null constraints. This ultimately means that all properties should be Optional. Is this OK? Or should we add some client-side validation to implemented something like "required=True"?

My feeling would be that be should not add something like this to the clients and if we would introduce required parameters.

barakalon commented 1 year ago
  • We cannot remove the "old" way of doing things, so we would need to keep both ways around. Are there any problems with this? Can we do the maintance?
  • How big of a change/addition would this be? Ideally we would just an additional API and internally use as much existing code as possible

No problems with this. The proposal above is backwards compatible and would be exactly as you mentioned - an additional API.

This looks like the perfect place for dataclasses

Do you mean Python's builtin dataclasses? I'm skeptical of that. I think we'd want our own custom Model class/metaclass to play play along with the custom property descriptors. That's what enables the class property DSL.

We already have the schema package, so I'd do all schema creation throught this package

That's right. This would integrate with the existing schema package.

As additional idea, we could also (optionally) parse the response of GET etc to return Models, so you'd get an Author class back, but that wouldn'T be

I think your thought got cut off here. But yes - the models can be loaded from the GET response, and there should be some part of the interface that does that automatically. I think that's a requirement for a system like this.

ju-bezdek commented 1 year ago

This is awesome...

dropping some thoughts here...

  1. this proposal is introducing yet another model-ing framework... there are original python dataclasss, than there are dataclass_json, than we have pydantic that is becoming standard nowadays ... this is introducing another one and they all compete... even more, this is introducing even more non-standard features...

On the other hand it also brings a ton of magic that makes a lot of things easier...

I don't have clear opinion here... but I'd consider maybe playing around with the idea of decorators and embracing standard python type annotations ...

Q: is there a reason why use >> operator instead of coma? , ? ... I don't think the >> will bring help IDE in any way... and creating the path just as a list of properties enables us to use plain lists and thus more dynamic approaches...

and this syntax might also scare some newbies:)