Import dataset via command line

weaviate / weaviate-cli

CLI tool for Weaviate

https://weaviate.io/developers/weaviate/client-libraries/cli

BSD 3-Clause "New" or "Revised" License

11 stars 11 forks source link

Import dataset via command line #47

Closed bobvanluijt closed 3 years ago

bobvanluijt commented 4 years ago

A file formatted like this (where a data-object should be a 1-on-1 copy of what comes from the /v1/things and /v1/actions end-point):

{
    "things": [
        // THINGS OBJECT
    ],
    "actions": [
        // ACTIONS OBJECT
    ]
}

The feature should -at least- have the following functionality.

Milestone I

[x] The data should be able to be imported in this manner: weaviate-cli data-import -f data.json
[x] Validate if the complete dataset is formatted properly and if the classes and properties are available in the schema (i.e., the file should be validated).
[x] Import based on batching.
[x] Catch an API POST timeout error and try again with variable intervals. After 5 seconds, after 10 seconds, etc. In 10 iterations.
[x] Set a flag to determine if an import should continue if an error occurs during the import on the Weaviate end (this can be determined based on the batch import response).

Milestone 2

[ ] If no schema is set, user the import data to determine the schema.

fefi42 commented 4 years ago

Validate if the complete dataset is formatted properly and if the classes and properties are available

If this should be done before the schema is send to weaviate then we need to write a parser. I can assist with that, but it will probably take a bit of time. @etiennedi Maybe we can use the parser of weaviate so we don't have to reimplement the rules?

Both

Import based on batching.

and

Catch an API POST timeout error and try again with variable intervals. After 5 seconds, after 10 seconds, etc. In 10 iterations.

Are already implemented with the client. For a long time I already have the plan to replace the cli code with the client back end. This might be a good moment for this.

Set a flag to determine if an import should continue if an error occurs during the import on the Weaviate end (this can be determined based on the batch import response).

When one record in a batch fails the rest of the batch will be loaded anyways. We can stop after the batch but we can't stop at the specific record. If the requirement is to stop directly at the first error we should not use batching for that.

bobvanluijt commented 4 years ago

If this should be done before the schema is sent to weaviate then we need to write a parser

This is not needed, it is fine to assume that the schema is already in Weaviate and validate based on that schema.

fefi42 commented 3 years ago

@StefanBogdan What do you think about milestone 2? This could be an interesting issue for you to get into the CLI and the python client.

Adding a command like weaviate-cli schema propose <data-file> that generates a schema based on given data. Think about what part of this functionality should be in the CLI and what in the client. I am thinking about a function like propose_property_for_data(data: list) in the client, that selects a fitting data type for a list of values and returns a property definition.

E.g. propose_property_for_data(['hello', 'world']) returns:

{
            "name": "",
            "description": "",
            "dataType": ["text"]
}

propose_property_for_data(['hello', 'world'], name='message', description='A message from a friend') returns:

{
            "name": "message",
            "description": "A message from a friend",
            "dataType": ["text"]
}

propose_property_for_data([true, false, true]) returns:

{
            "name": "",
            "description": "",
            "dataType": ["boolean"]
}

The CLI would then use that function to generate a useful schema.

@bobvanluijt Can you maybe elaborate a bit more on how the input data should be structured? @laura-ham Any opinions on this? I know you also thought about this feature before.

laura-ham commented 3 years ago

Very good idea, this will mean a lot for our users!

Some random thoughts, not per se good:

We should be clear to the user, and the program, what we return per function. So that the user or code knows what to expect when calling a function like propose_property_for_data(data: list). In your example it sometimes returns an object with all fields filled, and sometimes e.g. the 'name' is empty. We could choose to always return a value, even if no name is suggested, e.g. in your last example:
```
{
        "name": "undefinedBooleanPropNameOne",
        "description": "undefinedBooleanPropDescriptionOne",
        "dataType": ["boolean"]
}
```
That means that, especially for weaviate-cli schema propose <data-file> the schema can be used directly, regardless if the proposed output makes sense. Upside is that we avoid errors if values are left empty by the script/user, but potential downside is that we have classes/properties with 'nonsense' names in Weaviate, which is not helpful for the user either.. An alternative is to add some meta info to the query return about the guessed correctness of the proposed values, empty values and errors.
What about having a quite elaborate set of functions, including: propose_property_dataType_for_data(..), propose_property_name_for_data(..), propose_property_..._for_data(..) (function names can change, this is just to reflect the idea). I could imagine that having functions like this could help later on with supporting (parts of schema) generation process in the Console.

fefi42 commented 3 years ago

@laura-ham thanks for the response. I think its a valid point that the schema returned by the CLI should be importable and not only be half way filled. I am not fully decided on what is happening in the CLI and what in the python client though.

It might be better to have the python client implement propose_property_for_data(data: list, name: str, description="") that always requires a name. Then the CLI would set that name if possible or generates a name if there is none. Selecting a name might depend on the input data. I wonder what input data we are supporting. E.g. if we have a table we can just select the column name. If that is not possible the CLI might just prompt the user for a name.

@laura-ham can you elaborate on how propose_property_name_for_data(..) would work? I think we should be careful to not go too much out of scope with such a function.

laura-ham commented 3 years ago

@fefi42 We indeed need to see what kind of input data we want to support. The name can barely derived from just a list of data, but I can imagine that a most data sources are JSON or Excel/CSV files. In those files, there could be column names defined, which the function takes as property/class name.

Regarding your second point, propose_property_name_for_data(..) I am not sure if and how useful it would be. It was a random thought on that maybe people, or a UI like the console, could use a button/function that just generates the "name" value of a property or class from a list of data values. But for this there needs to be some NLP/ML stuff done, and I don't think this has to be a first feature per se (depends on what functionalities we want in Milestone 2). But maybe these kind of functions can be added in the future, so something to keep in mind when designing the function structures etc. As example: if you have propose_property_name_for_data(data=["Amsterdam", "Utrecht", "Rotterdam"]), it would return:

{
            "name": "name",
}

and for propose_class_name_for_data(data={"name": ["Amsterdam", "Utrecht", "Rotterdam"] , "population": [800000,1300000, 600000 ] "country": ["Netherlands", "Netherlands", "Netherlands"}) it could return

{
            "name": "City",
}

@bobvanluijt do you think this would be useful?

StefanBogdan commented 3 years ago

Weaviate now supports auto-schema so no need for this feature anymore.