Open fefi42 opened 3 years ago
Something regarding the gRPC
interface: With V1.0.0 Weaviate will become more modular and what is currently the contextionary will become one of the modules. This means that:
gRPC
basedI would prefer 3.ii in that case.
Thanks for sharing @fefi42 – very insightful
Custom contextionaries
In the past weeks we have experimented with creating custom contextionaries on the side. We where interested in a model that is provided through a python library. Python is hugely popular and used by many data scientists and developers for their ml models. Support for such models would greatly increase the applications weavaite could be used for.
BERT model spike
We used a model called BERT. From a library called huggingface. https://huggingface.co/transformers/model_doc/bert.html The library allows to download the model as a tensorflow or pyTorch model. We chose the latter.
BERT does not implement a global embedding, making it worse when searching for single words. Its strength lies within longer text especially sentences. It would probably come in more handy in usecases like chat bots and less in classical search problems.
TODO: Elaborate evaluation currently missing, waiting until the nil interface bug is resolved in weaviate before continuation.
Steps to make this a feature
There have been multiple questions/requests already to use custom contextionaries from the community and customers. What steps need to be taken to allow this?
To create a weaviat contextionary one must only implement the proto interface which allows weaviate to make RPCs to it. As of now this contains the following list:
Here we run into the first issue. Not every model provides all the information that is needed to implement all these functions.
In the BERT case a function like
MultiNearestWordsByVector
can't be implemented because the vectors are dynamically created based on the text input. However this function is also not required for most of weavaiats features.If we want to support many models we need to document exactly what functions are required for what features. Weaviate must stay robust in a way that it stays useable even if not all features can be supported.
Recommendation 1: Document exactly what functions are needed for what features. Provide default implementeations for all non essential functions.
While the proto3 implementation is not terribly complicated, it is quite cumbersome and requires a fair amount of motivation to engage in. Reasons for this are the rather obscure data structure that the data is wrapped in. The split in server and client where testing becomes an asynchronous task. A missing testing infrastructure that helps to quickly validate implementations.
Similar to the client libraries it probably makes sense to provide some kind of abstraction for the interface implementation. If we provide a minimal interface layer that accepts python native datatypes (such as a numpy array) and convert it into the RPC vector data structure for the user, we pick up users in their compfort zone and increase development time rapitly.
Recommendation 2: Create abstraction layer for proto3 interface that handles language native types. Provide tests for this layer.
One major issue we ran into was speed. The BERT model calculates a vector for every corpus by propagating multiple layers of operations. This slows down the calculation significantly to the contextionaries we use so far. While speed is an issue it is not an unsolvable problem.
Recommendation 3: Make weaviate timeouts configurable. Meaybe think on how to support long vectorization runs.\ Recommendation 4: Provide structures and documentation e.g. in the form of a docker container template, that support horizontal scaling and the usage of GPUs.