weaviate / contextionary

Weaviate's own language vectorizer, which allows for semantic context-based searches in Weaviate
https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-contextionary
BSD 3-Clause "New" or "Revised" License
13 stars 2 forks source link

Custom contextionaries #43

Open fefi42 opened 3 years ago

fefi42 commented 3 years ago

Custom contextionaries

In the past weeks we have experimented with creating custom contextionaries on the side. We where interested in a model that is provided through a python library. Python is hugely popular and used by many data scientists and developers for their ml models. Support for such models would greatly increase the applications weavaite could be used for.

BERT model spike

We used a model called BERT. From a library called huggingface. https://huggingface.co/transformers/model_doc/bert.html The library allows to download the model as a tensorflow or pyTorch model. We chose the latter.

BERT does not implement a global embedding, making it worse when searching for single words. Its strength lies within longer text especially sentences. It would probably come in more handy in usecases like chat bots and less in classical search problems.

TODO: Elaborate evaluation currently missing, waiting until the nil interface bug is resolved in weaviate before continuation.

Steps to make this a feature

There have been multiple questions/requests already to use custom contextionaries from the community and customers. What steps need to be taken to allow this?

To create a weaviat contextionary one must only implement the proto interface which allows weaviate to make RPCs to it. As of now this contains the following list:

  rpc IsWordStopword(Word) returns (WordStopword) {}
  rpc IsWordPresent(Word) returns (WordPresent) {}
  rpc SchemaSearch(SchemaSearchParams) returns (SchemaSearchResults) {}
  rpc SafeGetSimilarWordsWithCertainty(SimilarWordsParams) returns (SimilarWordsResults) {}
  rpc VectorForWord(Word) returns (Vector) {}
  rpc MultiVectorForWord(WordList) returns (VectorList) {}
  rpc VectorForCorpi(Corpi) returns (Vector) {}
  rpc NearestWordsByVector(VectorNNParams) returns (NearestWords) {}
  rpc MultiNearestWordsByVector(VectorNNParamsList) returns (NearestWordsList) {}
  rpc Meta(MetaParams) returns (MetaOverview) {}
  rpc AddExtension(ExtensionInput) returns (AddExtensionResult) {}

Here we run into the first issue. Not every model provides all the information that is needed to implement all these functions.

In the BERT case a function like MultiNearestWordsByVector can't be implemented because the vectors are dynamically created based on the text input. However this function is also not required for most of weavaiats features.

If we want to support many models we need to document exactly what functions are required for what features. Weaviate must stay robust in a way that it stays useable even if not all features can be supported.

Recommendation 1: Document exactly what functions are needed for what features. Provide default implementeations for all non essential functions.

While the proto3 implementation is not terribly complicated, it is quite cumbersome and requires a fair amount of motivation to engage in. Reasons for this are the rather obscure data structure that the data is wrapped in. The split in server and client where testing becomes an asynchronous task. A missing testing infrastructure that helps to quickly validate implementations.

Similar to the client libraries it probably makes sense to provide some kind of abstraction for the interface implementation. If we provide a minimal interface layer that accepts python native datatypes (such as a numpy array) and convert it into the RPC vector data structure for the user, we pick up users in their compfort zone and increase development time rapitly.

Recommendation 2: Create abstraction layer for proto3 interface that handles language native types. Provide tests for this layer.

One major issue we ran into was speed. The BERT model calculates a vector for every corpus by propagating multiple layers of operations. This slows down the calculation significantly to the contextionaries we use so far. While speed is an issue it is not an unsolvable problem.

  1. Not every usecase needs real time speed in the first place. So slower speeds might be acceptable in favour of a better model. If a model is too slow weaviate might start to timeout the request though.
  2. We are able to horizontally scale. There is no need to share data among models thus we can replicate and scale horizontally with relative ease.
  3. We don't use GPUs. In our spike we took the easy route and used CPUs however both pyTorch and Tensorflow have very good GPU integration making use of heavy parallelization. Using GPU/TPUs can reduce the vectorization time of a corpus significantly.

Recommendation 3: Make weaviate timeouts configurable. Meaybe think on how to support long vectorization runs.\ Recommendation 4: Provide structures and documentation e.g. in the form of a docker container template, that support horizontal scaling and the usage of GPUs.

etiennedi commented 3 years ago

Something regarding the gRPC interface: With V1.0.0 Weaviate will become more modular and what is currently the contextionary will become one of the modules. This means that:

  1. Weviate without any modules will not have any vectorizers. All vectors at import and search time have to be provided by the user
  2. The current contextionary (including this gRPC API) will be a module
  3. This means that for a BERT-vectorizer (not calling it contextionary on purpose) there would be two options:
    1. It reuses the contextionary-module just with a different container and therefore has to adhere to the gRPC API
    2. It provides its own module and is therefore in control of the API and could present it's own API (which wouldn't even have to be gRPC based
fefi42 commented 3 years ago

I would prefer 3.ii in that case.

bobvanluijt commented 3 years ago

Thanks for sharing @fefi42 – very insightful