weaviate / weaviate

Weaviate is an open-source vector database that stores both objects and vectors, allowing for the combination of vector search with structured filtering with the fault tolerance and scalability of a cloud-native database​.
https://weaviate.io/developers/weaviate/
BSD 3-Clause "New" or "Revised" License
11.16k stars 773 forks source link

Allow contextual classification for transformers module #1603

Closed bobvanluijt closed 3 years ago

bobvanluijt commented 3 years ago

It would be great if this function is also supported in this module (text2vec-transformers).

etiennedi commented 3 years ago

The exact text2vec-contextionary-contextual type is completely specific to the contextionary "system", i.e. it uses occurrences to calculate a classification-specific vector on the fly, etc. This was introduced after initial disappointing results with just using the existing objects vectors.

Since we don't have those means (occurrence, influence over the vector building process, etc.) available in other modules, I think we could implement a generic type contextual which is not specific to any module and therefore works with any module, including the text2vec-transformers module. It would simply pick the closest target vector for each source vector - as was the original intention.

bobvanluijt commented 3 years ago

That would be great!

etiennedi commented 3 years ago

Sure. Priority-wise, do you think this will be needed before/more urgent than the Auto-Schema feature? Or should I plan it after?

bobvanluijt commented 3 years ago

Also note: https://dsmaxpool.slack.com/archives/CCLK5F3NX/p1624699504272400 and related https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681

etiennedi commented 3 years ago

Makes we wonder if we should name the classification zeroshot, that's something people already understand whereas contextual is something we'd need to explain to them.

bobvanluijt commented 3 years ago

Ha - good point

bobvanluijt commented 3 years ago

@laura-ham – can you please weigh in regarding the naming question?

laura-ham commented 3 years ago

I'm voting for zero-shot-classification, because people know this name, like @etiennedi mentions. I think we should stick as much as possible to familiar and official terms as possible. Zero-shot classification is exactly what it is, see also https://en.wikipedia.org/wiki/Zero-shot_learning.

We could also reuse this term for a potential zero-shot question answering module :) https://www.arxiv-vanity.com/papers/1611.05546/ and https://www.aclweb.org/anthology/2020.emnlp-main.11/

bobvanluijt commented 3 years ago

Thanks @laura-ham – is this only a documentation and RESTful thing or does it also have an impact on the code @etiennedi ?

etiennedi commented 3 years ago

The naming? It's going to have some very, very minimal impact, yes. Mainly on object/function naming.

bobvanluijt commented 3 years ago

The naming?

Yes :-)

It's going to have some very, very minimal impact, yes.

@laura-ham – can you please add this (the naming) to the OPS list?

etiennedi commented 3 years ago

cc @antas-marcin, Note that the new classification type will be called zeroshot (formerly planned as contextual). This will only affect the type we are newly adding, not the existing text2vec-contextionary-contextual for now. But we can think about deprecating that name as well and introducing text2vec-contextionary-zeroshot in a non-breaking fashion, i.e. allow both, but print a warning on the deprecated name.

laura-ham commented 3 years ago

@laura-ham – can you please add this (the naming) to the OPS list?

@bobvanluijt What is the current to-do related to this? I don't see any task until changes in Weaviate are applied(new classification type or renaming text2vec-contextionary-contextual.

bobvanluijt commented 3 years ago

I don't see any task until changes in Weaviate are applied

Certainly, but you can already create the OPS ticket with the disclaimer that this can be done when it's implemented in Weaviate ;-)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.