Splitting for vector position calculation

weaviate / contextionary

Weaviate's own language vectorizer, which allows for semantic context-based searches in Weaviate

https://weaviate.io/developers/weaviate/modules/retriever-vectorizer-modules/text2vec-contextionary

BSD 3-Clause "New" or "Revised" License

14 stars 2 forks source link

Splitting for vector position calculation #7

Closed etiennedi closed 5 years ago

etiennedi commented 5 years ago

From https://github.com/semi-technologies/weaviate/issues/959

When creating the centroid for a concept class. Weaviate should not only split by string, but rather by non-alphabetical characters.

Examples:

foobar baz baq => [foobar, baz, baq]
foo-bar baz baq => [foo, bar, baz, baq]
foo-bar, baz, baq => [foo, bar, baz, baq]
foo-bar, b&z, baq => [foo, bar, b, z, baq]
foobar baz#(*@@baq => [foobar, baz, baq]

Idea: Maybe this can be combined with: https://github.com/semi-technologies/weaviate/issues/952

fefi42 commented 5 years ago

Currently I am using a split function (on multiple occasions) that splits on these characters: '-', '_', '.', ',', '"', "'", "/", "&" maybe even more are helpful. It should not remove any special characters of other languages.

The function also filters out very short tokens that might be left after splitting: foo-a, bar's baz => [foo, bar, baz] Since s and a probably have not meaning for them selfs. This could be done (optionally) for one and two character words. It might also makes sense to keep very short words if they are the only ones remaining: fo-o, ba'r => [fo, ba]

Another very useful thing can be compound splitting. This makes especially sense in languages like Germanic languages where compound words are used heavily. A very basic implementation in python can be found here. This implementation iterates over all characters and thus is not super efficient. There is quite some research about this so I am sure there is a quicker way.

etiennedi commented 5 years ago

I did a small experimentation with Go's unicode package. Essentially anything that has true in the last two columns (IsPunct || IsSpace) will be considered a splitting character, whereas anything in the other categories will be considered part of the word.

Character       |IsLetter       |IsNumber       |IsMark         |IsPunct        |IsSpace
-------------------------------------------------------------------------------------------------------------
a               |true           |false          |false          |false          |false
b               |true           |false          |false          |false          |false
A               |true           |false          |false          |false          |false
B               |true           |false          |false          |false          |false
Ç               |true           |false          |false          |false          |false
ç               |true           |false          |false          |false          |false
Ö               |true           |false          |false          |false          |false
ö               |true           |false          |false          |false          |false
-               |false          |false          |false          |true           |false
_               |false          |false          |false          |true           |false
,               |false          |false          |false          |true           |false
&               |false          |false          |false          |true           |false
(               |false          |false          |false          |true           |false
#               |false          |false          |false          |true           |false
                |false          |false          |false          |false          |true

Any objections?

bobvanluijt commented 5 years ago

nope, looks good

fefi42 commented 5 years ago

Are we considering blacklisting and whitelisting certain words? Would be nice if we could explicitly keep something like R2-D2 as a word.

etiennedi commented 5 years ago

My intuition would be to say, let's split up "R2-D2" into ["r2", "d2"], because once we have the compound word feature it would essentially be the same as ["brad", "pitt"], i.e. two technically independent words that have special meaning when they occur next to one another.

bobvanluijt commented 5 years ago

because once we have the compound word feature it would essentially be the same as ["brad", "pitt"]

Agreed

etiennedi commented 5 years ago

Released as en0.8.0-v0.3.3.