Closed etiennedi closed 5 years ago
Currently I am using a split function (on multiple occasions) that splits on these characters:
'-', '_', '.', ',', '"', "'", "/", "&"
maybe even more are helpful. It should not remove any special characters of other languages.
The function also filters out very short tokens that might be left after splitting:
foo-a, bar's baz
=> [foo, bar, baz]
Since s and a probably have not meaning for them selfs. This could be done (optionally) for one and two character words.
It might also makes sense to keep very short words if they are the only ones remaining:
fo-o, ba'r
=> [fo, ba]
Another very useful thing can be compound splitting. This makes especially sense in languages like Germanic languages where compound words are used heavily. A very basic implementation in python can be found here. This implementation iterates over all characters and thus is not super efficient. There is quite some research about this so I am sure there is a quicker way.
I did a small experimentation with Go's unicode
package. Essentially anything that has true in the last two columns (IsPunct || IsSpace
) will be considered a splitting character, whereas anything in the other categories will be considered part of the word.
Character |IsLetter |IsNumber |IsMark |IsPunct |IsSpace
-------------------------------------------------------------------------------------------------------------
a |true |false |false |false |false
b |true |false |false |false |false
A |true |false |false |false |false
B |true |false |false |false |false
Ç |true |false |false |false |false
ç |true |false |false |false |false
Ö |true |false |false |false |false
ö |true |false |false |false |false
- |false |false |false |true |false
_ |false |false |false |true |false
, |false |false |false |true |false
& |false |false |false |true |false
( |false |false |false |true |false
# |false |false |false |true |false
|false |false |false |false |true
Any objections?
nope, looks good
Are we considering blacklisting and whitelisting certain words? Would be nice if we could explicitly keep something like R2-D2
as a word.
My intuition would be to say, let's split up "R2-D2" into ["r2", "d2"]
, because once we have the compound word feature it would essentially be the same as ["brad", "pitt"]
, i.e. two technically independent words that have special meaning when they occur next to one another.
because once we have the compound word feature it would essentially be the same as ["brad", "pitt"]
Agreed
Released as en0.8.0-v0.3.3
.
From https://github.com/semi-technologies/weaviate/issues/959
When creating the centroid for a concept class. Weaviate should not only split by string, but rather by non-alphabetical characters.
Examples:
foobar baz baq
=>[foobar, baz, baq]
foo-bar baz baq
=>[foo, bar, baz, baq]
foo-bar, baz, baq
=>[foo, bar, baz, baq]
foo-bar, b&z, baq
=>[foo, bar, b, z, baq]
foobar baz#(*@@baq
=>[foobar, baz, baq]
Idea: Maybe this can be combined with: https://github.com/semi-technologies/weaviate/issues/952