yaa110 / rake-rs

Multilingual implementation of RAKE algorithm for Rust
https://crates.io/crates/rake
Apache License 2.0
33 stars 8 forks source link

Tokenization of 's #3

Open kornelski opened 5 years ago

kornelski commented 5 years ago

The punctuation regex includes apostrophe, so it splits "foo's" as two separate phrases. I'm seeing "s something" in keywords.

I think it could be fixed by using less smart splitting:

    text.split(|c: char| match c {
                '.'| ',' | '!' | '?' | ':' | ';' | '(' | ')' | '{' | '}' => true,
                _ => false,
            }).filter(|s| !s.is_empty()).for_each(|s| {
                let mut phrase = Vec::new();
                s.split(|c:char| !c.is_alphanumeric() && c != '\'' && c != '’').filter(|s| !s.is_empty()).for_each(|word| {
                    let word = word.trim_matches(|c: char| !c.is_alphanumeric());
yaa110 commented 5 years ago

Please note that the library should be multilingual, e.g. ، and ؛ are punctuation characters in Persian. So, \p{P} is easier to be used for multilingual support. However, 's must be ignored as you mentioned.