Ngrams improvements - Githubissues

Euak commented 5 years ago

Hello, @yooper Based on the Ngram Statistics Package by Ted Pedersen and Satanjeev Banerjee I implemented some features for the Ngram functionalities of this library.

I fixed the separator insertion when the ngram is created with a separator with length bigger than one;
I implemented a function to calculate the frequency of each ngram inside of a ngram array and its tokens. The frequency is based on Pedersen and Banerjee's package as follows: For bigrams, it calculates the frequency of the bigram as a whole and the frequencies of the right and left token in its found positions. For trigrams, it calculates the frequency of the trigram as a whole, the frequencies of each token in its found positions, the frequency of the first token with the second token, the frequency of the first token with the third token and the frequency of the second token with the third token, all in its found positions.
Finally, I implemented calculations for statistic measures that determine the degree of association. Also, based on Pedersen and Banerjee's package.
Tests were also implemented.

There is a much more detailed description of Pedersen and Banerjee's package at their paper, available at: http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf

Feel free to contact me in case of questions.

yooper commented 5 years ago

Thank you for the contribution. Before I merge, please make sure to camelCase your variables. You have several variables using underscore.

Euak commented 5 years ago

Hi, @yooper. Sorry for that mistake. I fixed it. Thanks.

yooper / php-text-analysis

Ngrams improvements #47