yeskarthik / n-gram

1 stars 2 forks source link

Add bigrams #1

Open ronojoy opened 9 years ago

ronojoy commented 9 years ago

@yeskarthik, nice work. Can you improve the generator by including bigrams and sampling from the bigram probability distribution ? Have a look at this paper for details on how to do this.

yeskarthik commented 9 years ago

@ronojoy just implemented bi-gram, that paper looks interesting! thanks :) let me know your views on the bigram implementation.

ronojoy commented 9 years ago

Yes, this looks good. Things will get cumbersome as you increase the Markov chain order with this approach. Therefore, can you now try to use the NLTK n-gram class to write this for a general n-gram model, with n=1, 2, 3 ... given as a parameter ? This code should not take more than 10 lines. Also, check out the options for smoothing the n-gram model in the NLTK class. How about trying to do this for Indian languages, using Unicode ? Two pointers to help you

quick theory : https://sites.google.com/site/gothnlp/links/probability-and-n-grams NLTK ngrams : http://www.nltk.org/api/nltk.model.html

yeskarthik commented 9 years ago

Thanks @ronojoy I used the NLTK library and the implementation is now too simple. Just 2/3 lines. Btw, like you suggested I tried using the Tamil text for training and the results are good. I used Thirukural and one of Bharathiyar's story/poems. I noticed that rarely words are repeating in those texts! so am getting sentences just as how it was in the training texts and just in a different order.

I also played around with other corpora available in the NLTK library. Wondering what would be their real world applications. Maybe transliteration / translation / voice recognition engines might use to choose the most probable next word?

One more thing I saw was that, when I tried to run the Thirukural corpus on my bigram code and tried to generate text, it basically took a very long time (> 20 minutes or so) then I stopped the script, but the same runs in seconds using the NLTK library, there's huge performance difference.

ronojoy commented 9 years ago

@yeskarthik, just had a look at your code and here are some more suggestions to improve the model :

N-gram models have tons of applications in NLP. They are usually the first port of call for simple classification tasks. For instance, a naive Bayes classifier invokes a n-gram model with N = 1 to compute word probabilities. Suggestions for next words (e.g. on a search engine) are also generated by N-gram models. Likewise for the T9 algorithm, etc.

I haven't looked into the NLTK N-gram method implemention. An optimised data structure of off-loading to C code could be possible reasons why they get so much of speedup.

yeskarthik commented 9 years ago

Thanks @ronojoy, I was reading about the smoothing that I found here

  1. http://www.cs.jhu.edu/~jason/465/PDFSlides/lect05-smoothing.pdf
  2. http://nlp.stanford.edu/~wcmac/papers/20050421-smoothing-tutorial.pdf

While I tried implementing the Witten-Bell smoother I also found that NLTK has removed 'models' including NgramModel from its latest version (develop) since it has a number of bugs unresolved including the one that I faced while implementing it. (I got a division by zero error)

Do you have any other library in mind?

Refs:

  1. http://stackoverflow.com/questions/15697623/training-and-evaluating-bigram-trigram-distributions-with-ngrammodel-in-nltk-us (This is the error that I got)
  2. https://github.com/nltk/nltk/issues/367
  3. http://stackoverflow.com/questions/26443084/is-there-an-alternate-for-the-now-removed-module-nltk-model-ngrammodel
ronojoy commented 9 years ago

@yeskarthik, they have removed the n-gram model from the main branch since there are bugs in parts of the code. Faizal (#valuefromdata) and I are planning to work on this over the next several weeks to fix bugs and send in a pull request to NLTK. You are welcome to help out, if you want.

The current solution is to roll back to the older version of NLTK, avoid the Lidstone family of smoothers, and generally be careful to check that all returned probabilities are between 0 and 1. You can also talk to Faizal for additional input.

The other library which has everything implemented and is generally bug-free is the SRI language modelling toolkit. It is an old-style C library, with command line binary and millions of switches! :) Give it a spin, if only to see how much cooler working in Python is.