Reducing the problem to binary classification (identify given text english or not)

zjc0516 / language-detection

Automatically exported from code.google.com/p/language-detection

0 stars 0 forks source link

Reducing the problem to binary classification (identify given text english or not) #11

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

Hi,

I just want to detect whether given text is in English or not. For my problem, 
I am not really interested in identifying exact language of the text. I 
understand by reducing number of target languages to predict will increase the 
accuracy. So, Can I infer that just predicting whether the text in English or 
not, instead of predicting exact language will increase the accuracy. If so, 
how can I tweak this package to detect whether given text is English or not.

Regards,
Vamsi

Original issue reported on code.google.com by zeeva...@gmail.com on 22 Mar 2011 at 2:57

GoogleCodeExporter commented 8 years ago

So, is all you need to know is whether the text is in "Latin" or not? (eg 
Spanish and French are written using the same character type as English). If 
so, I can paste here a quick code (in java) to check the character set.

could you clarify your needs?

Original comment by mawa...@live.com on 23 Mar 2011 at 2:58

GoogleCodeExporter commented 8 years ago

To be precise, I want to know whether text is English or not (not latin or 
not). My understanding is that langdetect module detects languages not just 
based on character sets, but also using ngram distribution. So, I was wondering 
how can we customize langdetect module to identify whether given text is 
english or not, so that French or Spanish text will be identified as 
non-english text.

Original comment by zeeva...@gmail.com on 23 Mar 2011 at 8:45

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Understood. Two immediate options I can think of:
1- remove all other profiles other than English. This might have an affect when 
normalising the prob (unsure, so I would need to test it)
2- Just control it within your code (if "en" then ...... else....). Reducing 
the number of profiles may well increase accuracy. But by how much? 5%? is it 
really worth spending much more time testing and exploring?
You sound pretty technical. Do you have programming background or do you need 
help with implementing the proposed options?

Original comment by mawa...@live.com on 23 Mar 2011 at 9:56

GoogleCodeExporter commented 8 years ago

Thanks for your response. 
1) I am afraid to say that there is problem with the first proposed option, if 
I understood it correctly. Don't we need to have one more profile other than 
English, so that system could predict either English or not. If we just keep 
English profile, then language detected will be always English. Isn't it? 
Correct me, if my understanding is wrong. Actually I am thinking this option 
would work if one more profile is created from all other non-English texts. Let 
me know if it is possible.
2) As of now, I am using the second option. Surely would love to have a 
solution works better than this.

Yes, I do have programming background. Appreciate your response.

Original comment by zeeva...@gmail.com on 24 Mar 2011 at 1:13

GoogleCodeExporter commented 8 years ago

1) If you have a single profile, the language id will compare the n-gram 
generated from your input text with the single profile (for example "en"). 
Depending on your max threshold, the language id will return either "en" or 
nothing (or unknown).

Right, I will try this over the weekend and will come back to you early next 
week after conducting a few tests to measure the accuarcy of this option.

Original comment by mawa...@live.com on 24 Mar 2011 at 8:51

GoogleCodeExporter commented 8 years ago

The langdetect-bundled profiles are learned not by English or not but by each 
language.
What you want to do needs re-learning language profiles with English or not.

As Comment 4-1 says, If you remain only English profile, langdetect will always 
output "en" 100%...

So within the current langdetect package, the best way is the proposal in the 
comment 4-2, I think too.

Original comment by nakatani.shuyo on 27 Mar 2011 at 2:19

GoogleCodeExporter commented 8 years ago

I have tried a few options. The fastest best way is to handle it as Nakatani 
suggested (comment 4-2).

Original comment by mawa...@live.com on 31 Mar 2011 at 9:45