weiqk / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

Language Detect plugin not filtering #72

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. use the following urls in seed.txt
http://www.epa.gov/espanol/
http://en.wikipedia.org/wiki/Infection_in_childcare
http://www.italianinelmondo.com/
http://www.urdupoint.com/
http://www.bbc.com/news/
http://www.eeas.europa.eu/delegations/cameroon/index_fr.htm

2. Use apache nutch1.8 custom built using cloudera's hadoop distribution 5.1.2 
with solr 4.9

3. To allow only English language websites to be indexed on solr, use only en 
in profiles directory and also modify the class LanguageDetector to give 
priority only for English using the below link:
https://code.google.com/p/language-detection/wiki/NutchPlugin

follow the steps as mention in the above link.

What is the expected output? What do you see instead?
Only English language websites has to be indexed on solr.

What version of the product are you using? On what operating system?
OS: linux 64 bit
Apache nutch 1.8
Apache solr 4.9
Cloudera CDH 5.1.2

Please provide any additional information below.
The library is not consistent in filtering the languages. During few instance 
only english is seen on solr admin and spanish is filtered out(when seed.txt 
had only 1 english website and 1 spanish website)
when the no of links is around 5 having 2 english, 1 spanish, 1 italian, 1  
french, 1 urdu web urls.
the output expected is : only 2 english urls on solr admin
Actual output: 2 english along with an italian, french, urdu urls are also 
indexed.
Please help me soon with your help or any suggestions to get the correct 
results.

Thanks,
Guru

Original issue reported on code.google.com by gururajf...@gmail.com on 10 Sep 2014 at 6:38