Open GoogleCodeExporter opened 8 years ago
I guess it is probably not because of no space text but 'unknown language' text.
langdetect tend to stop detection processes when conversing its probabilities
of languages. So easy text detection is earlier than not easy one.
Original comment by nakatani.shuyo
on 17 Oct 2011 at 8:28
I changed the mail regex a bit and it improved numbers a bit:
//private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
private static final Pattern MAIL_REGEX = Pattern.compile("[-_.0-9A-Za-z]{1,64}@([-_0-9A-Za-z]){1,63}(.([-_.0-9A-Za-z]{1,63}))");
New timings:
36 ms
154 ms
19 ms
140 ms
Original comment by dan...@nuix.com
on 17 Oct 2011 at 9:41
Wow, I verified lots improvement by your code! (honestly, I couldn't believe
it...)
Then I'll modify at your proposal.
Very THANKS!
Original comment by nakatani.shuyo
on 18 Oct 2011 at 3:57
//private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,2076}");
private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.,?&~;+=/#0-9A-Za-z]{1,2076}");
URL regex should have also comma.
Original comment by markowsk...@gmail.com
on 18 Oct 2011 at 2:00
Yeah, I didn't want to turn this into a debate about which characters are valid
in email addresses, because actually there are *quite a few* more than what is
mentioned here.
Original comment by dan...@nuix.com
on 19 Oct 2011 at 1:28
You started discussion with comment 2 ;)
All URI valid characters are here: http://www.ietf.org/rfc/rfc3986.txt
Original comment by markowsk...@gmail.com
on 19 Oct 2011 at 6:29
I created Issue 27 to track the comment about URL matching, since this ticket
is about the performance issue, not correctness.
Original comment by trejkaz
on 19 Oct 2011 at 9:50
Can we issue out a new release? This is a very important fix for us, and I
think there have been many other important fixes since the last release.
Thanks!
Original comment by david.si...@gmail.com
on 20 Oct 2011 at 3:24
I don't release yet, but committed the modified source and the jar file.
http://code.google.com/p/language-detection/source/browse/
http://code.google.com/p/language-detection/source/browse/#svn%2Ftrunk%2Flib
Original comment by nakatani.shuyo
on 20 Oct 2011 at 5:40
Original issue reported on code.google.com by
dan...@nuix.com
on 17 Oct 2011 at 5:55