zelandiya / maui-standalone

21 stars 11 forks source link

Maui fail to build thesaurus with text vocabulary #1

Closed H-B-Schmidt closed 9 years ago

H-B-Schmidt commented 9 years ago

13 Okt 2014 10:44:34 INFO MauiTopicExtractor - Extracting keyphrases with options: 13 Okt 2014 10:44:34 INFO MauiTopicExtractor - -l data/docs/EconBiz_test -m data/models/stw_keyword_extraction_model_top_100 -v data/vocabulary/stw_top_100_keywords.txt -f text -e default -i en -n 5 -c 0.0 -t com.entopix.maui.stemmers.PorterStemmer -s com.entopix.maui.stopwords.StopwordsEnglish
13 Okt 2014 10:44:34 INFO MauiTopicExtractor - -- Loading the model... 13 Okt 2014 10:44:35 INFO MauiTopicExtractor - --- Loading the vocabulary... 13 Okt 2014 10:44:35 INFO Vocabulary - --- Loading Vocabulary from text files... 13 Okt 2014 10:44:35 INFO Vocabulary - -- Building the Vocabulary index 13 Okt 2014 10:44:35 ERROR MauiTopicExtractor - Failed to load thesaurus! java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1954) at com.entopix.maui.vocab.Vocabulary.buildTEXT(Vocabulary.java:493) at com.entopix.maui.vocab.Vocabulary.initializeFromTXTFiles(Vocabulary.java:401) at com.entopix.maui.vocab.Vocabulary.initializeVocabulary(Vocabulary.java:131) at com.entopix.maui.main.MauiTopicExtractor.loadVocabulary(MauiTopicExtractor.java:442) at com.entopix.maui.main.MauiTopicExtractor.loadModel(MauiTopicExtractor.java:575) at com.entopix.maui.main.MauiTopicExtractor.main(MauiTopicExtractor.java:639) at com.entopix.maui.StandaloneMain.main(StandaloneMain.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:53) at java.lang.Thread.run(Thread.java:745)

The vocabulary is a text file with 241 lines of text. The training sequence went without an error message.

zelandiya commented 9 years ago

Could you please post here a snippet from you stw_top_100_keywords.txt vocabulary? Does it follow the formatting guidelines?

H-B-Schmidt commented 9 years ago

Here is the snippet. Don't know about formatting guidelines. It is a plain text files with 0A hex as linefeed. Btw is there a difference between Maui standalone and Maui 1.2 or will they provide identical results ? Thanks for the quick reply.

Theory United States United States of America Germany EU countries World Estimation Great Britain UK (United Kingdom) United Kingdom Economic Growth Developing countries LDC (Less Developed Countries) Less developed countries Low-income countries Underdeveloped countries China People's Republic of China India Republic of India Russia France Monetary policy Japan Economic policy Macroeconomic policy Globalisation Globalization Internationalisation Internationalization Transnationalization Direct investment Foreign direct investment Foreign investment Poland Small and medium-sized enterprises Small and medium size firms SME SMEs Economic diplomacy External sector International economic relations Australia Italy Innovation Global enterprise Global firm MNC (Multinational company) MNE (Multinational enterprise) Multinational company Multinational corporation Multinational enterprise Transnational corporation Economic development Labor market Labour market Buying behaviour Consumer behaviour Consumer research Canada Multifactor productivity Productivity TFP (Total factor productivity) Total factor productivity Comparison Austria Bank

zelandiya commented 9 years ago

While you shouldn't be getting an exception, this is not the expected structure for the vocabulary. I would recommend to download the "text" versions of the Agrovoc or the AOD vocabulary here: http://www.nzdl.org/Kea/download.html and structure yours accordingly. This Maui version is 1.3 and it's a beta.

H-B-Schmidt commented 9 years ago

Ok, thanks a lot for the support. With KEA I read that you have to remove the author keywords from the text file but I could not find this instruction for Maui. Is it necessary to do that ?

zelandiya commented 9 years ago

I think what it means is that if you are working with scientific publications, most of them contain author keywords, and if you were to run performance evaluation, these keywords would give KEA an unfair advantage.