ziqizhang / jate

NEWS: JATE2.0 Beta.11 Released, see details below.
GNU Lesser General Public License v3.0
81 stars 29 forks source link

Purging index between JATE calls #32

Closed eltimster closed 7 years ago

eltimster commented 7 years ago

I am trying to run JATE on different corpora, but found that it seems to incrementally add to the SOLR index when it indexes a new corpus, meaning I get terms from not just the corpus of interest, but the union of all corpora processed to that point. My solution to the problem has been to rm purge files from the relevant data/index file, but this is now causing an exception:

`2016-10-25 09:24:04 INFO  AppCValue:328 - Indexing corpus from [docs/english] and perform candidate extraction ...
2016-10-25 09:24:05 INFO  AppCValue:331 -  [151996] files are scanned and will be indexed and analysed.
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:08 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading exception data for lemmatiser...
Tue Oct 25 09:24:09 AEDT 2016 loading done
Tue Oct 25 09:24:09 AEDT 2016 loading done
2016-10-25 09:24:09 ERROR SolrCore:525 - [jateCore] Solr index directory '/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/jateCore/data/index/' is locked.  Throwing exception.
2016-10-25 09:24:09 ERROR CoreContainer:740 - Error creating core [jateCore]: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
org.apache.solr.common.SolrException: Index locked for write for core 'jateCore'. Solr now longer supports forceful unlocking via 'unlockOnStartup'. Please verify locks manually!
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:820)
        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:659)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
2016-10-25 09:24:12 ERROR SolrCore:139 - org.apache.solr.common.SolrException: Exception writing document id 112188-q to the index; possible analysis error.
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:167)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:955)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1110)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:706)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:179)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:174)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:139)
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:153)
        at uk.ac.shef.dcs.jate.util.JATEUtil.addNewDoc(JATEUtil.java:339)
        at uk.ac.shef.dcs.jate.app.App.indexJATEDocuments(App.java:374)
        at uk.ac.shef.dcs.jate.app.App.lambda$index$4(App.java:340)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at uk.ac.shef.dcs.jate.app.App.index(App.java:338)
        at uk.ac.shef.dcs.jate.app.AppCValue.main(AppCValue.java:45)
Caused by: java.lang.NullPointerException
        at opennlp.tools.util.Cache.put(Cache.java:134)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:195)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:87)
        at opennlp.tools.postag.DefaultPOSContextGenerator.getContext(DefaultPOSContextGenerator.java:32)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:102)
        at opennlp.tools.ml.BeamSearch.bestSequences(BeamSearch.java:168)
        at opennlp.tools.ml.BeamSearch.bestSequence(BeamSearch.java:173)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:194)
        at opennlp.tools.postag.POSTaggerME.tag(POSTaggerME.java:190)
        at uk.ac.shef.dcs.jate.nlp.opennlp.POSTaggerOpenNLP.tag(POSTaggerOpenNLP.java:23)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.assignPOS(OpenNLPPOSTaggerFilter.java:103)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.createTags(OpenNLPPOSTaggerFilter.java:97)
        at org.apache.lucene.analysis.jate.OpenNLPPOSTaggerFilter.incrementToken(OpenNLPPOSTaggerFilter.java:51)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.getNextToken(ComplexShingleFilter.java:335)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.shiftInputWindow(ComplexShingleFilter.java:412)
        at org.apache.lucene.analysis.jate.ComplexShingleFilter.incrementToken(ComplexShingleFilter.java:175)
        at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:45)
        at org.apache.lucene.analysis.jate.EnglishLemmatisationFilter.incrementToken(EnglishLemmatisationFilter.java:30)
        at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:51)
        at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:613)
        at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:344)
        at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:300)
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1475)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:239)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:163)
        ... 23 more

2016-10-25 09:24:12 ERROR TransactionLog:567 - Error: Forcing close of tlog{file=/home/tim/forum-style/jate-2.0-beta.1/testdata/solr-testbed/ACLRDTEC/data/tlog/tlog.0000000000000004167 refcount=2}`

Is there a clean way to do what I want to do?

Also, by way of note, the lack of support for concurrent processes (also caused by SOLR only wanting one JATE indexer running at a time) is a real bottleneck ...

jerrygaoLondon commented 7 years ago

Thanks for reporting this issue. They are not bug from my perspective . You should post in our google group for further discussion. I put my short answer as below.

To analysis different corpus, you can create two different solr core directory with corpus-specific setting. You don't need always purge solr index every time. If you want to try different ATE algorithm, you DON NOT need to run candidate extraction again. Corpus directory is an option for both embedded mode and plugin mode. For embedded mode, if "-corpusDir" is not provided, JATE will skip term candidate extraction step and directly run term scoring, ranking and exporting over the provided solr core directory. For plugin mode, there is a 'extraction' option in 'solrconfig.xml'.

From the solr exception you are reporting, i suspect that your solr core index are not clean. You can check if there is a write.lock file there. You should check whether JATE/SOLR process is killed and simply remove all files in data directory. This problem happens usually because that the solr process is not shutdown or killed successfully. You can also simply manually remove the write lock for the unexpected situation provided that your solr index is not corrupted and you don't want to re-index the corpus.

For concurrent process/indexing for large corpus, there are many ways to scale up/out solr. In large scale case, jate embedded mode is not the good choice. You should go for plugin mode and looking into how to set up solr cloud server, for instance.

Note that JATE is not intended simply as an app. We make it easily to run and do the demo. It is designed and developed as a library to work with Apache Solr. You can extended it with your ATE algorithm based on Solr framework.