sicheva commented 6 years ago

Im trying to exectue TF-IDF on the Embedded Mode jate-2.0-beta.7 but i get this error:

_2018-06-14 14:34:33 INFO  TFIDF:27 - Beginning computing TermEx values,, total terms=4148
2018-06-14 14:34:33 INFO  TFIDF:38 - Complete
2018-06-14 14:34:45 **ERROR CachingDirectoryFactory:184** - Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<<refCount=1;path=C:\Users\Sonja\Desktop\ProjetStage\jateDemo\solr-testbed\ACLRDTEC\data\index;done=false>>
2018-06-14 14:34:45 ERROR CachingDirectoryFactory:150 - Error closing directory:org.apache.solr.common.SolrException: Timeout waiting for all directory ref counts to be released - gave up waiting on CachedDir<<refCount=1;path=C:\Users\Sonja\Desktop\ProjetStage\jateDemo\solr-testbed\ACLRDTEC\data\index;done=false>>
        at org.apache.solr.core.CachingDirectoryFactory.close(CachingDirectoryFactory.java:187)
        at org.apache.solr.core.SolrCore.close(SolrCore.java:1257)
        at org.apache.solr.core.SolrCores.close(SolrCores.java:124)
        at org.apache.solr.core.CoreContainer.shutdown(CoreContainer.java:562)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.shutdown(EmbeddedSolrServer.java:263)
        at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.close(EmbeddedSolrServer.java:268)
        at uk.ac.shef.dcs.jate.app.App.extract(App.java:317)
        at uk.ac.shef.dcs.jate.app.AppTFIDF.main(AppTFIDF.java:49)

2018-06-14 14:34:45 INFO  AppTFIDF:516 - Exporting terms to [t-terms.json]
2018-06-14 14:34:45 INFO  AppTFIDF:520 - complete._

I don't understand what is the problem with the CachingDirectoryFactory, also i remarcted that even if I run TF-IDF or only TF i get the same first term scored its weird.Normally they are the opposite function ???(so i conclude that this error need to cause this)

ziqizhang commented 6 years ago

Hi

The error seems to happen because solr does not close the index properly. But it should not affect the results.

TF-IDF and TF can produce the same score if IDF of a term is 1.0. Typically if you have only 1 document in the corpus, that would be the case.

Can you provide your test files for us to investigate?

sicheva commented 6 years ago

hi Thanks for your response , i m using 26 files ... Here is the zip file corpus.zip

ziqizhang commented 6 years ago

Thank you.

I cannot reproduce your error, unfortunately. See the log file below.

It may be that we are using different schema files. I am using the ACLRDTEC config in the distribution for indexing your corpus, this gives some 2400 candidate terms. but your log indicates you had >4000. If you share your configurations, I can have another look.

The tfidf and ttf results are also different. Though both ranks 'plant' to be the first, it clearly gets different scores:

ttf: [{"string":"plant","score":360.0}, tfidf: [{"string":"plant","score":0.02591184465570414}

Log output: indexing

Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:56 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:56 BST 2018 loading done Sat Jun 16 10:04:56 BST 2018 loading done Sat Jun 16 10:04:56 BST 2018 loading done Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:58 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:04:58 BST 2018 loading done Sat Jun 16 10:04:58 BST 2018 loading done Sat Jun 16 10:04:58 BST 2018 loading done Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:05:00 BST 2018 loading done Sat Jun 16 10:05:00 BST 2018 loading done Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:05:00 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:05:00 BST 2018 loading done Sat Jun 16 10:05:00 BST 2018 loading done 2018-06-16 10:05:00 INFO Indexing:26 - DELETING PREVIOUS INDEX 2018-06-16 10:05:01 INFO Indexing:30 - INDEXING BEGINS WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.tika.parser.ParseContext (file:/home/zz/.m2/repository/org/apache/tika/tika-core/1.15/tika-core-1.15.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int) WARNING: Please consider reporting this to the maintainers of org.apache.tika.parser.ParseContext WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 2018-06-16 10:05:01 INFO IndexingHandler:28 - Beginning indexing dataset, total docs=26 2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (3).txt 2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure. 2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure. 2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure. 2018-06-16 10:05:01 WARN OpenNLPTokenizer:150 - Token start and end offsets does not match with token length. Usually you may safely ignore this as it is often because there is an HTML entity in your text. Check Issue 26 on JATE webpage to make sure. 2018-06-16 10:05:01 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (10).txt 2018-06-16 10:05:02 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (5).txt 2018-06-16 10:05:03 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (1).txt 2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (1).txt 2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (12).txt 2018-06-16 10:05:04 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (4).txt 2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (3).txt 2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (2).txt 2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (2).txt 2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (9).txt 2018-06-16 10:05:05 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (8).txt 2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (7).txt 2018-06-16 10:05:06 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (6).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (3).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (5).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (6).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (7).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (4).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (5).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (2).txt 2018-06-16 10:05:07 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (1).txt 2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/roman (4).txt 2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (8).txt 2018-06-16 10:05:08 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/nabil (6).txt 2018-06-16 10:05:09 INFO IndexingHandler:34 - Processing:/home/zz/Work/data/corpus/doc/Output0nassim (11).txt 2018-06-16 10:05:09 INFO IndexingHandler:87 - Complete indexing dataset. Total processed items = 26 2018-06-16 10:05:09 INFO Indexing:37 - INDEXING COMPLETE

tfidf

Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:51 BST 2018 loading done Sat Jun 16 10:06:51 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:51 BST 2018 loading done Sat Jun 16 10:06:51 BST 2018 loading done Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:52 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:52 BST 2018 loading done Sat Jun 16 10:06:53 BST 2018 loading done Sat Jun 16 10:06:53 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:53 BST 2018 loading done Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:54 BST 2018 loading done Sat Jun 16 10:06:54 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:54 BST 2018 loading done Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:55 BST 2018 loading done Sat Jun 16 10:06:55 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:06:55 BST 2018 loading done 2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311 2018-06-16 10:06:55 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488 2018-06-16 10:06:55 INFO TFIDF:27 - Beginning computing TFIDF values,, total terms=2488 2018-06-16 10:06:55 INFO TFIDF:38 - Complete 2018-06-16 10:06:55 INFO AppTFIDF:492 - Exporting terms to [/home/zz/Work/data/tfidf.json] 2018-06-16 10:06:56 INFO AppTFIDF:496 - complete.

ttf

Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:33 BST 2018 loading done Sat Jun 16 10:08:33 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:34 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:34 BST 2018 loading done Sat Jun 16 10:08:34 BST 2018 loading done Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:35 BST 2018 loading done Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:35 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:35 BST 2018 loading done Sat Jun 16 10:08:35 BST 2018 loading done Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:37 BST 2018 loading done Sat Jun 16 10:08:37 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:37 BST 2018 loading done Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:38 BST 2018 loading done Sat Jun 16 10:08:38 BST 2018 loading exception data for lemmatiser... Sat Jun 16 10:08:38 BST 2018 loading done 2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:55 - Building features using cpu cores=8, total=2488, max per worker=311 2018-06-16 10:08:38 INFO FrequencyTermBasedFBMaster:64 - Complete building features. Total=2488 success=2488 2018-06-16 10:08:38 INFO TTF:25 - Beginning computing TTF values,, total terms=2488 2018-06-16 10:08:38 INFO TTF:32 - Complete 2018-06-16 10:08:38 INFO AppTTF:492 - Exporting terms to [/home/zz/Work/data/ttf.json] 2018-06-16 10:08:38 INFO AppTTF:496 - complete.

sicheva commented 6 years ago

Hi , I installed the Plugin Mode version -jate-2.0-beta.11 on Solr-7.2.1 and i get the same results as you so it's okay ! But still i don't understand why the same term "plant" is ranked first in the two alogorithme TF and TF-IDF. Normaly if the term "plant" is rancked filst in TF with maximal score ttf: [{"string":"plant","score":360.0}] (that means "plant" appeared a lot in the corpus ), Should not be ranckted first with the maximal score in alogorithme TD-IDF (give greater weight to the least frequent terms) !!

Here is the exmple .json TF-->

JSON 0 string : "plant" score : 360 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null variants otherInfo 1 string : "view" score : 340 termInfo 2 string : "plante" score : 279 termInfo 3 string : "date" score : 235 termInfo

TF-IDF --> JSON 0 string : "plant" score : 0.026804505797656034 termInfo offsets id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (5).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (8).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (4).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (2).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\nabil (3).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (1).txt,path=null id=C:\Users\Sonja\jateSolrPluginDemo-master\doc\roman (3).txt,path=null variants otherInfo 1 string : "plante" score : 0.016739207424584502 termInfo 2 string : "view" score : 0.012504136237184318 termInfo 3 string : "dialog" score : 0.011585130414139116 termInfo

"plant" appeared in 11 files of 26 ...

Thank you for your help

ziqizhang commented 6 years ago

OK im glad that it works for you in the end and I am closing this issue now.

To answer the question about TF and TFIDF, it is possible that both rank the same term as #1. They are not entirely the opposite, taking into account that this TFIDF is not the originl TFIDF that works for individual documents, i.e., you get different TFIDF for the word 'plant' in doc1 and doc2, doc3 etc.

TFIDF = TTF x IDF

so obviously if a term has high TTF, it could also have high TFIDF. In the case of 'plant' in your corpus, it may be that this word also has high IDF, i.e., it is found only in a small subset of the corpus. But within this small subset, it may have very high frequency, thus giving a high TTF in the corpus too.

Hope that makes sense.

ziqizhang / jate

Timeout waiting for all directory ref counts to be released #43

Log output: indexing

tfidf

ttf