Open BruceLee66 opened 5 years ago
Hi BruceLee66. Thanks for your post. Currently whoosh original author Matt Chaput are not in this organization and nobody could contact with him. We are trying to keep this software work but unfortunately we don't have too much knowledge about the internal things. Everyone have to read the code and learn. I will try my best give you an answer, but if you would like help us, we would be very grateful.
Hi @BruceLee66 Now I'm reading the weighting model code and now feel maybe have the answer.
Currently you have two options:
whoosh.scoring.FunctionWeighting
, provide a customized weighting function.whoosh.scoring.FunctionWeighting
class.You could read scoring.py
and get some examples.
This is the language model code i write.but the effect is bad.
def lm(tf,dl,cf,fl,u):
#tf代表词在该文档中出现次数
#dl代表文档长度
#cf代表词在文档集合中出现次数
#fl代表所有文档的长度之和
#u代表参数
return (tf/dl)*dl/(dl+u)+(1-dl/(dl+u))*(cf/fl)
class LM(WeightingModel):
# def __init__(self,u):
# self.u=1000
def scorer(self, searcher, fieldname, text, u=1000,qf=1):
if not searcher.schema[fieldname].scorable:
return WeightScorer.for_(searcher, fieldname, text)
# print(fieldname)
print(text)
return LMScorer(searcher, fieldname, text, u=u,qf=qf)
# def scorer(self, searcher, fieldname, text, u=1000):
# # IDF is a global statistic, so get it from the top-level searcher
# parent = searcher.get_parent() # Returns self if no parent
# self.cf = parent.weight(fieldname, text)
# self.fl = parent.field_length(fieldname)
# maxweight = searcher.term_info(fieldname, text).max_weight()
# return LMScorer(maxweight, idf)
class LMScorer(WeightLengthScorer):
def __init__(self, searcher, fieldname, text, u,qf=1):
# for item in searcher.document(fieldname,text):
# print(item)
parent = searcher.get_parent() # Returns self if no parent
# print(fieldname)
# print(text)
# self.cf = parent.frequence_all(fieldname, text)
reader=parent.reader()
term_info=reader.term_info(fieldname,text)
self.cf=term_info.weight()
self.fl = parent.field_length(fieldname)
# self._maxquality = maxweight * idf
self.u=u
# print(fieldname)
self.setup(searcher, fieldname, text)
def supports_block_quality(self):
return True
def _score(self, weight,length):
# print(weight)
# print(length)
print('2222')
return lm(weight,length,self.cf,self.fl,self.u)
def max_quality(self):
return self._maxquality
def block_quality(self, matcher):
return matcher.block_max_weight() * self.idf
Hi @BruceLee66 , thanks for your sample code. I'm currently reading the paper of language model scoring (http://ciir.cs.umass.edu/pubfiles/ir-120.pdf) Then will add add LM class to master branch.
I have use a smooth in the formula.The formula is in the code.
Hi, @BruceLee66 Could you tell me your code is based on which paper, or could you share me some article? I'm not very clear about the general idea of your code...
Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
fwiw we use this weighting: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/webapps/tool_shed/search/repo_search.py#L36
I want to use Language Model as the weighting model,what should i do?