whoosh-community / whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python.
Other
254 stars 37 forks source link

if i want to change the weighting model,what should i do? #494

Open BruceLee66 opened 5 years ago

BruceLee66 commented 5 years ago

I want to use Language Model as the weighting model,what should i do?

fortable1999 commented 5 years ago

Hi BruceLee66. Thanks for your post. Currently whoosh original author Matt Chaput are not in this organization and nobody could contact with him. We are trying to keep this software work but unfortunately we don't have too much knowledge about the internal things. Everyone have to read the code and learn. I will try my best give you an answer, but if you would like help us, we would be very grateful.

fortable1999 commented 5 years ago

Hi @BruceLee66 Now I'm reading the weighting model code and now feel maybe have the answer.

Currently you have two options:

  1. use whoosh.scoring.FunctionWeighting, provide a customized weighting function.
  2. Implement a new WeightingModel, inherit from whoosh.scoring.FunctionWeighting class.

You could read scoring.py and get some examples.

BruceLee66 commented 5 years ago

This is the language model code i write.but the effect is bad.

def lm(tf,dl,cf,fl,u):
    #tf代表词在该文档中出现次数
    #dl代表文档长度
    #cf代表词在文档集合中出现次数
    #fl代表所有文档的长度之和
    #u代表参数
    return (tf/dl)*dl/(dl+u)+(1-dl/(dl+u))*(cf/fl)

class LM(WeightingModel):
    # def __init__(self,u):
    #     self.u=1000
    def scorer(self, searcher, fieldname, text, u=1000,qf=1):
        if not searcher.schema[fieldname].scorable:
            return WeightScorer.for_(searcher, fieldname, text)
        # print(fieldname)
        print(text)
        return LMScorer(searcher, fieldname, text, u=u,qf=qf)
    # def scorer(self, searcher, fieldname, text, u=1000):
    #     # IDF is a global statistic, so get it from the top-level searcher
    #     parent = searcher.get_parent()  # Returns self if no parent
    #     self.cf = parent.weight(fieldname, text)
    #     self.fl = parent.field_length(fieldname)

    #     maxweight = searcher.term_info(fieldname, text).max_weight()
    #     return LMScorer(maxweight, idf)

class LMScorer(WeightLengthScorer):
    def __init__(self, searcher, fieldname, text, u,qf=1):
        # for item in searcher.document(fieldname,text):
        #     print(item)
        parent = searcher.get_parent()  # Returns self if no parent
        # print(fieldname)
        # print(text)
        # self.cf = parent.frequence_all(fieldname, text)
        reader=parent.reader()
        term_info=reader.term_info(fieldname,text)
        self.cf=term_info.weight()
        self.fl = parent.field_length(fieldname)
        # self._maxquality = maxweight * idf
        self.u=u
        # print(fieldname)

        self.setup(searcher, fieldname, text)

    def supports_block_quality(self):
        return True

    def _score(self, weight,length):
        # print(weight)
        # print(length)
        print('2222')
        return lm(weight,length,self.cf,self.fl,self.u)

    def max_quality(self):
        return self._maxquality

    def block_quality(self, matcher):
        return matcher.block_max_weight() * self.idf
fortable1999 commented 5 years ago

Hi @BruceLee66 , thanks for your sample code. I'm currently reading the paper of language model scoring (http://ciir.cs.umass.edu/pubfiles/ir-120.pdf) Then will add add LM class to master branch.

BruceLee66 commented 5 years ago

I have use a smooth in the formula.The formula is in the code.

fortable1999 commented 5 years ago

Hi, @BruceLee66 Could you tell me your code is based on which paper, or could you share me some article? I'm not very clear about the general idea of your code...

BruceLee66 commented 5 years ago

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

martenson commented 5 years ago

fwiw we use this weighting: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/webapps/tool_shed/search/repo_search.py#L36