wikimedia / search-highlighter

Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
100 stars 37 forks source link

how to affect fragment scoring (elasticsearch plugin) #16

Open linkwoman opened 9 years ago

linkwoman commented 9 years ago

Hi! Thanks for the plugin; I hope I can get it to work for my use case.

How do you affect fragment scoring in the elasticsearch plugin? I search my data for organic compound and find that fragments (I'm using sentences) with compound show up higher than fragments with _organic compound_ and that, in fact, organic is not highlighted, but compound is.

I'm looking for a way to get back sentences, and to ensure that both proximity and order affect fragment scoring such that the following would be the order in which the sentences would be returned:

  1. Organic compounds can also be classified or subdivided by the presence of heteroatoms.
  2. Organic chemistry is the science concerned with all aspects of organic compounds.
  3. Others state that if a molecule contains carbon―it is organic.
  4. Natural compounds refer to those that are produced by plants or animals.

(where I can choose to have organic compounds highlighted as a phrase or separate terms)

You mention fragment_weigher as a way to customize fragment scoring but I can't seem to get it to do anything (fragments returned are always the same, and in the same order). Is this the parameter I should be looking at?

If you could point me to some examples that would do what I'm trying to do, I'd sure appreciate it.

thank you!

nik9000 commented 9 years ago

I search my data for organic compound and find that fragments (I'm using sentences) with compound show up higher than fragments with organic compound and that, in fact, organic is not highlighted, but compound is.

Ew. That sounds broken. Can you reproduce it on a clean index using curl commands?

If you could point me to some examples that would do what I'm trying to do, I'd sure appreciate it.

The docs don't have very broken out examples, sorry. It might be simpler to look at the code, unfortunately: https://github.com/wikimedia/search-highlighter/blob/master/experimental-highlighter-elasticsearch-plugin/src/main/java/org/elasticsearch/search/highlight/ExperimentalHighlighter.java#L471

There aren't that many free parameters and I haven't rigged it up for extension via another Elasticsearch plugin. It is pretty simple to implement a new weigher though. It looks like what you want is for results that are closer to the beginning of the snippet to be worth more. That'd be pretty simple to implement but doesn't exist yet. You could also do more complex nlp analysis on the sentences and that isn't implemented either.

If you want weighing early results more highly I could implement it but I don't know when I'd get a chance. If you'd like to give it a shot I'll certainly review it though you'd have to do this stuff: https://www.mediawiki.org/wiki/Developer_access