weixsong / elasticlunr.js

Based on lunr.js, but more flexible and customized.
http://elasticlunr.com
MIT License
2.04k stars 149 forks source link

Diacritic free search #46

Open selimober opened 7 years ago

selimober commented 7 years ago

Hi,

Could you give me some pointers on indexing/searching words ignoring diacritics. For example, I want Gödel and Godel match, sama as Şarap and Sarap.

And thanks for this great library and the documentation.

selimober commented 7 years ago

I don't know if it's the best approach but for those who'll have a similar question, here is what I've come up with

Install fold-to-ascii:

npm install fold-to-ascii --save

In your elasticlunr setup:

import asciiFolder from 'fold-to-ascii'

const replaceDiacritics = token => asciiFolder.fold(token)

elasticlunr.Pipeline.registerFunction(replaceDiacritics, "replaceDiacritics")

const index = elasticlunr(function() {
    // .. fields and ref setup
    this.pipeline.after(elasticlunr.trimmer, replaceDiacritics)
})

Note: if you're using lunr-languages, you need to specify that language package's functions in after like: this.pipeline.after(elasticlunr.tr.trimmer, ...)

andriichernenko commented 6 years ago

@selimober I tried your approach, but it seems incomplete. Normalizing all characters in the search query to ASCII means that, for example, searching for Gödel matches Godel, but Godel does not match Gödel. Or am I missing something?

selimober commented 6 years ago

Hi @andriichernenko. Unfortunately, I no longer have a system in place to test this, but as far as I remember, both cases you mentioned used to work.

And I would expect them to work, check here: http://elasticlunr.com/docs/pipeline.js.html

elasticlunr.Pipelines maintain an ordered list of functions to be applied to both documents tokens and query tokens.

So pipeline functions are applied to both the document and the query. It's at least my understanding.

andriichernenko commented 6 years ago

@selimober I suppose you mean these functions are applied to the documents when search index is built, not to the existing index, right? If so, this is probably the problem in my case.

selimober commented 6 years ago

@andriichernenko No I didn't mean that but that would also be a valid cause for your problem :) What I meant is these functions are applied to both your query, i.e if you're searching for Gödel it would be transformed into Godel and to your documents while you're adding them to your index, so if you add a document which as Gödel in it, a query with Godel should match it.

As I said, I don't have any means to test what I'm talking about so take them with a grain of salt.

hdoro commented 4 years ago

Thank you for this, Selim!

In case others are trying this to no avail, my problem was due to adding the function to remove diacritics (accents) after all the language-specific function I was using for localized search. Turns out I just had to add it before anything else and it now works wonderfully! Here's the portion of the code I do this:

// I have a lunrPT function that gets the elasticlust import and transforms
// it to add language-specific trimmers and stop-words.
function lunrPt(lunr) {
  lunrStemmer(lunr)

  /* register specific locale function */
  lunr.pt = function () {
    this.pipeline.reset()
    this.pipeline.add(removeAccents) // notice this comes before the rest!
    this.pipeline.add(lunr.pt.trimmer, lunr.pt.stopWordFilter, lunr.pt.stemmer)
  }
  // ...
}

Hope this helps others 😄

chichilatte commented 9 months ago

I'm using elasticlunr@0.9.5 and was having real difficulty finding a working example of how you'd tweak the pipeline. I finally got it working like this...

import lunr from 'elasticlunr'

const indexes = {
  // I happened to need a few languages
  en: createSearchIndex("en"),
  fr: createSearchIndex("fr"),
}

function createSearchIndex(lang) {
  const index = lunr()
  lunr.Pipeline.registerFunction(removeAccents, "removeAccents")
  pipeline.before(lunr.trimmer, removeAccents)

  // You'd populate your index here
  // index.addField(`title`)
  // index.addField(`description`)
  // ...

  return index;
}

/**
 * Take a string like "cùchullaìn" and return "cuchullain"
 */
function removeAccents(token) {
  return token.normalize("NFD").replace(/\p{Diacritic}/gu, "")
}

Important note

Put your removeAccents function before the trimmer. The trimmer has a bug where it removes accented letters if they're at the start or end of the token!