Open selimober opened 7 years ago
I don't know if it's the best approach but for those who'll have a similar question, here is what I've come up with
Install fold-to-ascii:
npm install fold-to-ascii --save
In your elasticlunr setup:
import asciiFolder from 'fold-to-ascii'
const replaceDiacritics = token => asciiFolder.fold(token)
elasticlunr.Pipeline.registerFunction(replaceDiacritics, "replaceDiacritics")
const index = elasticlunr(function() {
// .. fields and ref setup
this.pipeline.after(elasticlunr.trimmer, replaceDiacritics)
})
Note: if you're using lunr-languages
, you need to specify that language package's functions in after
like: this.pipeline.after(elasticlunr.tr.trimmer, ...)
@selimober I tried your approach, but it seems incomplete. Normalizing all characters in the search query to ASCII means that, for example, searching for Gödel
matches Godel
, but Godel
does not match Gödel
. Or am I missing something?
Hi @andriichernenko. Unfortunately, I no longer have a system in place to test this, but as far as I remember, both cases you mentioned used to work.
And I would expect them to work, check here: http://elasticlunr.com/docs/pipeline.js.html
elasticlunr.Pipelines maintain an ordered list of functions to be applied to both documents tokens and query tokens.
So pipeline functions are applied to both the document and the query. It's at least my understanding.
@selimober I suppose you mean these functions are applied to the documents when search index is built, not to the existing index, right? If so, this is probably the problem in my case.
@andriichernenko No I didn't mean that but that would also be a valid cause for your problem :)
What I meant is these functions are applied to both your query, i.e if you're searching for Gödel
it would be transformed into Godel
and to your documents while you're adding them to your index, so if you add a document which as Gödel
in it, a query with Godel
should match it.
As I said, I don't have any means to test what I'm talking about so take them with a grain of salt.
Thank you for this, Selim!
In case others are trying this to no avail, my problem was due to adding the function to remove diacritics (accents) after all the language-specific function I was using for localized search. Turns out I just had to add it before anything else and it now works wonderfully! Here's the portion of the code I do this:
// I have a lunrPT function that gets the elasticlust import and transforms
// it to add language-specific trimmers and stop-words.
function lunrPt(lunr) {
lunrStemmer(lunr)
/* register specific locale function */
lunr.pt = function () {
this.pipeline.reset()
this.pipeline.add(removeAccents) // notice this comes before the rest!
this.pipeline.add(lunr.pt.trimmer, lunr.pt.stopWordFilter, lunr.pt.stemmer)
}
// ...
}
Hope this helps others 😄
I'm using elasticlunr@0.9.5
and was having real difficulty finding a working example of how you'd tweak the pipeline. I finally got it working like this...
import lunr from 'elasticlunr'
const indexes = {
// I happened to need a few languages
en: createSearchIndex("en"),
fr: createSearchIndex("fr"),
}
function createSearchIndex(lang) {
const index = lunr()
lunr.Pipeline.registerFunction(removeAccents, "removeAccents")
pipeline.before(lunr.trimmer, removeAccents)
// You'd populate your index here
// index.addField(`title`)
// index.addField(`description`)
// ...
return index;
}
/**
* Take a string like "cùchullaìn" and return "cuchullain"
*/
function removeAccents(token) {
return token.normalize("NFD").replace(/\p{Diacritic}/gu, "")
}
Put your removeAccents
function before the trimmer. The trimmer has a bug where it removes accented letters if they're at the start or end of the token!
Hi,
Could you give me some pointers on indexing/searching words ignoring diacritics. For example, I want
Gödel
andGodel
match, sama asŞarap
andSarap
.And thanks for this great library and the documentation.