Open flapka opened 3 years ago
@flapka Your observation is consistent with Solr's configuration, that diacritics are not normalized. Adding an ICU Tokenizer may invoke the desired behavior. As far as the smaller hit count might be due a smaller number of fields being searched on relative to orbis/quicksearch. Here's the current config:
All list here more diacritics for testing.
I wonder if this issue is still (?) our to-do list.
Example problem (hat tip to Julian): a researcher looking for a Schöner globe in our collection (we have two) will not find them using the search terms schoner and globe, and that's less than ideal.
Does the issue remain a weeks-long solr search engine optimization project?
Thanks for reopening this issue @flapka and @jlee-a I think it is still an issue. Searching on raisonne returns 17 results that are different from the results returned by searching on raisonné.
I think we do want this. But still will take significant time to find/create/deploy the right filter and test.
@BaylaArietta Could you help me find terms (probably in titles) that have diacritics for P&D objects? I will add the ones I find for P&S here. @flapka @KraigBinkowski @jessquag Are there other cases in your collections that you could share too?
Dug deeper. The ASCIIFolderFilterFactory is in place now. However it is after the PorterStemFilter. The PorterStemFilter turns raisonne->raisonn, but leaves raisonné as is. So ultimately raisonné becomes raisonne while raisonne becomes raisonn (w/o the e). "Schoner" doesn't have a stemming issue, so it works as is. In any case put the ASCIIFolderFilter higher in the chain, so both raisonné and raisonne become raisonn. But that results in the 14 REF 3 RB result regardless of diacritic, which I don't think we want. I think we want everything to normalize to raisonne (with the e) resulting in 17 REF as the result. I think the best way to do that is remove the PorterStemFilter entirely - the stemming just conflates things. I can see porter stemming in a google like search (when we want for example "boxes" to find things like "box"), but for art searches "raisonne" is different the "raisonn".
@edgartdata, as a francophone make sense?
I agree with you Eric: we want everything to normalize to raisonne (with the e) so that searching on 'raisonné' and 'raisonne' returns the same results.
I have searched on 'Mädchen' and 'Madchen' and both return the same print. I have searched on 'aumônière' and 'aumoniere' and both return the same print. I have searched on 'l'âge' and 'l'age' and both return the same 3 prints https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=l%27age
So it seems that adding ASCIIFolderFilterFactory works for quite a few diacritics (but I encourage others to test this as well) except for 'raisonné'. Unfortunately 'raisonné' is a critical word for our search to get right because it is used so frequently in the museum field.
All list here more diacritics for testing.
Another set of examples for diacritic that I found:
So here, 1 and 2 bring up the same results. However, search term 3 brings up different results (some results that 1 and 2 missed, and it misses one that 1 and 2 found): This is because search term 3, although it looks almost the same as search term 2, is actually the OCLC formatted "ö" which adds an extra character in MARC, not the standard character with diacritic "ö" (search 2) or regular "o" (search 1). I had to copy the 3rd example from OCLC Connexion. What I believe we want is for each of the 3 searches to bring up all the results totaled across the 3 searches, should be 8 results total.
My guess is that Schöner (Scho%CC%88ner) defied the asciifilter and got indexed as Scho%CC%88ner. The problem is that Schöner you put in search is always (Sch%C3%B6ner) which can get filtered to the diacritically normalized (Schoner), but will never map to (Scho%CC%88ner). You have to explicitly put "Scho%CC%88ner" in the q field of the URL to match, even putting "Scho%CC%88ner" in the query box doesn't work as % becomes %25.
I cannot think of a solution to this.
But otherwise I do think the stemmer should be removed for the 'raisonne' case.
Also I'm getting different results depending on browser:
Using the solr analyzer in safari no matter what Schöner breaks down to bytes: [53 63 68 c3 b6 6e 65 72]
In chrome I can get both depending on whether Schöner derived from q=Scho%CC%88ner or q=Sch%C3%B6ner or [53 63 68 6f cc 88 6e 65 72] [53 63 68 c3 b6 6e 65 72]
I have found more examples. Based on what I have seen, I think that it is likely that all diacritics are affected--each of them having the three forms.
I have found that raisonné also carries 3 different forms, similar to what we see with Schöner:
Another example dédié, which I chose because I wanted to find an example with two diacritics. The first 3 examples are following the same pattern of searches as above. The fourth is an additional variant because of the double diacritics:
Another example frères:
I find oddities when I search on terms that contain diacritics.
Take "raisonné" for example. Orbis says the term (as raisonné or raisonne) appears 192 times in YCBA library collections; Quicksearch says 302 times -- this inconsistency is itself vexing.
In Blacklight, a search on raisonné yields 13 hits: https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonn%C3%A9+
A search on raisonne yields 19 hits: https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonne
I believe there is no overlap in the two sets of results.
Questions: