Searching and diacritics

flapka commented 3 years ago

I find oddities when I search on terms that contain diacritics.

Take "raisonné" for example. Orbis says the term (as raisonné or raisonne) appears 192 times in YCBA library collections; Quicksearch says 302 times -- this inconsistency is itself vexing.

In Blacklight, a search on raisonné yields 13 hits: https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonn%C3%A9+

A search on raisonne yields 19 hits: https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonne

I believe there is no overlap in the two sets of results.

Questions:

We want the Blacklight search to be diacritic-flexible, right? If so, is it much trouble to implement the needed change?
Even when we search using the term with a diacritic (raisonné), the results are fewer than we should have. I wonder if there's an additional shortcoming. Investigate after addressing question 1?

yulgit1 commented 3 years ago

@flapka Your observation is consistent with Solr's configuration, that diacritics are not normalized. Adding an ICU Tokenizer may invoke the desired behavior. As far as the smaller hit count might be due a smaller number of fields being searched on relative to orbis/quicksearch. Here's the current config:

author_txt^10 title_txt^4 topic_txt^4 publishDate_txt format_txt physical_txt description_txt credit_line_txt callnumber_txt type_txt collection_txt geographic_txt topic_subjectActor_txt^3 title_alt_txt publisher_txt resourceURL_txt cartographic_detail_txt marc_contents_txt form_genre_txt author_additional_txt exhibition_history_txt^2 curatorial_comment_txt^2 curatorial_comment_auth_txt^2 It's a little unwieldy to simply tweak and test. This could really become a dedicated weeks-long solr search engine optimization project.

edgartdata commented 3 years ago

All list here more diacritics for testing.

flapka commented 1 year ago

I wonder if this issue is still (?) our to-do list.

Example problem (hat tip to Julian): a researcher looking for a Schöner globe in our collection (we have two) will not find them using the search terms schoner and globe, and that's less than ideal.

Does the issue remain a weeks-long solr search engine optimization project?

edgartdata commented 1 year ago

Thanks for reopening this issue @flapka and @jlee-a I think it is still an issue. Searching on raisonne returns 17 results that are different from the results returned by searching on raisonné.

yulgit1 commented 1 year ago

I think we do want this. But still will take significant time to find/create/deploy the right filter and test.

edgartdata commented 1 year ago

@BaylaArietta Could you help me find terms (probably in titles) that have diacritics for P&D objects? I will add the ones I find for P&S here. @flapka @KraigBinkowski @jessquag Are there other cases in your collections that you could share too?

yulgit1 commented 1 year ago

Dug deeper. The ASCIIFolderFilterFactory is in place now. However it is after the PorterStemFilter. The PorterStemFilter turns raisonne->raisonn, but leaves raisonné as is. So ultimately raisonné becomes raisonne while raisonne becomes raisonn (w/o the e). "Schoner" doesn't have a stemming issue, so it works as is. In any case put the ASCIIFolderFilter higher in the chain, so both raisonné and raisonne become raisonn. But that results in the 14 REF 3 RB result regardless of diacritic, which I don't think we want. I think we want everything to normalize to raisonne (with the e) resulting in 17 REF as the result. I think the best way to do that is remove the PorterStemFilter entirely - the stemming just conflates things. I can see porter stemming in a google like search (when we want for example "boxes" to find things like "box"), but for art searches "raisonne" is different the "raisonn".

@edgartdata, as a francophone make sense?

edgartdata commented 1 year ago

I agree with you Eric: we want everything to normalize to raisonne (with the e) so that searching on 'raisonné' and 'raisonne' returns the same results.

I have searched on 'Mädchen' and 'Madchen' and both return the same print. I have searched on 'aumônière' and 'aumoniere' and both return the same print. I have searched on 'l'âge' and 'l'age' and both return the same 3 prints https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=l%27age

So it seems that adding ASCIIFolderFilterFactory works for quite a few diacritics (but I encourage others to test this as well) except for 'raisonné'. Unfortunately 'raisonné' is a critical word for our search to get right because it is used so frequently in the museum field.

jlee-a commented 1 year ago

All list here more diacritics for testing.

Another set of examples for diacritic that I found:

So here, 1 and 2 bring up the same results. However, search term 3 brings up different results (some results that 1 and 2 missed, and it misses one that 1 and 2 found): This is because search term 3, although it looks almost the same as search term 2, is actually the OCLC formatted "ö" which adds an extra character in MARC, not the standard character with diacritic "ö" (search 2) or regular "o" (search 1). I had to copy the 3rd example from OCLC Connexion. What I believe we want is for each of the 3 searches to bring up all the results totaled across the 3 searches, should be 8 results total.

yulgit1 commented 1 year ago

My guess is that Schöner (Scho%CC%88ner) defied the asciifilter and got indexed as Scho%CC%88ner. The problem is that Schöner you put in search is always (Sch%C3%B6ner) which can get filtered to the diacritically normalized (Schoner), but will never map to (Scho%CC%88ner). You have to explicitly put "Scho%CC%88ner" in the q field of the URL to match, even putting "Scho%CC%88ner" in the query box doesn't work as % becomes %25.

I cannot think of a solution to this.

But otherwise I do think the stemmer should be removed for the 'raisonne' case.

yulgit1 commented 1 year ago

Also I'm getting different results depending on browser:

Using the solr analyzer in safari no matter what Schöner breaks down to bytes: [53 63 68 c3 b6 6e 65 72]

In chrome I can get both depending on whether Schöner derived from q=Scho%CC%88ner or q=Sch%C3%B6ner or [53 63 68 6f cc 88 6e 65 72] [53 63 68 c3 b6 6e 65 72]

jlee-a commented 1 year ago

I have found more examples. Based on what I have seen, I think that it is likely that all diacritics are affected--each of them having the three forms.

I have found that raisonné also carries 3 different forms, similar to what we see with Schöner:

https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonn%C3%A9
https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonne
https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=raisonne%CC%81 1 brings up 17 results, 2 brings up 18 results, and 3 brings up a stunning 282 results. I think that 3 is the one that is not working with the asciifilter. It sounds like this specific word would be fixed by stemming (?) but I believe the same issue will apply for diacritics in the middle of words, such as the following examples.

Another example dédié, which I chose because I wanted to find an example with two diacritics. The first 3 examples are following the same pattern of searches as above. The fourth is an additional variant because of the double diacritics:

Another example frères:

https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=freres
https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=fr%C3%A8res
https://collections.britishart.yale.edu/?utf8=%E2%9C%93&search_field=all_fields&q=fre%CC%80res In this case, 1 and 2 bring exactly the same results, while 3 results in mostly non-overlapping results with 1 and 2.

ycba-cia / blacklight-collections2

Searching and diacritics #273