ycba-cia / blacklight-collections2

5 stars 2 forks source link

Boosting #127

Closed edgartdata closed 5 years ago

edgartdata commented 5 years ago

@yulgit1 Would it be possible to not take into consideration the places of birth, death, activity and tour of artists in the boosting? For example, many objects by Walter Sickert show up when searching on Wiltshire. Since we are not displaying artists' places of birth, death, activity and tour in the detailed records (only in the XML), users are bound to be confused by these results.

yulgit1 commented 5 years ago

@edgardata, currently it isn't boosting, it's appearing in the results because of the fullrecord_txt field where it's matching on the full lido xml. We could remove full_record_txt from the search (just do it on author, title,topic) which I think in general is the better solution. We could also add fields other than author,titile,topic, as well, as many field as we want - no problem specifying many fields.

edgartdata commented 5 years ago

@yulgit1 I think removing full_record_txt from the search (just do it on author, title,topic) is a great idea. I think this is what we were trying to say last time. This is bound to reduce the 'noisy'/irrelevant/odd results. @flapka @KraigBinkowski are you okay with this decision?

EdwardTown1 commented 5 years ago

@yulgit1 I'm noticing that data in the Bibliography might be overly inflated. i.e. When I search Hogarth, many pictures not by the artist William Hogarth appear because they were included in a book about Hogarth and his peers. Of course I could narrow the search by finding Hogarth in the Creator Facet but sometimes these can be difficult to locate. Any thoughts? Thanks, Ed

yulgit1 commented 5 years ago

@EdwardTown1 your observance with Hogarth here is exactly an instance of fullrecord_txt contributing too heavily to the relevance score, My suggestion is the same, remove fullrecord_txt, and add beyond author, title,and topic any fields that should have some weight.

flapka commented 5 years ago

@yulgit1 @edgartdata @KraigBinkowski @EdwardTown1

Removing full_record_txt from the search would be considered a fatal flaw by me and my RB colleagues. If a term appears anywhere in the description of the object, it must be retrieved by the search.

Put another way: if we remove full_record_txt from the search, we would need to add a slew of other fields to the search in order to facilitate desirable results.

KraigBinkowski commented 5 years ago

If the bib records that are about Hogarth show up in a search of Hogarth as creator AND the results are prioriticed so that the book results come last, I don't see what the issue is. The books that show up are about Hogarth as a creator and one of the fundamental reasons for having a joint access to collections.

KraigBinkowski commented 5 years ago

I believe that we all agreed that overall relevance would be trumped by medium in results? In other words: all ptgs come first, arranged by relevance, print / drawings next, arranged by relevance, etc.

yulgit1 commented 5 years ago

@flapka, that was my suggestion, configure with the slew of other fields. Perhaps before trying that we could try boosting title,creator,topic to something larger than the current '4' it is set at now, to diminish the effects of the '1' boost on fullrecord_txt.

@KraigBinkowski What your saying in the 2 comments above is true, not sure what others feel about this...

flapka commented 5 years ago

@yulgit1 Thanks. Yes, if the search searches every field that displays in a detailed record, I'd be okay with removing full_record_txt from the search.

yulgit1 commented 5 years ago

TODO: Use all fields from https://github.com/ycba-cia/blacklight-collections2/issues/18 for solrconfig default search fields.

yulgit1 commented 5 years ago

@flapka @KraigBinkowski @EdwardTown1 @edgartdata

2 questions: 1) Should citation_txt be included in the slew of fields? (seems like you might not want to match on author/title/etc words from citations) 2) author, title,topic are boosted 4, the rest are at 1, are there any other fields you want boosted higher than 1 and how much?

In any case here are the fields:

author_txt^4 title_txt^4 publishDate_txt format_txt physical_txt description_txt credit_line_txt callnumber_txt type_txt collection_txt topic_txt^4 geographic_txt topic_subjectActor_txt citation_txt ? title_alt_txt publisher_txt resourceURL_txt cartographic_detail_txt marc_contents_txt form_genre_txt author_additional_txt

flapka commented 5 years ago

I think I have nothing to add to questions 1 and 2 (1 doesn't apply to RB).

As for the fields, I think the list you give above would cover everything we need in RB -- if we add edition. Thanks!

edgartdata commented 5 years ago

@yulgit1 Does citation_txt only carry data from the art collection bibliography?

flapka commented 5 years ago

That's my understanding. Bibliographic citations in RB descriptions will for the time being at least map to description_txt

yulgit1 commented 5 years ago

Records with citation_txt: lido: 4069 marc: 315

example lido: "Rosie Dias, Exhibiting Englishness, John Boydell's Shakespeare Gallery and the formation of a national aesthetic, Yale University Press, New Haven, 2013, pp. 200-201, fig. 85, N72.N38 D53 2013 (YCBA)Rosie Dias, Exhibiting Englishness, John Boydell's Shakespeare Gallery and the formation of a national aesthetic, Yale University Press, New Haven, 2013, pp. 200-201, fig. 85, N72.N38 D53 2013 (YCBA)"

example marc: "Henrietta Matilda Crompton, North & South Devonshire. Yale Center for British Art, Paul Mellon Fund."

flapka commented 5 years ago

The MARC example "Henrietta Matilda Crompton ..." is a different kind of data. It's "Cite as" information (i.e. how to cite this object) instead of a bibliographic citation of the type given in the lido example.

The MARC example (from field 524) shouldn't map to citation_txt.

edgartdata commented 5 years ago

@flapka so RB does not want any of its data to be mapped to citation_txt?

If that's correct, then let's try to take it out of the relevance score for the art collection.

flapka commented 5 years ago

Correct; and that sounds fine to me.

yulgit1 commented 5 years ago

OK I will take it out of the relevance score.

As far as the show page rendering, Lido records are showing citation_txt as 'Publications'. Marc is showing citation_txt as 'Cite As' for the few that do exist or else "Yale Center for British Art" (adding this was a previous issue) https://github.com/ycba-cia/blacklight-collections2/issues/65

KraigBinkowski commented 5 years ago

I don't believe a search should include the citation fields from the bibliography module - I remember receiving results that hit words from the citations -- I think this would be very confusing for a patron, I was confused until I figured out why a result was there. I recommend not including citation_txt in the range of fields searched.

yulgit1 commented 5 years ago

Made the change. Let me know if your getting expected results. It's a little difficult to do a definitive test, but FWIW the Hogarth search Ed was concerned about above there is now 584 rather than 937 results and many of the non "Hogarth" creators removed.

snippet from new configuration:

<str name="qf">author_txt^4 title_txt^4 topic_txt^4 publishDate_txt
            format_txt physical_txt description_txt credit_line_txt
            call_number_txt type_txt collection_txt geographic_txt
            topic_subjectActor_txt title_alt_txt 
            publisher_txt resourceURL_txt cartographic_detail_txt 
            marc_contents_txt form_genre_txt author_additional_txt</str>
    <str name="bq">has_image_ss:"available"^8 OR on_view_ss:"On view"^7</str>
KraigBinkowski commented 5 years ago

This seems correct - though it does highlight the fact that Ptgs and P&D use one form of the artist's name (From TMS) and Rare and Ref use another (from MARC). when limiting by the creator facet you can see the difference. No way around that I suppose.

William Hogarth, 1697–1764, British -- 91 Hogarth, William, 1697-1764. -- 38

edgartdata commented 5 years ago

TEST: Boost associated people to x2 or x3 so that self-portraits appear first.

edgartdata commented 5 years ago

Tried boosting creators to 10, titles are 4 and subjects to 3 so that all works on paper by Hogarth show at the top of the section for works on paper.