zepheira / librarylink_collections

Library.Link Collections
4 stars 14 forks source link

Building up the list from from PBS Great American Reads #3

Closed uogbuji closed 5 years ago

uogbuji commented 5 years ago

@erimille & @informaticmonad have had several approaches to the PBS Great American Reads. Now that I'm involved this ticket starts largely with my own approach, and I hope we can roll in what all of us learned.

Work in progress is at https://github.com/zepheira/librarylink_collections/blob/alamw2019/lists/greatamericanread.json

See also: https://github.com/zepheira/librarylink_collections/blob/master/lists/harrypotter.json

uogbuji commented 5 years ago

I started with a scraper for the titles. The GAR page is stupid dynamic content from JS, so needs Chrome inspect to get the meat of it.

screen shot 2019-01-24 at 10 39 52 pm

Copy the highlighted element & paste into gar.html.

Then:

microx --html --find-text Mockingbird --show-attrs “id,class” gar.html
html/body/div[@class=“gar-pagewrap”]/div[@class=“gar-book-list”]/a[@class=“gar-book-item”]/div[@class=“gar-book-item-inner”]/div[@class=“gar-book-title-author”]/div[@class=“gar-book-info”]

I need to check into why I didn't get the full path to the title alone, which has a trailing /b.

microx --html --expr ‘html/body/div[@class=“gar-pagewrap”]/div[@class=“gar-book-list”]/a[@class=“gar-book-item”]/div[@class=“gar-book-item-inner”]/div[@class=“gar-book-title-author”]/div[@class=“gar-book-info”]/b’ gar.html

With an update to get just the string value of the title, and to remove the "(series)" text in some titles.

microx --html --expr 'html/body/div[@class="gar-pagewrap"]/div[@class="gar-book-list"]/a[@class="gar-book-item"]/div[@class="gar-book-item-inner"]/div[@class="gar-book-title-author"]/div[@class="gar-book-info"]/b' --foreach "substring-before(., ' (Series)')" gar.html > gar-titles.txt

Now we have a list of GAR titles.

uogbuji commented 5 years ago

Next step is converting titles to groups of ISBNs. Considered isbn.nu, isbndb.com ($$), Google Books API and LibraryThing but settled on OpenLibrary for combined low hassle & no cost.

See also: Alternatives to Amazon API (2009) about cover images overall but also has resources for resolving titles

I implemented liblink_title_report

liblink_title_report --title-list gar-titles.txt > greatamericanread.json

Note: I implemented this to be gentle on the OpenLibrary service.

Gloria then did a lot of cleanup, especially culling of the spurious search results we get from OpenLibrary.

informaticmonad commented 5 years ago

Details on cleanup: Combined ISBNs for matching works, removed duplicate ISBNs, removed additional titles that weren't on the GAR list, fixed capitalization for some titles

uogbuji commented 5 years ago

As Gloria mentions my first pass at liblink_title_report did not cull duplicate ISBNs across group boundaries. Now it does.