Closed uogbuji closed 5 years ago
I started with a scraper for the titles. The GAR page is stupid dynamic content from JS, so needs Chrome inspect to get the meat of it.
Copy the highlighted element & paste into gar.html
.
Then:
microx --html --find-text Mockingbird --show-attrs “id,class” gar.html
html/body/div[@class=“gar-pagewrap”]/div[@class=“gar-book-list”]/a[@class=“gar-book-item”]/div[@class=“gar-book-item-inner”]/div[@class=“gar-book-title-author”]/div[@class=“gar-book-info”]
I need to check into why I didn't get the full path to the title alone, which has a trailing /b
.
microx --html --expr ‘html/body/div[@class=“gar-pagewrap”]/div[@class=“gar-book-list”]/a[@class=“gar-book-item”]/div[@class=“gar-book-item-inner”]/div[@class=“gar-book-title-author”]/div[@class=“gar-book-info”]/b’ gar.html
With an update to get just the string value of the title, and to remove the "(series)" text in some titles.
microx --html --expr 'html/body/div[@class="gar-pagewrap"]/div[@class="gar-book-list"]/a[@class="gar-book-item"]/div[@class="gar-book-item-inner"]/div[@class="gar-book-title-author"]/div[@class="gar-book-info"]/b' --foreach "substring-before(., ' (Series)')" gar.html > gar-titles.txt
Now we have a list of GAR titles.
Next step is converting titles to groups of ISBNs. Considered isbn.nu, isbndb.com ($$), Google Books API and LibraryThing but settled on OpenLibrary for combined low hassle & no cost.
See also: Alternatives to Amazon API (2009) about cover images overall but also has resources for resolving titles
I implemented liblink_title_report
liblink_title_report --title-list gar-titles.txt > greatamericanread.json
Note: I implemented this to be gentle on the OpenLibrary service.
Gloria then did a lot of cleanup, especially culling of the spurious search results we get from OpenLibrary.
Details on cleanup: Combined ISBNs for matching works, removed duplicate ISBNs, removed additional titles that weren't on the GAR list, fixed capitalization for some titles
As Gloria mentions my first pass at liblink_title_report
did not cull duplicate ISBNs across group boundaries. Now it does.
@erimille & @informaticmonad have had several approaches to the PBS Great American Reads. Now that I'm involved this ticket starts largely with my own approach, and I hope we can roll in what all of us learned.
Work in progress is at https://github.com/zepheira/librarylink_collections/blob/alamw2019/lists/greatamericanread.json
See also: https://github.com/zepheira/librarylink_collections/blob/master/lists/harrypotter.json