zepheira / librarylink_collections

Library.Link Collections
4 stars 14 forks source link

Scrape ALA Notable Children's Books list #2

Closed uogbuji closed 5 years ago

uogbuji commented 5 years ago

Source: http://www.ala.org/alsc/awardsgrants/notalists/ncb

uogbuji commented 5 years ago

I started out by adding something I've wanted for a long time in Amara, the ability to search for a string and show XPaths to text that contains those strings.

curl -O http://www.ala.org/alsc/awardsgrants/notalists/ncb
#Thyra is just a unique string that occurs in the first desired item listed on the page
microx --html --find-text=Thyra --show-attrs=id,class ncb

Output:

html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]

With this in place it's trivial to separately get the titles and the rest of the info for each book, including the ISBNs:

Titles:

$ microx --html --expr='html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]' --foreach='div/em/strong' /tmp/ncb.html | head
<strong>Alfie</strong>
<strong>All Around Us</strong>
<strong>All the Way to Havana</strong>
<strong>Baby Goes to Market</strong>
<strong>Big Cat, Little Cat</strong>
<strong>Blue Sky, White Stars</strong>
<strong>The Book of Mistakes</strong>
<strong>The Boy and the Whale</strong>
<strong>Charlie & Mouse</strong>
<strong>Frida Kahlo and Her Animalitos</strong>

Rest:

$ microx --html --expr='html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]' --foreach='div/em/following-sibling::text()' /tmp/ncb.html | head
. By Thyra Heder. Illus. by the author. Abrams (9781419725296).
. By Xelena González. Illus. by Adriana M. Garcia. Cinco Puntos (9781941026762).
. By Margarita Engle. Illus. by Mike Curato. Henry Holt (9781627796422).
. By Atinuke. Illus. by Angela Brooksbank. Candlewick (9780763695705).
. By Elisha Cooper. Illus. by the author. Roaring Brook (9781626723719).
. By Sarvinder Naberhaus. Illus. by Kadir Nelson. Dial (9780803737006). 
. By Corinna Luyken. Illus. by the author. Dial (9780735227927).
. By Mordicai Gerstein. Illus. by the author. Roaring Brook (9781626725058).
. By Laurel Snyder. Illus. by Emily Hughes. Chronicle (9781452131535).
. By Monica Brown. Illus. by John Parra. North-South (9780735842694).

If we just go with the list format where only ISBN is required (e.g. topshelf.json) then I can get the bare list of ISBNs I need as follows:

$ microx --html --expr='html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]' --foreach='div/em/following-sibling::text()' /tmp/ncb.html | python <(cat << EOF
> import re, sys
> for l in sys.stdin:
>     m = re.search(r'\((\d+)\)', l, re.MULTILINE)
>     if m: print(f'"{m.group(1)}",')
> EOF
> ) | head
"9781419725296",
"9781941026762",
"9781627796422",
"9780763695705",
"9781626723719",
"9780803737006",
"9780735227927",
"9781626725058",
"9781452131535",
"9780735842694",

Phew! Some black belt bash-Fu there. Had to remind myself how to set up a here-doc for Python while still maintaining the stdin from pipe.

Pastable snippet to use for real:

microx --html --expr='html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]' --foreach='div/em/following-sibling::text()' /tmp/ncb.html | python <(cat << EOF
import re, sys
for l in sys.stdin:
    m = re.search(r'\((\d+)\)', l, re.MULTILINE)
    if m: print(f'"{m.group(1)}",')
EOF
)

Result after adding the JSON envelope: https://github.com/zepheira/librarylink_collections/blob/master/lists/ala-notable-childrens-books.json

uogbuji commented 5 years ago

Just capturing for completeness of notes. If I wanted to zip up both titles & remainder of each book div together, this does the trick, with pipe delimiter:

$ microx --html --expr='html/body[@class="html not-front not-logged-in one-sidebar sidebar-first page-node page-node- page-node-280 node-type-page"]/div[@class="main-container with-sidebar container"]/div[@class="row container"]/section[@class="col-sm-9 well"]/div[@class="region region-content"]/section[@id="block-system-main"][@class="block block-system clearfix"]/article[@id="node-280"][@class="node node-page clearfix"]/div[@class="field field-name-body field-type-text-with-summary field-label-hidden"]/div[@class="field-items"]/div[@class="field-item even"]/div[em]' --foreach='concat(em/strong, "|", em/following-sibling::text())' /tmp/ncb.html | head
Alfie|. By Thyra Heder. Illus. by the author. Abrams (9781419725296).
All Around Us|. By Xelena González. Illus. by Adriana M. Garcia. Cinco Puntos (9781941026762).
All the Way to Havana|. By Margarita Engle. Illus. by Mike Curato. Henry Holt (9781627796422).
Baby Goes to Market|. By Atinuke. Illus. by Angela Brooksbank. Candlewick (9780763695705).
Big Cat, Little Cat|. By Elisha Cooper. Illus. by the author. Roaring Brook (9781626723719).
Blue Sky, White Stars|. By Sarvinder Naberhaus. Illus. by Kadir Nelson. Dial (9780803737006). 
The Book of Mistakes|. By Corinna Luyken. Illus. by the author. Dial (9780735227927).
The Boy and the Whale|. By Mordicai Gerstein. Illus. by the author. Roaring Brook (9781626725058).
Charlie & Mouse|. By Laurel Snyder. Illus. by Emily Hughes. Chronicle (9781452131535).
Frida Kahlo and Her Animalitos|. By Monica Brown. Illus. by John Parra. North-South (9780735842694).
uogbuji commented 5 years ago

OK @informaticmonad confirmed that this simple list format is good enough to start so closing.

uogbuji commented 5 years ago

Just to exercise this scraping pattern @informaticmonad asked me to extract ISBNs from January 2019 LibraryReads. I tried by tag-locating the first ISBN:

$ microx --html --find-text=9780743298070 --show-attrs=id,class january-2019-libraryreads 
html[@class="no-js"]/body/div[@class="wrapper clear"]/div[@id="blogContent"]/div[@class="entry"]/blockquote

But it turned out that trailing blockquote is a special case they use to highlight the top entry. Backtracked up one level, as confirmed by checking the second ISBN:

microx --html --find-text=9781250133731 --show-attrs=id,class january-2019-libraryreads

OK so now we can use that. Cutting to the chase:

microx --html --expr='html[@class="no-js"]/body/div[@class="wrapper clear"]/div[@id="blogContent"]/div[@class="entry"]' january-2019-libraryreads  | python <(cat << EOF
import re, sys
for l in sys.stdin:
    m = re.search(r'ISBN: (\d+)', l, re.MULTILINE)
    if m: print(f'"{m.group(1)}",')
EOF
)

Added: https://github.com/zepheira/librarylink_collections/blob/master/lists/january-2019-libraryreads.json

uogbuji commented 5 years ago

Noting the above require the updated version of microx in amara3.xml 3.0.0.