wellcomecollection / catalogue-api

:crystal_ball: The API for searching the Wellcome Collection catalogue.
https://developers.wellcomecollection.org
MIT License
4 stars 0 forks source link

Christy's request regarding # of pre-1900 digitised books #818

Closed agnesgaroux closed 1 month ago

agnesgaroux commented 1 month ago

Email from Natalie after she talked to Christy

Re the question from Christy, she wants to know the number of copies (items, not bibs) of pre-1900 books we hold that are not yet digitised, and as a proportion of the total.

So is it possible to get: Total number of items for the bib format book printed before 1900 Number of those that haven’t been digitised These will all have to be unsuppressed, I think And also have a bib level value of monograph (as opposed to e.g. chapt/article)

agnesgaroux commented 1 month ago

Christy's original email

I’m trying to compare how many books we’ve digitised against the entirety of the catalogued collection. This is proving quite tricky to do through the user interfaces.

Is it possible to use the API to find out how many books we have catalogued in Sierra that were published before 1900?

And following on from that, is it possible to tell which ones do not have a version that’s digitised/hosted by Wellcome (as opposed to third party subscription sites)? And then to export a list of those?

Happy to chat further if anyone has time to look into this – it’s not super urgent.

agnesgaroux commented 1 month ago

Query

// works that are books 
// that were produced before 1900 
// that are not available online
get works-indexed-2024-08-15/_count
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "filterableValues.format.id": "a"
          }
        },
        {
          "range": {
            "query.production.label": {
              "lt": 1900
            }
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "filterableValues.availabilities.id": "online"
          }
        }
      ]
    }
  }
}
agnesgaroux commented 1 month ago

Aggs to add up the items for the found works

"aggs": {
    "total_items_id": {
      "sum": {
        "script": {
          "source": "doc['query.items.id'].length"
        }
      }
    }
  }
agnesgaroux commented 1 month ago

I have a local script to query the index and parse the hits into a csv, with identifier (bNumber), title and workId (potentially looking to format these into works url)

agnesgaroux commented 1 month ago

Blocked: Christy is on leave right now and we need some clarifications on a few points

agnesgaroux commented 1 month ago

you can pause on this if you like, I'm checking whether locations would helpful or not to add in determining sets that are suitable for digitisation (there's a lot of stuff on the list of 10k that you sent that we have already looked through, journals catalogued as books, etc. but location might help to narrow down the collections we're interested in.

Closing this. Will open new issue if she wants the location added