xlcnd / isbnlib

python library to validate, clean, transform and get metadata of ISBN strings (for devs).
Other
225 stars 29 forks source link

ISBN from words throttling #119

Closed vorte closed 1 year ago

vorte commented 1 year ago

I have isbnlib deployed in an AWS Lambda. Occasionally the lambda needs to retrieve an approximate ISBN and we found the isbn_from_words() function matches these requirements. The main issue is that the requests keep getting throttled with 429 errors. The lambda only invokes isbn_from_words once or twice per hour and still over half of the requests are throttled.

I realise this could be due to the public lambda IPs being shared by all Lambda customers but would using the Google Search API be a more robust solution instead? So isbnlib users can generate an API key and pass it in to the isbn_from_words function. If an apiKey is not provided, we can fallback to the current behaviour.

Google discourage scraping the search page but also using the API would be a lot more robust that trying to guess where the ISBN is within the html.

Reference: https://stackoverflow.com/questions/29962902/how-do-i-get-google-search-results-from-urlfetch-in-google-apps-script/30041104#30041104

xlcnd commented 1 year ago

Thanks! I will consider your suggestion in a further version.

xlcnd commented 1 year ago

Meanwhile, I suggest you use goom("your words")[0]['ISBN-13'].

By the way, probably this is much better than a general search!

vorte commented 1 year ago

I tried using goom as suggested and it seems to work quite well, however, sometimes piping the returned isbn to meta fails. For example:

# goom returns correct isbn 9788579308529
>>> isbn=goom("Manual de persuasão do FBI Karlins, Marvin; Shafer, Jack ")[0]['ISBN-13'] 
>>> meta(isbn)
{}

I realise I can just use the metadata from goom(...)[0] in this instance but why does meta() return an empty map, given both functions call the same gbooks endpoint? 😕 Am I missing something?

xlcnd commented 1 year ago

Despite being the same service, the calls are different and different database indexes are used to select the relevant items for each call (this is the usual procedure!). And since these databases are not in a completely consistent state, you get inconsistent results!

You can test this by enter in your browser:

vorte commented 1 year ago

It's strange that the same endpoint can return different data based on which query params are used but this is clearly an inconsistency on google books api, rather than isbnlib.

My original issue has been resolved by using goom so feel free to close this. I still think providing the API key to isbn_from_words() could be a nice improvement for the future though. 👍