yumetodo / google-photos-album-image-url-fetch

https://www.npmjs.com/package/google-photos-album-image-url-fetch
23 stars 4 forks source link

[Feature Request] Fetching data from 500+ image album #3

Open peterthehan opened 4 years ago

peterthehan commented 4 years ago

I tried to fetch from an album containing ~6000 images and this api seems to cut off after 500 (probably due to the way Photos lazy loads the images). Is there a way to fetch these in chunks at an interval to retrieve the full album?

yumetodo commented 4 years ago

This library gets images URL by web scraping. In Google Photos Album page(shared URL), they draw photos based on the information provided by the AF_initDataCallback function that is embedded at the foot of the page(you can check it by getting HTML by curl or some other tools). This library watches the function by regex. So, when you want to boost developing speed, please provide the information about the AF_initDataCallback function's body on the album contains 500+ images(NOTE: Don't paste the function's body when the album contains some private photos. compare the result and https://github.com/yumetodo/google-photos-album-image-url-fetch/blob/a8fd411c90066c4005e2916437c478e05864983d/src/impl.ts#L27-L66, and notice me the defferece).

peterthehan commented 4 years ago

Hmm, I don't have a large enough album whose data I feel comfortable sharing...

However, I did curl my album to get the source and searching for AF_initDataCallback yielded 4 results; 3 of them were not really meaningful, but the 4th one had a huge block of data which seems to be what the scraper is parsing. I did a quick check on the length of this data and it was also exactly 500 objects (expected).

peterthehan commented 4 years ago

Did some quick research, maybe https://github.com/puppeteer/puppeteer to handle infinite scrolling?

yumetodo commented 4 years ago

I have no plan to depend on any browser due to the execution speed and maintenance cost.

yumetodo commented 4 years ago

I found that input[2] is a blank string when the album contains less than 500. When over 500 images are contained, that is not a blank string like AH_uQ431FY3wLG0VIxpEIeoouZFF_XXT_KGkTFY1fC89XHXgNQhhKXL1ib623N4eIK9dvpujX_V83U0WNTKV5nbHPxM_0L-6_csqRzSx07tWSxqJHtOo4I1rhRY-TeJGtpJLgpsCftZudCTE9B9X7Pa7Dwu3N7dedA0mRcwEPQJxcCk3EV-WiQE.

However, I have no idea how to use it. no suspicious connection can be found.

peterthehan commented 4 years ago

https://developers.google.com/photos/library/reference/rest/v1/mediaItems/search#response-body probably a nextPageToken. Thanks for looking into this so far btw!

yumetodo commented 4 years ago

If that is nextPageToken, there should be a request using that. But I cannot find it.

yumetodo commented 4 years ago

I found a supecious request. image

yumetodo commented 4 years ago

I give up to emulate the request.

req_body.txt

f.req=[[["snAcKc","[\"AF1QipMgHGuJRRQa_sIbtZWMnVDRRID1eogDnfC73_4oSPl_yWMqqnEES8cEVYQqs2nmyw\",\"AH_uQ431FY3wLG0VIxpEIeoouZFF_XXT_KGkTFY1fC89XHXgNQhhKXL1ib623N4eIK9dvpujX_V83U0WNTKV5nbHPxM_0L-6_csqRzSx07tWSxqJHtOo4I1rhRY-TeJGtpJLgpsCftZudCTE9B9X7Pa7Dwu3N7dedA0mRcwEPQJxcCk3EV-WiQE\",null,\"WHVZVTJwTEYxNDNiVXkwQS1zbTYxYVFYMktGSktB\",null]",null,"generic"]]]

request.sh

#!/bin/bash
curl -X POST -H "Origin: https://photos.google.com" --data-urlencode @req_body.txt https://photos.google.com/_/PhotosUi/data/batchexecute -v
$./request.sh
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 216.58.197.206:443...
* Connected to photos.google.com (216.58.197.206) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: C:/msys64/mingw64/ssl/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=*.google.com
*  start date: Apr  1 12:58:27 2020 GMT
*  expire date: Jun 24 12:58:27 2020 GMT
*  subjectAltName: host "photos.google.com" matched cert's "*.google.com"
*  issuer: C=US; O=Google Trust Services; CN=GTS CA 1O1
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xf31f40)
> POST /_/PhotosUi/data/batchexecute HTTP/2
> Host: photos.google.com
> user-agent: curl/7.69.1
> accept: */*
> origin: https://photos.google.com
> content-length: 425
> content-type: application/x-www-form-urlencoded
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* We are completely uploaded and fine
< HTTP/2 400 
< content-type: application/json; charset=utf-8
< cache-control: no-cache, no-store, max-age=0, must-revalidate
< pragma: no-cache
< expires: Mon, 01 Jan 1990 00:00:00 GMT
< date: Wed, 22 Apr 2020 09:18:48 GMT
< x-content-type-options: nosniff
< p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
< strict-transport-security: max-age=31536000
< server: ESF
< x-xss-protection: 0
< x-frame-options: SAMEORIGIN
< set-cookie: NID=202=Uuq_pZxNarjRNLznKRc-fNU64P0xODlLDoJGNamc-wh8hSJnX4IriPOM8NVXPOl98tr6CirEMgDDRgpEGi21_H9oEKrX2U23PxccvKCdIHV7H1HMuLOZsVP18SRj7zid55zCEe5YIc7UuP39enfI3qBczHAvY40km6weG_M-NCw; expires=Thu, 22-Oct-2020 09:18:48 GMT; path=/; domain=.google.com; HttpOnly
< alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000
< accept-ranges: none
< vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site,Accept-Encoding
<
)]}'

[["er",null,null,null,null,400,null,null,null,3]
,["di",18]
,["af.httprm",17,"-9208582034154782081",50]
]* Connection #0 to host photos.google.com left intact
Josee9988 commented 3 years ago

I also have the same problem. I would like to fetch more than 500 photos.

Josee9988 commented 3 years ago

@yumetodo do you know if this could be fixed? Thanks and sorry for the insistence ❤️

yumetodo commented 3 years ago

@Josee9988 I gave up investigating how to emulate the original request. However, information provision is welcome.

The current workaround is split album to keep <500 images.

kllmanu commented 3 years ago

Actually, the current max is 300 images. I guess they changed it. At least this is what I get when I try to scrape the page with curl.