Open peterthehan opened 4 years ago
This library gets images URL by web scraping.
In Google Photos Album page(shared URL), they draw photos based on the information provided by the AF_initDataCallback
function that is embedded at the foot of the page(you can check it by getting HTML by curl or some other tools). This library watches the function by regex.
So, when you want to boost developing speed, please provide the information about the AF_initDataCallback
function's body on the album contains 500+ images(NOTE: Don't paste the function's body when the album contains some private photos. compare the result and https://github.com/yumetodo/google-photos-album-image-url-fetch/blob/a8fd411c90066c4005e2916437c478e05864983d/src/impl.ts#L27-L66, and notice me the defferece).
Hmm, I don't have a large enough album whose data I feel comfortable sharing...
However, I did curl my album to get the source and searching for AF_initDataCallback
yielded 4 results; 3 of them were not really meaningful, but the 4th one had a huge block of data which seems to be what the scraper is parsing. I did a quick check on the length of this data and it was also exactly 500 objects (expected).
Did some quick research, maybe https://github.com/puppeteer/puppeteer to handle infinite scrolling?
I have no plan to depend on any browser due to the execution speed and maintenance cost.
I found that input[2]
is a blank string when the album contains less than 500. When over 500 images are contained, that is not a blank string like AH_uQ431FY3wLG0VIxpEIeoouZFF_XXT_KGkTFY1fC89XHXgNQhhKXL1ib623N4eIK9dvpujX_V83U0WNTKV5nbHPxM_0L-6_csqRzSx07tWSxqJHtOo4I1rhRY-TeJGtpJLgpsCftZudCTE9B9X7Pa7Dwu3N7dedA0mRcwEPQJxcCk3EV-WiQE
.
However, I have no idea how to use it. no suspicious connection can be found.
https://developers.google.com/photos/library/reference/rest/v1/mediaItems/search#response-body probably a nextPageToken
. Thanks for looking into this so far btw!
If that is nextPageToken
, there should be a request using that. But I cannot find it.
I found a supecious request.
I give up to emulate the request.
req_body.txt
f.req=[[["snAcKc","[\"AF1QipMgHGuJRRQa_sIbtZWMnVDRRID1eogDnfC73_4oSPl_yWMqqnEES8cEVYQqs2nmyw\",\"AH_uQ431FY3wLG0VIxpEIeoouZFF_XXT_KGkTFY1fC89XHXgNQhhKXL1ib623N4eIK9dvpujX_V83U0WNTKV5nbHPxM_0L-6_csqRzSx07tWSxqJHtOo4I1rhRY-TeJGtpJLgpsCftZudCTE9B9X7Pa7Dwu3N7dedA0mRcwEPQJxcCk3EV-WiQE\",null,\"WHVZVTJwTEYxNDNiVXkwQS1zbTYxYVFYMktGSktB\",null]",null,"generic"]]]
request.sh
#!/bin/bash
curl -X POST -H "Origin: https://photos.google.com" --data-urlencode @req_body.txt https://photos.google.com/_/PhotosUi/data/batchexecute -v
$./request.sh
Note: Unnecessary use of -X or --request, POST is already inferred.
* Trying 216.58.197.206:443...
* Connected to photos.google.com (216.58.197.206) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: C:/msys64/mingw64/ssl/certs/ca-bundle.crt
CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
* subject: C=US; ST=California; L=Mountain View; O=Google LLC; CN=*.google.com
* start date: Apr 1 12:58:27 2020 GMT
* expire date: Jun 24 12:58:27 2020 GMT
* subjectAltName: host "photos.google.com" matched cert's "*.google.com"
* issuer: C=US; O=Google Trust Services; CN=GTS CA 1O1
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xf31f40)
> POST /_/PhotosUi/data/batchexecute HTTP/2
> Host: photos.google.com
> user-agent: curl/7.69.1
> accept: */*
> origin: https://photos.google.com
> content-length: 425
> content-type: application/x-www-form-urlencoded
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* We are completely uploaded and fine
< HTTP/2 400
< content-type: application/json; charset=utf-8
< cache-control: no-cache, no-store, max-age=0, must-revalidate
< pragma: no-cache
< expires: Mon, 01 Jan 1990 00:00:00 GMT
< date: Wed, 22 Apr 2020 09:18:48 GMT
< x-content-type-options: nosniff
< p3p: CP="This is not a P3P policy! See g.co/p3phelp for more info."
< strict-transport-security: max-age=31536000
< server: ESF
< x-xss-protection: 0
< x-frame-options: SAMEORIGIN
< set-cookie: NID=202=Uuq_pZxNarjRNLznKRc-fNU64P0xODlLDoJGNamc-wh8hSJnX4IriPOM8NVXPOl98tr6CirEMgDDRgpEGi21_H9oEKrX2U23PxccvKCdIHV7H1HMuLOZsVP18SRj7zid55zCEe5YIc7UuP39enfI3qBczHAvY40km6weG_M-NCw; expires=Thu, 22-Oct-2020 09:18:48 GMT; path=/; domain=.google.com; HttpOnly
< alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,h3-T050=":443"; ma=2592000
< accept-ranges: none
< vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site,Accept-Encoding
<
)]}'
[["er",null,null,null,null,400,null,null,null,3]
,["di",18]
,["af.httprm",17,"-9208582034154782081",50]
]* Connection #0 to host photos.google.com left intact
I also have the same problem. I would like to fetch more than 500 photos.
@yumetodo do you know if this could be fixed? Thanks and sorry for the insistence ❤️
@Josee9988 I gave up investigating how to emulate the original request. However, information provision is welcome.
The current workaround is split album to keep <500 images.
Actually, the current max is 300 images. I guess they changed it. At least this is what I get when I try to scrape the page with curl
.
I tried to fetch from an album containing ~6000 images and this api seems to cut off after 500 (probably due to the way Photos lazy loads the images). Is there a way to fetch these in chunks at an interval to retrieve the full album?