Open zhehaowang opened 5 years ago
b51c7797d838c8093a84f4ce080f65313304c62f should address this. Added per-query and per-page voluntary throttling, multiple accounts rotation and auth cookie.
It appears after being blocked from one get_details the next auth still fails indefinitely without human intervention. Need more investigation.
With aggressive throttling we can mostly get through the current list. When we eventually get stuck after all the AJs it appears manual cookies reloading helps.
They use a thirdparty solution called PerimeterX. Need some targeted research.
There does not appear to be an easy fix for this PerimeterX thing. We added significant self throttling but still wasn't able to get the entire search list through: after extended time a get_details would 403 and all subsequent get_details would 403, until manual intervention in browser to just click "I'm a human". The fact that our browser request gets blocked, too, seems to indicate this is an IP based blocking.
Selenium was not able to help click that button: when simulating a Selenium click more checks from reCaptcha popped up.
As a start, we should make sure to not duplicate queries. Then we should consider spreading our requests out more.
Note that the API endpoint we are using is not what they have in the official repo (https://github.com/stockx/PublicAPI). It would appear the API endpoint there needs an API key which is only available for lv4 seller. Also the API listed there seems incomplete for our use case: transaction history, e.g., is not available.
Not knowing PerimeterX's mechanism the best thing to try now could be having a fleet of IP addresses and activating them throughout different times of day.
It would appear throttle time and different logins don't seem to help. Standard query kw setup,
Each item right now is 4 requests.
We could try
This is an IP-based block as once 403'ed, other devices behind the same NAT also need to go through captcha.
We were not blocked in last scrape on 07/27. Presumably this was lifted? Closing for now.
This is observed again since feedv2 on 20191222. Presumably the new architecture could help. Need to implement / test.
This is observed in both update and query modes. The current workaround are shell scripts to limit how many we update each time.
If we breach such we become temporarily unavailable for about 30min, no human intervention needed. If not seems we can just sleep for 60s and keep going. This is not as harsh as the previous iteration.
One problem is a script may never finish updating everything due to limit's interaction with requests that didn't error out due to 403.
The problem has since been addressed and 403 on stockx no longer seems to be a major blocker.
stockx appears to be one of those sites constantly upgrading their anti-bot mechanism. On 06/02/19 my auth requests get through if they have User-Agent set. On 06/09/19 I had to add Referrer, Origin and Content-Type. On 06/12/19 I had to add these to get_details requests as well, and I still get 403 after the first few requests. As a short term solution perhaps a rate limit, or multiple sources, will do.
Goal of this is to be able to continue scraping stockx uninterruptedly. I can think of
I believe ultimately they want people to use their api but what I'm doing now is probably too brutal.
@djian618