feed: stockxapi rate limit

zhehaowang commented 5 years ago

stockx appears to be one of those sites constantly upgrading their anti-bot mechanism. On 06/02/19 my auth requests get through if they have User-Agent set. On 06/09/19 I had to add Referrer, Origin and Content-Type. On 06/12/19 I had to add these to get_details requests as well, and I still get 403 after the first few requests. As a short term solution perhaps a rate limit, or multiple sources, will do.

Goal of this is to be able to continue scraping stockx uninterruptedly. I can think of

either add a rate limit on our side, or
find out if additional fields can just let our requests get through, or
switch to a different framework with such support built in

I believe ultimately they want people to use their api but what I'm doing now is probably too brutal.

@djian618

zhehaowang commented 5 years ago

b51c7797d838c8093a84f4ce080f65313304c62f should address this. Added per-query and per-page voluntary throttling, multiple accounts rotation and auth cookie.

It appears after being blocked from one get_details the next auth still fails indefinitely without human intervention. Need more investigation.

With aggressive throttling we can mostly get through the current list. When we eventually get stuck after all the AJs it appears manual cookies reloading helps.

Which particular cookie help?
Does setting cookie in every get_detail help?
Would an adaptive sleep by query response term help?

They use a thirdparty solution called PerimeterX. Need some targeted research.

zhehaowang commented 5 years ago

There does not appear to be an easy fix for this PerimeterX thing. We added significant self throttling but still wasn't able to get the entire search list through: after extended time a get_details would 403 and all subsequent get_details would 403, until manual intervention in browser to just click "I'm a human". The fact that our browser request gets blocked, too, seems to indicate this is an IP based blocking.

Selenium was not able to help click that button: when simulating a Selenium click more checks from reCaptcha popped up.

As a start, we should make sure to not duplicate queries. Then we should consider spreading our requests out more.

zhehaowang commented 5 years ago

Note that the API endpoint we are using is not what they have in the official repo (https://github.com/stockx/PublicAPI). It would appear the API endpoint there needs an API key which is only available for lv4 seller. Also the API listed there seems incomplete for our use case: transaction history, e.g., is not available.

Not knowing PerimeterX's mechanism the best thing to try now could be having a fleet of IP addresses and activating them throughout different times of day.

zhehaowang commented 5 years ago

It would appear throttle time and different logins don't seem to help. Standard query kw setup,

0.1s per-request throttle, 20s per keyword throttle, 1 account
- 1st 403 at 30 items / 118 request / halfway through AJ1 in
- 2nd 403 at 227 items / 910 request / halfway through AJ6 in
- 3rd 403 at 225 items / 905 request / AJ11
- 4th 403 at 255 items / 1019 request / AJ19
- 5th 403 at 176 items / 705 request / AJ29
- 6th 403 at 174 items / 699 request / adidas ultraboost
- 7th 403 at 245 items / 979 request / nike kobe
- 8th 403 right at the end

Each item right now is 4 requests.

We could try

fleet of ip address (which requires deployment-ready version of the script)
record when the hiccups happen under different setups to study PerimeterX's behavior.

This is an IP-based block as once 403'ed, other devices behind the same NAT also need to go through captcha.

zhehaowang commented 5 years ago

We were not blocked in last scrape on 07/27. Presumably this was lifted? Closing for now.

zhehaowang commented 4 years ago

This is observed again since feedv2 on 20191222. Presumably the new architecture could help. Need to implement / test.

zhehaowang commented 4 years ago

This is observed in both update and query modes. The current workaround are shell scripts to limit how many we update each time.

If we breach such we become temporarily unavailable for about 30min, no human intervention needed. If not seems we can just sleep for 60s and keep going. This is not as harsh as the previous iteration.

One problem is a script may never finish updating everything due to limit's interaction with requests that didn't error out due to 403.

zhehaowang commented 4 years ago

The problem has since been addressed and 403 on stockx no longer seems to be a major blocker.

zhehaowang / sneaky

feed: stockxapi rate limit #1