yasserg / crawler4j

Open Source Web Crawler for Java
Apache License 2.0
4.56k stars 1.93k forks source link

amazon api spinder #40

Open frankskywalker opened 9 years ago

frankskywalker commented 9 years ago

hi yasserg

crawler4j helps me to get data from amazon api, there are many many urls to query, first i add 2*numberOfCrawlers seeds and call Controller.startNonBlocking(), then i add another url when MyCrawler.visit() is called.But i find everytime the crawler4j will stop at a random task.Maybe 180 ,maybe 1000, there still other seeds but it seems the threads are all died.

So,what happened?can you help me?

laxika commented 9 years ago

Doesn't amazon banned you? I had to work on an Amazon crawling project (not with crawl4j) 2-3 weeks ago at my last workplace. You should query the API with 1 seconds delays otherwise it will ban you. Query amazon.com directly is even a worse business. We used 50+ proxies and 30 sec delays to finally be able and crawll all stuff. It took a toooong time but without the delay the whole proxy list is got banned after 30-40 mins and had to wait 2 days for unabnning.

https://affiliate-program.amazon.com/gp/advertising/api/detail/faq.html