wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
138 stars 44 forks source link

html-metada(preq) being blocked by DDOS arrest #26

Closed akocmark closed 8 years ago

akocmark commented 8 years ago

Hi guys, thank you for this wonderful module.

I just wanna ask help regarding this ddos protection issue. It seems that this module can't get through some site with ddos protection(DDOS arrest). Like this gulfnews website: http://gulfnews.com/news/uae/health/health-authority-launches-campaign-for-safe-disposal-of-expired-medicines-1.1637255

Is there any way around this?

Thanks Mark

mvolz commented 8 years ago

We don't use the request bit of this library ourselves, only the methods once the cheerio object has been loaded, but we don't seem to have a problem with that site in citoid (https://github.com/wikimedia/citoid/blob/master/lib/Scraper.js). It might be because we use cookies in citoid?

mvolz commented 8 years ago

Instead of the url as the first argument, you can also pass an options object like with the request library: https://github.com/request/request#requestoptions-callback

So in this options object you can put the url, a cookie jar, a user-agent string, etc. Some websites might flag block if you make the request without the user-agent string, that could also be the issue.

akocmark commented 8 years ago

Hi mvolz!

Thank you for the quick response. The user-agent header did the trick! Thank you so much!