wikimedia / html-metadata

MetaData html scraper and parser for Node.js (supports Promises and callback style)
MIT License
165 stars 44 forks source link

Not working on nytimes.com #59

Closed touren closed 7 years ago

touren commented 7 years ago

Hi, I try to parse the page: http://www.nytimes.com/2017/04/07/world/middleeast/syria-attack-trump.html Got some error: (node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit (node:68496) Warning: Possible EventEmitter memory leak detected. 11 pipe listeners added. Use emitter.setMaxListeners() to increase limit Unhandled rejection Error: Exceeded maxRedirects. Probably stuck in a redirect loop https://www.nytimes.com/glogin?URI=https%3A%2F%2Fwww.nytimes.com%2F2017%2F04%2F07%2Fworld%2Fmiddleeast%2Fsyria-attack-trump.html%3F_r%3D4 at Redirect.onResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/lib/redirect.js:98:27) at Request.onRequestResponse (/Users/Tao/Work/Ludlow/www/ludlow-web/node_modules/request/request.js:917:22) at emitOne (events.js:96:13) at ClientRequest.emit (events.js:188:7) at HTTPParser.parserOnIncomingClient [as onIncoming] (_http_client.js:474:21) at HTTPParser.parserOnHeadersComplete (_http_common.js:99:23) at TLSSocket.socketOnData (_http_client.js:363:20) at emitOne (events.js:96:13) at TLSSocket.emit (events.js:188:7) at readableAddChunk (_stream_readable.js:176:18) at TLSSocket.Readable.push (_stream_readable.js:134:10) at TLSWrap.onread (net.js:548:20)

Probably need to set some cookies to break the redirect loop.

achingbrain commented 7 years ago

You need to set request's jar parameter to true to enable cookies:

const scraper = require('html-metadata')

scraper({
  url: 'http://www.nytimes.com/2017/04/07/world/middleeast/syria-attack-trump.html',
  jar: true
}, (error, metadata) => {
  // do something here
})
mvolz commented 7 years ago

Did @achingbrain 's suggestion work for you?

We let users set their own options objects as some people want to use a new cookie jar every request or want to use the same one, etc. see docs under "options". Basically we just pass the options object on to the request library: https://github.com/request/request#requestoptions-callback

Closing, feel free to reopen if you still have issues :).

touren commented 7 years ago

It works. Thank you guys.