Closed Germminate closed 3 years ago
Hi @s0ph1e,
Thank you for you response.
I am unable to exclude using urlFilter
as it is downloaded from the href
s of the .css
files.
I have another question, how can a port of a specific API (website-scraper) instance be closed after it is done (say if i have 100 parallel instances running)?
Right now, I am facing the below issue:
Extracting batch of 5000 urls ...
events.js:353
throw er; // Unhandled 'error' event
^
Error: read ENOTCONN
at tryReadStart (net.js:574:20)
at Socket._read (net.js:585:5)
at Socket.Readable.read (internal/streams/readable.js:481:10)
at Socket.read (net.js:625:39)
at new Socket (net.js:377:12)
at Object.Socket (net.js:269:41)
at createSocket (internal/child_process.js:314:14)
at ChildProcess.spawn (internal/child_process.js:435:23)
at spawn (child_process.js:577:9)
at Object.spawnWithSignal [as spawn] (child_process.js:714:17)
at BrowserRunner.start (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:77:30)
at ChromeLauncher.launch (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:242:12)
at async /home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/lib/index.js:21:19
at async Scraper.runActions (/home/local/KLASS/germaine.tan/Desktop/gitlab/scraper/node_modules/website-scraper/lib/scraper.js:228:14)
Emitted 'error' event on Socket instance at:
at emitErrorNT (internal/streams/destroy.js:106:8)
at emitErrorCloseNT (internal/streams/destroy.js:74:3)
at processTicksAndRejections (internal/process/task_queues.js:82:21) {
errno: -107,
code: 'ENOTCONN',
syscall: 'read'
}
my script simply parses a list of urls to your API and calls the scrape function.
If i run them one by one, this error doesn't occur.
Hi @Germminate
urlFilter
should work fine with urls from css files. If it doesn't - then it looks like a bug, please open an issue in https://github.com/website-scraper/node-website-scraper/issuesHi Sophie,
Thanks. It works with parallel instances it was my URLs that were problematic.
Hello, I tested the repository using the below script:
It returned the font files as well. How do i save a webpage without saving all the font files?
Edit: In fact, after testing, defining the subdirectories and sources doesn't restrict the scrape to just the stated extension types.