website-scraper / website-scraper-puppeteer

Plugin for website-scraper which returns html for dynamic websites using puppeteer
MIT License
324 stars 80 forks source link

Excluding font files #29

Closed Germminate closed 3 years ago

Germminate commented 3 years ago

Hello, I tested the repository using the below script:

scrape({
    urls: [
        '/site/url'
    ],
    directory: 'path/to/dirt',
    subdirectories: [
        {
            directory: 'binaries',
            extensions: ['.jpg', '.png', '.svg', '.jpeg', '.mp3', '.mp4', '.wav']
        },
        {
            directory: 'js',
            extensions: ['.js']
        },
        {
            directory: 'css',
            extensions: ['.css']
        },
    ],
    sources: [
        {
            selector: 'img',
            attr: 'src'
        },
        {
            selector: 'audio',
            attr: 'src'
        },
        {
            selector: 'video',
            attr: 'src'
        },
        {
            selector: 'link[rel="stylesheet"]',
            attr: 'href'
        },
        {
            selector: 'script',
            attr: 'src'
        }
    ],
    plugins: [ 
        new PuppeteerPlugin({
          scrollToBottom: { timeout: 10000, viewportN: 10 }, /* optional */
          blockNavigation: true, /* optional */
        })
      ]
}).then(function (result) {
    // Outputs HTML 
    // console.log(result);
    console.log("Content succesfully downloaded");
}).catch(function (err) {
    console.log(err);
});

It returned the font files as well. How do i save a webpage without saving all the font files?

Edit: In fact, after testing, defining the subdirectories and sources doesn't restrict the scrape to just the stated extension types.

s0ph1e commented 3 years ago

Hi @Germminate

It depends on how this fonts are loaded. I suggest to

Hope it helps

Germminate commented 3 years ago

Hi @s0ph1e, Thank you for you response. I am unable to exclude using urlFilter as it is downloaded from the hrefs of the .css files.

I have another question, how can a port of a specific API (website-scraper) instance be closed after it is done (say if i have 100 parallel instances running)?

Right now, I am facing the below issue:

Extracting batch of 5000 urls ...
events.js:353
      throw er; // Unhandled 'error' event
      ^

Error: read ENOTCONN
    at tryReadStart (net.js:574:20)
    at Socket._read (net.js:585:5)
    at Socket.Readable.read (internal/streams/readable.js:481:10)
    at Socket.read (net.js:625:39)
    at new Socket (net.js:377:12)
    at Object.Socket (net.js:269:41)
    at createSocket (internal/child_process.js:314:14)
    at ChildProcess.spawn (internal/child_process.js:435:23)
    at spawn (child_process.js:577:9)
    at Object.spawnWithSignal [as spawn] (child_process.js:714:17)
    at BrowserRunner.start (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:77:30)
    at ChromeLauncher.launch (/home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/node_modules/puppeteer/lib/Launcher.js:242:12)
    at async /home/local/KLASS/germaine.tan/node_modules/website-scraper-puppeteer/lib/index.js:21:19
    at async Scraper.runActions (/home/local/KLASS/germaine.tan/Desktop/gitlab/scraper/node_modules/website-scraper/lib/scraper.js:228:14)
Emitted 'error' event on Socket instance at:
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21) {
  errno: -107,
  code: 'ENOTCONN',
  syscall: 'read'
}

my script simply parses a list of urls to your API and calls the scrape function.

If i run them one by one, this error doesn't occur.

s0ph1e commented 3 years ago

Hi @Germminate

  1. I believe urlFilter should work fine with urls from css files. If it doesn't - then it looks like a bug, please open an issue in https://github.com/website-scraper/node-website-scraper/issues
  2. As for closing browser - it should be closed automatically after everything is done https://github.com/website-scraper/website-scraper-puppeteer/blob/7459dbec47a2c2cc94d9cba1b79d0dabbcb75d5f/lib/index.js#L63 but I didn't test how it works with multiple parallel instances. I assume that operation system opens only 1 chrome process and reuses it instead of opening 100 chrome processes. I can suggest to try with multiple urls to same scraper call, see https://github.com/website-scraper/node-website-scraper#urls - if that's possible for your use-case.
Germminate commented 3 years ago

Hi Sophie,

Thanks. It works with parallel instances it was my URLs that were problematic.