weld-io / scraping-service

REST API for scraping dynamic websites using Node.js, headless Chrome and Cheerio.
MIT License
64 stars 14 forks source link

Scraping Service (serverless)

Scraping Service is a REST API for scraping dynamic websites using Node.js, Puppeteer and Cheerio. It works in serverless environments such as Vercel.


Made by the team at Weld (www.weldyourownapp.com), the #codefree web/app creation tool:

Weld

How to Run

Start Scraping Service in development mode:

API=dom yarn dev
# you can replace `dom` with: dom-simple (just fetch, no Chromium), image, meta, page

or in production mode:

yarn start

Server will default to http://localhost:3036

Environment variables

How to Test

yarn test

How to Use

Scrape DOM

Do a HTTP GET:

http://localhost:3036/api/dom?url=https://news.ycombinator.com&selector=.title+a

or simple with just Fetch:

http://localhost:3036/api/dom-simple?url=https://news.ycombinator.com&selector=.title+a

Results:

{
    "time": 792,
    "results": [
      {
          "selector": ".title a",
          "count": 61,
          "items": [
            "Ask a Female Engineer: Thoughts on the Google Memo",
            (more items...)
          ]
      }
    ]
}

Parameters:

Scrape page content

http://localhost:3036/api/page?url=https://www.weldyourownapp.com

Results:

{
  "url": "http://www.tomsoderlund.com",
  "length": 13560,
  "content": "<html>...</html>"
}

Parameters:

Scrape metadata

http://localhost:3036/api/meta?url=https://www.weldyourownapp.com

Results:

{
  "url":"https://www.weldyourownapp.com",
  "general":{
    "appleTouchIcons":[
      {
        "href":"/images/apple-touch-icon.png"
      }
    ],
    "icons":[
      {
        "href":"/images/apple-touch-icon.png"
      }
    ],
    "canonical":"http://www.weldyourownapp.com/",
    "description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
    "title":"Weld - The Visual CMS"
  },
  "openGraph":{
    "site_name":"Weld - The Visual CMS",
    "title":"Weld - The Visual CMS",
    "description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
    "locale":"en_US",
    "url":"http://www.weldyourownapp.com/",
    "image":{
      "url":"https://s3-eu-west-1.amazonaws.com/weld-design-kit/weld-logo-square.png"
    }
  },
  "twitter":{
    "title":"Weld - The Visual CMS",
    "description":"Create visual, animated, interactive content on your existing web/e-commerce platform.",
    "card":"summary",
    "url":"http://www.weldyourownapp.com/",
    "site":"@Weld_io",
    "creator":"@Weld_io",
    "image":"https://s3-eu-west-1.amazonaws.com/weld-design-kit/weld-logo-square.png"
  }
}

Get image

http://localhost:3036/api/image?url=https://www.weldyourownapp.com

Implementation

Built on Node.js, Express, Puppeteer, Cheerio, html-metadata.

Deploying on Vercel

See vercel.json – set up as serverless API controllers.

Older: Deploying on Heroku

Stack: Heroku-18

Buildpacks:

  1. https://buildpack-registry.s3.amazonaws.com/buildpacks/jontewks/puppeteer.tgz
  2. heroku/nodejs

Heroku set-up

Set up and configure app

heroku create MYAPPNAME heroku config:set NODE_ENV=production

Stack and Buildpacks

heroku buildpacks:add --index 1 https://buildpack-registry.s3.amazonaws.com/buildpacks/jontewks/puppeteer.tgz