cri.dev
about posts rss

Crawling a web site with browserless, puppeteer and Node.js

Published on

Example repository and explanation to a practical crawling with browserless and puppeteer.

Learn how to get started with browserless in a few easy steps!

Source code to follow along

You can get a copy of the source code from github.com/christian-fei/browserless-example, and follow these steps:

git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i

npm run start-browserless
node crawl-with-api.js https://christianfei.com

Starting the browserless backend

From inside the repository christian-fei/browserless-example, run the following to start the browserless backend (requires to be Docker running):

npm run start-browserless

FYI: behind the scenes this command is used: docker run -e "MAX_CONCURRENT_SESSIONS=10" -e "MAX_QUEUE_LENGTH=0" -p 3000:3000 --rm -it browserless/chrome.

Connecting puppeteer to browserless

Instead of spinning up a chrome instance on your own machine, you can virtualize it and run it (even on another host!) in Docker, through browserless.

This can be achieved by supplying an extra parameter to puppeteer.connect (notice that we use .connect instead of .launch), namely browserWSEndpoint:

const puppeteer = require('puppeteer')

async function main () {
  const browser = await puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}

This way a puppeteer instance from the browserless pool is used. BTW on http://localhost:3000 you can see the currently opened browser sessions.

browserless sessions

About the browserless HTTP API 👌

When you have a browserless backend running (on port 3000), you can make HTTP request to that API, that is further documented on the official docs:

You can POST to the endpoint /scrape and instruct puppeteer to extract all elements matching a given selector, and much more:

curl -X POST \
  https://chrome.browserless.io/scrape \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d '{
  "url": "https://example.com/",
  "elements": [{
      "selector": "a"
  }]
}'

Additionally you can also take screenshots, get the whole HTML of a page, and more with the debug property:

curl -X POST \
  http://localhost:3000/scrape \
  -H 'Cache-Control: no-cache' \
  -H 'Content-Type: application/json' \
  -d '{
  "url": "https://example.com/",
  "elements": [{
      "selector": "a"
  }],
  "debug": {
    "screenshot": true,
    "console": true,
    "network": true,
    "cookies": true,
    "html": true
  }
}'

The Node.js crawler

I am going to use p-queue to set up a simple queue (in-memory) and got for making HTTP requests.

The queue represents all scraping jobs for a given URL.

The idea is to start from the homepage and from there look for all relative links (<a> with a href attribute starting with /).

create the queue

  const queue = new PQueue({ concurrency: 5, timeout: 30000 })

To start of the crawling process, add a first url to crawl, from there on look for further links to crawls, and so forth:

  queue.add(() => crawl(url, { baseurl, seen = new Set(), queue }))

the crawl function

The crawl function is a recursive one, whose job is to crawl more links from a single URL and add them as crawling jobs to the queue.

It makes a HTTP POST request to http://localhost:3000/scrape scraping for relative links on the page.

async function crawl (url, { baseurl, seen = new Set(), queue }) {
  console.log('🕸   crawling', url)
  const { body } = await got.post(`http://localhost:3000/scrape`, {
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url,
      elements: [{
        selector: `a[href^="/"]]`
      }],
      debug: { html: true }
    }),
    timeout: 10000
  })
  const json = JSON.parse(body)

Now you have a json object containing the results from your scraping job, e.g.:

{
  "data": [{
    "selector": "a[href^=\"/\"]",
    "results": [{
      "html":
...

Extract links

From data[0].results you can map all matching elements found on the page for the selector a[href^="/"] and extract all links:

  const links = json.data[0].results
    .filter(r => r.attributes.find(a => a.name === 'href'))
    .map(r => r.attributes.find(a => a.name === 'href').value)

Add new links to queue

Filter out unwanted links, make relative links absolute by prepending the baseurl, exclude links with anchors tags, and finally verify through the seen Set that the links hasn’t been crawled before:

  links
    .filter(link => !link.startsWith('//'))
    .map(link => link.startsWith(baseurl) ? link : `${baseurl.replace(/\/$/, '')}${link}`)
    .filter(link => !/#.*$/.test(link))
    .filter(l => !seen.has(l))
    .forEach(l => {
      seen.add(l)
      queue.add(() => retry(() => crawl(l, { baseurl, seen, completed, queue })))
    })

At the end, you add the resulting link to the seen Set to avoid crawling it more than once, and add new links to the crawling queue.


Full source code on GitHub

Fork / clone it from christian-fei/browserless-example,
with Docker installed and running:

git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i

npm run start-browserless
node crawl-with-api.js https://christianfei.com

Here, have a slice of pizza 🍕