Example repository and explanation to a practical crawling with browserless and puppeteer.
Learn how to get started with browserless in a few easy steps!
Source code to follow along
You can get a copy of the source code from github.com/christian-fei/browserless-example, and follow these steps:
git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i
npm run start-browserless
node crawl-with-api.js https://christianfei.com
Starting the browserless backend
From inside the repository christian-fei/browserless-example, run the following to start the browserless backend (requires to be Docker running):
npm run start-browserless
FYI: behind the scenes this command is used: docker run -e "MAX_CONCURRENT_SESSIONS=10" -e "MAX_QUEUE_LENGTH=0" -p 3000:3000 --rm -it browserless/chrome
.
Connecting puppeteer to browserless
Instead of spinning up a chrome instance on your own machine, you can virtualize it and run it (even on another host!) in Docker, through browserless.
This can be achieved by supplying an extra parameter to puppeteer.connect
(notice that we use .connect
instead of .launch
), namely browserWSEndpoint
:
const puppeteer = require('puppeteer')
async function main () {
const browser = await puppeteer.connect({ browserWSEndpoint: 'ws://localhost:3000' })
}
This way a puppeteer instance from the browserless pool is used. BTW on http://localhost:3000
you can see the currently opened browser sessions.
About the browserless HTTP API 👌
When you have a browserless backend running (on port 3000), you can make HTTP request to that API, that is further documented on the official docs:
You can POST
to the endpoint /scrape
and instruct puppeteer to extract all elements matching a given selector, and much more:
curl -X POST \
https://chrome.browserless.io/scrape \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [{
"selector": "a"
}]
}'
Additionally you can also take screenshots, get the whole HTML of a page, and more with the debug
property:
curl -X POST \
http://localhost:3000/scrape \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [{
"selector": "a"
}],
"debug": {
"screenshot": true,
"console": true,
"network": true,
"cookies": true,
"html": true
}
}'
The Node.js crawler
I am going to use p-queue
to set up a simple queue (in-memory) and got
for making HTTP requests.
The queue represents all scraping jobs for a given URL.
The idea is to start from the homepage and from there look for all relative links (<a>
with a href
attribute starting with /
).
create the queue
const queue = new PQueue({ concurrency: 5, timeout: 30000 })
To start of the crawling process, add a first url to crawl, from there on look for further links to crawls, and so forth:
queue.add(() => crawl(url, { baseurl, seen = new Set(), queue }))
the crawl
function
The crawl
function is a recursive one, whose job is to crawl more links from a single URL and add them as crawling jobs to the queue.
It makes a HTTP POST request to http://localhost:3000/scrape
scraping for relative links on the page.
async function crawl (url, { baseurl, seen = new Set(), queue }) {
console.log('🕸 crawling', url)
const { body } = await got.post(`http://localhost:3000/scrape`, {
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
url,
elements: [{
selector: `a[href^="/"]]`
}],
debug: { html: true }
}),
timeout: 10000
})
const json = JSON.parse(body)
Now you have a json
object containing the results from your scraping job, e.g.:
{
"data": [{
"selector": "a[href^=\"/\"]",
"results": [{
"html":
...
Extract links
From data[0].results
you can map all matching elements found on the page for the selector a[href^="/"]
and extract all links:
const links = json.data[0].results
.filter(r => r.attributes.find(a => a.name === 'href'))
.map(r => r.attributes.find(a => a.name === 'href').value)
Add new links to queue
Filter out unwanted links, make relative links absolute by prepending the baseurl, exclude links with anchors tags, and finally verify through the seen
Set
that the links hasn’t been crawled before:
links
.filter(link => !link.startsWith('//'))
.map(link => link.startsWith(baseurl) ? link : `${baseurl.replace(/\/$/, '')}${link}`)
.filter(link => !/#.*$/.test(link))
.filter(l => !seen.has(l))
.forEach(l => {
seen.add(l)
queue.add(() => retry(() => crawl(l, { baseurl, seen, completed, queue })))
})
At the end, you add the resulting link to the seen
Set
to avoid crawling it more than once, and add new links to the crawling queue.
Full source code on GitHub
Fork / clone it from christian-fei/browserless-example,
with Docker installed and running:
git clone https://github.com/christian-fei/browserless-example.git
cd browserless-example
npm i
npm run start-browserless
node crawl-with-api.js https://christianfei.com