Test driving a HackerNews scraper with Node.js

Published on

This is a short summary of my experience while writing a simple HackerNews scraper.

As a pure exercise or kata if you want, I tried to apply Clean code, Refactoring and Testing priciples for this small npm module.

The task is simple:

Get the posts on the front page of and parse them.
Table of contents

npm setup

Start with a npm init -y in a clean repository. You can create the repo beforehand and then initialise it with npm.

This creates a package.json file that resembles your npm package.

Install test dependencies

I would start with ava as test-runner and assertion library.

npm i --save-dev ava

Add the following test script to the scripts section of your package.json:

  "scripts": {
    "test": "ava",

The first test: Fetching the HTML

As a first test, I would verify that I can successfully get the HTML of

Let’s try.

Create index.test.js

Create a file called index.test.js and start by including ava:

const test = require('ava')

The first test could look something like this:

test('gets html from', async t => {

First assertion

Let’s assert/verify that our code is able to get the html

test('gets html from', async t => {
  const html = await getHTML()
  t.true(typeof html === 'string')
  t.true(html.startsWith('<html '))

Now we need getHTML. I like using got for making simple HTTP requests.

Install it with npm i got

const got = require('got')
async function getHTML () {
  return got('')
    .then(res => res.body)

Now we have getHTMLΒ that returns the markup for

Let’s parse it!


HTML of HackerNews

Taking a look at the source of HackerNews, the page has the following structure:

<table border="0" cellpadding="0" cellspacing="0" class="itemlist">
    <tr class="athing" id="25005567">
      <td align="right" valign="top" class="title">
        <span class="rank">1.</span>
      <td valign="top" class="votelinks">
          <a id="up_25005567" href="vote?id=25005567&amp;how=up&amp;goto=news">
            <div class="votearrow" title="upvote"></div>
      <td class="title">
        <a href="" class="storylink">
          Deprecating scp
        <span class="sitebit comhead"> (<a href="from?"><span class="sitestr"></span></a>)</span>
      <td colspan="2"></td>
      <td class="subtext">
        <span class="score" id="score_25005567">399 points</span> by <a href="user?id=Tomte" class="hnuser">Tomte</a>
        <span class="age"><a href="item?id=25005567">8 hours ago</a></span> <span id="unv_25005567"></span> | <a
          href="hide?id=25005567&amp;goto=news">hide</a> | <a href="item?id=25005567">212&nbsp;comments</a> </td>
    <tr class="spacer" style="height:5px"></tr>

Apart from their β€œarchaic” markup, it looks quite clear:

An item’s title is in the <td> with the class title.

The upvotes, comments etc. are present in the adjacent <td>.

Then follows a <tr class="spacer">.

Second test: Parsing HTML to HackerNews item

A test could look like this:

test('parses items from html', t => {
  const news = parseNews(html())
  t.true(Array.isArray(news)), 30)[0].title, 'Deprecating scp')[0].url, '')

html() is just a function that returns HTML from HackerNews (get the string with view-source:

function html () {
  return `<html lang="en" op="news"><head><meta name="referrer" content="origin">...`

So we need a function parseNews

Parsing the HTML

A valid alternative to cheerio is node-html-parser.

Install it with npm install node-html-parser

parseNews could look like this:

const { parse } = require('node-html-parser')
function parseNews (html = '') {
  const doc = parse(html)
  const trs = doc.querySelectorAll('table.itemlist tr')
  return trs.reduce((acc, tr, index) => {
    const titles = tr.querySelectorAll('.title')
    if (titles.length === 2) {
      const title = tr.querySelectorAll('.title')[1].text.replace(/\(.*\)$/, '').trim()
      return acc.concat([{
        url: tr.querySelector('.title a').getAttribute('href')
    return acc
  }, [])

Returning an array like this:

    "title": "Deprecating scp"
    "title": "Gron – Make JSON Greppable"

This satisfies our second test!

  2 tests passed

Extracting more data

Now we just extracted the title from each HackerNews post.

We can further extract upvotes, author, link and comments.

Adapting the second test:

test('parses items from html', t => {
  const news = parseNews(html())
  console.log(JSON.stringify(news, null, 2))
  t.true(Array.isArray(news)), 30)[0].title, 'Deprecating scp')[0].url, '')[0].link, '')[0].author, 'Tomte')[0].upvotes, 435)[0].comments, 231)

To make the test pass, let’s add more logic to parseNews:

function parseNews (html = '') {
  const doc = parse(html)
  const trs = doc.querySelectorAll('table.itemlist tr')
  return trs.reduce((acc, tr, index) => {
    const titles = tr.querySelectorAll('.title')
    if (titles.length === 2) {
      const title = tr.querySelectorAll('.title')[1].text.replace(/\(.*\)$/, '').trim()
      return acc.concat([{
        url: tr.querySelector('.title a').getAttribute('href')
    const subtext = tr.querySelector('.subtext')
    if (subtext) {
      const el = subtext.querySelector('.score')
      const links = subtext.querySelectorAll('a')
      if (!el || links.length !== 4) return acc
      acc[acc.length - 1].upvotes = +el.text.replace(' points', '').trim()
      acc[acc.length - 1].author = links[0].text
      acc[acc.length - 1].comments = +links[links.length - 1].text.replace('comments', '').trim()
      acc[acc.length - 1].link = '' + links[links.length - 1].getAttribute('href')
      return acc
    return acc
  }, [])

Super, we now get a whole HackerNews item!

Refactoring parseNews

parseNews is a messy garbage of HTML parsing with foreign selectors and special cases.

To make it a bit easier to read, I would focus on the if statements.

I’ll try to make them clearer by adding two new functions to determine if the <tr> contains the title, or contains the upvotes, comments etc.

function containsTitle (tr) {
  const titles = tr.querySelectorAll('.title')
  return titles.length === 2
function containsUpvotes (tr) {
  const subtext = tr.querySelector('.subtext')
  if (!subtext) return false
  const el = subtext.querySelector('.score')
  const links = subtext.querySelectorAll('a')
  if (!el || links.length !== 4) return false
  return true

These two functions integrated in the current parseNews function:

function parseNews (html = '') {
  const doc = parse(html)
  const trs = doc.querySelectorAll('table.itemlist tr')
  return trs.reduce((acc, tr, index) => {
    if (containsTitle(tr)) {
      const title = tr.querySelectorAll('.title')[1].text.replace(/\(.*\)$/, '').trim()
      return acc.concat([{
        url: tr.querySelector('.title a').getAttribute('href')
    if (containsUpvotes(tr)) {
      const subtext = tr.querySelector('.subtext')
      const el = subtext.querySelector('.score')
      const links = subtext.querySelectorAll('a')
      if (!el || links.length !== 4) return acc
      acc[acc.length - 1].upvotes = +el.text.replace(' points', '').trim()
      acc[acc.length - 1].author = links[0].text
      acc[acc.length - 1].comments = +links[links.length - 1].text.replace('comments', '').trim()
      acc[acc.length - 1].link = '' + links[links.length - 1].getAttribute('href')
      return acc
    return acc
  }, [])
function containsTitle (tr) {
  const titles = tr.querySelectorAll('.title')
  return titles.length === 2
function containsUpvotes (tr) {
  const subtext = tr.querySelector('.subtext')
  if (!subtext) return false
  const el = subtext.querySelector('.score')
  const links = subtext.querySelectorAll('a')
  if (!el || links.length !== 4) return false
  return true

Combining Parsing and Fetching

Now let’s write an integration test that verifies we are able to get the HTML and parse HackerNews items.

test('fetches HTML and parses items', async t => {
  const html = await getHTML()
  const news = parseNews(html), 30)

This already works, awesome!

This is the heart of our package, so it deserves a place in index.js.

As well as the other code not used for tests, we put it in a folder lib with their tests.

The directory structure looks like this:

➜  hn git:(main) tree -I node_modules
β”œβ”€β”€ index.js
β”œβ”€β”€ index.test.js
β”œβ”€β”€ lib
β”‚Β Β  β”œβ”€β”€ get-html.js
β”‚Β Β  β”œβ”€β”€ get-html.test.js
β”‚Β Β  β”œβ”€β”€ parse-news.js
β”‚Β Β  └── parse-news.test.js
β”œβ”€β”€ package-lock.json
└── package.json

1 directory, 9 files

The full source code can be found at

Here, have a slice of pizza πŸ•