Making Minimal Analytics

2021-04-28· Updated 2022-06-22 · christian fei

Chris

Web Analytics & dashboards always fascinated me.

This is my approach to a self-hosted, simple web analytics solution.

It also helped me to get back to Full stack web development, using a clean and testable approach.

dashboard

TLDR;

Check out the project at s.cri.dev
GitHub repo
Node.js framework-less HTTP server
- Simple bot detection
- Persistence on file and in-memory
- Fast calculation of analytics
- Configurable retention timeframe (default 6 months)
Frontend in Preact.js and Chart.css
- Web component tests with enzyme
- preact-router for frontend routing
- parcel bundler
Deployment with docker-compose

Update 2022-06-22

It's been a year since I started working on this project.

And running in production too! 🚀

At the time of writing (2022-06-22), through minimal-analytics, I have tracked anonymously

~101k unique visitors
~111k unique page views
over 71k visitors referred by google (over two thirds!)

without cookies too!

minimal analytics update 2022-06-22

Why?

Good question.

First of all: I am a big fan of Plausible Analytics (no affiliation at all), from several point of views:

Privacy oriented
An excellent example of an Open-Source project, that MONETIZES!
"Look, a cool new Elixir project! That's cleanly written too!"
Great content marketing, lots to learn from them

Although I found a small hiccup.

Unfortunately, since Plausible is getting more recognition and thus included in blocker lists, your web analytics are quite a bit off.

I am doing an experiment by tracking pageviews on this website through

Plausible (hosted, subscription)
Umami (self-hosted)
Minimal Analytics (self-hosted)

This to validate my hypothesis that I am getting ~30% less visitors reported (around 2-3k).

Why three web analytics software?

The hosted Plausible version to have a lower bound for the pageviews.

Umami since it is self-hosted and should catch every pageview (difficult to track and block, except if you disable all JavaScript).

Minimal Analytics to verify that I'm getting close to the pageviews tracked by Umami.

Finally I should be able to determine how many users "I am missing out of" in my Web analytics.

Experiments with persistence

I was telling myself "JavaScript is fast enough" and "You probably don't even need SQLite".

I also wanted to experiment with a LJSON "database", persisted on file.

Data and persistence

The file database is only read at the beginning, then only used to append new data points.

Restoring the memory

I am reading the file and setting the program memory with:

memory = fs.readFileSync(options.DATA_PATH)
  .toString('utf-8')
  .split('\n')
  .filter(Boolean)
  .map(l => JSON.parse(l))
  .filter(({ d }) => +new Date(d) > (+new Date() - 1000 * 60 * 60 * 24 * 180))

After reading from file, only the variable memory is used. It contains the data points of the last six months. As you can see this is easily configurable.

Adding a new data point

To add a new data point, simply appending to the LJSON file:

fs.appendFile(options.DATA_PATH, JSON.stringify(pageview) + '\n', (err) => {
  if (err) console.error('failed to persist pageview', err)
})

HTTP server

The HTTP server is built on top of the native http module.

It takes care of

CORS for configured domain (through process.env.SITE_BASE_URL)
Simple bot detection
Tracking visitors to the site
Serving the dashboard
API for the dashboard
Serving the client tracker script

Dashboard

The dashboard is built on top of preact, using the create-preact-app CLI.

Component tests with enzyme.

Every 15 seconds the Live visitors are updated, and every minute the whole analytics for the currently selected filters.

The dashboard consists of a single route.

preact-router comes in handy for handling navigation changes and setting the query parameters for the current filters.

I also experimented with chart.css to show the pageviews chart, and I must say it was a pleasant experience.

This is what the dashboard looks like now:

minimal-analytics-dashboard

Live visitors

Each visitor get an ID, without cookies, that is essentially calculated using SHA256 and a random seed that changes every day:

createHash('sha256').update(remoteIp + userAgent + seed).copy().digest('hex').substring(0, 10)

On the frontend, the visitors are aggregated by page:

minimal-analytics-live

Top referrers and pages

The Top performing referrers and pages are displayed like this

minimal-analytics-top

Filtering

Every item on the page (except for the chart, I'm working on it) is filterable:

minimal-analytics-filter

Pageviews history

The last 25 pageviews are shown, so that you can see recent activity on the website:

minimal-analytics-history

Client tracker script

It's simply included as a <script> tag, with async and deffered to signal it's low priority to the browser.

<script async defer src="https://s.your.domain.com/client.js"></script>

Once downloaded, the JavaScript in the client script makes a GET request to /p to track a pageview.

Every 10 seconds it sends a heartbeat, to check if the client is still connected and viewing the webpage.

The information collected are the following

{
  r: document.referrer,
  p: window.location.pathname,
  w: window.innerWidth
}

That's it for the tracking on the client side.

Privacy-friendly

Each visitor to this blog gets assigned a hash.

The hash is an encoded representation of the IP address, user-agent and a seed that changes each day.

This all without the use of cookies or any storage at all.

Below is the current implementation and its tests

const { createHash } = require('crypto')

module.exports = function visitorFromRequest (req, seed = new Date().getUTCDate()) {
  const userAgent = req.headers['user-agent']
  const remoteIp = req.headers['x-forwarded-for'] ||
    req.connection.remoteAddress ||
    req.socket.remoteAddress ||
    (req.connection.socket ? req.connection.socket.remoteAddress : null)
  return createHash('sha256').update(remoteIp + userAgent + seed).copy().digest('hex').substring(0, 10)
}

This will return a 10 character long identifier for each website visitor.

This comes in handy when differentiating pageviews from visitors, and so the bounce rate.

Additionally, this is useful when viewing the history of a single visitors.

This enables tracing the "path" the user took when navigating the site.

Simulated data from Plausible

Wrote a little scraper that for every page in my sitemap.xml, makes a request to the Plausible API and get the pageview for a given date.

This way I was able to approximately "reconstruct" pageview (without referrer information unfortunately) prior to April.

I started tracking data with Minimal Analytics since April 12th 2021.

All previous data is scraped from Plausible.

GitHub repo

Join me and let me know what you think (perhaps even contribute directly!) on GitHub.

The project can be found at github.com/christian-fei/minimal-analytics

Future

As said above, this all started as an experiment, so the idea is to validate my assumptions and determine how many users of this blog have an AdBlocker installed or similar.

To be more consistent with the results of the other web analytics software, I will probably need to switch to XHR instead of using fetch.

I am also trying to optimize the calculation further, to be able to handle more datapoints and do the calculations in memory.

With my current data, around 70k data points (since August 2020), the calculations for the yearly analytics take 150ms.