Estimating unique visitors in larger numbers

As a supplement to small-scale counting using bloom filters

Summary:

Update 2018-03-17: This article hasn't aged well; better information about Flajolet-Martin (known as HyperLogLog to the rest of the world) is easy to find, and browser fingerprinting is such a concern now that it's the subject of academic research and a reason to use specialized browser versions to defeat it.
Unique visitor counting using a bloom filter is currently only viable when visitors number in the thousands; Larger numbers would require more bits of entropy than browser user agent strings are likely to yield. However, if we're happy to get a rough estimate rather than an exact number, we can use Flajolet-Martin-sketches to get that in a very similar way; one that takes next to no disk space, memory or processing power.

Introduction

This is a followup to an article I wrote early 2010. At the time, I thought I'd found a pretty smart, or at least unique way for simple, privacy-friendly visitor tracking.
I was wrong; there's an even simpler, much more scaleable probabilistic technique that, for most applications, is a preferable solution. Googling for it suggests that it's not widely known, so it makes sense to write a followup article about it.

However, much of the same concepts necessary for the original solution apply to the one explained in this document as well.

Preliminary concepts

In particular, we need to extract entropy from the (semi-)unique identifiers (IP address, User-Agent string, etcetera) we can glean from from the browser. Similarly, this algorithm depends on the bits in these identifiers being uniformly distributed. So applying a hash to this information is our best bet.

Probabilities

We have to be willing to assume that the hash function we use yields values that are evenly distributed over the range of possible values. That is, the outcome of the hash function should be about as even (and unrelated to the input) as a dice roll.

Imagine a dice with 2³² sides, all equally likely, and look at the bit string of a possible outcome. What is the chance of the first bit being zero? Exactly one in two.
The chance of the first two bits both being zero is one in four. The first three bits? One in 2³, or one in eight.

1	2	3	4	5	6	7	8	9
1	1	1	0	1	0	0	0	0

A counter showing we've encountered a hash with five consecutive zeroes, something that's expected to happen once every 2⁵ = 32 visits.

If we take an array of bits, with all members initialized to zero at the beginning, and set bit n if we encounter a hash starting with n consecutive zeroes, eventually, after sufficient visits (that each generate a semi-unique hash key), we'll have an array that starts with a series of ones. If we have six consecutive ones in the array, then chances are that we've had about 2⁶ = 64 visitors. Or some other number between 2⁵ and 2⁷.

Actually, Flajolet and Martin themselves (PDF), for complicated mathematical reasons I have no hope of understanding, place the expected value at 0.77n, and estimate the standard deviation at 1.12 bits. So six consecutive ones actually means a value between 2^3.5 = 11 and 2^5.74 = 53, most likely to be 2^4.62 = 21.

Because we're taking the exponent of the number of bits, there's really no limit to how high we can ‘count’ A hash value that starts with thirty consecutive zeroes is astronomically unlikely; It means there will have been in the order of one billion visitors to the page. But we can store this possibility inside a 32-bit integer value.

Flajolet-Martin sketches remind me of a mark placed to commemorate floods; if you know that a flood of a certain height only happens every two-hundred years, you can guess how many lesser floods have passed.
Photograph: “High Water”, © Duncan Harris.

Features of the Flajolet-Martin sketch

It should be obvious by now that this method gets less accurate as the number of visitors increases. At the beginning of the scale, the counter will be off by one or two visitors. But the difference between 2³¹ and 2³² is slightly more than a billion.

An advantage that Flajolet and Martin themselves point out is that their method of counting lends itself well to distribution; If you have three web servers, each counting unique visitors in a Flajolet-Martin sketch, then the total number of unique visitors (discounting visitors who've visited web server A, then web server B, and then C) is predicted by the ORed value of the three Flajolet-Martin sketches.

Our assumptions in applying this method to web site visitors

A few paragraphs up, we've made the assumption that a hash function acts as a ‘uniformizer’. A magical function that takes a series of values that are non-uniformly distributed over the set of all possible inputs, and produces a series of values that is uniformly distributed over all possible outputs. Is this really the case? No, and yes. There is no hash function that yields a uniformly distributed output for every range of input values. But if we verify this experimentally, it becomes clear that many hash functions are ‘good enough’ for this purpose.

Another assumption is that our input values, ip addresses and browser User-Agent strings, are sufficiently unique to tell the difference between one user and the other. At the time of writing, companies like Microsoft were taking action to make User-Agent information transmitted by the browser less unique, not more, so this is something that is changing rapidly.

Conclusion

‘sketching’, that is, using compact data structures to record statistical properties of large, or even huge data sets, don't seem to be as well known as they could be. I hope this article provides an interesting introduction to the technique, as well as a clear demonstration of a practical application.