Update 2018-03-17: This article hasn't aged well; better information about Flajolet-Martin (known as HyperLogLog to the rest of the world) is easy to find, and browser fingerprinting is such a concern now that it's the subject of academic research and a reason to use specialized browser versions to defeat it.
Unique visitor counting using a bloom filter is
currently only viable when visitors number in the
thousands; Larger numbers would require more bits of
entropy than browser user agent strings are likely to
yield. However, if we're happy to get a rough estimate rather than
an exact number, we can use
Flajolet-Martin-sketches to get that in a very similar
way; one that takes next to no disk space, memory or processing power.
This is a followup to an article I wrote
early 2010. At the time, I thought I'd found a pretty
smart, or at least unique way for simple, privacy-friendly
visitor tracking.
I was wrong; there's an even simpler,
much more scaleable probabilistic technique that, for most
applications, is a preferable solution. Googling for it
suggests that it's not widely known, so it makes sense
to write a followup article about it.
However, much of the same concepts necessary for the original solution apply to the one explained in this document as well.
In particular, we need to extract entropy from the (semi-)unique identifiers (IP address, User-Agent string, etcetera) we can glean from from the browser. Similarly, this algorithm depends on the bits in these identifiers being uniformly distributed. So applying a hash to this information is our best bet.
We have to be willing to assume that the hash function we use yields values that are evenly distributed over the range of possible values. That is, the outcome of the hash function should be about as even (and unrelated to the input) as a dice roll.
Imagine a dice with 232 sides, all equally likely, and look at the
bit string of a possible outcome. What is the chance of the first bit being
zero? Exactly one in two.
The chance of the first two bits both being zero is one in four. The
first three bits? One in 23, or one in eight.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
If we take an array of bits, with all members initialized to zero at the beginning, and set bit n if we encounter a hash starting with n consecutive zeroes, eventually, after sufficient visits (that each generate a semi-unique hash key), we'll have an array that starts with a series of ones. If we have six consecutive ones in the array, then chances are that we've had about 26 = 64 visitors. Or some other number between 25 and 27.
Actually, Flajolet and Martin themselves (PDF), for complicated mathematical reasons I have no hope of understanding, place the expected value at 0.77n, and estimate the standard deviation at 1.12 bits. So six consecutive ones actually means a value between 23.5 = 11 and 25.74 = 53, most likely to be 24.62 = 21.
Because we're taking the exponent of the number of bits, there's really no limit to how high we can ‘count’ A hash value that starts with thirty consecutive zeroes is astronomically unlikely; It means there will have been in the order of one billion visitors to the page. But we can store this possibility inside a 32-bit integer value.
It should be obvious by now that this method gets less accurate as the number of visitors increases. At the beginning of the scale, the counter will be off by one or two visitors. But the difference between 231 and 232 is slightly more than a billion.
An advantage that Flajolet and Martin themselves point out is that their method of counting lends itself well to distribution; If you have three web servers, each counting unique visitors in a Flajolet-Martin sketch, then the total number of unique visitors (discounting visitors who've visited web server A, then web server B, and then C) is predicted by the ORed value of the three Flajolet-Martin sketches.
A few paragraphs up, we've made the assumption that a hash function acts as a ‘uniformizer’. A magical function that takes a series of values that are non-uniformly distributed over the set of all possible inputs, and produces a series of values that is uniformly distributed over all possible outputs. Is this really the case? No, and yes. There is no hash function that yields a uniformly distributed output for every range of input values. But if we verify this experimentally, it becomes clear that many hash functions are ‘good enough’ for this purpose.
Another assumption is that our input values, ip addresses and browser User-Agent strings, are sufficiently unique to tell the difference between one user and the other. At the time of writing, companies like Microsoft were taking action to make User-Agent information transmitted by the browser less unique, not more, so this is something that is changing rapidly.
‘sketching’, that is, using compact data structures to record statistical properties of large, or even huge data sets, don't seem to be as well known as they could be. I hope this article provides an interesting introduction to the technique, as well as a clear demonstration of a practical application.