Unique visitor counting using a bloom filter is
currently only viable when visitors number in the
thousands; Larger numbers would require more bits of
entropy than browser user agent strings are likely to
yield. However, if we're happy to get a rough estimate rather than
an exact number, we can use
*Flajolet-Martin*-sketches to get that in a very similar
way; one that takes next to no disk space, memory or processing power.

This is a followup to an article I wrote
early 2010. At the time, I thought I'd found a pretty
smart, or at least unique way for simple, privacy-friendly
visitor tracking.

I was wrong; there's an even simpler,
much more scaleable probabilistic technique that, for most
applications, is a preferable solution. Googling for it
suggests that it's not widely known, so it makes sense
to write a followup article about it.

However, much of the same concepts necessary for the original solution apply to the one explained in this document as well.

In particular, we need to extract entropy from the (semi-)unique identifiers (IP address, User-Agent string, etcetera) we can glean from from the browser. Similarly, this algorithm depends on the bits in these identifiers being uniformly distributed. So applying a hash to this information is our best bet.

We have to be willing to __assume__ that the hash function we use yields
values that are evenly distributed over the range of *possible*
values. That is, the outcome of the hash function should be about as
even (and unrelated to the input) as a dice roll.

Imagine a dice with 2^{32} sides, all equally likely, and look at the
bit string of a possible outcome. What is the chance of the first bit being
zero? Exactly one in two.

The chance of the first *two* bits both being zero is one in four. The
first three bits? One in 2^{3}, or one in eight.

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|

1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |

If we take an array of bits, with all members initialized to
*zero* at the beginning, and set bit *n* if we
encounter a hash starting with *n* consecutive
zeroes, eventually, after sufficient visits (that each
generate a semi-unique hash key), we'll have an array that
starts with a series of ones. If we have six consecutive
ones in the array, then chances are that we've had
*about* 2^{6} = 64 visitors. Or some other
number between 2^{5} and 2^{7}.

**Actually**,
Flajolet and Martin themselves (PDF), for complicated mathematical reasons
I have no hope of understanding, place the expected value at *0.77n*,
and estimate the standard deviation at 1.12 bits. So six consecutive ones
actually means a value between 2^{3.5} = 11 and 2^{5.74} = 53,
most likely to be 2^{4.62} = 21.

Because we're taking the *exponent* of the number of
bits, there's really no limit to how high we can ‘count’ A hash
value that starts with thirty consecutive zeroes is astronomically unlikely;
It means there will have been in the order of one *billion* visitors
to the page. But we can store this possibility inside a 32-bit integer value.

It should be obvious by now that this method gets less accurate as the number
of visitors increases. At the beginning of the scale, the counter will be off by
one or two visitors. But the difference between 2^{31} and 2^{32}
is slightly more than a billion.

An advantage that Flajolet and Martin themselves point out is that their method of counting lends itself well to distribution; If you have three web servers, each counting unique visitors in a Flajolet-Martin sketch, then the total number of unique visitors (discounting visitors who've visited web server A, then web server B, and then C) is predicted by the ORed value of the three Flajolet-Martin sketches.

A few paragraphs up, we've made the assumption that a hash function acts as
a ‘uniformizer’. A magical function that takes a series of values
that are non-uniformly distributed over the set of all possible inputs, and
produces a series of values that *is* uniformly distributed over all
possible outputs. Is this really the case? No, and yes. There is no hash
function that yields a uniformly distributed output for *every* range of
input values. But if we verify this
experimentally, it becomes clear that
many hash functions are ‘good enough’ for this purpose.

Another assumption is that our input values, ip addresses and browser User-Agent strings, are sufficiently unique to tell the difference between one user and the other. At the time of writing, companies like Microsoft were taking action to make User-Agent information transmitted by the browser less unique, not more, so this is something that is changing rapidly.

‘sketching’, that is, using compact data structures to record statistical properties of large, or even huge data sets, don't seem to be as well known as they could be. I hope this article provides an interesting introduction to the technique, as well as a clear demonstration of a practical application.