February 16, 2020

Explaining Benford’s Law

Filed under: Uncategorized — admin @ 10:57 am

When we look at collections of numbers that emerge from various things, statistical phenomena emerge. Often, they can be used that the original data was really sampled from a “naturally occurring” process, and not “faked”.

Many people have seen the Bell Curve of the Normal Distribution, that shows up when data from many different random variables is averaged together. This distribution shows up even if the random variables come from totally different processes with different statistical distributions, such as grades in a class, or the students’ heights. The Law of Large Numbers explains why this happens.

Other interesting patterns include the Golden Ratio, approximated by the Fibonacci sequence, which have been claimed to appear in many places in nature, although various examples have been disputed. The ratio might show up in spirals due to rotation and scale invariance.

But perhaps more intriguing are two statistical laws that show up in empirical data, and might at first be unexpected. One is Zipf’s Law, which establishes an inverse relationship between the frequency of a symbol and its rank in the frequency table. Thus, the most common word in a language appears 2x as often as the second most common word, 3x as often as the third most common one, and so on. There are some interesting analyses of this from the point of view of Shannon’s and Kolmogorov’s information theories.

Plot of the Zipf CDF for N=10

Benford’s Law

This is the other intriguing law, showing up in all kinds of numbers from stock market prices to baseball statistics. If the numbers are expressed in base 10, or indeed any base, the first digit does not have a uniform distribution. Rather, the digit 1 appears about 30% of the time, 2 appears 16% of the time, while 9 appears 5% of the time. Many people have wondered why this holds true across so many sets of numbers.

For numbers that are generated from scale-invariant processes, such as stock market prices, the law is relatively easy to explain. When you’re at 1000, it takes 100% growth to get to 2000, then then 50%. growth to get to 3000, and so on, until it takes only 10% to get to 10,000. Then, it once again takes 200% to get off the first digit being 1.

But, what’s more interesting is that Benford’s law also often applies when the numbers come from various uniform distribution! That is to say, the real-life process is not scale invariant, but rather, generates values evenly distributed between a and b. Why does the law apply then?

I wanted to write down an easy explanation that occurred to me today. Uniform distributions cannot span the entire number line, because then the total area under the curve would be infinite, violating that P(X) = 1 for the whole set X. Thus, uniform distribution lands between some two numbers a and b

PDF of the uniform probability distribution using the maximum convention at the transition points.

To keep things simple, let’s assume that a = 0, so we have a process that generates some non-negative numbers. It can be either a discrete process (generating whole numbers) or a continuous one (generating arbitrary real numbers in the range). There is some maximum number b that the process can generate, and the sampled results, represented by the random variable X, are evenly distributed between 0 and b.

Using basic Probability, we can calculate the chances of X starting with the digit 5 by summing over all integers N the following:

Σ P(N ≤ b < N+1) • P(first digit of X is 5 | given N ≤ b < N+1)

Now, breaking up the probabilities in this way, we can see why Benford’s law applies even in the case of uniformly distributed results. When b = 499, for instance, it’s true we have an equal chance of getting 10 ≤ X < 20 as we do for 80 ≤ X < 90, so if X consists of two digits (before the decimal point), it’s just as likely to start with a 1 as with an 8. However, for three-digit numbers X, we see that none of the sampled results can start with 5, 6, 7, 8 or 9. There are, in fact, hundreds of three-digit values (before the decimal point) that can result from the process X, ranging from 100 to 499. The range 100…499 is over four times larger than the range 10…100, so given the uniform distribution, X is far more likely to yield values there, with the first digit being between 1 and 4.

Similarly, if N was 399, or 350, you’re far more likely to get a 1, 2, or 3 as the leading digit, due to that larger range 100…N being included. In fact, given any you can even calculate the exact probability of how much more often the leading digit will be 1-4, but in true math teacher fashion, I will leave this as an exercise to the dear reader. The main thing I wanted to convey was the intuition.

Finally, remember that we are summing the probability over all N. Unless N is a power of 10, the first digit of X will simply not be uniformly distributed, since the possible values between N+1 and the next power of 10 are not going to come up in the sampling. Thus, for N = 200, slightly over half as many of our numbers will start with 1 (that is to say, all the numbers 10-20 and 100-200). As N increases to 300, the proportion of numbers starting with 2 starts to increase, until at N = 300 it is equally likely for a number to start with 1 or 2 (but not 3 or higher).

When you sum all of this up, you see that the digit 1 gets a big boost as N goes from 100 to 200, and retains that boost as 2 starts to experience that initial boost, and so on. By the time you get to N = 1000, the digit 1 got 10 of these boosts, while the digit 9 got just one of them. Moreover, each boost a digit finally received had to be shared with the previous digits, so the boost for the digit 1 was about 1/2, while the boost for digits 1 and 2 was evenly split, thus an extra 1/3 for each, etc.

Summing all this up, we see that by the time N reached 1000, the digit 1 was first in roughly the following frequency:

1/10 + 1/2 + 1/3 + 1/4 + 1/5 + 1/6 + 1/7 + 1/8 + 1/9 ≈ 1.92

while the digit 9 was first in toughly the following frequency:

1/10 + 1/9 ≈ 0.2

So we see that if 1 appears roughly 50% of the time, 9 would appear about 5% of the time, 10x less.

1 Comment »

  1. buy accutane online 20mg [url=]where to buy cialis online[/url] Syphilis And Amoxicillin Clavulanate cialis dosage Levitra Wofur

    Comment by Marktax — March 27, 2020 @ 1:40 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment