Simon Newcomb and "natural numbers" (Benford's Law)

Background

Simon Newcomb (1835-1909), the Canadian-American astronomer and mathematician, published in 1881 a "Note on the Frequency of Use of the Different Digits in Natural Numbers." For Newcomb, natural numbers were those occurring "in nature," i.e. the kind of numbers one would run into in the course of everyday life. He discovered, for example, that not all the digits (1, 2, ..., 9) occur with the same frequency in the first place of such a number; he formulated a law (see below) and gave a rough proof which I will attempt to present. This law was rediscovered by Frank Benford ("The law of anomalous numbers," 1938) and is now somewhat unfairly known as "Benford's Law." A mathematically sound and complete proof was published by Theodore Hill in 1995.

An Experiment with the New York Times

To get some experimental feeling for the phenomenon, I looked at all the numbers given as numerals in the first 15 pages of the New York Times for Saturday, February 21, 2009. I omitted dates and advertisements, and repeats (in the same context, in captions or in tables). For each of those 213 numbers I recorded the first digit, and tabulated the data as follows:

digit	occurrences	frequency
1	56	.26
2	48	.23
3	27	.13
4	20	.09
5	30	.14
6	11	.05
7	8	.04
8	9	.04
9	4	.02

For some of the flavor of Newcomb's "natural number" concept, here are the 8 numbers from this set with initial digit 7:

page	number	reference
A3	71	age of Jane Fonda
A10	70,000	illegal gambling proceeds, Wilkes-Barre, Pa.
A12	787 billion	U.S. economic stimulus package, 2/09
A12	7,365.67	Dow Jones Industrial Average, 2/20/09
A14	7.2	magnitude of hypothetical earthquake
A14	744,000	population of San Francisco
A14	71	age of Senator Ronald W. Burris
A15	70,000	low-end starting salary for butler, New York City

Newcomb's Law

Clearly the distribution is very unsymmetrical. Newcomb tells us how he was led to his discovery: "That the ten digits do not occur with equal frequency must be evident to anyone making much use of logarithmic tables, and noticing how much faster the first pages wear out than the last ones. The first significant figure is oftener 1 than any other digit, and the frequency diminishes up to 9." The place where he noticed the phenomenon gave him a clue to its explanation, which he formulated thus:

The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable.

"Mantissae" probably seems as archaic to today's readers as a starter crank on the front of an automobile, but until 1960 or so every high-school science student was taught the lore of logarithms, and in particular how to use "common" (base-10) logarithmic tables in calculation. Their use involved the separation of a logarithm into two parts: its integer part (the characteristic) and its fractional part (the mantissa). Here is an example:

Suppose, before the days of hand-held calculators, you needed a rapid way to multiply four-digit numbers, and to divide that product by another four-digit number, with an answer accurate to three digits. Say

86.73 X 1.265 X 7607 / .3018.

Procedure: You think of each of the numbers as a power of 10 times a number between 1 and 10:

86.73 = 10¹ X 8.673
1.265 = 10⁰ X 1.265
7607 = 10³ X 7.607
.3108 = 10^-1 X 3.108.

When you take logarithms, since log(ab) = log a + log b,

log(86.73) = 1 + log(8.673)
log(1.265) = 0 + log(1.265)
log(7607) = 3 + log(7.607)
log(.3018) = -1 + log(3.108).

The second term in each of the logs is a number between 0 and 1: this will be the mantissa; the leading term is the characteristic. To obtain the log of the product we want, log(86.73) + log(1.265) + log(7607) - log(.3018) we make two calculations. First we add or subtract the characteristics; this is an integer calculation. Users of the "slide-rule" (an analogue device conveniently replacing the consultation of logarithmic tables, common through the first half of the twentient century) would do this part in their heads. In this case the total is 5. Then you consult a four-place logarithmic table for the mantissae:

log(8.673) = .93817
log(1.265) = .10209
log(7.607) = .88121
log(3.018) = .47972.

The mantissae total (with signs) to 1.44175. You chop off the "1" and add it to the characteristic. The log table gives log(2.765) = .44170 and log(2.766) = .44185. Since you only expect 3 places of accuracy, you can take 2.765 as the mantissa contribution to the product, which you calculate as 10⁵⁺¹ X 2.765 = 2765000. Feeding the numbers into a digital calculator gives an answer to nine places: 2765375.13; but if the factors have an indeterminacy in the fifth place, the fourth digit in the product is not reliable: the extra precision is illusory.

Newcomb's argument

Newcomb first argues that all his "natural numbers" are ratios. This makes sense because most natural numbers are given in units, and the number exhibited is the ratio of some measurement to the same measurement taken on some more or less arbitrary token, e.g. the standard kilogram, the solar year. Then he argues that the set of natural numbers must be closed under further formation of ratios, i.e. under multiplication and division. This implies that the set of logarithms of natural numbers is closed under addition and subtraction; and in particular that the set of mantissae of logarithms of natural numbers is closed under addition and subtraction modulo 1, since as in the example above, when a sum of mantissae is greater than 1 the integer part is moved over to the characteristic; and similarly when it is less than -1. In Newcomb's words: "Since these exponents [the mantissae] are formed by casting off all the integers from a series of numbers, we may suppose them arranged around a circle ..." where we can add and subtract them like angles, except modulo 1 instead of modulo 2π.

Newcomb's leap

Next Newcomb asks the question (translated into our notation): Given a number of points on the circle distributed "according to any arbitrary law," choose n of them at random, say s₁, s₂, ... s_n and form the sum s₁ ± s₂ ± ... ±s_n (modulo 1). What is the probability that this sum will be contained in a given interval of length ds? And he answers: "It is evident that, whatever may be the original law of arrangement," the set of such sums "will approach to an equal distribution around the circle as n is increased," or, in other words, "the required probability will be equal to ds." In other words, The law of probability of the occurrence of numbers is such that all mantissae of their logarithms are equally probable.

This is not evident, but it is plausible. The following figure shows a small simulation of the phenomenon. Here just two "mantissae" s and t, corresponding say to natural numbers m and n, are chosen; the mantissae corresponding to the products mⁱn^j are plotted around the circle of numbers modulo 1, for i, j running from 0 to 8. Comparison with the logarithms of numbers starting with 1, 2, etc. suggests an explanation for the distribution of these numbers among natural numbers.

a. An illustration of the equal distribution phenomenon Newcomb refers to. Here two numbers s and t are chosen on the circle of circumference 1 (I took numbers corresponding to angles 41^o and 95^o); the green angles correspond to all the numbers of the form i s + j t (modulo 1), for i and j integers between 0 and 8. b. The mantissae corresponding to the integers 1, 2, ..., 9. This is the same display that occurs on a circular slide-rule (see below).

Part of a circular slide-rule designed by John W. Mauchly. Mauchly was one of the designers of the ENIAC, the first large-scale general-purpose electronic computer. There was presumably another, smaller, paper disc with similar gradations that could rotate on top of this one, and probably a rotating pointer for keeping track of locations. Image courtesy of University of Pennsylvania Libraries.

Recent history

It took more than a hundred years for a satisfactory explanation of Newcomb's observation. The main stumbling block was the lack of a precise mathematical concept corresponding to Newcomb's "natural numbers." Theodore Hill realized that base-invariance was the key property: the uniform distribution of mantissae of natural numbers in any base (not only in base 10); this had been already been remarked by Newcomb. As Hill states it, "there is a unique countably-additive base-invariant probability measure on the positive reals."

References

Frank Benford, The law of anomalous numbers, Proceedings of the American Philospphical Society 78 (1938) 551-572

Theodore P. Hill, Base-invariance implies Benford's law, Proceedings of the A. M. S. 123 (1995) 887-895

Simon Newcomb, Note on the Frequency of Use of the Different Digits in Natural Numbers, American Journal of Mathematics 4 (1881) 39-40