2014-05-30

Benford's Law

The hackernewsletter today linked to the Wikipedia article on Benford’s Law, which is a rather interesting topic.

Benfords Law distribution (Wikipedia) Benford’s Law holds that in many common data sets, the distribution of the first digit of each member of the set forms a gentle curve, wherein about 30% of the numbers start with the digit 1 down to about 5% starting with 0.

I was curious to see what other data sets might fit within this distribution.

Just a few ruby scripts away!

Ruby’s Random Number Generator

Like a good PRNG, ruby’s rand method returns numbers that are fairly evenly distributed over the output range.

[15:11:42 mburke]$ ruby benford.rb random
1 : 102  : ###################################################
2 : 108  : ######################################################
3 : 106  : #####################################################
4 : 108  : ######################################################
5 : 120  : ############################################################
6 : 113  : ########################################################
7 : 108  : ######################################################
8 : 117  : ##########################################################
9 : 118  : ###########################################################
0 : 0    :

Length of English Words

I also tried the length of English words found in the Mac’s default /usr/share/dict/words file

[14:53:02 mburke]$ ruby benford.rb words
1 : 146417  : ########################################
2 : 823     :
3 : 160     :
4 : 1420    :
5 : 5272    : #
6 : 10230   : ##
7 : 17706   : ####
8 : 23869   : ######
9 : 29989   : ########
0 : 0       :

Since almost all words are less than 20 letters, its not surprising that the majority of them fall between 10 and 19 letters long.

Lines in Main Source Code Folder

I calculated the number of lines in each of the files in our main source code repository. This data set did rather closely follow the expected distribution.

[15:02:20 mburke]$ find . -type f -exec wc -l {} \; | cut -d " " -f 1 > ~/Desktop/source-lengths.txt

[15:05:35 mburke]$ ruby benford.rb file ~/Desktop/source-lengths.txt
1 : 6769   : ################
2 : 16075  : ########################################
3 : 4370   : ##########
4 : 2947   : #######
5 : 1646   : ####
6 : 1471   : ###
7 : 1026   : ##
8 : 1064   : ##
9 : 991    : ##
0 : 75     :

Twitter Stats

Using the awesome t gem, I calculated the Benford distribution for the number of followers, following, and tweets of the people I follow on twitter.

[22:16:04 mburke]$ t leaders | xargs t users --csv -l >> leaders.csv

Tweets
[22:49:21 mburke]$ ruby ~/personal/ruby/benford.rb file <(tail -n+1 leaders.csv | csvfix order -f 4 | sed 's/\"//g' )
1 : 64  : ############################################################
2 : 41  : ######################################
3 : 40  : #####################################
4 : 14  : #############
5 : 23  : #####################
6 : 15  : ##############
7 : 18  : ################
8 : 12  : ###########
9 : 13  : ############
0 : 2   : #

Following
[22:50:52 mburke]$ ruby ~/personal/ruby/benford.rb file <(tail -n+1 leaders.csv | csvfix order -f 7 | sed 's/\"//g' )
1 : 66  : ############################################################
2 : 31  : ############################
3 : 34  : ##############################
4 : 23  : ####################
5 : 18  : ################
6 : 15  : #############
7 : 9   : ########
8 : 11  : ##########
9 : 7   : ######
0 : 28  : #########################

Followers
[22:51:28 mburke]$ ruby ~/personal/ruby/benford.rb file <(tail -n+1 leaders.csv | csvfix order -f 8 | sed 's/\"//g' )
1 : 69  : ############################################################
2 : 48  : #########################################
3 : 32  : ###########################
4 : 24  : ####################
5 : 20  : #################
6 : 12  : ##########
7 : 12  : ##########
8 : 19  : ################
9 : 5   : ####
0 : 1   :

Nothing in these sets followed Benford’s Law perfectly, though they at least have the gradual drop off as the first digits grows.

Perhaps a bigger data set would converge better.

I’ll update this as I think of more sets to try.