#1016 What is Zipf’s Law?

What is Zipf’s Law?

What is Zipf’s Law? It is a statistical distribution in a data set where the first item is twice as big as the second item, three times as big as the third item, four times as big as the fourth item, and so on.

Zipf’s Law is called a law, but it is not really a law because there are some cases where it fits and some cases where it doesn’t. It appears to fit most closely when used with languages and it seems to fit for all languages, not just a few. It has been tested on many works of literature and corpuses of language.

The man behind the law was a linguist called George Kingsley Zipf. He lived from 1902 to 1950 and he spent most of his professional life researching the statistical occurrences of word in different languages. This sounds like the profession of a mathematician, but he saw himself as a linguist and didn’t actually like math.

So, how does the law work? If you take an English book and count the frequency of all the words, you will find that the most common word is “and”, which will make up about 7% of the words. The second most common word is “of”, which will make up 3.5% of the words. “And” is two times more common than “of”. The third most common word will be “and”, which makes up about 2.8% of the words. “The” is about three times more common than “and”. This will carry on as you keep counting the words, until you get down to the uncommon words. This pattern makes a perfect graph where the line starts in the top right with “the” and descends smoothly as the words decrease in frequency.

Zipf was only able to do this with a few languages and with books or corpuses, because each word had to be counted manually. Now, with AI, it is possible to check the Zipf Law across the entire Internet and all available languages, and the same results appear. The first word in any language is twice as common as the next, three times as common as the third, and so on.

This is fascinating, but it doesn’t only apply to languages. Zipf’s Law can be seen in a whole range of different situations. Website popularity. A small number of websites get most of the traffic. A few websites have twice as much traffic as the next tier of websites, three times as much as the next tier, and so on. The number of followers on SNS. A small number of people have most of the followers. The people with the most followers on Instagram and Tik Tok have twice as many followers as the next tier, three times as many as the third tier, and so on. Content on YouTube. A small number of people produce most of the content. The people that produce the most content produce twice as much as the next tier, and so on. The salaries of actors. A small number of actors earn most of the money. The size of cities. In most countries, the largest city is two times larger than the second largest city, three times larger than the third largest city, and so on. And on and on. There are many more examples. The city example is interesting. If you look at the US, the largest city is New York, which has roughly 8.2 million people. Second largest is Los Angeles with 3.8 million. Third largest is Chicago with 2.6 million. These numbers are the Zipf Law.

So, why should this happen? Is there an explanation? There is, but the explanations tend to be specific to the situation. The reason why Zipf Law fits language is different to the reason why it fits the size of cities in a country. One of the theories is that growth begets growth at a proportional rate. As a city gets larger, more people want to move there and it increases. People always want to move upwards, so they travel from larger city to larger city until they end up at the largest city. A lot of the results are explained like this. The actors that get paid more are more famous and get more jobs. The number of followers an Instagrammer has increases because the more followers someone has, the more followers those followers are going to create.

One thing worth pointing out is that this is not a law. There are so many things where it doesn’t apply. In fact, even with cities, there are many countries where it doesn’t apply. However, it is a very interesting statistical finding. And this is what I learned today.

Photo by Jess Bailey Designs: https://www.pexels.com/photo/open-textbook-762687/

Sources

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4176592

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5172588

https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

https://www.geeksforgeeks.org/zipfs-law

https://bootcamp.uxdesign.cc/how-zipfs-law-can-help-you-understand-the-world-around-you-b6e34c64e9d5

https://www.techtarget.com/whatis/definition/Zipfs-Law

https://en.wikipedia.org/wiki/George_Kingsley_Zipf

https://en.wikipedia.org/wiki/Zipf%27s_law