Education, study and knowledge

Zipf's law: a curious social and mathematical phenomenon

We use thousands of words every day, with meanings of all kinds and belonging to very varied grammatical categories. However, not all of them are used with the same frequency. Depending on how important they are to the structure of the sentence, there are words that are more recurrent than others.

Zipf's law is a postulate that takes this phenomenon into account and specifies how likely a word is to be used based on its position in the ranking of all words used in a language. Next we will go into more detail about this law.

  • Related article: "The 12 types of language (and their characteristics)"

Zipf's law

George Kingsley Zipf (1902–1950) was an American linguist, born in Freeport, Illinois, who encountered a curious phenomenon in his studies of comparative philology. In his work, in which he was conducting statistical analyzes, he found that the most used words seemed to have a pattern of appearance, this being the birth of the law that receives his surname.

According to Zipf's law, in the vast majority of the time, if not always,

instagram story viewer
the words that are used in a written text or in an oral conversation will follow the following pattern: the most used word, which would occupy the first position in the ranking, would be twice as often used as the second most used, three times as many times as the third, four times as many times as the fourth, and so on successively.

In mathematical terms, this law would be:

Pn ≈ 1⁄na

Where 'Pn' is the frequency of a word in the order 'n' and the exponent 'a' is approximately 1.

It should be said that George Zipf was not the only one who observed this regularity in the frequency of the most used words of many languages, both natural and artificial. In fact, it is known that there were others, such as the steganographer Jean-Baptiste Estoup and the physicist Felix Auerbach.

Zipf studied this phenomenon with texts in English and, apparently, it is true. If we take the original version of The Origin of Species by Charles Darwin (1859) we see that the word most used in the first chapter is "the", with an appearance of about 1,050, while the second is "and", appearing about 400 times, and the third is "to," appearing about 300. Although not exactly, you can see that the second word appears half as many times as the first and the third one third.

The same thing happens in Spanish. If we take this same article as an example, we can see that the word "of" is used 85 times, being the most used, while the word "la", which is the second most used, can be counted up to 57 times.

Seeing that this phenomenon occurs in other languages, it is interesting to think about how the human brain processes language. Although there are many cultural phenomena that measured the use and meaning of many words, the language in question being a cultural factor in itself, the way in which we use the most used words seems to be an independent factor of the culture.

  • You may be interested: "What is Cultural Psychology?"

Frequency of function words

Let's look at the following ten words: ‘what’, ‘from’, ‘not’, ‘to’, ‘the’, ‘the’, ‘is’, ‘and’, ‘in’ and ‘what’. what do they all have in common? Which are meaningless words on their own but ironically are the 10 most used words in the Spanish language.

By saying that they lack meaning, we mean that, if a sentence is said in which there is no noun, adjective, verb or adverb, the sentence is meaningless. For example:

… And…… in…… one… of…… to… of……

On the other hand, if we replace the dots with words with meaning, we can have a phrase like the following.

Miguel and Ana have a brown table next to their bed at home.

These frequently used words are what are known function words, and They are in charge of giving grammatical structure to the sentence. They are not only the 10 that we have seen, in fact there are dozens of them, and all of them are among the hundred most used words in Spanish.

Although they are meaningless on their own, are impossible to omit in any sentence to which you want to make sense. It is necessary that human beings, in order to transmit a message efficiently, we resort to words that constitute the structure of the sentence. For this reason they are, curiously, the most used.

Investigation

Despite what George Zipf observed in his studies of comparative philosophy, until relatively recently it had not been possible to empirically address the postulates of the law. Not because it was materially impossible to analyze all conversations or texts in English, or any other language, but because of the titanic task and the great effort that it implied.

Fortunately, and thanks to the existence of modern computing and software, it has been It is possible to investigate whether this law was in the way Zipf originally proposed it or whether there were variations.

One case is the research carried out by the Center for Mathematical Research (CRM, in Catalan Center de Recerca Matemàtica) linked to the Autonomous University of Barcelona. Researchers Álvaro Corral, Isabel Moreno García and Francesc Font Clos carried out a comprehensive analysis scale in which they analyzed thousands of digitized texts in English to see how true Zipf's law was.

His work, in which an extensive corpus of about 30,000 volumes was analyzed, allowed to obtain a law equivalent to Zipf's, in which it was seen that the most used word was twice as used as the second, and so on.

The Zipf law in other contexts

Although Zipf's law was originally used to explain the frequency of words used in each language, comparing its range of appearance with its real frequency in texts and conversations, it has also been extrapolated to other situations.

A rather striking case is the number of people living in US capitals. According to Zipf's law, the most populous American capital had twice the size of the second most populated, and three times the size of the third most populated.

If you look at the 2010 population census, this agrees. New York had a total population of 8,175,133 people, with the next most populous capital being Los Angeles, with 3,792,621 and the following capitals in the ranking, Chicago, Houston and Philadelphia with 2,695,598, 2,100,263 and 1,526,006, respectively

This can also be seen in the case of the most populated cities in Spain, although Zipf's law does not apply. it is fully compliant but it does correspond, to a greater or lesser extent, to the rank each city occupies in ranking. Madrid, with a population of 3,266,126, has twice that of Barcelona, ​​with 1,636,762, while Valencia has about a third with 800,000 inhabitants.

Another observable case of Zipf's law is with web pages. Cyberspace is very extensive, with nearly 15 billion web pages created. Taking into account that in the world there are about 6,800 million people, in theory for each of them there would be two web pages to visit every day, which is not the case.

The ten most visited pages at present are: Google (60.49 million monthly visits), Youtube (24.31 million), Facebook (19.98 million), Baidu (9.77 million), Wikipedia (4.69 million), Twitter (3.92 million), Yahoo (3.74 million), Pornhub (3.36 million), Instagram (3.21 million) and Xvideos (3, 19 millions). Looking at these numbers, you can see that Google is twice as visited as YouTube, three times as much as Facebook, more than four times as much as Baidu ...

Bibliographic references:

  • Font-Clos, F., Boleda, G. and Corral, Á. (2013) A scaling law beyond Zipf's law and its relation to Heaps' law. New Journal of Physics, 15. doi.org/10.1088/1367-2630/15/9/093033.
  • Montemurro, M. TO. (2001). Beyond the Zipf – Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300: 567-578.

Atomism: what it is and how this philosophical paradigm has developed

We don't know a lot. The reality is something complex and difficult to interpret, to which humani...

Read more

The 14 parts of the microscope, and their functions

The microscope has been a fundamental tool in research, especially in disciplines related to medi...

Read more

The 22 most important types of novel: their characteristics and themes

The 22 most important types of novel: their characteristics and themes

There are all kinds of novels, especially when we talk about their literary genres. There are lov...

Read more