Wikipedia Mining reveals hidden Revolution of Human Priorities

Wikipedia data mining may reveal changes over time in the human perception of the world, and may also serve as an independent reliable quantitative method of investigation of historical events.

Guest blog by Vladimir Shatalov.

Wikipedia, without exaggeration, can be considered “contemporary Bible”. However, unlike the Bible, it is written simultaneously and independently by many people. This process is not controlled by anyone and some features of knowledge and facts, imprinted in Wikipedia, are objective evidences of the history of humans. The opinion of the authors of the pages has no effect on dates of birth and death of famous people. The number of cross-citations Wikipedia pages looks like an independent “expert” assessment of the significance of the cited person. Thus, these data form the reliable statistical dataset for quantitative studies of the history.

A few highlights:

The most cited person of the 1st-century is Jesus,

16th-century – William Shakespeare,
18th-century – George Washington,
19th-century – Abraham Lincoln,
20th-century – George W. Bush,
21st-century – Roger Federer.

Some historical cataclysm dates may be detected in peculiarities of lifespan distribution. For example, the dips in the lifespan (Fig.1, top) around 1915 and 1945 are obviously caused by the World War I, the 1918 flu pandemic, and the World War II.

The less intense dip at 1865 possibly corresponds to the American Civil War (but also to the Austrian-Prussian and the Prussian-French wars around the same time); the weak dip at 1969 (it is also statistically significant)—to the Vietnam War.

We guess the less pronounced features might also be related to wars: at 1810—the Napoleonic Wars, at 1795—the French Revolution and so on.The top and bottom (standard error of the mean value) parts of Fig. 1 demonstrate strong anti-correlation.

To study this, mortality rates were calculated for the two war periods and the two relatively peaceful ones preceding them. The wartime histograms appear to be significantly distorted and about two times wider than “peaceful” ones. Broadening of the distribution leads to an increase in the standard deviation that appears as a peak of statistical error. 

Fig. 1. Lifespan (top) and standard error of the mean value (bottom) in years.

New mass media generations cause changes in trends of number of biographies with time. A year of activity was assigned to every biographical page.Fig. 2 presents a histogram of the number of biographical pages versus the year of activity. The shape of the resulting distribution is quite surprising. It has a broken linear trend that shows a sudden change of the slope near 1700 AC. Such a behavior was associated with the advent of the newspaper era. The hump after 1500 AC was considered to correspond to the advent of the book printing era that took place in the 15th century.


Fig. 2. Number of biographical pages per year (solid line) and its trends (dashes lines).

Areas of personal activities (and, accordingly, of human culture) were categorized in just a few broad and easily understandable terms. A keyword classification was used to plot time dependence of categories that reveals an evolution of human priorities. After that a new index of human priorities, namely the Personal to Public Ratio, was introduced. The time dependence of this ratio exhibits a kind of hidden revolution in human priorities. Being almost a constant over centuries it increases eight times in the last decades due to the Sport and Art group growth.

So, Wikipedia biographical pages can be used as a unique self-consistent source of information for studies in historical sociology. Wikipedia data mining may reveal changes over time in the human perception of the world, and may also serve as an independent reliable quantitative method of investigation of historical events.

The extracted from Wikipedia and classified in the article dataset may be downloaded from ResearchGate account of Prof. Vladimir Shatalov as supplementary resources.