A “wordy” internet research method

Researchers at the University of Konstanz develop guidelines for studies that utilize the Google Books Ngram Viewer

The Google Books Ngram Viewer is an online search engine that provides data on the frequency of words in a corpus of more than eight million books. In a 2011 publication, the search engine’s developers presented the idea that their tool could be used to track cultural change. They illustrated, for example, that responses to the Tian'anmen Square student protests in Beijing were largely absent in the Chinese corpus after 1989, whereas the English literature covered these incidents extensively. While others have confirmed this particular finding, general doubts have emerged in regard to the accuracy of the Google Ngram Viewer results. Professor Ulf-Dietrich Reips and his colleague Dr Nadja Younes now provide guidelines for improving the reliability of Google Ngram studies in an article published in the online journal PLOS ONE. To test and exemplify their proposed guidelines, they examined the occurrence of words with a religious connotation in the years from 1900 to 2000.

The eight million books that comprise the corpus of the Google Ngram tool include British and American English literature, books in various European languages such as German, French, Italian and Spanish, as well as Chinese, Hebrew and Russian literature. The scanned literature dates from the period between 1500 and 2008 and was provided primarily by university libraries. These digital corpora are increasingly being used, for example, to illustrate how values are changing in specific cultures by investigating the use of words that promote either individualistic or collectivistic ideas. Consistent with current theories, most research indicates that cultures are becoming more individualistic, moving away from collectivist patterns of thought and behaviour. When, for example, searching for various collectivistic words such as "Gehorsam” (obedience), "Güte” (benevolence), and "Verehrung” (worship) within the German corpus over a period from 1800 to 2000, the Google Ngram Viewer shows a steep descending curve starting halfway through the 19th century. With one notable exception: around and during the Second World War.

In a previous study, both Ulf-Dietrich Reips, whose fields of work include Psychological Methods and Assessment as well as Internet-based research, and his colleague Nadja Younes predicted and confirmed the reversal of this development process around the time of the Second World War. In their latest publication, the research team explains how to increase the reliability of Google Ngram studies via the methods presented in their article. “We propose that researchers use a combination of different methodological procedures in order to attain the most reliable results,” says Nadja Younes.

One suggestion advanced by the researchers from Konstanz is to compare the occurrence of corresponding words in the various language corpora. The Google Ngram Viewer shows that there was a clear increase in the frequency of religious words in the German corpus during the Second World War – contrary to the general trend of decreased significance. This finding is in line with the current theory that in times of crises, such as wars, religion becomes more important. The researchers discovered a similar increase in the Italian corpus. The English-language literature, on the other hand, does not show such an increase of religious words during the Second World War, thus matching the overall decrease of such terms. “Our explanation for this is that the majority of English-speaking populations were not directly affected by armed conflict”, says Nadja Younes.

Taking historical events such as the Second World War into account is crucial for the interpretation of the results. However, phenomena such as propaganda and censorship must also be taken into consideration: when screening the Russian corpus during the time of the Soviet regime, for example, there is a clear downturn in the frequency of words with religious connotation.

After the end of the Soviet Union, the curve begins to rise again. Does this mean that the Russian population was less religious during the Soviet era? Nadja Younes warns that only looking at the frequency of words can lead to flawed conclusions: “Research suggests that people living in the USSR secretly lived out their religion due to religious persecution”.

In addition to comparing the corpora available in different languages, the researchers also recommend using procedures such as cross-checking corpora in the same language, although currently only English texts are sub-divided into a general and an English fiction corpus. To provide additional certainty in the interpretation of results, the article also proposes using word inflections, synonyms and a standardization procedure specially developed by the researchers that accounts for both the influx of data and a disproportional influence of single words. Nadja Younes is convinced: “Google Ngram provides great potential for research purposes. It was previously not possible to investigate changing values in different cultures so efficiently using such a large number of cultural products”.

Facts:

  • Publication: Nadja Younes, Ulf-Dietrich Reips (2019) Guideline for improving the reliability of Google Ngram studies: Evidence from religious terms. PloS ONE 14(3):e0213554. DOI: https://doi.org/10.1371/journal.pone.0213554
  • Using the example of religion, researchers at the University of Konstanz develop guidelines that increase the reliability of studies that utilize the Google Books Ngram Viewer search engine.
  • Google Books Ngram Viewer has access to more than eight million books in eight languages.
  • To the website of the working group: http://iscience.uni.kn