WordSwarm – Visualizing Word Trends in Periodicals
Word clouds provide an intuitive way to visualize word-frequency in corpora and are easy to generate. WordSwarm is a new free tool for animating word clouds that show how buzz-words ebb and flow in chronologically ordered text such as journals, blogs, and even Google n-grams.
Written documents are an intuitive and easy way for humans to store, share, and interpret data. Except, when the corpora consist of the 30 million scanned books in the GoogleBooks Library, or even the 400,000+ pages in long established journals like Science. In these cases we turn to data scientists and their natural language processing tools. Google Books nGram Viewer takes a query of words and generates a graph of their popularity in the Google Books Library from 1500 to 2008; however, it is only capable of searching up to twelve words, and even then the graph can be difficult to interpret. Other tools like word clouds, word nets, word trees, and stream graphs can also provide insights into corpora, but are also limited by query size or the inability to provide temporal word relevance. With these limitations in mind, and the desire to identify previously unknown ‘buzz-words’ without an a priori list, I developed the WordSwarm program.
WordSwarm generates dynamic word clouds in which the word size changes as the animation moves forward through the corpus. The top words from the preprocessing are colored randomly or from an assigned pallet, sized according to their magnitude at the first date, and then displayed in a pseudo-random location on the screen. The animation progresses into the future by growing or shrinking each word according to its frequency in the corpus at the next date. Clash detection is achieved using a 2D physics engine, which also applies ‘gravitational force’ to each word, bringing the larger words closer to the center of the screen.
WordSwarms have been used to show changes in the focus of scientific research as inferred from Science Magazine, the popularity of U.S. Presidents as inferred from their frequency in books found in the Google Books library, and the popularity of U.S. Baby Names from records of babies born each year. The Science Magazine WordSwarm demonstrates the tool’s insightful power by showing that the 1980’s maintains a biological focus that begins with animal testing in rats that shifts to the basic science of understanding genes, which remains the focus through the 90’s. The new millennia brings more advanced understanding of proteins and genomes, but also non-health issues like energy and atmospheric carbon. Finally, the current decade brings forward new knowledge in quantum theory and methods for controlling systems. These few examples only begin to showcase a WordSwarm’s ability to easily display chronological trends in diverse corpora.
Try creating your own WordSwarms by downloading the open-source program at www.WordSwarm.com, where you will also find a tutorial on how to quickly get started. In an age where time and relevance are at a premium, dynamic visual presentation of trends through a free tool like WordSwarm can fuel new insights for data scientists and the general public alike.
Dr. Michael Kane is currently serving as a fellow at the U.S. Department of Energy’s Advanced Research Project Agency – Energy (ARPA-E). His focus is on technologies for controlling, monitoring, and managing infrastructure systems in order to improve energy efficiency and production.