Visualizing Unstructured Analysis – Elections, Words, and Zika virus
Unstructured data has proven to be a big analytics challenge. This week in the Data Driven Digest, we’re serving up some ingenious visualizations of unstructured data and making it talk.
As the 2016 Presidential campaigns finish New Hampshire and move on towards “Super Tuesday” on March 1, the candidates and talking heads are still trading accusations about media bias. Which got us thinking about text analysis and ways to visualize unstructured content. (Not that we’re bragging, but TechCrunch thinks we have an interesting way to measure the tenor of coverage on the candidates…)
So this week in the Data Driven Digest, we’re serving up some ingenious visualizations of unstructured data. Enjoy!
Unstructured Data Visualization in Action
We’ve been busy with our own visualization of unstructured data — namely, all the media coverage of the 2016 Presidential race. Just in time for the first-in-the-nation Iowa caucuses, OpenText released Election Tracker ‘16, an online tool that lets you monitor, compare, and analyze news coverage of all the candidates. Drawing on OpenText Release 16 (Content Suite and Analytics Suite), Election Tracker ‘16 automatically scans and reads hundreds of major online media publications around the world.
This data is analyzed daily to determine sentiment and extract additional information, such as people, places, and topics. It is then translated into visual summaries and embedded into the election app where it can be accessed using interactive dashboards and reports.
This kind of content analysis can reveal much more than traditional polling data ─holistic insights into candidates’ approaches and whether their campaign messages are attracting coverage. And although digesting the daily coverage has long been a part of any politician’s day, OpenText Release 16 can do what no human can do: Read, analyze, process, and visualize a billion words a day.
Word Crunching 9 Billion Tweets
While we’re tracking language, forensic linguist Jack Grieve of Aston University, Birmingham, England has come up with an “on fleek” (perfect, on point) way to pinpoint how new slang words enter the language: Twitter.
Grieve studied a dataset of Tweets in 2013─4 from 7 million users all over America, containing nearly 9 billion words (collected by geography professor Diansheng Guo of the University of South Carolina). After eliminating all the regular, boring words found in the dictionary (so that he’d only be seeing “new” words), Grieve sorted all the remaining words by county, filtered out the rare outliers and obvious mistakes, and looked for the terms that showed the fastest rise in popularity, week over week.
These popular newcomers included “baeless” (single/a seemingly perpetual state), “famo” (family and friends), TFW (“that feeling when…” e.g. TFW when a much younger friend has to define the term for you chagrin─ that would be chagrin ), and “rekt” (short for wrecked or destroyed, not “rectitude”).
As described in the online magazine Quartz, Grieve found that some new words are popularized by social media microstars or are native to the Internet, like “faved” (to “favorite” a Tweet) or “amirite” (an intentional misspelling of “Am I right?” mocking the assumption that your audience agrees with a given point of view).
Grieve’s larger points include the insights you can get from crunching Big Data (9 billion Twitter words!), and social media’s ability to capture language as it’s actually used in real time. “If you’re talking about everyday spoken language, Twitter is going to be closer than a news interview or university lecture,” he told Quartz.
On a more serious subject, unstructured data in the form of news coverage helps track outbreaks of infectious diseases such as the Zika virus.
HealthMap.org is a site (and mobile app) created by a team of medical researchers and software developers at Boston Children’s Hospital. They use “online informal sources” to track emerging diseases including flu, the dengue virus, and Zika. Their tracker automatically pulls from a wide range of intelligence sources, including online news stories, eyewitness accounts, official reports, and expert discussions about dangerous infectious diseases. (In nine languages, including Chinese and Spanish.)
Drawing from unstructured data is what differentiates HealthMap.org from other infectious disease trackers, such as the federal Centers for Disease Control and Prevention’s weekly FluView report.
The CDC’s FluView provides an admirable range of data, broken out by patients’ age, region, flu strain, comparisons with previous flu seasons, and more.
The only problem is that the CDC bases its reports on flu cases reported by hospitals and public health clinics in the U.S. This means the data is both delayed and incomplete (e.g. doesn’t include flu victims who never saw a doctor, or cases not reported to the CDC), limiting its predictive value.
By contrast, the HealthMap approach captures a much broader range of data sources. So its reports convey a fuller picture of disease outbreaks, in near-real time, giving doctors and public-health planners (or nervous travelers) better insights into how Zika is likely to spread. This kind of data visualization is just what the doctor ordered.
- OpenText Data Digest, Jan 5: Life and Expectations
- Free Data Science Curriculum
- OpenText Data Digest Nov 27: Data Mapping Music