OpenText Data Digest, Dec 18: The Data Awakens

As the world enjoys the latest instalment of the Star Wars series, we review interesting visualizations based on the movie series. Strong is the data behind the Force. Enjoy!

It is a period of data confusion. Rebel businesses, striking from a hidden NoSQL base, have assembled their first embedded application against the evil Big Data. During the battle, data scientists managed to steal secret plans to the Empire’s ultimate weapon, the SPREADSHEET, a mass of rows and columns with enough frustration and lack of scale that it could crash an entire business plan.

While that may not be the plot of the new Star Wars film (or any for that matter), the scenario may invoke a few cheers for the noble data scientists tasked with creating dashboards and visualizations to battle the dark side of Big Data.

Find out more on how to battle your own Big Data problem with Analytics

As the world enjoys the latest installment of the Star Wars franchise, it seemed fitting for us to acknowledge visualizations based on the movie series. Strong is the data behind the Force. Enjoy these examples.

The Force is Strong with This One

light v dark side

Image source: Bloomberg Business

At its core, the Star Wars movie franchise is about the battle between the light and dark sides of the Force. But how much time do they spend exploring that mystical power that surrounds us and penetrates us, and binds the galaxy together? Amazingly, a mere 34 minutes out the total 805 minutes amassed in the first six films.

The screen above is one of five outstanding visualizations of the use of the Force created by a team of data reporters and visualization designers at Bloomberg Business. Creators Dashiell Bennett (@dashbot), Tait Foster (@taitfoster), Mark Glassman (@markglassman), Chandra Illick (@chandraelise), Chloe Whiteaker (@ChloeWhiteaker), and Jeremy Scott Diamond (@_jsdiamond) really draw you in. They break down not only the time spent talking about the Force but identifying which character uses the Force the most and what types of Force abilities are used. Each movie was viewed by the team with data compiled by hand and then entered into a spreadsheet.  If there were discrepancies, the team used the novelizations and screenplays of the films as references.

While the project is engaging, it also digs deep, offering secondary layers of data such as the number of times Obi-Wan Kenobi uses the Jedi Mind Trick versus Luke Skywalker or Qui-Gon Jinn.

Great Shot, Kid, That Was One in a Million

Star Wars - word mapping

Image source: Gaston Sanchez 

Sometimes the technologies behind visualizations need to be acknowledged. Our second entry is an example of an arc diagram that was compiled using the R technology.  The Star Wars tie-in here is a statistical text analysis of the scripts from the second trilogy (Episodes IV, V, and VI) using arc-diagram representations.

Arc diagrams are often used to visualize repetition patterns. The thickness of the arc lines can be used to represent frequency from the source to the targets or “nodes,” as they are often called. The visualization is not often used as the reader may not clearly see the correlation between the different nodes. However, arc diagrams are great for showing relationships where time or numerical values aren’t involved. Here, the chart shows which characters speak to each other most often, and the words they use most. (No surprise, “sir” and “master” are C-3PO’s most common utterances, while Han Solo says “hey,” “kid,” and “get going” a lot.)

Gaston Sanchez, a data scientist and lecturer with the University of California, Berkeley and Berkeley City College, came up with this arc diagram as part of a lecture he was giving on the use of Arc Diagrams with R. Sanchez showed how to use R’s “tm” and “igraph” packages to extract text out of the scripts and compute adjacency matrices.

R has become embedded in the corporate world. R is an implementation of the S programming language developed by Bell Labs back in the 1990s. The language has been compared to Python as a way to dive into data analysis or apply statistical techniques. While R has typically been used by academics and researchers, more businesses are embracing R because it is seen as good for user-friendly analysis and graphical modeling.

This is the Data You are Looking For

Star Wars days at box office

Image source: Eshan Wickrema and Lachlan James

While “Star Wars, The Force Awakens” is expected to break box office records, it faces strong challengers to rank as one of the highest-grossing films of all time. According to stats from Box Office Mojo and Variety, the 1999 release of “Star Wars Episode I: The Phantom Menace” ranks number 20 on the list. When adjusted for inflation, the 1977 release of “Star Wars” is ranked third on the list of all-time movies behind “Gone with the Wind” and “Avatar.”

Looking at the first 100 days of release is one key to understanding the return on investment for a given film. Writers Eshan Wickrema and Lachlan James compared the stats of the first six Star Wars films against each other. What’s significant is that each film made more in revenue than its predecessor, with the prequel films making nearly twice the amount of “Return of the Jedi,” the most popular of the original trilogy.

We share our favorite data-driven observations and visualizations every week here.  What topics would you like to read about?  Please leave suggestions and questions in the comment area below.