How Visualization is Transforming Exploratory Data Analysis
Data analysts are dealing with bigger datasets than ever before, making interrogation difficult. Visualized Exploratory Data Analysis, supported by advanced parallel computing, promises an answer.
By Todd Mostak, CEO at OmniSci
Image by Yvette W from Pixabay
As humankind begins its coexistence with artificial intelligence, it’s worthy to note that one of the most important methods of advanced learning, at least in the data realm, dates back more than a half-century.
Exploratory Data Analysis (EDA), a term coined by renowned statistician John Tukey, is a technique for initially understanding and developing a view of a particular set of data before deep inquiry starts. In EDA, statistical techniques are used to describe the characteristics of the data in order to generate initial hypotheses.
In the age of big data, when datasets routinely grow to petabytes in size, EDA is more important than ever. Most repositories of information today are simply too large, complex and diverse to explore by rote numerical analysis.
Moreover, with so much of the world’s data now geospatial (location-based) in nature, analysts are faced with additional challenges. These datasets can include changes over time (spatiotemporal data); the ability to characterize and quickly gain perspective on such multi-dimensional data is an essential step, particularly in critical situations such as disaster response.
Performant, intuitive visual analytics capabilities will be vital to EDA’s success amid this rising tide of information. Humans are predisposed to understand complex information in visual terms. Maps are a perfect example; the shapes, colors, lines, reference points and comparative scale are all ways in which humans take in meaning from maps quickly.
Visual presentation is where spreadsheets fall short. Performing EDA using spreadsheet software involves time-consuming formulas and filters. Spreadsheets lack the speed, in both computation and communication, to enable analysts to quickly identify relationships and generate ideas.
Visualization at scale also allows analysts to work at the speed of their natural curiosity. In most professional scenarios, data is no longer about generating a report; it’s about learning. Analysts need tools that help determine what questions should be asked—and based on the answers, uncover what the next questions should be. This natural form of inquiry, made possible by high performance data analytics, will not only bring incompatible systems together, but also stimulate insight and understanding.
Visual EDA in practice
New use cases for visualized EDA are emerging almost daily. In defense/intelligence operations, visually-driven EDA can combine Intelligence Surveillance and Reconnaissance (ISR) data with IoT sensor information, signals intelligence, cyber, logistics and even social media. Using the latest analytics software that leverages the parallel processing capabilities of GPUs/CPUs, analysts can visualize billions of records in milliseconds. This enables them to track movement of vehicles, ships, and clusters of people to establish movement patterns over time. During the COVID-19 pandemic, this has allowed researchers to find emerging hot spots and improve directives for minimizing viral transmission.
Telecommunications is another area to benefit from early-stage visual data exploration. Network teams find immense value in first combining various Operations Support System (OSS) datasets, and then performing EDA to not only answer known questions, but also to generate new insights and opportunities. It assists in optimizing network performance, reducing customer churn, and improving customer satisfaction; on the product side, visual EDA supports the smooth rollout of new services and software system updates.
Autonomous vehicle technology will soon transform the world in untold ways. Using visually-enhanced EDA, automakers have the unique opportunity to quickly explore vehicle usage patterns after purchase. OEMs and service providers can more easily scour real-life data to better anticipate customer wants and needs, leading to more advanced products and services.
An invaluable tool
For anyone, or any enterprise, that needs to explore huge amounts of data, the ability to interactively and visually explore large amounts makes it easy for a subject matter expert to find trends, anomalies and meaning out of a vast swath of data. By combining the right domain knowledge and intuition with the unparalleled pattern-matching capabilities of the human visual cortex, insights can be uncovered that might often go unnoticed with more traditional ad-hoc querying or static reporting approaches. As such, visually-supercharged, massive scale EDA form the backbone of a broader analytics and data science pipeline, including the application of gateway technologies such as AI and machine learning.
Next-generation data analytics platforms are also beginning to heavily leverage visualization as a means of driving fast querying and interrogation of data. Rather than simply providing an analyst with simple static reports, these systems allow users to interactively click and brush on the data to cross-filter and drill-down on it, creating a tight, iterative feedback loop from hypothesis to question to answer and back again.
The biggest pitfall for EDA is the limitation of the data. Blind spots, particularly missing or incomplete datasets, can lead companies to incorrect conclusions. For any project, the team needs to consider the possible limitations and outside variables. They must also heed the age-old warning that correlation does not equal causation. However, by combining visual analytics capabilities with more formal statistical testing and machine learning techniques (sometimes in the same platform), the risk of over-pattern matching can be significantly mitigated.
Visual EDA will be essential for taming the challenges of big data. EDA, supported and enhanced with accelerated analytics technologies, will provide business, academia, science, engineering, public safety and other sectors with unprecedented capabilities to distill answers from previously untractable datasets. . It will ensure, in conjunction with the deployment of AI and machine learning methodologies, that the immense amount of data now collected can be leveraged towards making better and faster decisions.
Bio: Todd Mostak (@toddmostak) is the founder and CEO of OmniSci, the pioneer in accelerated analytics that enables business to uncover important insights. Mr. Mostak conceived the idea of a GPU-accelerated analytics platform while conducting graduate research at Harvard, after tiring of waiting hours or days for traditional CPU-based platforms to run analytic workflows. He later jointed MIT’s CSAIL as a research fellow before founding OmniSci.
- 11 Essential Code Blocks for Complete EDA (Exploratory Data Analysis)
- A Lightning Fast Look at Single Line Exploratory Data Analysis
- Know your data much faster with the new Sweetviz Python library