Interview: Kirk Borne, Data Scientist, GMU on Big Data in Astrophysics and Correlation vs. Causality

We discuss how to build the best data models, significance of correlation and causality in Predictive Analytics, and impact of Big Data on Astrophysics.

Kirk BorneKirk Borne is a Data Scientist at George Mason University. He has been at Mason since 2003, where he does research, teaches ,and advises students in the graduate and undergraduate Data Science, Informatics, and Computational Science programs. He helped to create the Data Science B.S. degree program that began in 2007. Previously, he spent nearly 20 years in positions supporting NASA projects, including an assignment as NASA's Data Archive Project Scientist for the Hubble Space Telescope, and as Project Manager in NASA's Space Science Data Operations Office. He has extensive experience in big data and data science, including expertise in scientific data mining and data systems.

He has published over 200 articles and given over 200 invited talks at conferences and universities worldwide. He serves on several national and international advisory boards and journal editorial boards related to big data. In these roles, he focuses on achieving big discoveries from big data, and he promotes the use of information and data-centric experiences with big data in the STEM education pipeline at all levels. He believes in data literacy for all.​​

Here is my interview with him:

Anmol Rajpurohit: Q1. In your keynote, you highlighted the benefits of large datasets, how Big Data can be used as a sort-of experimentation bed. Quite often, we have Big Data, but yet not necessarily all the data that we desire (for various reasons, including when the data is hard to find or to quantify). What scientific approaches do you recommend in these situations to benefit from partial, incomplete data?

Missing DataKirk Borne: The reality here is that "partial, incomplete data" has been the norm for all of human history, and certainly for the history of science. Consequently, traditional methods of modeling and simulation are useful here -- where you build a model that represents whatever it is you are studying. The model includes parameters for things that you don't know and it includes constraints from the things you do know (i.e., from your partial, incomplete data).

The goal of modeling and simulation is to help identify the "best" parameters for the things that you don't know, and then apply those "best models" to your field of study in order to make inferences, predictions, and decisions.
Of course, a model is not perfect -- it is simply a partial representation of reality, from which you hope to make better discoveries and decisions than you could have made otherwise. As the famous statistician George Box said: "All models are wrong, but some are useful." That's precisely the point. The model is imperfect, but it is still useful. Similarly in the case of partial and incomplete data, our subsequent understanding, models, inferences, and predictions are imperfect, but they are still useful.

AR: Q2. From the perspective of Predictive Analytics, Correlations are a great discovery. But, is it good enough without a proper understanding of underlying Causality?

KB: Finding causality is good science, but in many applications it is more important to make a good decision. For example, in astronomy, scientists discovered in the 1960's that there were energetic bursts of gamma-rays coming from space. We had no idea what the cause was, Correlation Causationbut we discovered that the spatial distribution of these bursts across the sky correlated eerily well with an isotropic model (that is, the bursts were not coming from any preferred direction or location in the sky). Nevertheless, this correlation led to improved astrophysical theories, new technologically powerful scientific instruments, and further observations for several more decades before the cause was ultimately discovered in the mid-1990's. The cause was found to be from a massive star exploding (and then collapsing into a black hole), which occurs sporadically and randomly throughout the Universe. So, the correlation led to great physical models and fantastic improvements in space astronomy instrumentation, even without understanding (initally) the underlying cause.

Similarly, in online retail stores, businesses discover correlations in customer purchase patterns, thereby enabling and empowering recommender engines to present meaningful product recommendations to their customers. These engines are not only good at recommendations, but they are also very good at generating revenue for the business. There is no hint in these models as to what causes a customer to have a preference for product A and also for product Z, but if the historical purchase data reveal that the products are correlated, then it is simply smart business sense for you to act on that correlation, even without the causal understanding.
So, I would say that we definitely want to understand causality, and we should never give up our search for the underlying causes, but let us not churn on that problem (which can lead to "analysis paralysis"), instead focus on using the discovered correlations in powerful predictive analytics applications.

AR: Q3. How has Big Data impacted the science of Astrophysics? Can you share some discoveries that were made based on Big Data?

KB: Astronomy data collections have definitely been growing "astronomically" for many years, but the biggest and the best is yet to come, including the LSST (Large Synoptic Survey Telescope) project that will begin construction in the summer of 2014 and the future SKA (Square Kilometer Array). These petascale big data projects promise amazing discoveries. From the terascale projects of the past couple of decades, there have been many important discoveries. For example, from NASA's Kepler mission, we are discovering hundreds of new planets around distant stars via the slight variations in the time series of the stars' light emissions being tracked over several years for more than 100,000 stars. In the first large surveys of galaxies' distances in the 1980's, we found large voids in the Universe, regions that are almost devoid of massive galaxies, leading to a better understanding of the massive large-scale structure of the Universe -- it has essentially the same structure as soap bubbles: the majority of massive galaxies and clusters of galaxies reside on the surfaces of enormous bubble-like regions of space, with almost nothing within the interior regions of the bubble-like structure. We also found extremely rare ultra-luminous galaxies emitting enormous infrared radiation, caused by super-starbursts inside these galaxies -- these were discovered after studying the properties of millions of galaxies.

More recently, we used a citizen science project called Galaxy Zoo to empower the general public to help us look at and characterize images of nearly a million galaxies (which was far more than any individual scientist or team of scientists Galaxy Zoo could look at), and our citizen volunteers found totally new classes of astronomical objects (e.g., light echos from dead quasars, and also little green galaxies, which the volunteers dubbed "green peas"). In all of these cases, it was the study of very large databases of objects that led to the discovery of the surprising, interesting, new things. For me, that is the most exciting and exhilarating aspect of big data science -- discovery of the new, unexpected thing. That "novelty discovery" approach works in astronomy, but also in any big data domain! Finding the unknown unknowns is what motivates my work and it is my goal every day as a data scientist.

The second and last part of this interview: Kirk Borne on Decision Science as a Service and Data Science curriculum