Interview: Kirk Borne, Data Scientist, GMU on Big Data in Astrophysics and Correlation vs. Causality
We discuss how to build the best data models, significance of correlation and causality in Predictive Analytics, and impact of Big Data on Astrophysics.
He has published over 200 articles and given over 200 invited talks at conferences and universities worldwide. He serves on several national and international advisory boards and journal editorial boards related to big data. In these roles, he focuses on achieving big discoveries from big data, and he promotes the use of information and data-centric experiences with big data in the STEM education pipeline at all levels. He believes in data literacy for all.
Here is my interview with him:
Anmol Rajpurohit: Q1. In your keynote, you highlighted the benefits of large datasets, how Big Data can be used as a sort-of experimentation bed. Quite often, we have Big Data, but yet not necessarily all the data that we desire (for various reasons, including when the data is hard to find or to quantify). What scientific approaches do you recommend in these situations to benefit from partial, incomplete data?
The goal of modeling and simulation is to help identify the "best" parameters for the things that you don't know, and then apply those "best models" to your field of study in order to make inferences, predictions, and decisions.Of course, a model is not perfect -- it is simply a partial representation of reality, from which you hope to make better discoveries and decisions than you could have made otherwise. As the famous statistician George Box said: "All models are wrong, but some are useful." That's precisely the point. The model is imperfect, but it is still useful. Similarly in the case of partial and incomplete data, our subsequent understanding, models, inferences, and predictions are imperfect, but they are still useful.
AR: Q2. From the perspective of Predictive Analytics, Correlations are a great discovery. But, is it good enough without a proper understanding of underlying Causality?
KB: Finding causality is good science, but in many applications it is more important to make a good decision. For example, in astronomy, scientists discovered in the 1960's that there were energetic bursts of gamma-rays coming from space. We had no idea what the cause was,
Similarly, in online retail stores, businesses discover correlations in customer purchase patterns, thereby enabling and empowering recommender engines to present meaningful product recommendations to their customers. These engines are not only good at recommendations, but they are also very good at generating revenue for the business. There is no hint in these models as to what causes a customer to have a preference for product A and also for product Z, but if the historical purchase data reveal that the products are correlated, then it is simply smart business sense for you to act on that correlation, even without the causal understanding.
So, I would say that we definitely want to understand causality, and we should never give up our search for the underlying causes, but let us not churn on that problem (which can lead to "analysis paralysis"), instead focus on using the discovered correlations in powerful predictive analytics applications.
AR: Q3. How has Big Data impacted the science of Astrophysics? Can you share some discoveries that were made based on Big Data?
KB: Astronomy data collections have definitely been growing "astronomically" for many years, but the biggest and the best is yet to come, including the LSST (Large Synoptic Survey Telescope) project that will begin construction in the summer of 2014 and the future SKA (Square Kilometer Array). These petascale big data projects promise amazing discoveries. From the terascale projects of the past couple of decades, there have been many important discoveries. For example, from NASA's Kepler mission, we are discovering hundreds of new planets around distant stars via the slight variations in the time series of the stars' light emissions being tracked over several years for more than 100,000 stars. In the first large surveys of galaxies' distances in the 1980's, we found large voids in the Universe, regions that are almost devoid of massive galaxies, leading to a better understanding of the massive large-scale structure of the Universe -- it has essentially the same structure as soap bubbles: the majority of massive galaxies and clusters of galaxies reside on the surfaces of enormous bubble-like regions of space, with almost nothing within the interior regions of the bubble-like structure. We also found extremely rare ultra-luminous galaxies emitting enormous infrared radiation, caused by super-starbursts inside these galaxies -- these were discovered after studying the properties of millions of galaxies.
More recently, we used a citizen science project called Galaxy Zoo to empower the general public to help us look at and characterize images of nearly a million galaxies (which was far more than any individual scientist or team of scientists
The second and last part of this interview: Kirk Borne on Decision Science as a Service and Data Science curriculum
Related: