Theoretical Data Discovery: Using Physics to Understand Data Science

Data science may be a relatively recent buzzword, but the collection of tools and techniques to which it refers come from a broad range of disciplines. Physics has a wealth of concepts to learn from, as evidenced in this piece.

By Sevak Avakians, Intelligent Artifacts.

Despite "Data Science" being a recently coined buzzword, it is not actually a new field of study. Much of the science behind it has been developed by physicists and mathematicians mostly over the past two centuries as part of work needed to solve other problems. (If you need a data scientist, hire a physicist!) Today, the need for data science has expanded beyond the physics laboratory, but the science behind it is still the same. We can continue using answers provided by physicists for questions in new arenas.

Quantum entanglement

For example, I was recently asked this very thought-provoking question regarding Intelligent Artifacts' Genie Cognitive Computer:

"...can it detect if the data included in the dataset is missing something critical? Is there data I should have gathered but didn't? For example, can it detect that I really need a contextual signal for the predictions I am looking to make that I missed to record and tell me that I need to start to measure it? Or can it model the missing information somehow with a reasonable confidence?"

My initial response was:

No, that is not possible to do for any system, I currently believe. It's the same as the saying, "You don't know what you don't know." I don't believe from an information theory point-of-view that there is any way to detect if the data included in the dataset is missing crucial information. Perhaps, one way to get that would be an automated way of data source discovery and testing. Meaning, some algorithm would search out different data sources and include them -maybe temporarily- into the data stream. Then, it (i.e. Genie's cognitive processor &/or the Information Analyzer) would analyze the results. At some point, it could discover critical data. But, to the original question, I don't believe that it is possible for any agent to know that there is critical data missing if it has never previously been presented to the system.

A few nights later as I stared at my ceiling pondering questions in quantum mechanics, it occurred to me that this problem has come up before in physics! Specifically, the problem was regarding the non-deterministic nature of quantum mechanics (QM), which led to the phenomena of entanglement described as "spooky action-at-a-distance" by Albert Einstein.

That first problem (which is analogous for us, today, in data science) was that the same sequence of past/current events, i.e. the present state, would bifurcate into multiple possible future outcomes. Meaning that the predictions of QM are not deterministic as in Newtonian mechanics.

German physicist Max Born (point of trivia: his granddaughter is Olivia Newton John) started thinking of quantum mechanics as a statistical interpretation. (This is consistent with what we do with Genie's predictions. Given the current state, we calculate the probability of the future states by reviewing the frequency of those futures that have succeeded the current state in the past.) This made a lot of people, including Albert Einstein, very uncomfortable, prompting his famous quote, "God does not play dice". Attempting to disprove this interpretation, Einstein, Podolski, and Rosen came up with the EPR thought experiment as a paradox, which introduced the world to entanglement, i.e. a phenomenon that allows particles to instantaneously correlate their states, seemingly violating causality and signaling at the speed-of-light limitation . Turns out, this isn't a paradox at all. Nature actually behaves this way!

To salvage determinism, American physicist David Bohm put forth the idea that the results only appear this way because there are unknown "local hidden variables" that - if accounted - would make deterministic predictions. John Bell (a friend of IA's Advisory Board Head, David McGoveran), came up with an inequality test that was used by experimenters (Alain Aspect, et al.) that proved that there were no local hidden variables.

Well, the point for us is that such a test can theoretically be done for our data sets to prove or disprove any unknown variables that would be important for the predictions.

The difference between quantum mechanics' and Genie's predictions is that QM relies on a highly constrained vigorous mathematical model. Genie, instead, relies on the data that has been observed. The former allows comparison of model against data. Or, as done with Bells' Inequality, compares the QM model against the deterministic "local hidden variable" model, against data. Since we designed Genie to not require a model of the data, the strength in the predictions comes from comparing new data against past data. Additionally, it would require a second deterministic model to compare against, which is not the case for the predictions made with Genie.

Which brings me back to the original conclusion; it isn't possible to detect data that hasn't been fed into the system.

Data Science will continue to benefit from the work done by physicists pushing the boundaries of knowledge, as long as we take the time to apply proper scientific inquiry against these new subjects availing themselves to it.

The author would like to thank Peter Olausson and David McGoveran for their enlightening conversations on this matter.)

Bio: Sevak Avakians has a background in Physics, Telecommunication, Information Theory, and Artificial Intelligence. In 2008, Sevak invented an AGI framework named GAIuS. Genie is built on top of GAIuS and provided through the company he founded, Intelligent Artifacts.