Exclusive: Cognitive Mining, Data Mining, and StatSoft – Interview with Dr. Thomas Hill

What is the relationship between Cognitive Mining and Data Mining? I discuss this, what makes StatSoft different, achieving user satisfaction, Big Data and Privacy with StatSoft VP Dr. Thomas Hill.

By Gregory Piatetsky, Oct 14, 2013.

What is the relationship between Cognitive Mining and Data Mining?

The landmark research by Statsoft CEO Paul Lewicki and his co-author Thomas Hill, VP Analytic Solutions at StatSoft, proved that the connection is very deep and important.

According to Wikipedia, Lewicki and Hill showed that advanced expertise acquired by humans via experience, involves the acquisition and use of patterns that can be more complex than what humans can verbalize or intuitively experience. Frequently such patterns involve high-order interactions between multiple variables, while human consciousness usually can handle only first and second order interactions.

Dr. Thomas HillDr. Thomas Hill is a VP Analytic Solutions at StatSoft Inc., where he worked for over 20 years on development of data analysis, data and text mining algorithms, and the delivery of analytic solutions. He was a professor at the U. of Tulsa from 1984 to 2009, where he taught data analysis and data mining courses. Dr. Hill has received numerous academic grants and awards from NSF, NIH, the Center for Innovation Management, the Electric Power Research Institute, and other institutions.

Here my interview with Dr. Hill.

Gregory Piatetsky, Q1: Your landmark research with Paul Lewicki [and Maria Czyzewska] on "Nonconscious social information processing" showed that humans can acquire complex advanced expertise that they cannot verbalize. This suggests a limitation of expert-hypothesis-driven data analysis methods, because they rely on testing hypotheses that have to be explicitly formulated by researchers.
What are the broad implications for data mining and data science?

Thomas Hill: Lewicki and others (including some research published by Thomas Hill) have demonstrated over a wide range of human experiences and expertise, that exposure to complex and rich stimuli, consisting of large numbers of sensory inputs and high-order interactions between the presence or absence of specific features, will stimulate the acquisition of complex procedural knowledge without the learners' conscious awareness. Hence the acquisition of such knowledge is best characterized as non-conscious information acquisition and processing.

Nonconscious Learning of Covariations For example, when humans look at sequences of abstract pictures, faces, or tracking targets over seemingly random locations on the screen, carefully calibrated measures of procedural knowledge (e.g., based on response times) will reflect the acquisition of knowledge about complex covariations and rules inferred from the rich and complex stimuli.

The conclusions from this research are highly relevant for understanding how large amounts of high-dimensional information, consisting of complex interactions between numerous parameters, can be derived efficiently through systematic exposure to relevant stimuli and exemplars. Specifically:

  • It appears that knowledge about complex interactions and relationships in rich stimuli are the result of the repeated application of simple covariation-learning algorithms that detect co-occurrences between certain stimuli and combines them into complex interactions and knowledge
  • In human experts, most of this knowledge is procedural in nature, not declarative; in short, experienced experts can be effective and efficient decision makers but are poor at verbalizing how those decisions were made
  • When the covariations and repeated patterns in the rich stimulus field change, so that previously acquired procedural knowledge is no longer applicable, experts are slow to recognize this, and are often confused and reluctant to let go of "old habits"

Human expertise and effective decision making can be remarkable in many ways:

  • It is capable of leveraging "big data," i.e., is remarkably capable with respect to the amount of information and stored knowledge that is used.
  • It is capable of coping with high-velocity data, i.e., it is very fast, with respect to the speed with which information is synthesized into effective, accurate decisions.
  • It is very efficient, with respect to how little energy our brain requires to process vast amount of information, and makes near-instant decisions.

From the perspective of analytic approaches, these capabilities are accomplished through the repeated application of simple learning algorithms to rich and complex stimuli to identify repeated patterns that allow for accurate expectations and predictions regarding future events and outcomes.

It seems that big-data-analytics is converging on this approach as well: Applying large numbers of models, based on the application of general approximators to relevant diverse exemplars is in most cases the best recipe for extracting complex information from data.

GP, Q2: Your findings reminded me of Leo Breiman famous 2001 paper "Statistical Modeling: The Two Cultures" (Statistical Science 16:3), where he writes

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown."

Leo Breiman put himself in the second culture which he described (in 2001) as a small minority of researchers. Do your findings support the second, algorithmic, data-driven culture of data analysis, and if so how?

TH: See also the response to Q1. Obviously, there is and always will be applications for statistical hypothesis testing and modeling. In particular in science, it remains critical that evidence for theories and theoretical understanding of reality is advanced by testing a-priori hypotheses derived from theories, or by refining a-priori expectations.

There are also applications where this approach is critical: Recall that human "experts" (with highly evolved procedural knowledge in some domain) are usually not good at responding and understanding when old rules no longer apply.

If the mechanisms generating data are not understood (e.g., why a drug is effective), it can easily happen that something changes that renders old findings no longer to be predictive of future outcomes. In medicine, such errors can be critical.

GP, Q3. How did your research in Cognitive psychology influenced STATISTICA?

TH: Most importantly, it has driven the roadmap with respect to what algorithms we embraced and refined. For example, boosting of simple learners against repeated samples of diverse exemplars (e.g., stochastic gradient boosting) is one of the algorithms that in our minds "mimics" in many ways the way that humans acquire procedural learning.

GP, Q4: StatsoftStatSoft took the 1st place in user satisfaction in 2013 Rexer Analytics Survey (followed by KNIME, SAS JMP, IBM SPSS Modeler, and RapidMiner) and had high satisfaction in other user surveys. Who are your typical users and how do you achieve such satisfaction?

TH: We have always maintained a very disciplined approach to log and "digest" customer feedback. As a result, in many ways we may well have the simplest point-and-click interfaces to build even complex models.

The other big factor, in our experience, is the fact that STATISTICA is very easy to integrate into existing IT assets, regardless if they depend on Excel or text files, or distributed file systems and web-based data services and schedulers. One way to look at our platform is as a development platform that is highly compliant with standard interfaces, programming and scripting languages, and so on. We know for sure that this makes deployments of our platform at our larger Enterprise clients much easier and cost effective: In many ways, STATISTICA will simply be just another (Windows) service running, against existing standard database tables that store all data and metadata. So no new IT skills are required.

In practice, projects can fail when a platform does not integrate--or integrate easily--with what is already there, or fails to enable practitioners and non-data-scientists to do useful work quickly. STATISTICA is very good at that.

GP, Q5: How would you compare StatSoft STATISTICA Data Miner with other similar products ? How do you compete with enterprise products like SAS, IBM SPSS Modeler on one hand, and free, open source software like R, KNIME, or RapidMiner, on the other hand ? What are some new exciting features you are planning to add?

TH: Regarding our competitive advantages over products from SAS and IBM, they are, of course, tough competitors, and we understand that we will win customers only if we outperform our competitors in the areas that are most relevant to the users.

Needless to say, we are working hard to achieve that goal and in the last two years have made significant progress as indicated by market share. Where exactly are our specific strengths in relation to products from these two competitors? I would prefer users (who are the most impartial judges) to answer these questions for you...

Regarding R and other open source software-we certainly do NOT consider them to be our competitors but, rather, most welcome allies who help proliferate the use of advanced analytics in addition to making significant contributions to the science that we all rely on.

StatSoft has been one of the first commercial software companies that fully embraced (in the sense of supporting) R, by incorporating a seamless integration between R and our platform. Also, to the best of our knowledge, we are the only one among the major data mining companies that has contributed to R by enhancing its functionality (i.e., by releasing functionality to the R community under unrestricted GPL licensing).

On the other hand, StatSoft's customers depend on us for our analytics systems, platforms, and solutions that are validated, meticulously tested, follow carefully controlled software life-cycle management procedures, and are developed in close collaboration with end-users in the respective industries to meet their detailed requirements. The open-source world has been and continues to be a wonderful "Wikipedia of statistics and analytics" - a dynamic forum of ideas, new algorithms, methods, technologies.

Commercial applications for mission critical applications require stringent software development procedures, software lifecycle management, validation, test cases, requirements documents, and so on. For example, in medical device and pharmaceutical manufacturing, analytics have to be validated, documented, and then "locked down." This means features such as version control of analytic recipes, audit logs, and approval processes are all critical features.

In our opinion, open-source code will continue to grow and provide important new ideas. At the same time, commercial and/or mission critical applications will also continue to rely on STATISTICA for its functionality that continues to be developed in direct response to real-life use cases and to the endless lists of requirements that are dictated by constant interactions with the customers who use our software for mission critical applications.

Also, unlike open source software that delivers immensely valuable ideas and implementations but that is less disciplined in its product lifecycle management aspects, the STATISTICA software is strictly validated in a highly disciplined environment, while following the product life cycle management that adheres to SOP's and-for example-maintains backwards compatibility with the previous versions. So you will never encounter a situation when some "new and improved" version of STATISTICA will break the previous implementation of our technology at customer sites. Also, STATISTICA software is entirely free of the restrictions that some of the open-source tools and algorithms place on commercial use (something we respect and honor).

Regarding roadmap and "exciting new features": Without giving away the "punch line", suffice is to say that one of the opportunities of big-data is to build, manage, and maintain large numbers of models. Again, this is something we have seen for a while in manufacturing (thousands of parameters recorded second-by-second to describe very complex processes). This means that a challenge for big-data analytics is to automate model building itself, enable effective model-sentinels that know when to recalibrate models, and do so automatically. In short, the challenge is to enable fewer data analysts and scientists to manage more models (perhaps thousands per analyst), and to fully take advantage of the data that are collected at ever increasing speed and volume. That is where a lot of our R&D has been going for a while.

GP, Q6: You have been a professor at U. of Tulsa for 25 years. How did you combine research at U. of Tulsa with work at StatSoft and what eventually caused you to leave university and work for StatSoft full-time?

TH: We never really combined the two. I left The University of Tulsa in the late nineties (and after ten years). That was an exciting time when many of the algorithms and approaches commonly applied today started to emerge. I wanted to play a role in this emerging technology, based on my understanding of at least some of the basic mechanisms responsible for the incredible data processing capabilities of the human mind.

GP: Here is the second part of the interview with Dr. Thomas Hill on Cognitive Mining, Data Mining, and StatSoft.