Exclusive: Cognitive Mining, Data Mining, and Statsoft – part 2

Cognitive Mining and Data Mining, data scientists mission, Big Data, privacy, and advice for beginning data scientists - Part 2 of the KDnuggets exclusive interview with StatSoft VP Dr. Thomas Hill

By Gregory Piatetsky, Oct 14, 2013.

Here the first part of Cognitive Mining, Data Mining, and StatSoft - KDnuggets Exclusive interview with Dr. Hill.

This is the second part of the interview.

Dr. Thomas HillDr. Thomas Hill is a VP Analytic Solutions at StatSoft Inc., where he worked for over 20 years on development of data analysis, data and text mining algorithms, and the delivery of analytic solutions. He was a professor at the U. of Tulsa from 1984 to 2009, where he taught data analysis and data mining courses. Dr. Hill has received numerous academic grants and awards from NSF, NIH, the Center for Innovation Management, the Electric Power Research Institute, and other institutions.

Gregory Piatetsky, Q7: Paul Lewicki said in a recent interview

"Our mission is to create value and to make the world a better place, and analytics contribute directly to that"

What can data scientists do to help improve the world - can they, for example, better explain the climate change to the public? Do they have a special power with which comes special responsibility, or are they like any other citizens?

Thomas Hill: Every profession and every professional has a responsibility towards the world in which they live and which, after all, enables us to pursue our passions and professions. While this might seem "corny" to some it is also the truth. I can say that I do not know any successful statistician or data scientist who went into the profession to make money or seek power.

There are two salient parts that drive what StatsoftStatSoft does every day and how we shape our products.

First, our slogan, "Making the World More Productive," means creating a higher quality of life with fewer resources. A good and concrete example of this is the work we did with fossil fuel power plants where, after making extensive R&D investments, we developed a highly efficient way to help the environment, and we released technology reviewed and covered by the [non-profit] Electric Power Research Institute of Palo Alto. We are very proud of this work. In short, by using data mining and predictive modeling algorithms and approaches, we were able to significantly reduce emissions and improve efficiency-more energy from fewer resources (less CO2 emissions) without additional complex hardware and costs.

As you may know, our software is widely used throughout manufacturing for various things including semiconductors, solar panels, pharmaceuticals, medical devices, and so on. In all cases we can say that our software contributes to better products, produced at a lower cost, with less scrap and lower impact on the environment. For example, pharma is one of our biggest markets and we are proud that our technology helps produce more affordable medicine, and a cleaner environment.

And regardless of one's political persuasion, I think everyone can agree that this is a "good thing" and making the world a "better place". That makes all of us at StatSoft feel good - to contribute to that goal.

Another key part of the functionality of our software platform is that it supports numerous options to support validated deployments, role-based access, and audit logs. In short, it has all the features to enable transparency of processes, documentation of decisions, and assignment of responsibilities via approval processes. This is important because it directly relates to the ubiquitous concerns about "privacy." What is most important is that the data about me that I do not want to be publicly available will never be made publicly available. It is also important that decisions about credit, or the best course of medical treatments, and so on are made impartially, fairly, and following a documented process that is validated to bring about optimal outcomes to the best of our knowledge.

By working with highly regulated industries, such as the pharmaceutical industries, for a long time, we have incorporated into the STATISTICA platform all features that will be required to enable governance of data, analytics, and the application of results. We believe we are very much "ahead of the curve" on this aspect of analytics, in particular with respect to big data, and in critical areas of personal importance such as credit worthiness, health care, etc.

Big Data GP, Q8: What is your opinion of the "Big Data" - both as the technology trend and the buzzword? Has it reached the Hype Peak ? What are the realistic expectation of what can be achieved with and from "Big Data"?

TH: There are two parts here: First, big data today will be normal data tomorrow, and analyzing such data will become routine (if it is not already, e.g., among many of our manufacturing clients).

The other part is unstructured data, high-velocity data, and continuous data streams that are becoming increasingly common. We believe that there are particular analytics and analytic workflows that are still being refined that will become very important, e.g., for automated dynamic learning and forecasting, identification of optimal steady states, optimization of system robustness, and so on. The work we have done with power plants and other continuous process manufacturers has enlightened our product roadmap in many ways, how to approach such problems.

GP, Q9: The other side of Big Data is the increasing lack of privacy, and increasing power of companies to understand our behavior. A recent story about Target using data mining to find women who have just become pregnant generated a lot of attention, and companies are using data to better predict our preferences, products we may buy, videos we want to see, who will we vote for, etc. What is your opinion of the tradeoffs between Big Data and Privacy?

TH: As previously described, it is important to define what aspect of the common usage of the term "privacy" really matters. For example, if a national database of health data contains all of my data and is used and useful for making better treatments, I (and most people, probably) would not have any problems with that. But if such information is widely distributed or, worse, if wrong information gets into the database and is distributed, then reputations can be ruined.

Right now, it is extremely difficult to erase wrong information from all databases, once it is out. Also, old and irrelevant information can live a life "of its own" and never go away.

That is why we stressed governance: To inspire trust that critical data that may damage one's reputation or worse (e.g., in case of erroneous medical treatment), it is absolutely critical that transparent, clear guidelines are created and can be enforced about what data is stored, for how long (when will it be purged), who has access to it, how can it be validated, and how it can be used.

We have seen the trend towards increased regulation of data and predictive modeling in many applications for years, including financial applications, insurance, pharma, medical device manufacturing, and, of course, health care.

GP, Q10: What advice would you give to people considering entering the field of Data Mining and Data Science? Is data scientist "the sexiest profession of 21st century" ?

TH: I think that "Data Science" is only the latest "new-career"; the most important skill is to realize that the speed of changes will only accelerate. At the risk of sounding like a cliché, a career in this field (in analytics of some sort) is a life-long, exhausting, yet exhilarating continuous learning process.

I can say honestly that I have not spent a week over the last 30 years where I have not read multiple papers related to new methods, approaches, problems, policy statements and what not. So, what is critical is the passion for learning new things, and with that there are many ways to get into the data mining and data science field, be that through statistics, engineering, medical research, genetics, cognitive science, data science, and so on.

StatSoft Electronic Statistics Handbook GP, Q11: Electronic Statistics Handbook™ (statsoft.com/textbook/) is a great resource from StatSoft. Tell us who developed it, how it is typically used, what are some hidden gems there?

TH: The idea evolved over 15 years ago and was inspired in the same way as most of StatSoft's most significant projects - brainstorming on how we can be more useful to society (and literally "making the world more productive").

It has always been a common observation of a number of individuals at StatSoft, that what used to be called statistics (and what has more generally evolved into predictive or advanced analytics these days) offers potentially one of most efficient - in terms of ROI - and also easiest to implement ways to increase overall productivity (again, "making the world more productive"). Yet due to the ways in which this topic is often taught in colleges, as well as the near-complete lack of any high-school preparation to this topic, most people typically shy away from it or even hate it as a subject. I know about this, having taught statistics courses; for many students those courses are more associated with painful frustrations rather than making the world a better place.

Indeed it seems paradoxical that so many students are "forced" to learn in high school advanced calculus and some of its concepts, which 99.99% of them will never use in their work or life; still nobody teaches them the concept of "interactions between variables" or "the law of large numbers" without which they may well be somewhat "intellectually impaired" today and in the future, as knowledge workers or even just reflective observers of reality around them...!

Our Electronic Statistics Textbook uses a different approach, starting with the highly intuitive introduction of "elementary concepts" on which everything else is based.

Further, what makes it different, and how we provide guidance for our internal contributors, is to say: Stress what a method is good for first, then how it works, then what the most important strengths and weaknesses are, and then where to go for more technical detail. Every solution and project we have successfully completed at StatSoft started in many ways at the "end": How do we know we are done, how do we know we "won" (how do we measure success). From that comes naturally where to find the data and what methods are applicable. The Electronics Statistics Textbook is written that way: What is it good for, what problems can we solve; then it describes how these problems are solved.

There are many excellent guides and resources out there for statistics, data mining, big data, small data, machine learning, and so on. But StatSoft is in the business of "Making the World More Productive" (as stated above), so the focus of the textbook is and will remain to provide a short and outcome-driven overview of what these methods can accomplish, and how.

The approach used in our Electronic Statistics Textbook has proven to be highly effective and, as a result, EST is the most visited resource of statistics on the Internet, the only resource on statistics recommended by Encyclopedia Britannica, and there are as many as 375,000 links to it from other websites worldwide.

The StatSoft community from all our offices worldwide help us continuously improve it, and it is one of those "public service" projects of our organization that we are very proud of.

GP, Q12: What is a last book that you read and liked? What do you do like to do when you are away from a computer?

TH: And finally, a question I find difficult to answer. The truth is: I am rarely if ever away from the computer, and I mostly "process" articles, papers, book-chapters, and books - sometimes as many as hundreds a week. Also, being a quite active researcher, writer/author, and manager of new technology projects, I find it increasingly difficult to find time to read just for pleasure and to read things entirely unrelated to current projects - which, luckily, are very diversified and give me exposure to sometimes amazingly diverse areas where predictive analytics can be used to improve things...

However, and to be perfectly forthcoming and honest: I love dogs and really enjoyed reading, "A Dog's Purpose" by Bruce Cameron - which is just a sweet book...