Exclusive Interview: Peter Bruce, President Statistics.com
We discuss the mission of Statistics.com, selection of analytics courses and certificates, the future of analytics education, MOOCs, are Statistics disconnected from Big Data, the role of a data scientist, and more.
At a recent INFORMS Analytics Conference in Boston (Mar 30 - Apr 1, 2014) I met Peter Bruce, President of Statistics.com, and our discussion grew into this interview, covering the mission of Statistics.com, the future of analytics education, MOOCs, are statistics disconnected from Big Data, and more.
Peter Bruce is the President of The Institute for Statistics Education at Statistics.com. He is the developer of Resampling Stats software (originated by Julian Simon in the 1970's), and taught resampling statistics at the U. of Maryland and elsewhere. He is the co-author of Data Mining for Business Intelligence (Wiley, 2006, 2nd ed. 2010), Introductory Statistics: A Resampling Perspective (Wiley 2014) and many journal articles. He serves on the American Statistical Association Advisory Committee on Professional Development. Before taking up statistics, Bruce served in the U.S. Foreign Service; he has degrees (not in statistics) from Princeton, Harvard and the U. of Maryland.
Gregory Piatetsky: What is the mission of Statistics.com?
Peter Bruce: Our mission is to provide online education in statistics, analytics, data science and OR to a global audience. We have small classes, 4-weeks long, that give learners the opportunity to interact with recognized experts while not requiring you to be online at specific times. In ed-tech jargon, a course of ours is a "SPOC" (small, private, online class).
GP: How do you select courses, what certificates you offer?
PB: I try to keep abreast of what is standard and what is new in our fields, by attending conferences, speaking with publishers, eliciting suggestions and advice from our faculty (we have over 50 instructors), participating in meetups, and talking with people like you. We offer 110+ courses and four certificates:
- Analytics for Data Science
- Programming for Data Science
- Social Science Statistics
GP: How do you compete with universities on one hand, and with big MOOCs like Coursera and Udacity on the other hand?
PB: Our topical coverage is more comprehensive than either MOOC's or university analytics programs; we offer 110+ courses in areas like predictive modeling, data mining, text analytics, operations research, social networks, statistical programming, and, of course, traditional statistics.
We are more affordable and offer more flexible scheduling than universities; courses last only four weeks, with 4-5 courses starting every Friday. Compared to MOOC's, we offer a very different experience: classes are small (generally less than 30), you get personal interaction with a leading expert over a period of 4 weeks, and we provide human help with your software and coding, as well as human review of your project work to help you understand where you went wrong. Compared to both MOOC's and universities, our courses are more focused, concentrating on one or two topics.
GP: Your site is named statistics.com, although you also have many data science courses as well. How do you see Statistics vs Data Science vs Big Data?
PB: The word "statistics" has different meanings. To a sports fan, it means the numbers that quantify what is going on in a game, or with a player. The word's origin is in the Latin word for state, and early kings and generals organized the first systematic collection of statistics to determine whom they could tax and draft into the army. In the 1800's it came also to mean the discipline that concerned itself with counting, measuring, analyzing and interpreting data. Later, it also came to mean, more narrowly, the study of how to draw inferences from sample data.
Interesting Fact: 2.4 million survey responses in a Literary Digest survey predicted a landslide victory for Alf Landon. Not Big Data by today's standards, but impressive in 1936. The massive error contributed to the success of George Gallup, who had published his first poll the year before.
Our job is to provide data professionals the skills, knowledge and tool proficiency they need for jobs that involve analytics, which includes statistics. I think this is pretty close to the idea of "data science," a term that is relatively new, and to Big Data as well (we offer courses in Hadoop, Python, NLP, SQL, text mining, etc.). Though I suppose Big Data, being a term of great popular interest, may mean different things to different people. However, I think the major challenges in Big Data, and 95% of the effort, lie in the sphere of extracting, transforming, cleaning and wrangling data to the point where it is amenable to analysis, and in deploying the model in a way that provides value.
Recruiters, for convenience, need shorthand terms like "data scientist," or "statistician," and job postings (rather than arguments among academicians) may be the ultimate arbiter of terminology. Beyond that, I think it does not pay to get too hung up on definitions and professional classifications in a rapidly changing area like this. Some of the biggest contributions to data science have come from people who take what is valuable from several fields and synthesize progress - for example, statisticians like Jerome Friedman and Daryl Pregibon who combine statistical backgrounds with machine learning techniques.
So, however you define the fields, we provide data analytics professionals with the capabilities and understanding they need to do their jobs.
GP: Are statisticians disconnected from Big Data? See the huge debate which arose from my post: Why statistical community is disconnected from Big Data and how to fix it?
PB: I suppose some are. Many statisticians who completed their studies more than a few years ago did not really have exposure to Big Data, and did not necessarily have to do much programming, instead using software like SPSS and SAS. And there is still a lot of statistical analysis that happens not with big data, but with data collected following a plan of sound statistical design. Clinical trials, surveys, and research studies, for example. Statistics departments nowadays, though, recognize their responsibility to provide students with effective programming skills to deal with a varied range of data and analysis requirements.
"Is this a game of chance?"
"Not the way I play it."
(W. C. Fields, My Little Chickadee)
Statisticians bring to the table an appreciation for the deceptive role that random chance can play.
I think it is also relevant to ask whether Big Data is disconnected from statistics, particularly in the areas of study design and understanding the role that random chance and variability play in data. More data does not always mean more accurate data, and some familiarity with statistics can help determine when more is better. Statisticians also have a great deal to contribute to the issue of false discovery, and this is happening, to some extent, in the area of genomics.
John Elder, founder of a highly regarded data mining consulting firm, talks about the "Massive Search Effect," in which enough searching and comparing will always turn up something "meaningful." A background in statistics really helps in distinguishing real phenomena from artifacts of chance. I heard a presentation just recently by a computer science professor who oversaw a team project to assess correlations (interactions) among 3000 drugs, meaning 9 million drug pairs. The focus was on the significant computational challenges, and it was only in passing that he mentioned that 4.5 million chi-square tests were performed. To a statistician, of course, the real challenge is in deriving meaning from such results, and judging whether the conclusions are real, or just the product of chance.
GP: How do you see the future of MOOCs and analytics education? Will Machine Learning play an important role in education, as it does with Coursera?
PB: Some predictions:
- Some insights may be gained, particularly at the earlier grades where risk factors for some simple metrics like dropout or failure might be teased out. Certainly, there are a number of commercial software vendors plying the k-12 market with this promise. My guess is that the contribution of machine learning will not soon grow to be a big component of educational practice. This is due partly to the complexity of the human learning process and its desired outcomes, and partly to the fact that the removal of human judgment from teaching is an antithetical concept to most educators, so will face some adoption resistance.
- The auto-collection of data will be useful in MOOC's and other large courses for purposes of improving each individual course, using the data from that course. This will involve mainly old-fashioned data analysis, not necessarily machine-learning. This course-specific use will be constrained by the time and attention the instructor can devote to the analysis.
- The auto-collection of massive amounts of data in MOOC'S and other venues will yield publicly-available data flows, which, in turn, will fuel much general educational research. In the same way, the Harvard Nurses Study, which collected data on tens of thousands of nurses over decades and then made it publicly available, generated hundreds of epidemiological research papers. The effect will be much bigger with the MOOC data, since the data collection cost is so much smaller.
- The quality of the research will, unfortunately, be poor. The data will not be as clean as the nurses data, and the false discovery rate will be just as high as it is in epidemiological research (where a recent study found that, of 52 claims based on observational data, none replicated in controlled trials). The issue will be compounded in the field of education because so much research is already taking place. The need for professors of education is driven by the need to train over 400,000 teachers per year, more than 20 times the number of doctors. In our current system of higher education, those professors must publish research, and critique each others' work. The arrival of massive amounts of education Big Data will be met with a hungry audience.
GP: What is your opinion on the role of a data scientist ? You probably saw Data Scientist Venn Diagram from Drew Conway Should companies try to find (and statistics.com train) versatile Data Scientists or should companies try to fill the Data Scientist role with a team?
See discussion on /2014/01/split-on-data-science-skills-individual-vs-team-approach.html
PB: No one person can do everything and be all things to all people - there are no unicorns, as some have called the intersection of all three domains in Conway's diagram. Even though we have 110 courses at Statistics.com, we focus on the analytic sphere and bring in the other skills as they relate to it. So we teach how to query databases, but not how to design and administer them. We teach some programming, and how to use programming skills to build a data science workflow, but we are not a university computer science department.
That said, even though a company may have a team of specialists, continually broadening their knowledge beyond their individual specialties is valuable, both to the company and the individual. The Operations Research Society (INFORMS), touts the "T" model for professionals in its "CAP" certification process. The "T" meaning deep vertical knowledge in one specialty, and familiarity across a broad range of others. The more the database person, the programmer and the statistician know about what each other does, the faster and more efficient the development process.
This also relates to the question of whether to hire from outside, or train from within. The "T" model suggests training from within to broaden the top of the T for existing employees , and hiring from outside where an essential deep skill is missing.
In a recent survey by the Institute for Corporate Productivity, 47% of respondents said they mostly plan to train existing staff on analytics, 17% said they would mostly hire new staff, and 27% said they already have the needed staff(Training and Development, April 2014).
GP: What is a recent book you read and liked?
Life After Life by Kate Atkinson and Cloud Atlas by David Mitchell. Both have masterful uses of time as a non-linear variable to be manipulated.