KDnuggets Home » News » 2014 » May » Opinions, Interviews » Exclusive Interview: Michael O’Connell, Chief Data Scientist, TIBCO on How to Lead in Big Data ( 14:n12 )

Exclusive Interview: Michael O’Connell, Chief Data Scientist, TIBCO on How to Lead in Big Data

We discuss Big Data vs. Fast Data, Data Visualization trends, Jaspersoft acquisition, factors differentiating future leaders of Big Data and more.

Michael O'ConnellMichael O'Connell is Chief Data Scientist at TIBCO Software, developing analytic solutions across a number of industries including Financial Services, Energy, Life Sciences, Consumer Goods & Retail, and Telco, Media & Networks. He has been working on statistical software applications for the past 20 years, and has published more than 50 papers and several software packages on statistical methods. Michael did his Ph.D. work in Statistics at North Carolina State University and is Adjunct Professor Statistics in the department.

Here is my interview with him:

Anmol Rajpurohit: Q1. How would you define "Big Data"? When enterprises think of Big Data, what are the most important questions they should be thinking about?

Big DataMichael O'Connell: First, I distinguish "big data" as referring to data-at-rest, and "fast data" to data-in-motion. In that context, I think of big data as (a) data (at rest) that can't be handled with usual data management and analytic techniques; and/or (b) all available data (at rest) pertaining to a business problem.

Some important big data questions in the enterprise include:
  • What/how specific business problems can be addressed with big data analytics and what is the value thesis/potential.
  • What data management and analytic methods are appropriate e.g. in-database or in-appliance analytics, hadoop map reduce jobs and data-on-demand; and most importantly how to structure the ETL, EDA, feature identification and modeling to get to the kernel of the business problem. It is far more important to construct a relevant dataset and features; and to apply appropriate algorithms, than to run inappropriate algorithms on a big/raw dataset. For example, a (generalized) linear model run on millions/billions of rows is often not as informative as fitting a model that better identifies the relevant patterns/structures on smaller, well-constructed datasets.
  • How to deploy insights obtained with big data-at-rest to fast data-in-motion and enable real-time data analysis.

I like to fit generalized additive models to identify (non-linear) relationships, gradient boosting machines to obtain good predictions, PCA with robust partitioning to identify segments, and association rules for affinity analysis. In addition to using a variety of supervised and unsupervised learning methods, we like to use location analytics and optimization methods; and we do a ton of time series analyses for forecasting demand and future results. We always explore the data and models with interactive EDA and visual analytics – this is a must for identifying the business-relevant structures.

In my experience one can generate extreme value by applying analytics at the confluence of (big) data at rest and fast data in motion to solve engineering, manufacturing, R&D and sales & marketing problems.

AR: Q2. Your book "A Picture is Worth a Thousand Tables" was a very insightful and encouraging description of graphics in life sciences. What would be your top recommendations to get the most value out of graphics? What key trends do you currently observe in the data visualization arena?

A picture is worth a thousand tablesMOC: I come from the graphics school of John Tukey and Ed Tufte; who I see as godfathers of the two primary forms of graphics: (a) graphics for exploratory data analysis, and (b) graphics for pixel perfect reporting. Exploratory analysis generates value in data discovery. Report graphics generate value in communicating the findings from data analysis with clarity.

My top recommendation is to enable exploratory graphics for creating data discovery sequences - Guided Analytics - that are immediately intuitive to a casual business user.  Guided Analytics that disaggregate data and provide transparency on the business, to enable impactful action.

Graphics sequences that force the business user to refresh and explore data; and "spot the fires" in their business.

One interesting trend is the emergence of Javascript graphics, for example the d3 library. It's exciting to add such beautiful graphics to our Spotfire data discovery environment and to see these graphics respond to filtering, marking, brushing, coloring and layout.

AR: Q3. Where does the recent acquisition of Jaspersoft fit in TIBCO's overall strategy and product portfolio?

Jaspersoft Tibco logoMOC: TIBCO Jaspersoft is a disruptive BI product suite that provides pixel perfect interactive graphical reporting and components that are readily embeddable in other software applications. Jaspersoft has a terrific commercial open source business model including nearly 16 million downloads, more than 140,000 production deployments and 2,000 commercial customers in 100 countries. More than 400,000 registered members participate in Jaspersoft’s open source BI projects. The Jaspersoft AWS pay-by-the-hour service is particularly interesting. Jaspersoft is an exciting addition to the TIBCO product family. I'm looking forward to exploring the product suite, the community and the synergies with the rest of our stack and beyond.

AR: Q4. You come from a very strong statistical background. What challenges have you experienced while working with team members having not-so-good understanding of statistics? How much knowledge and experience of statistics do you consider vital for data scientists?

MOC: Pretty much everyone at TIBCO thinks like statisticians. Statistics is the math of life, a framework for understanding events. TIBCO is all about understanding and anticipating events; and enabling action. The Spotfire tag line is "first to insight, first to action".  We are a fast company. We make things happen around the world every second of every day.

I think of data scientists as knowing more about statistics than computer scientists and more about computer science than statisticians.

My staff develops simple software solutions to complex problems. We enable our customers to create extreme business value with our visual analytics software.

AR: Q5. In the current competitive landscape of Big Data, what factors do you think will help differentiate the future leaders?

Big Data leadersMOC: It's all about making complexity simple; and driving insight to action.

Future leaders:
  1. Have the analytic capabilities to understand big data at rest: connect and mash-up data; derive features; provide guided, self-service dashboards; develop in-line and predictive analytics that get to the heart of the business problem at hand.
  2. Use in-line analytics to interpret fast data in motion; to understand what is happening at the moment of truth.
  3. Have the software muscle to take corrective action; to sense and respond to issues and opportunities; to make the most of perishable inventory in the moment.
  4. Understand and effectively monetize their social networks of customers, channels, suppliers and agents.
  5. Inspire their staff with collaboration; crowd-sourcing and organizational intelligence.

Analytics software plays a huge role in all of this. We will continue to see leading companies outsmart the rest with 2-speed information architectures that create rapid value while operationalizing efficient business processes.

AR: Q6. Is "talent crunch" a real problem in Data Science? What has been your personal experience around it?

MOC: I don't think it’s quite as big a problem as it's made out to be. I have a ton of great talent interested in joining our team. But you do need to recognize and foster great talent; and create working environments and internal communities for synergies and knowledge sharing.

AR: Q7. What advice would you give to Data Science students and researchers who are just starting their career?

AdviceMOC: Spend your time on high value problems. Try out as much productive technology as you can as fast as you can - Spotfire, R, SAS, JS, python, Streambase for example. Get yourself on projects with collaborative customers and smart colleagues. Get in your 10,000 hours as fast as you can. Have fun with your work, your colleagues, and your customers. Be positive, passionate and observant in all aspects of your life. Be respectful and look to learn and help.

AR: Q8. On a personal note, we are curious to know what keeps you busy when you are away from work?

I'm a music and art lover and I enjoy going to art exhibits and concerts. I like musicians such as My Bloody Valentine, Caribou, Brian Eno, Wire, Ride, Pavement, Sigur Ros, Radiohead, Spiritualized, Nick Cave, Lou Reed, David Bowie, Ryan Adams, Amy Winehouse; directors like Wes Anderson, Jim Jarmusch, David Lynch, Matthew Weiner; and visual artists like Jenny Holzer, Damien Hirst, Paul Klee, Ai Weiwei, Gerhard Richter, Rebecca Horn and many more.