Interview: Beth Smith, General Manager of the IBM Analytics Platform business, on Analytics, Hadoop, Spark

We discuss coming Analytics surprises, what has changed, Open Source, Hadoop, Apache Spark, Open Data Platform, new analytics roles, IBM resources for analytics educations, and more.

By Gregory Piatetsky, @kdnuggets.

Beth SmithBeth Smith is General Manager of the IBM Analytics Platform business. This platform spans Predictive Analytics, BI, Hadoop, Spark, Stream Computing, Databases, Warehouse, Content Management, Information Integration, Master Data Management, and Governance. Her passion is relentless pursuit of innovation and working with clients to realize value for their businesses.

I recently had a chance to discuss Analytics with her in advance of IBM announcement of new big data office in San Francisco focusing on Spark.

Gregory Piatetsky, Q1. What do you think will surprise us in Analytics in 3-5 years, that people will look back and say "that was obvious"

Beth Smith: Analytics will no longer be something that a few very smart mathematicians and data scientists use and understand; analytics will be a part of everything we do. The surprise will be how much industries have transformed because of it - even more so than the creation and transformation fueled by the Internet and e-business.

In the next 3-5 years, transportation, manufacturing, banking, telecommunications, our home services - you name it - will all see new processes and business models formed because of analytics. In fact, there was a recent study where Nucleus Research found that for every dollar spent on analytics, the payback was just over $13.

No question in my mind - the organizations that step back and use analytics to impact all aspects of their business will gain competitive edge. Those that don't will be in for a real surprise.

Q2. Analytics has been around for a while. What has changed?

Nothing has stayed the same. Data is more readily available. External factors, like weather, sensors and social interaction have extended the data corpus. Plus analytics capabilities are becoming more sophisticated to sift through masses of disparate data - even data streaming in motion - and extract valuable insight. Not only are they more sophisticated, the fundamental core components of compute, storage and bandwidth are declining at an incredible rate. We did a quick analysis to plot (Figure 1) this trend using publicly available data and its clear the barriers to entry have fallen.

Cost Compute Storage Transfer Declining

Figure 1. Costs of compute, storage, and transfer of data in USD, decreasing exponentially

Add to this the advent of Cloud, Mobile, the Internet of Things and we have a new world in which a growing developer community can build analytic-driven capabilities, deploy them where they are needed, and do so at a cost that is far more affordable than ever before.

Q3. How do you see IBM keeping pace with the rate of innovation in open source for Analytics?

IBMIBM has a long history of open source commitment. In 1999, IBM was a part of the inaugural announcement of the Apache Software Foundation (ASF) because of our commitment to open source and the collaborative community. Not only has that commitment - through both sponsorship funding and donation of millions of lines of code - continued every year since; it has grown.

The history of Linux would be radically different without IBM's involvement. IBM was the first (and for a while, the only) established enterprise IT company that publicly sponsored and legally backed the usage of Linux. Since the beginning of the Linux project, IBM is the #3 code contributor to Linux, behind only RedHat and Intel.

IBM BlueMixKey technologies to IBM's cloud strategy - like Cloud Foundry and OpenStack - are open source based. Cloud Foundry is the basis for Bluemix, and IBM is a top contributor to the Cloud Foundry codebase since its inception.

It's not just about software - IBM is committed to open source, as evidenced by the OpenPOWER consortium.

In the Hadoop space, IBM is ramping up its development resources to increase code contributions. My development team is actively collaborating with members of the Hadoop committer community - some of which are of course IBMers - to work on issues related to YARN performance, storage flexibility, and overall Hadoop security. Also, I recently opened a Spark Technology Center in San Francisco, and that team is already making significant efforts in contributing to the Spark code base.

Open source is important for technology progress and innovation. But, it isn't the only aspect that enables technology innovation, advancement, and disruption. That's why my worldwide engineering team is focused on combining open source, with standards, and our own innovative capabilities to help our clients optimize the deployment mix and get to their business objectives faster.

HadoopAn example is our work with Apache Hadoop. We contribute to the open source Hadoop projects. We understand the need for standards in business and technical solutions, so we build standards compliance in our products. Hadoop is no different - for example, our SQL solutions on Hadoop are ANSI compliant at the highest levels. We collaborated with other vendors to create the Open Data Platform initiative with the goal to establish consistency, so customers would have a stable foundation for their own innovation.

Of course we provide our own free for production use Hadoop distribution based on that consistent core technology. On top of that, we provide rich capabilities for analysts and data scientists to extract value from data stored in Hadoop, as well as data integration and management capabilities to ensure their Hadoop environment can be managed and deployed enterprise-wide.

Hadoop and Spark are great examples of disruptive technologies. But to make their value accessible to more than just application programmers, deep support for familiar interfaces (like SQL and R) is essential for adoption. Integration with established tools, like reporting, ETL, and data stores (warehouses and databases) is also important. Finally, it's when you take advantage of new technical advantages to do things that weren't possible before, that you start getting real value. This is the motivation behind our innovations like Big Match, which features large scale entity extraction, where you can, for example, learn things about your customers from call center logs, emails, or social data feeds.

We have proven out this approach of advancing and extending open source with our strong heritage in Linux and Java. Now with Hadoop and Spark, our task is to continue our pace of innovation, embrace and contribute to open source, and ensure all our constituents gain the benefit from strength across the board.