KDnuggets Interview: Michael Brodie on Data Curation, Cloud Computing, Startup Quality, Verizon (part 2)

The second part of our exclusive interview focuses on Data Curation, Cloud Computing, Data Tamer and Jisto startups, and his experience as a chief Scientist of Verizon - and how that relates to teenager never tidying a room for 60 years.

By Gregory Piatetsky, Apr 28, 2014.

Here is Part 1 of the interview with Michael Brodie. This is part 2. See also Part 3.

Michael BrodieDr. Michael L. Brodie has served as Chief Scientist of a Fortune 20 company, an Advisory Board member of leading national and international research organizations, and an invited speaker and lecturer. In his role as Chief Scientist Dr. Brodie has researched and analyzed challenges and opportunities in advanced technology, architecture, and methodologies for Information Technology strategies. He has guided advanced deployments of emergent technologies at industrial scale, most recently Cloud Computing and Big Data. In his Advisory Board roles Dr. Brodie addresses current and emergent strategic challenges and opportunities that are central to the charter and success of the organizations. As an invited speaker Dr. Brodie has presented compelling visions, challenges, and strategies for our emerging Digital Universe in over 100 keynote speeches in over 30 countries and in over 100 books and articles.

Gregory Piatetsky: Q5. Currently you are an adviser at a startup called Data Tamer, co-founded by another leading DB researcher and serial entrepreneur Michael Stonebraker. What can you tell us about Data Tamer and its product?

Digital UniverseMichael Brodie: Consider the data universe. Since the 1980’s I have said in keynotes that the database and business worlds deal with less than 10% of the world’s data most of which is structured, discrete, and conforms to some schema. With the Web and Internet of Things in the 1990s massive amounts of unstructured data began to emerge with a growth rate that was inconceivable while shrinking database data to less than 8%. The EMC/IDC claim that our Digital Universe is 4.4 zettabytes and will double every two years until 2020 when it will be 44 zettabytes. Amazing!

[If you are constantly amazed at the growth of the Digital World, you don’t understand it yet – A profound, casual comment of my departed friend, Gerard Berry, Academie Francais.]

In 1988 or so you, Gregory, and a few others saw the potential of data with your knowledge discovery in databases – a radical idea. Little did others, including me, realize the potential of this, now named Big Data. Even though Big Data is hot in 2014, almost 30 years later, it’s application, tools, and technologies are in their infancy, analogous to the emergence of the Web in the early 1990s. Just as the Web has and is changing the world, so too will Big Data.

Compared with database data, Big Data is crazy. It’s inconceivably massive, dirty, imprecise, incomplete, and heterogeneous beyond anything we’ve seen before, and is largely schema-less or model-less. Yet it trumps finite, precise, database data in many ways hence is a treasure trove of value. Big Data Big Data is qualitatively different from database data that is a small subset of Big Data. It offers far greater potential thus value and requires different thinking, tools, and techniques. Database data is approached top-down. Telco billing folks know billing inside out so they create models that they impose, top-down on data. Data that does not comply is erroneous. Database data, like Telco bills must be precise with a single version of truth, so that the billing amount is justifiable. Due in part to scale, Big Data must be approached bottom up. More fundamentally, we should let data speak; see what models or correlations emerge from the data, e.g., to discover if adding strawberry to the popsicle line-up makes sense (a known unknown) or to discover something we never thought of (unknown unknowns). Rather than impose a preconceived, possibly biased, model on data we should investigate what possible models, interpretations, or correlations are in the data (possibly in the phenomena) that might help us understand it.

Hence, the new paradigm is to approach Big Data bottom-up due to the scale of the data and to let the data speak. Big Data is a different, larger world than the database world. The database world (small data) is a small corner of the Big Data world. Correspondingly Big Data requires new tools, e.g., Big Data Analytics, Machine Learning (the current red haired child), Fourier transforms, statistics, visualizations, in short any model that might help elucidate the wisdom in the data. But how do you get Big Data, e.g., 100 data sources, 1,000, 100,000 or even 500,000, into these tools? How do you identify the 5,000 data sources that include Sally Blogs and consolidate them into a coherent, rationale, consistent view of dear Sally? When questions arise in consolidating Sally’s data, how do you bring the relevant human expertise, if needed, to bear – at scale on 1 million people? Many successful Big Data projects report that this data curation process takes 80% of the project resources leaving 20% for the problem at hand. Data curation is so costly because it is largely manual hence it is error prone. That’s where Data Tamers comes to the rescue. It is a solution to curate data at scale.

We call it collaborative data curation because it optimizes the use of indispensable human experts. Data Curation is for Big Data what Data Integration is for small data. Data Curation Data Curation is bottom up and Data Integration is top down. It took me about a year to understand that fundamental difference. I have spent over 20 years of my professional life dealing with those amazing Data Integration platforms and some of the world’s largest data integration applications. Those technologies and platforms apply beautifully to database data – small data; they simply do not apply to Big Data.

To emphasize what is ahead, here is a prediction. Data Integration is increasingly crucial to combining top-down data into meaningful views. Data Integration is a huge challenge and huge market that will not go away. Big Data is orders of magnitude larger than small or database data. Correspondingly Data Curation will be orders of magnitude larger than Data Integration.Data Tamer logo The world will need Data Curation solutions like Data Tamer to let data scientists focus on analytics, the essential use and value of big data, while containing the costs of data preparation. In addition to Data Tamer there are some very cool data curation products contributing to addressing the growing need and creating a new software market. What is also cool about data curation is that it can be used to enrich the existing information assets that are the core of most enterprise’s applications and operations. Of course, the really cool potential of data curation is that it makes Big Data analytics efficiently available to allow users to discovering things that they never knew! How cool is that?

For more on Data Curation at scale, see Stonebraker et al : Data Curation at Scale: The Data Tamer System In CIDR 2013 (Conference on Innovative Data Systems Research).

GP: Q6. You also advise another startup Jisto. What can you tell us about your role at Jisto?

MB: I am having a blast with Jisto – some amazingly talented young engineers [PhDs actually] with lots of energy and a killer idea. Jisto is an exceptional example of the quality you ask about in the next question.

Cloud computingCloud computing enabled by virtualization is radically changing the world by reducing the cost and increasing the availability of computing resources. Can you imagine that only 50% of the world’s servers are virtualized?

Pop quiz [do not cheat and read ahead].

What is the average CPU utilization of physical servers, worldwide?
Of virtual servers?

  Answer: Virtual machine CPU utilization is typically less than 50% while physical servers are less than 20%, due to risk and scheduling challenges, but mostly cultural.

Jisto enables enterprises to transparently run more compute-intensive workloads on these paid-for but unused resources whether on premises or in public or private clouds, thus reducing costs by 75–93% over acquiring more hardware or cloud resources.

Jisto provides a high-performance, virtualized cloud-computing environment from underutilized enterprise or cloud computing resources (servers, laptops, etc.) without impacting the primary task of those resources. Organizations that will benefit most from Jisto are those that run parallelized compute-intensive applications in the data center or in private and public clouds (e.g., Amazon Web Services, Windows Azure, Google Cloud Platform, IBM SmartCloud).

Jisto is currently looking for early adopters for its beta program who will gain significant reduction in the cost of their computing possibly avoiding costly data center expansion. So talk to us jisto.com

GP: Q7. You are also a Principal at First Founders Limited. What do you look for in young business ventures - how do you determine quality?

Disciplined EntrepreneurshipMB: There are armies of people who evaluate the potential of startups. The professional ones are call "Vulture" Capitalists (VCs). The retired ones are called Angels. Like any serious problem there is due diligence to determine and evaluate the factors relevant to the business opportunity, the technology, the business plan, etc. as the many books, e.g [R. Field, “Disciplined Entrepreneurship: 24 Steps to a Successful Startup by Bill Aulet”, Journal of Business & Finance Librarianship, vol. 19, no. 1, pp. 83–86, Jan. 2014. ] and formulas suggest.

If you are reading a book, then you don’t know. Ultimately it comes down to good taste developed over years of successful experience. Andy Palmer, a serial entrepreneur, good friend, and very smart guy said “Do it once really well then repeat.” Andy ought to know, Data Tamer is about his 25th startup.

At First Founders I can do some technology, Jim can do finance, Howard can do business plans. Collectively we make a judgment. But good VC’s are the wizards. They have Rolodexes. First Founders Limited logo When their taste says maybe they refer the startup to the relevant folks in their network who essentially do the due diligence for them. Like at First Founders, the judgment is crowd sourced, actually what we call at Data Tamer, it is expert sourced. I have a growing trust of the crowd and especially of the expert crowd.

GP: Q8. You were a Chief Scientist at Verizon for over 10 years (and before that at GTE Labs which became part of Verizon). What were some of the most interesting projects you were involved in at GTE and Verizon?

MB: The technical challenge that stays with me is that addressed by the Verizon Portal, Verizon’s solution for Enterprise Telecommunications – providing Telecommunication service to enterprise customers, such as Microsoft. Verizon, like all large Telcos, is the result of the merger &acquisition of 300+ smaller telcos. Each had approximately 3 billing systems; hence Verizon acquired over 1,000 billing systems. Billing is only one of over a dozen systems categories, including sales, marketing, ordering, and provisioning. Providing a customer like Microsoft with a telephone bill for each Microsoft organization requires integrating data potentially from over 1,000 databases. As is the case for most enterprises, Verizon and Microsoft reorganize constantly complicating the sources to be integrated, like Microsoft, and the targets, Verizon’s changing businesses, e.g., wireline and FiOS. Every service company faces its little-discussed massive challenge.

Verizon logoIntegrating 1,000s of operational systems is a backward looking problem. The cool forward-looking problem was Verizon IT’s Standard Operating Environment (SOE). Prior to cloud platforms and cloud providers, Verizon IT (actually one team) sought to develop an SOE onto which Verizon’s major applications (0ver 6,000) could be migrated to be managed virtually on an internal cloud. What a fun challenge. When the team left Verizon as a group over 60 major corporate applications, including SAP, had been migrated. Smart folks, good solution that failed in Verizon. In industry, challenges are 80-20; 80% political, 20% technical. The SOE is being reborn in the infrastructure of another major infrastructure corporation.

Finally, the next most interesting and yet unsolved industry challenge was getting over the legacy, that Mike Stonebraker and I addressed in Legacy Information Systems Migration: The Incremental Strategy, Morgan Kaufmann Publishers, San Francisco, CA (1995) ISBN 1-55860-330-1 (M. Brodie and M. Stonebraker).

How do you keep a massive system up to date in terms of the application requirements and the underlying technology or migrate it to a modern, efficient, more cost effective platform?

Enterprises tend to invest only in new revenue generating opportunities often leaving the legacy problem to grow and grow. So existing systems like billing languish and accumulate. It’s like a teenager never tidying their room for 60 years. Now where are my blue shoes? I suggested to Mike Stonebraker that we rewrite our 1995 book. He did not even respond to the email, suggesting that it is largely a political problem and not technical, no matter the brilliant technical solution provided.

Lesson: If you are a CIO, clean up your goddamn room; you’re not going out until you do!