KDnuggets Home » News » 2015 » Feb » Opinions, Interviews, Reports » Interview: M.C. Srivas, MapR on Demystifying the Art of Processing Massive Data ( 15:n06 )

Interview: M.C. Srivas, MapR on Demystifying the Art of Processing Massive Data


We discuss the launch and evolution of MapR, achievements, key characteristics of MapR-DB, significance of Apache Drill, MapR use cases and more.



mc-srivasM.C. Srivas ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were used extensively. He wanted to provide that powerful capability to everyone, and started MapR on his vision to build the next-generation platform for semi-structured big data. His strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights. That vision is shared by all at MapR. Srivas brings to MapR his experiences at Google, Spinnaker Networks, Transarc in building game-changing products that advance the state of the art.

Srivas was Chief Architect at Spinnaker Networks (now NTAP) which built the industry's fastest single-box NAS filer, as well as the industry's most scalable clustered filer. Previously, he managed the Andrew File System (AFS) engineering team at Transarc (now IBM). AFS is now standard classroom material in operating systems courses. While not writing code, Srivas enjoys playing tennis, badminton and volleyball. M.C. has an MS in Computer Science from University of Delaware, and a B.Tech. in electrical engineering from IIT Delhi.

Here is my interview with him:

Anmol Rajpurohit Q1. How and when did you get the idea to launch MapR?

M.C. Srivas: I ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were extensively used. I wanted to provide that mapr-logopowerful capability to everyone, and started MapR on my vision to build the next-generation platform for semi-structured big data. The strategy was to evolve Hadoop and bring simplicity of use, extreme speed and complete reliability to Hadoop users everywhere, and make it seamlessly easy for enterprises to use this powerful new way to get deep insights.

AR: Q2. How do you reflect on the journey of MapR so far? What do you consider as the most significant achievements?

MCS: For a company that’s been out of stealth mode for about 3-1/2 years, we have done very well, with over 700 paying customers, many of whom are among the top companies in the world. One use case stands out: the Aadhaar project, an initiative of the Unique Identification Authority of India (UIDAI), runs entirely on the MapR Distribution including Hadoop. We donated our software for this noble cause, which has enabled hundreds of millions of people in India to access services like banking, pension payments, education, healthcare, fuel, etc., which were very difficult for them to access just a few years ago.
mapr-benchmarks AR: Q3. What unique characteristics of MapR-DB helped it achieve the highest score for Current Offering among all reviewed NoSQL database vendors?

MCS: MapR-DB benefits from lessons gleaned from traditional databases like Oracle and Sybase, and from NoSQL systems like BigTable and Amazon DynamoDB. We improved on these ideas while side-stepping weaknesses, and developed a database system that is the best in the world. MapR-DB can easily handle over a million tables, each with trillion-plus rows, and each row with 1000s of columns in JSON format. It is unbreakable, with over 99.999% of uptime. We also recently announced breakthrough performance results, achieved by using only four nodes of a ten-node MapR-DB cluster, ingesting over 100 million data points per second.
mapr-distribution-including-hadoop
 
By accelerating performance by 1,000 times on such a small cluster, MapR-DB makes it possible to cost effectively manage massive volumes of data and enable new applications such as Internet of Things (IoT) and other real-time data analysis applications, including industrial monitoring of manufacturing facilities, predictive maintenance of distributed hardware systems, and data center monitoring.

AR: Q4. What is the significance of Apache Drill to the big data ecosystem? How does Drill compare to Hive, Impala & Shark?

apache-drillMCS: Apache Drill enables users to run—for the first time—full ANSI SQL 2003 queries directly on Hadoop data. A user can now query semi-structured data with ambiguous schema, which was not possible before. For the first time, people can drill down into raw data, in place, without needing a DBA’s assistance, or without having to transform or migrate it first.

Apache Drill is breaking down the barriers that have existed in data analytics for as long as databases have existed. Apache Drill is fully self-service, letting you surf your data in-situ. It can handle schema changes as they happen, and process complex nested structures like JSON. Hive and Impala implement different, disjointed subsets of what Apache Drill is capable of. Hive and Impala implement HiveQL (Hive Query Language) which is not ANSI SQL, although Impala might be evolving slowly to include ANSI SQL 92. Shark has been discontinued, mainly due to serious quality problems.

AR: Q5. What does the introduction of Apache Drill in Scala with Apache Spark mean for MapR customers?

MCS: Apache Drill in the Spark context brings the power of ANSI SQL with JSON support to Spark. A Spark programmer now can treat any RDD as a SQL table in Drill, and have Drill seamlessly combine the Spark RDD with external data in HBase, MapR-DB, Mongo-DB, and other systems. Further, the output of Apache Drill can be treated as the input for a Spark map-reduce process. So the combination of Drill and Spark is unleashing some extraordinary and unprecedented data processing capabilities on the Hadoop platform.
apache-drill-mapr
 
AR: Q6. What are your favorite client use cases for MapR technology? Did any client surprise you by using MapR in unexpected ways?

MCS: We recently had one customer say, “We just think about the data; we don’t worry about limitations of underlying network and servers.” That comment tells me that our product architecture allows organizations to really focus on the data to improve their business. In one situation, we had a customer start with several use cases and improve their business top line to the tune of approximately USD $1 billion annually. Seeing customers use big data to impact the business as it happens is very gratifying, and we’re seeing that type of result over and over again.

Second part of the interview

Related: