Interview: M.C. Srivas, MapR on Demystifying the Art of Processing Massive Data
We discuss the launch and evolution of MapR, achievements, key characteristics of MapR-DB, significance of Apache Drill, MapR use cases and more.
Srivas was Chief Architect at Spinnaker Networks (now NTAP) which built the industry's fastest single-box NAS filer, as well as the industry's most scalable clustered filer. Previously, he managed the Andrew File System (AFS) engineering team at Transarc (now IBM). AFS is now standard classroom material in operating systems courses. While not writing code, Srivas enjoys playing tennis, badminton and volleyball. M.C. has an MS in Computer Science from University of Delaware, and a B.Tech. in electrical engineering from IIT Delhi.
Here is my interview with him:
Anmol Rajpurohit Q1. How and when did you get the idea to launch MapR?
M.C. Srivas: I ran one of the major search infrastructure teams at Google where GFS, BigTable and MapReduce were extensively used. I wanted to provide that
AR: Q2. How do you reflect on the journey of MapR so far? What do you consider as the most significant achievements?
MCS: For a company that’s been out of stealth mode for about 3-1/2 years, we have done very well, with over 700 paying customers, many of whom are among the top companies in the world. One use case stands out: the Aadhaar project, an initiative of the Unique Identification Authority of India (UIDAI), runs entirely on the MapR Distribution including Hadoop. We donated our software for this noble cause, which has enabled hundreds of millions of people in India to access services like banking, pension payments, education, healthcare, fuel, etc., which were very difficult for them to access just a few years ago.
MCS: MapR-DB benefits from lessons gleaned from traditional databases like Oracle and Sybase, and from NoSQL systems like BigTable and Amazon DynamoDB. We improved on these ideas while side-stepping weaknesses, and developed a database system that is the best in the world. MapR-DB can easily handle over a million tables, each with trillion-plus rows, and each row with 1000s of columns in JSON format. It is unbreakable, with over 99.999% of uptime. We also recently announced breakthrough performance results, achieved by using only four nodes of a ten-node MapR-DB cluster, ingesting over 100 million data points per second.
By accelerating performance by 1,000 times on such a small cluster, MapR-DB makes it possible to cost effectively manage massive volumes of data and enable new applications such as Internet of Things (IoT) and other real-time data analysis applications, including industrial monitoring of manufacturing facilities, predictive maintenance of distributed hardware systems, and data center monitoring.
AR: Q4. What is the significance of Apache Drill to the big data ecosystem? How does Drill compare to Hive, Impala & Shark?
Apache Drill is breaking down the barriers that have existed in data analytics for as long as databases have existed. Apache Drill is fully self-service, letting you surf your data in-situ. It can handle schema changes as they happen, and process complex nested structures like JSON. Hive and Impala implement different, disjointed subsets of what Apache Drill is capable of. Hive and Impala implement HiveQL (Hive Query Language) which is not ANSI SQL, although Impala might be evolving slowly to include ANSI SQL 92. Shark has been discontinued, mainly due to serious quality problems.
AR: Q5. What does the introduction of Apache Drill in Scala with Apache Spark mean for MapR customers?
MCS: Apache Drill in the Spark context brings the power of ANSI SQL with JSON support to Spark. A Spark programmer now can treat any RDD as a SQL table in Drill, and have Drill seamlessly combine the Spark RDD with external data in HBase, MapR-DB, Mongo-DB, and other systems. Further, the output of Apache Drill can be treated as the input for a Spark map-reduce process. So the combination of Drill and Spark is unleashing some extraordinary and unprecedented data processing capabilities on the Hadoop platform.
AR: Q6. What are your favorite client use cases for MapR technology? Did any client surprise you by using MapR in unexpected ways?
MCS: We recently had one customer say, “We just think about the data; we don’t worry about limitations of underlying network and servers.” That comment tells me that our product architecture allows organizations to really focus on the data to improve their business. In one situation, we had a customer start with several use cases and improve their business top line to the tune of approximately USD $1 billion annually. Seeing customers use big data to impact the business as it happens is very gratifying, and we’re seeing that type of result over and over again.
Second part of the interview
Related: