Big Data and Hadoop, Big Data Boot Camp LA

Big Data Boot Camp LA provided attendees a comprehensive understanding of Big Data and Hadoop technologies. Sujee Maniyam provided a good technical overview of Hadoop and current trends. We provide key takeaways.

Global Big Data ConferenceBig Data boot camp (Sep 19-21, 2014) was organized by Global Big Data Conference at Los Angeles Convention Center. This 3 day event gave a fast paced, vendor agnostic, technical overview of the Big Data landscape. The participants included a good number of beginners, who had no prior knowledge of databases or programming. The event was targeted towards both technical and non-technical people who wish to understand the emerging world of Big Data, with a specific focus on Hadoop, NoSQL & Machine Learning. Along with the lectures, attendees got first hand practical experience of implementing and using real Hadoop clusters and the latest Hadoop distributions.

Here is a brief summary of the class on “Big Data & Hadoop”:-

The event kicked off with introduction to Big Data. Sujee Maniyam from ElephantScale was the instructor. He mentioned that based on nature of data, data can either be human generated or machine generated. Although we are generating both kinds of data at a rapid pace, machine data has exploded in recent years. Lots of intelligence lies in big-data, such as user behavior. Therefore, all major online companies are following this mantra: “log everything and ask questions later”. The logs help in providing users with targeted ads, recommendations, etc. Describing data as the new “gold”, he suggested collecting as much data as possible and then applying analytics in different ways over it to generate profits. He briefly described big data challenges as: volume, variety and velocity. Big Data Wave Hadoop is a software stack which runs on cluster of machines providing distributed storage and processing. Before Hadoop, parallel computing concept included storage and compute clouds as separate entities. Hadoop merged the Hadooptwo by moving code to data. Talking about cost of storage, he mentioned that it has reduced significantly and it costs about 5 cents/GB these days. A modern disk can support about 150 MB/sec read speed i.e. reading 1 TB in just 2 hours. 10 GB of sequential data can be read in about 20 hours. How about 10 TB spread across 10 disks? Reading 10 disks of 10TB in parallel takes just 2 hrs, leading to IO throughput of 5GB/sec. Now, if we imagine 10 machines, each with 10 disks then we can read 100TB in 2 hours, leading to IO throughput of 50GB/sec. This is exactly how Hadoop scales.

Map Reduce Hadoop technology is widely used by all major online players, such as Yahoo, Facebook, Twitter, eBay, etc. Maniyam briefly discussed some of the popular Hadoop distributions by Apache(official), Cloudera, HortonWorks, MapR, etc. Hadoop in cloud is offered by AmazonEC2, Google Compute, RackSpace Cloud, Microsoft, etc.

Big Data in cloud is pretty popular among startups as there is no initial cost and they can pay per use. A click can create any sized cluster. However, Big Data in cloud has several challenges such as cost, security and getting data into cloud. Running a permanent hadoop cluster in the cloud can be costly. So, “Hadoop on demand” came as a solution.  The model works as following: spins up a cluster, processes data, shuts down the cluster and you can pay for usage.

Apache Whirr is an open source tool to easily manage clusters in the cloud. Hadoop has a large number of use cases: analyzing click stream data to optimize ad-serving, fraud detection, find influencers in Twitter sphere. Enterprises are now adopting Hadoop to get new insights and compliment their existing data-warehouse and processing. With the help of following image he described scaling and access time for various database technologies. Access Time vs. Scaling Talking about Hadoop ecosystem, Mr. Maniyam briefly described following technologies:
  • HDFS – Provides distributed storage
  • Map Reduce – Provides distributed computing
  • Pig – High level Map-Reduce
  • Hive – SQL layer over Hadoop
  • HBase – NoSQL storage for real-time queries

Extended Hadoop Ecosystem includes following:
  • Hadoop streaming – MapReduce in languages other than Java
  • Flume – data ingestion into HDFS
  • Sqoop – Import data from SQL databases
  • Oozie – Hadoop job scheduler
  • Mahout – Recommendation, clustering, classification

Technologies involved in Hadoop ETL can be classified as shown in following figure: ETL Maniyam summarized the lecture by mentioning that Hadoop is mainstream, solving very real and difficult problems; however, it is still making inroads into enterprises.