Big Data and Hadoop, Big Data Boot Camp LA
Tags: Big Data, Bootcamp, Elephant Scale, Global Big Data Conference, Hadoop, Los Angeles-CA, Sujee Maniyam, Training
Big Data Boot Camp LA provided attendees a comprehensive understanding of Big Data and Hadoop technologies. Sujee Maniyam provided a good technical overview of Hadoop and current trends. We provide key takeaways.

Here is a brief summary of the class on “Big Data & Hadoop”:-
The event kicked off with introduction to Big Data. Sujee Maniyam from ElephantScale was the instructor. He mentioned that based on nature of data, data can either be human generated or machine generated. Although we are generating both kinds of data at a rapid pace, machine data has exploded in recent years. Lots of intelligence lies in big-data, such as user behavior. Therefore, all major online companies are following this mantra: “log everything and ask questions later”. The logs help in providing users with targeted ads, recommendations, etc. Describing data as the new “gold”, he suggested collecting as much data as possible and then applying analytics in different ways over it to generate profits. He briefly described big data challenges as: volume, variety and velocity.



Big Data in cloud is pretty popular among startups as there is no initial cost and they can pay per use. A click can create any sized cluster. However, Big Data in cloud has several challenges such as cost, security and getting data into cloud. Running a permanent hadoop cluster in the cloud can be costly. So, “Hadoop on demand” came as a solution. The model works as following: spins up a cluster, processes data, shuts down the cluster and you can pay for usage.
Apache Whirr is an open source tool to easily manage clusters in the cloud. Hadoop has a large number of use cases: analyzing click stream data to optimize ad-serving, fraud detection, find influencers in Twitter sphere. Enterprises are now adopting Hadoop to get new insights and compliment their existing data-warehouse and processing. With the help of following image he described scaling and access time for various database technologies.

- HDFS – Provides distributed storage
- Map Reduce – Provides distributed computing
- Pig – High level Map-Reduce
- Hive – SQL layer over Hadoop
- HBase – NoSQL storage for real-time queries
Extended Hadoop Ecosystem includes following:
- Hadoop streaming – MapReduce in languages other than Java
- Flume – data ingestion into HDFS
- Sqoop – Import data from SQL databases
- Oozie – Hadoop job scheduler
- Mahout – Recommendation, clustering, classification
Technologies involved in Hadoop ETL can be classified as shown in following figure:

Related:
Top Stories Past 30 Days
|
|