Big Data Developer Conference, Santa Clara: Day 1 Highlights

Highlights from the presentations/tutorials by Data Science leaders from ElephantScale, SciSpike, Twitter and Informatica on day 1 of Big Data Developer Conference, Santa Clara

big-data-developer-conferenceBig Data Developer Conference organized by Global Big Data Conference was held last week in Santa Clara during March 23-25. It brought together data scientists, professionals handling data / performing data analytics in various different domains and excited to learn the best and latest in Big Data Ecosystem. The conference tutorials and talks covered a wide variety of topics including Hadoop, Lambda Architecture, MapReduce, Hive, Pig, Spark, MongoDB, etc.

Highlights from Day 1(Monday, March 23):

The first day of the conference started with introduction to Big Data technologies and Hadoop. Mr. Sujee Maniyam from ElephantScale was the instructor for Hadoop and MapReduce tutorials spread across the day. He started with the significance of Big Data and emphasized that machine generated data has exploded in recent years. He also mentioned that unstructured data is growing exponentially in comparison to structured data and therefore, more opportunities lie in exploring unstructured data.
After giving brief history and timeline of Hadoop, he outlined the top corporations leveraging Hadoop and shared some interesting use cases. Hadoop is now supported by most of the cloud service providers. Big data in cloud is very popular among startups, as they do not need to get involved with infrastructure setup and maintenance. However, data in cloud has issues such as cost, security and getting data into cloud.

He shared the biggest misconceptions around Hadoop as following:
  1. Hadoop is a database for big data. [Reality: Hadoop is a batch processing system. No real-time queries.]
  2. Hadoop clusters contain thousands of machines. [Reality: Only few big companies run thousand machine clusters. Lots of startups run 10-100 nodes. We can do a lot with small sized clusters.]
  3. Hadoop runs on cheap hardware. [Reality: Hadoop servers are pretty beefy]
  4. Hadoop replaces existing data warehouse [Reality: Hadoop is deployed alongside existing technologies. It’s complimentary not competitive.]
  5. Hadoop is easy. [Reality: Relatively new technology, engineering problems such as converting problems to hadoop / map reduce domain, etc.]

Sujee concluded the session talking about the Hadoop Ecosystem explaining various components briefly. He also gave a hands-on session on Hadoop, while answering questions from the audience.

Dr. Vladimir Bacvanski, Founder, SciSpike talked about how to survive and thrive with NoSQL in the Enterprise. He mentioned Polyglot Persistence as the key to success. Polyglot persistence involves use of several different database systems, each being the best choice in its application area. He shared problems with relational databases and how NoSQL goes about solving those.