Big Data Developer Conference, Santa Clara: Day 1 Highlights
Tags: Developers, Elephant Scale, Global Big Data Conference, Highlights, Informatica, MongoDB, Parquet, SciSpike, Twitter
Highlights from the presentations/tutorials by Data Science leaders from ElephantScale, SciSpike, Twitter and Informatica on day 1 of Big Data Developer Conference, Santa Clara
Typical NoSQL systems are non-relational, distributed, horizontally scalable and have dynamic schema. He categorized NoSQL into column-family, document-oriented, key-value and graph DB. He briefly discussed each of these categories along with their typical use cases.
Next, he discussed two key aspects of Hadoop: MapReduce framework and Hadoop Distributed File System (HDFS). He described lambda architecture and Apache Flink in details to help audience understand its critical utility. He concluded by mentioning following key-points:
- NoSQL addresses the weak points of relational systems
- Polyglot persistence: Use the most suitable database for your task
- Scale out to crunch Big Data
- Integrate with conventional technologies
Julain Le Dam, Analytics Data Pipeline Tech Lead, Twitter talked about how to use parquet as a basis for ETL and analytics. He started with introduction of a typical data flow process. Regarding storing data for analysis, he mentioned that production of lot of data is easy however we need to compress it in order to save storage.
Scanning of a lot of data is easy but not necessarily fast. Since we want faster turnaround and less storage space, therefore, we need compression but not at the cost of reading speed. Interoperability is not easy. We need a storage format that is interoperable with all the tools we use and keeps our options open for upcoming technologies.
Apache Parquet provides three great features: interoperability, space efficiency and query efficiency. Julian described in detail each feature with examples. He also shared code snippets explaining how to write to parquet with MapReduce / Scalding / Pig and how to query using Pig / Scalding / Hive / Impala / Drill / SparkSQL. At the end of the session, he shared Parquet timeline and impressive growth of project contributors.
Sunil Sabat, Principal Technical Alliance Manager, Informatica delivered a workshop on MongoDB. He first shared how MongoDB is different from other databases in the market. He mentioned that MongoDB has following unique set of properties: ad hoc queries, real time aggregation, rich query capabilities, strongly consistent, geospatial features, support for most programming languages and flexible schema. He gave hands-on exercises to the audience to make them understand the key features of MongoDB.
Highlights from second day