Big Data Developer Conference, Santa Clara: Day 3 Highlights

Highlights from the presentations/tutorials by Data Science leaders from VISA, Glassbeam, Unravel on day 3 of Big Data Developer Conference, Santa Clara.

big-data-developer-conferenceBig Data Developer Conference organized by Global Big Data Conference was held last week in Santa Clara during March 23-25. It brought together data scientists, professionals handling data / performing data analytics in various different domains and excited to learn the best and latest in Big Data Ecosystem. The conference tutorials and talks covered a wide variety of topics including Hadoop, Lambda Architecture, MapReduce, Hive, Pig, Spark, MongoDB, etc.

Highlights from Day 1

Highlights from Day 2

Highlights from Day 3(Wednesday, March 25):

Ajit Gaddam, Chief Security Architect, VISA kicked off the day three of conference talking about data security in hadoop. He mentioned the top three reasons for securing hadoop as:
  1. Hadoop contains sensitive data
  2. Hadoop is subject to regulatory adherence
  3. Hadoop security can enable your business

He briefly explained the five pillars of hadoop data security framework as following:
  1. Data Management
  2. Identity & Access Management
  3. Data Protection at Rest
  4. Data Protection in Transit
  5. Data Leakage / Exfiltration Prevention

Data management involves data classification & prioritization, data discovery and data tagging. Regarding identity & access management, he recommended to use Kerberos to validate nodes and client applications before admission into the cluster. Apache Sentry is a great platform to leverage role based access control.

Regarding data protection at rest, he asked to use file/OS level encryption to protect against privileged users or apps with direct access to files. He also insisted use of a central key management server to maintain the crypto keys and also separate keys for data. For data protection in transit, he suggested the use of TLS protocol to authenticate and ensure privacy of communications between nodes, name servers and applications.

Mohammed Guller, Chief Architect, Glassbeam gave a hands-on training session on Spark Core, Spark Streaming, Scala, Data Frames, etc. He asked the audience to start with learning important ideas hidden behind the technology. He gave a quick introduction about Hadoop under the hood and talked about important concepts of data serialization, file formats, etc.

He introduced Spark as an in-memory cluster computing system for processing and analyzing large datasets. Spark APIs are available for use in these languages: Scala, Java, Python. Spark is not only faster than Hadoop MapReduce but it is also more expressive as it is not limited to map and reduce operations. Spark is faster than MapReduce as it involves in-memory computation and advanced directed acyclic graph (DAG) execution engine (optimizes stages and minimizes shuffles).