Taming the Elephant: Advice to Director, Big Data Architect

Every other day, there is a new big data software is released in the market. Which one is the right to build your product? Understand how to resolve this conundrum and role of decision makers.

By Shan Subbu.


The Big Data ecosystem has got bigger and better. Thanks to all the big tech companies, who are making many of their in house software projects as open source projects through Apache Software Foundation. Bonus point for them, when they want to scale there will be lot of people who know the software.

Every other day, there is a new software available in the market and we have a lot to choose from the pool of software and build the desired product. OK, which is the right one to choose and build my product? Well, you may take the following into consideration in making the decision.

Building blocks for your enterprise application:

When an enterprise level data product is being built, there are lot of moving parts, lot of decision makers. All the stake holders, decision makers and development team should work in harmony to make the product a success. So what is the role each one is playing in making a decision and deliver the product.

Role of a Director or VP – Big Data:

As a Director or VP of Big Data, the responsibility is bigger in your shoulder to pick the right team and products.

From forming the team to delivery, you are into too many knowns and unknowns like in any other application development project. But this time, the elephant is bigger and taming the elephant with lot of unknowns is key to success of the project!

Here are some factors that can be considered in deciding to choose the team, product and services:

  • Support from Open Source Community – Does the Apache project has many contributors, how is it built (incubator), what is the road map, is the product buggy with lot of issues (jira or bugzilla).
  • Support from Third Party Vendors – Is there a vendor who is providing training, support services and other professional services?
    • Hadoop – Cloudera, Hortonworks, MapR, Pivotal, IBM.
    • Spark – Databricks.
    • Cassandra – Datastax.
    • MongoDB – MongoDB.
    • Kafka – Confluent.
  • Support from Big Data Consulting firms – IBM, Accenture, CapGemini, Silicon Valley Data Science, Perficient, KPI Partners, and the list goes on and on. Better talk to your primary vendor to see if they can provide these capabilities.
  • Cross Train the existing resources who know the application, company process and policy, so on and so forth. All the third party vendors mentioned above provide corporate training at your office or negotiate them to do so.
  • Training Camps – Get instant access to a pool of resources who are already trained – Insight Data Science, Insight Data Engineering, Zipfian AcademyNYC Data Science Academy, Data Incubators and I have missed several of them here.
  • Hiring – Availability of Resources in the region. This can be easily done by doing a bit of research in LinkedIn.
  • Attend a conference or two – Attend one or two conferences related to Big Data. See what people are working on, their environments, the day to day work, challenges faced. The one that is coming up right away is the Strata + Hadoop World and Kafka Summit.
  • Get mentored – Find somebody who had already implemented a Big Data project within the company or in the region or in your known list. Ask them to be a mentor.
  • Support the Team – Be ready to start from the scratch.

Role of an Enterprise Architect and Solutions Architect:

As an Enterprise Architect or a Solutions Architect, you need to pick the right tools for the enterprise application that you are building and not what tools you know already. It requires some unlearning and learning.

From data ingestion to visualization, there are too many solutions that you could propose to build the data pipeline as well as the analytics.

Deciding and designing the building blocks of the data pipeline is the key!

Here are few things that can be considered in deciding of what goes into your data pipeline:

  • Data Ingestion – What is the kind of data that needs to be ingested, is the ingestion going to happen from multiple source systems, number of messages to be handled (How about a trillion messages a day?).
  • Architecture – Batch processing or Real time processing or combination of both (Lambda Architecture).
  • Data Storage – Mainly the database layer. Get to know the CAP theorem! What does the application need? You can not get a CAP solution – you get a CA, CP or AP.
  • Visualization /  Application – Is it used for generating reports and send it to executives or an application that is going to use the data? Does the Data Scientists and Analysts has a particular need?
  • “ity” Qualities – Reliability, Stability, Scalability, High Availability and the “-ities” that the product needs.
  • Defend your selection of the tools with a strong backing powered by data.

How are you taming the elephant?