Building Data Systems: What Do You Need?

This post shares some insight gained through years of building data-powered products, and discusses the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.



As with testing and build, automated deployments and release management processes are crucial. Below are some things to consider:

  • The use of resource managers in modern data systems means that whatever deployment process is in place must account for resource requests or capacity scheduling, as well as related feedback to the development team.
  • The cost of maintaining multiple large clusters is often prohibitive to fully duplicating the prod environment. Therefore, you will have to deal with mismatch-sized clusters or logically separate multiple “environments” on the same cluster.
  • Performance and capacity testing will be iterative. It’s often difficult to deterministically judge ahead of time the exact resource configuration needed to optimize performance. Therefore, even in prod the roll out is will require multiple steps.
  • It’s appropriate for packaged artifacts that maintain jobs or workflows to live on edge nodes. Source code should not.


Commercial data infrastructure technologies (Hadoop distros and the like) are often complex integrations of multiple distributed systems. Operations processes must account for a multitude of configurations and monitoring of distributed processes. Some additional points:

  • Upgrade strategy (i.e. whether or not to stay on the latest supported versions of technologies) is important since in a production environment we have to verify regression tests are passing and validate that developers are in sync regarding the new versioning preferably through explicit dependency management automation. Complicating things is the fact that many platforms are multi-system integrations makes it more important to test and stay on supported version sets. In general, the bias should be toward the latest supported version set, since the feature set expands/evolves quickly, but the most important thing is to establish a verifiable, versionable process, that can be rolled back if necessary.
  • With Hadoop distros or other frameworks, operations infrastructure often incorporates execution time reporting (e.g. Spark dashboard), in order to validate and diagnose applications.
  • Many platforms, especially commercial offerings, provide UIs for configuration management. While they are useful to get information about the system, configurations should be managed through an automated, versionable process.


Having an explicit strategy for both resource management and dependency management are critical for reconciling developer productivity, performance at scale, and operational ease. Below you’ll find some of the key points to consider:

  • The feature set of specific distributions and component versions affect functionality. Modern data infrastructure development and Big Data are an emerging practice so the feature set of specific frameworks expands/evolves quickly but commercially supported distributions move at a slower pace. It’s important to explicitly define and communicate the versions of software you will be using, so that developers understand the capabilities that they can leverage from certain frameworks.
  • Never begin development using a newer version of a framework or API unless you have a plan for rolling out the update in production. Since you should be deploying to production as early as possible then this would mean this would happen almost concurrently.
  • Data platforms often aim for an org-wide, cross-functional audience set. This means that you will need to have a way to manage authorization and performance constraints/bounds for different groups. Resource managers and schedulers typically provide the mechanisms for doing this, but it will be up to your organization to make the decision as to who gets what (Have fun!).
  • Using resource managers will be your best bet from an operations standpoint to align your business needs to the performance profile of specific jobs/apps. The simplest model for this is essentially a queue, but can take, and most times likely should, take the form of group-based capacity scheduling. The trade-off here is that organization coordination/prioritization is a must.


Having available developers with deep knowledge of your technology stack is critical to success with data efforts. In addition, it’s necessary to provide those developers with the tools to do their job:

  • Desktop virtualization/containerization tools like VirtualBox, Vagrant, or Docker will allow developers to more easily “virtualize” the production environment locally.
  • Developers should be able to duplicate the execution environments locally as closely as possible. Meaning that if the execution machines are linux based and your workstations are windows, by necessity, tools like VirtualBox, Vagrant, Docker will need to be used.
  • You must align training plans with roadmap of platform/product. For example, if you plan on using modern data infrastructure, it is essential for developers to understand distributed systems principles.
  • You must allocate the appropriate development and operation personnel (or equivalent managed services) for the lifespan of the product/platform.

Product Management

The platform aspect of data infrastructure development calls for strong coordination of consumer needs, developer enablement, and operational clarity. As such, it’s important to engage all relevant parties as early as possible in the process. Validate results against the consumer early and begin establishing the deployment process as soon as the first iteration of work begins. Other tips:

  • Address operations and maintenance concerns as early as possible by beginning to deploy ASAP during development in order to iterate and refine any issues.
  • Much of data work is inherently iterative, reinforcing the need for feedback support systems such as monitoring tools at operational level and bug/issue trackers at the organizational level.
  • Like most modern software systems, getting feedback from end users is essential.
  • Establishing a roadmap for the data platform is imperative. Data platforms often aim for an organization-wide, cross-functional audience set. It is imperative to have a strategy on prioritizing consumer needs and onboarding new users/LoB/groups. The feature requests of the consumers must be treated as a prioritized set.
  • Availability, performance, etc. requirements have a large impact on design due to the complexities of distributed systems in general and current Big Data technologies specially, so these things should be thought about as early as possible.
  • Any technology platforms or distribution used in production must have licensed support.

There are a lot of capabilities to think about, and it can seem daunting. Remember that at the highest level, you are aiming to give your teams as much visibility as possible into what’s happening through instrumentation, implementing processes that provide validation at every step, and automating the tasks that make sense. Of course, we are here to help.

Hopefully, the points highlighted above will be useful as you are establishing your development and operations practices. Anything you think we missed?

Bio: Fausto Inestroza is a Senior Data Engineer at Silicon Valley Data Science. He has extensive experience with data platforms, analytical processes and distributed systems.

Original. Reposted with permission.