Building Data Systems: What Do You Need?

This post shares some insight gained through years of building data-powered products, and discusses the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure.

By Fausto Inestroza, Silicon Valley Data Science.

Eiffel Tower

In previous posts, we’ve looked at why organizations transforming themselves with data should care about good development practices, and the characteristics unique to data infrastructure development. In this post, I’m going to share what we’ve learned at SVDS over our years of helping clients build data-powered products—specifically, the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure. If you haven’t looked at the previous posts, I would encourage you to do so before reading this post, as they’ll provide a lot of context as to why we care about the capabilities discussed below. Please view this post as a guide, laid out in easily-visible bullet points for quick scanning.

A few key points before we start:

  • To reiterate what was discussed in earlier blog posts, the points discussed in the sections below are shaped by automation, validation, and instrumentation: the concepts that drive successful development architecture.
  • Much of what is covered in this post comes from continuous delivery concepts. For a more detailed overview of continuous delivery, please take a look at existing sources.
  • Whether you hold these capabilities explicitly in house or not depends on the constraints and aspirations of your organization. They can (and often do) manifest themselves as managed services, PaaS, or otherwise externally provided.
  • The implementation specifics of these capabilities here are going to be determined by the constraints and goals unique to your organizations. My hope is that the points below highlight what you should focus on.


Data engineers must be as conscious of the specifics of the physical infrastructure as that of the applications themselves. Though modern frameworks and platforms make the process of writing code faster and more accessible, the scale in terms of data volume, velocity, and variety of modern data processing means that conceptually abstracting away the scheduling and distribution of computation is difficult.

Put another way, engineers need to understand the mechanics of how the data will processed, even when using frameworks and platforms. SSD vs. disk, attached storage or not, how much memory, how many cores, etc. are decisions that data engineers have to make in order to design the best solution for the targeted data and workloads. All of this means reducing friction between developer and infrastructure deployment is imperative. Below are some important things to remember when thinking about how to enable this:

  • Infrastructure monitoring and log aggregation are imperative as the number of nodes used increases throughout your architecture.
  • The focus should be on repeatable, automated deployments.
  • Infrastructure-as-code allows for configuration management through a similar toolchain as application code and provides consistency across environments.
  • For a number of reasons, many organizations will not allow developers to directly provision infrastructure and will typically have a more specialized operations/network team to handle those responsibilities. If this is the case then providing clear, direct infrastructure deployment request processes for the developer with the ability to validate is necessary.


As the level of target scope increases in the testing sequence, testing for data infrastructure applications begins to deviate from traditional applications. While unit testing and basic sanity checks will look the same, the distributed nature of many data applications make traditional testing methodologies difficult to fully replicate. Below are specific issues:

  • As with any application development, code reviews, code coverage checks, code quality checks, and unit tests are imperative.
  • Using sample data and sample schemas becomes more important since pipelines mean the data becomes the integration point.
  • Having as much information about the data and associated metadata is critical for developers to rationalize fully about how to build and test the application.
  • Developers should be able to access a subset, a sample, or at worst a schema sample (in order to generate representative fake data).
  • Metadata validation is as important as data validation.
  • Duplicating environments in order to establish an appropriate code promotion process is equally important but harder: locally duplicating distributed systems takes multiple processes and actually replicating the distributed nature of the cluster setup is tough without a lot of work. Typically, this issue can be mitigated by investing in infrastructure automation, to be able to deploy the underlying platforms for testing in multiple environments rapidly.
  • The scale of the data often prohibits complete integration tests outside of performing smoke tests in the production cluster as, for example, testing a 10 hour batch job is not practical.
  • Performance testing will be iterative. The cost of duplicating the entire production environment is often prohibitive, so performance tuning will need to take place in something close to a prod environment. An alternative to this is having push-button/automated system to spin up instances just for performance tuning.
  • Further complicating performance testing is the fact that resource schedulers are often involved.
  • Running distributed applications often means multiple processes are creating logs. It’s important to enable your developers to diagnose issues by implementing log aggregators and search tools for logs.
  • Since certain issues only manifest themselves when concurrency is introduced, testing in concurrent mode should be done as early as possible. This means making sure developers are able to test concurrency on their local environments (some frameworks allow for this, e.g using Spark’s local mode with multiple threads).


Dependency management for distributed applications is HARD. It’s necessary not only to maintain consistency across promotion environments (dev, test, QA, prod), but within the clustered machines within each of those environments. The distributed nature of many of the base technologies coupled with the prevalence of frameworks in the Big Data ecosystem means that when it comes to dependency management organizations have to make a decision to either 1) cede management of shared libraries to the platform (usually the operations team) and make sure that developers can maintain version parity; or 2) cede control to developers to manage their own dependencies. Some more specifics below:

  • While the polyglot nature of data infrastructure development will tempt developers toward manual packaging and manual deployment (e.g. on an edge node), packaging standards should be enforced regardless of language or runtime. Choose a packaging strategy for the set of technologies at hand and establish an automated build process.
  • Understand the impact that maintaining multiple languages and runtimes has on your build process.
  • Pipelines themselves need to be either managed using something like Oozie (in hadoop ecosystem) or reliably managed through automated scripting (e.g. using cron).
  • For a traditional application you can version all configurable elements (source, scripts, libs, os configs, patch levels, etc), but with the current state of the technology in enterprise BD, multiple applications are running on a specific software stack (e.g. CDH distro). This means the change set for configurations for an app can not lie entirely within that apps repo. At best, the configurations are across separate repos, but with managed stacks like CDH, configuration versioning is typically handled internally to the platform software itself.