Connecting Data Systems and DevOps

This post will explain why anyone transforming their company into a data-driven organization should care about software development best practices, even if they don’t consider themselves a software company.

By Fausto Inestroza, Silicon Valley Data Science.

Bridge header

We’ve discussed it before, but it’s worth mentioning again: developing data systems is hard. Really hard. Companies spend a lot of time and effort talking to vendors and carefully selecting the software components for their enterprise data infrastructure. However, they often overlook the tools and processes that will enable them to build and operate the applications that make those component choices valuable. The consequences are dire: unmaintainable systems, unmet goals, unhappy users.

Paralleling the rise of data in the collective consciousness of businesses has been another trend: the emergence in the software development world of a number of interrelated approaches and practices (continuous deliverycontinuous integration, and DevOps) meant to improve the quality of software delivery. In particular, DevOps has emerged as a way to put into practice some of the same organizational principles espoused by agile methodology (collaboration, enablement, and a focus on iterative development cycles). The goal is to have operations, development, and QA capabilities work closely together (often using the same tools) throughout the entire software lifecycle.

In this post, I will explain why anyone transforming their company into a data-driven organization should care about software development best practices, even if they don’t consider themselves a software company.

In future posts, I’ll delve into the impact of data and distributed systems on development and operations, and the capabilities and practices that will help your data systems development succeed.

Why should I care about DevOps?

The open source projects that power much of the data systems built today were originally created as infrastructure: software that provided generalized functionality for multiple use cases. These technologies were typically created to reap the benefits that detailed insights and expanded processing capabilities provided to specific organizations. These companies understood that to effectively build such infrastructure you have to enable both the developers creating the software and the consumers of the infrastructure. As both creators and consumers of these systems and their dependent applications, they were able to bothbuild (and extend) the general capabilities of their data infrastructure and provide feedbackon necessary functionality as the goals of the organization evolved.

As data has become increasingly relevant across many industries, an ever-expanding number of companies (who otherwise would not consider themselves “software” companies) have begun building new data systems in order to transform themselves into data-driven organizations. These companies will sensibly make their first move toward this goal by using packaged software or commercially-supported platforms built around open sourced data infrastructure projects. And, yes, many of those platforms are great pieces of software and will provide a great deal of needed functionality.

However, it pays to be blunt here: If you are on the path to being a data-driven company, you have to be on the path to being a development-enabled company.

At some point, the specifics of your business will demand software development. The features you want to provide to your users, be they internal business users, developers, or external customers, will outmatch the pre-built functionality of packaged software solutions. Simply integrating a number of components or disparate systems together will not fulfill your ambitions.

Like the companies that created the original data infrastructure software, you will need to be able to create the processes within your organization to effectively build and provide feedback on the software you are creating.

Development effort will be one of your most precious resources. Development time is an oft-overlooked input into total cost of ownership when implementing a new project. Providing tools and establishing processes that maximize the potential of your developers and operators while minimizing the amount of rework or bug-hunting they will have to do is imperative.

In addition, you need to get feedback and validation from your systems and consumers as early as possible. What this typically means is deploying to production as early as possible in a repeatable fashion. In this way you can adjust your development accordingly and can focus on providing as much tangible benefit as fast as possible.

If you don’t consider yourself a software or tech company, I understand what you might be thinking: that software companies have development in their DNA, and that it would be difficult to change your organization into one that functions the same way. However, software development best practices have now moved beyond the walls of traditional software companies. They can (and often do) manifest themselves as consulting services, managed services, PaaS, or otherwise externally provided capabilities. Whether you hold these development capabilities in-house or not will be predicated on your organization’s specific goals and constraints. What’s most important to remember is that these are concerns you must address to give your data system development the best chance to succeed.

Where to start

In subsequent posts I’ll dive deeper into how to use modern software development principles to tackle some of the difficulties presented by data systems. For now, at a high level, there are three concepts that will enable your goals:

Automation—Many companies struggle with building, testing, and operating data systems because of a lack of automation. Its most obvious impact is the time saved but automation also enables us to create repeatable, verifiable processes for the build, testing, and deployment of both applications and underlying infrastructure. While there will always be constraints based on your timelines, cost, and resources, there should be a strong bias toward automation. Good candidates for automation are test executions, build scripting, dependency management, deployment scripting, and infrastructure configuration management.

Validation—The basic concept here is performing important steps like automated deployments and automated testing as often (and early) as possible in development (e.g. on every code check in). As Jez Humble and David Farley would say, you want to “bring the pain forward.” By implementing processes like testing and releases in this manner, you can address pain points before they have more serious consequences. Automated testing can, for example, be enabled by employing a continuous integration pipeline. Explicit user feedback can be facilitated with issue/bug tracking tools and a product owner who can advocate for the user. All of this ensures that the development and/or operations teams can react quickly and prevent expensive rework.

Instrumentation—As with all software development, providing visibility to all team members throughout the entire lifecycle is crucial. Such instrumentation will take the form of proper logging standards and practices, and infrastructure and application monitoring. The point is providing visibility into the end to end development process to your teams in order to continually improve your software and your development processes.

In the next post, I’ll describe the unique characteristics of data systems that impact how you effectively test, build, and deploy. In the meantime, please share your experiences with data systems and the development tools and practices that have worked for you.

Bio: Fausto Inestroza is a Senior Data Engineer at Silicon Valley Data Science. He has extensive experience with data platforms, analytical processes and distributed systems.

Original. Reposted with permission.