How (& Why) Data Scientists and Data Engineers Should Share a Platform
Sharing one platform has some obvious benefits for Data Science and Data Engineering teams, but technical, language and process challenges often make this a challenge. Learn how one company implemented single cloud platform for R, Python and other workloads – and some of the unexpected benefits they discovered along the way.
By Lovan Chetty, Cazena.
Attending analytic conferences really exposes the range and sophistication of analytic techniques that people with the right skills can apply to data. An example of this is the EARL Boston conference, which focuses on how to best use the R programming language to produce analytic outcomes. At the most recent conference, I led a session that was based on many conversations my industry colleagues and I have had with companies trying to accelerate their analytics programs. And a specific type of problem that tends to arise over and over again. It’s how two distinct groups – Data Engineers and Data Scientists – can work more collaboratively despite the stark differences between their skills.
- The Data Engineering teams typically understand the most efficient ways of curating data, storing data and operationalizing processes.
- The Data Science teams have a good grasp of math, statistics and how to derive insight from the data.
These teams often work independently. The Data Engineering team typically works on a shared platform, with a toolset and associated processes that optimize their flows, while the Data Science teams tend to have their own, separate set of tools and processes and generally work locally on their laptops. This creates inefficiencies, which I heard about firsthand from the Data Scientists at the EARL conference.
Many of them described extracting data from central systems with varying degrees of pain and compliance and then spending time refactoring that data to fit the analysis they wanted to do – and then (only then)starting the analysis process. This works but it’s not the most efficient overall flow because it leads to costly duplication of efforts as multiple users may extract data and waste time doing the same refactoring or transformations of data.For these reasons, it’s not surprising that more organizations are trying to have both teams work on a single platform.
The Challenges and Benefits of a Single Platform for Data Engineering & Data Science
While the concept of a single platform is a familiar topic in data strategy discussions, the flexibility of the cloud now makes it possible, though not necessarily easy. Ideally, everyone should be able to use their own tools and a variety of languages and be supported by a common underlying data and compute platform. Some of the reasons that this is challenging are related to delivering secure access to data across a variety of teams and locations, as well as having a common governance model across the disparate set of tools and processes.
Consider this real-world example from a relatively advanced Data Science team that I work with at a large corporation. The Data Engineering team predominantly uses Python for their data wrangling processes, while the Data Science team predominantly prefers R. Using our software, they deployed a single cloud platform, with a centralized HDFS data store, fast Spark processing engines and support for both Python and R, as well a variety of other languages to boot (SQL, Java, Scala, etc). Theoretically, anyone can do this by stitching together a broad range of open source projects on cloud infrastructure, but it’s complex to make it work efficiently and manage the different components. Our company has built software automation, security and an integrated experience that makes this easier in secure enterprise settings – but for this article, we’ll skip the pitch and focus more conceptually on why companies are moving to this model.
With the new platform, the R users read data directly from the central data store (HDFS), which means that they no longer spend time sub-setting or sampling datasets and copying data to their local machines. In addition, the platform gives the Data Science team more choices for their analytics. It includes a Spark cluster, which the Data Science team could leverage through R packages like SparklyR. The new functions are particularly helpful for producing models that consume larger amounts of data or applying models to full datasets for downstream consumption.The Data Science team was very happy with their new benefits on the new platform.
The Data Engineering team also saw significant yet different benefits. The platform change introduced a Spark cluster, which allowed them to enhance some of their Python processes for data management. An example of this was a data curation workflow, in which log files are processed from an operational application. Rather than continue to manage a sequential workflow, this process was inherently parallelizable with Spark as each section of the log was unrelated to the next.
The new platform also included some pySpark libraries, which allowed the Data Engineers to refactor and develop a process that was significantly more efficient. This dramatically improved data ingestion processes, which meant the Data Science team could run more real-time, time-sensitive analyses. This was not something they were able to do in the past as the data was typically a few hours old. A variety of new features like this enabled changes to operational processes that added up to significant time savings and efficiency.
The longest lasting effect, however, was the Data Engineering team now had more visibility into the datasets that the Data Science team was producing. The Data Engineers realized that some of the ways they prepared and manipulated data were not optimal for the Data Science team processes. This kicked off a more collaborative effort between the two groups – and now the datasets curated and managed by the Data Engineering team are more optimally structured for the Data Science team.
That means the company can produce analytic results more quickly, both due to process and newer tools, but they now also tackle a much wider range of analytic problems.It’s a single platform success story.
- The Easy Button for R & Python on Spark
- 3 minute demo: Data Science Sandbox as a Service
- An opinionated Data Science Toolbox in R from Hadley Wickham, tidyverse