Data Preparation Tips, Tricks, and Tools: An Interview with the Insiders

Data preparation and preprocessing tasks constitute a high percentage of any data-centric operation. In order to provide some insight, we have asked a pair of experts to answer a few questions on the subject.

Sorted data

Matthew Mayo: Claudia Perlich recently stated the following regarding data preparation in a Quora answer:

If you are not good at data preparation, you are NOT a good data scientist. [...] The validity of any analysis is resting almost completely on the preparation.

To what extent would you agree or disagree with this statement, and why?

Clare Bernard: 100% agree. Data scientists have both the opportunity and the challenge of taming the extreme heterogeneity and volume of information sources that are the natural result of enterprise growth (M&A), business process automation (idiosyncratic systems) and the proliferation of external data. The opportunity of variety comes from integrating data from many different sources to create accurate and insightful analyses.

But data scientists know even the best analytics methods and tools are only as good as the data that fuels them and their ability to execute them repeatably and at scale. Data unification, standardization and quality across silos is the fundamental challenge for data scientists who need to enable decision makers to ask incisive questions -- and to trust that the answers are inclusive and accurate and up-to-date enough to inform their decisions.

To date, manual, homegrown and self-service preparation approaches have proven incapable of sustaining enterprise-scale curation -- leaving data scientists without a means of ensuring that their analytics were being fed continuous, clean, unified data from relevant sources.

For data scientists, Tamr is a sustainable, scalable solution to this problem -- capable of automating the collection, organization and preparation of enterprise-wide data (supplier, customer, product, financial, etc.) for the full range of spend, cost service and revenue analytics.

Tamr’s machine-driven, expert-guided approach radically reduces the cost, time and effort of preparing data for analysis - enabling CDOs to fuel previously unattainable enterprise-wide spend, cost and revenue analytics.

Joe Boutros: I’m relatively new to data science, but from my hands on experience and observation I both agree with Claudia and applaud the emphatic nature of her statement. I never cease to be amazed at the number and diversity of open source tools that can be used to implement analytical methods. Using these tools means the hard parts of the job become:

  • properly preparing the data for analysis
  • choosing analytical tools and techniques appropriate for the problem
  • interpreting the outcome

I realize this may be a controversial stance, but the same trend has emerged in the world of software development. Much of a modern software developer’s time is spent stitching together software packages and relying on a skillset biased towards architecture to determine which frameworks, structures, patterns, and algorithms are appropriate to solve the job. It’s the rare software developer that implements these algorithms or frameworks by hand outside of open source development or whiteboard interviews.

An analogy I like to use is cooking. One big difference between an amateur cook and professional chef is selection and preparation of ingredients - the mise en place ( Anthony Bourdain famously says "Mise-en-place is the religion of all good line cooks”. Perhaps this attitude could serve the modern data scientist as well!

Sebastian Raschka: Claudia Perlich's statement really, really resonates with me! Of course, we do not always have to worry so much about data acquisition. Sometimes, relevant datasets may already be available to us -- for example, from previous projects, colleagues, or various platforms. However, as data scientists, *we* have to decide how to deal with missing values and how to prepare or extract the meaningful information we want to work with to solve the problem at hand. In other words, we need to make the call and decide "how" to work with this data.


Matthew Mayo: As specifically as is possible to state, what is the ecosystem or toolchain in which you spend most of your time? What is one example of a data preparation tip, trick, or tool that you favor related to said environment?

Joe Boutros: As an engineer, I’ve spent a lot of time building applications in the Python ecosystem. Naturally when I became interested in data science, the Python based tools were my first stop. I’ve become quite fond of the power of Pandas and the ease of learning about, preparing and manipulating data. One helpful trick for those new to Pandas but with SQL experience: you can load a DataFrame directly from the results of a SQL query using pandas.read_sql. Things like joins, aggregations can be done in SQL and moved bit by bit into Pandas as a learning exercise.

Sebastian Raschka: I am mostly working within a Python environment and the scipy-stack these days. For the data preparation stage, I mostly rely on pandas' DataFrames, the Blaze ecosystem, SQLite via sqlite3, and HDF5 via h5py.

On a side note, I recently dug up an older, academic project of mine for a professorial assistant who's going to continue the work -- I abandoned it ~2012 due to being busy with other projects. Funny enough, I wrote all the data preparation pipelines and scripts in Python, interleaved with some R scripts for analysis and plotting. Why I am mentioning this? Related to sharing a "preparation tip, trick, or tool," I have to admit that I had a pretty messy setup back then. When I was looking at it the other week, it really took some effort to understand what exactly I did there in 2013. These days, I am a big fan of having everything organized in Jupyter Notebook(s) when possible.

These notebooks help me to keep track of my interactive analyses and allow me to augment the data analysis with plots, equations, thoughts, and comments. These notebooks don't need to be perfectly polished, but having everything in one sequential outline often helps me to keep track of what exactly has been done and when -- and it makes writing reports so much easier! However, I also want to highlight that I use Jupyter Notebook just as notebooks, and they are very useful during the data preparation and analysis stages; I still write scripts and packages in "plug-in rich" text editors/IDEs.

Clare Bernard: Well, we spend pretty much all of our time on the Tamr ecosystem, of course.

In terms of data preparation tips, we’d point toward an aspect of data preparation that is often overlooked in our tech-oriented ecosystem: organizational and cultural alignment.

Organizationally, CDOs and their teams need to use data as a means of surfacing and delivering highly valuable institutional knowledge to drive decision making. But vast amounts of domain expertise is trapped in the heads of people in the organization who are not technical – who can’t write SQL. In short, CDOs need processes and tools built for liberating institutional knowledge across the organization and delivering it through high-quality data to key decision makers.

The CDO’s fundamental challenge here is cultivating enterprise-wide trust in the data and analytics fueling key business decisions. This means avoiding two curses of analytically-driven decision making: 1) data that is poor quality (“garbage in, garbage out”) or incomplete for the questions being asked of it; and 2) “tech for tech sake,” which is behind many data engineering initiatives.

Existing data engineering approaches such as data warehousing and master data management have been disasters of cost and complexity. Relying on the same vendors who got CEOs into this mess will result in another round of costly multi-decade projects which, in the end, will be proprietary and inflexible. Fortunately, emerging innovations such as expert sourcing, machine learning and the cloud provide new opportunities to change the game both technically and organizationally.

With its enterprise-scale data preparation solution for producing consistently clean, complete, unified data from across the organization, Tamr has harnessed these forces in a vendor-independent platform that maximizes the contributions of machines and from human experts.

Matthew Mayo: If you were pressed for a single statement of data preparation advice for newcomers to data science, or related fields, what would it be?

Clare Bernard: Look at your data. Actually look at it. Reading data might seem about as interesting as reading the dictionary, but if you take the time to do it you’ll save so much pain later on. Before you do any machine learning work, look at the data and ask yourself, could a human understand how this goes together?

Joe Boutros: Data preparation is not a one time thing - it’s continuous throughout the process. If you’re trying competing techniques, each will come with it’s unique preparation needs - even for different iterations of the same model.

Sebastian Raschka: Short and sweet: Be curious! As much as data preparation is an art in itself, I also think that it is hard to teach data preparation -- this doesn't mean that it is hard to learn, though! Every project is unusual and requires distinct steps, maybe even completely different tools. Since we established that typical data-related projects require you to spend 80% of your time on preparation, you'll automatically become better at it over time, through practice. Be curious, look at your data, work with your data, and talk about your work to get useful tips and feedback. Through experimentation, you'll maybe find that it's worth substituting your favorite tools in certain projects and looking at your data from a different angle, and it's often helpful to ask other people about their opinions on different approaches.

I would like to thank the above contributors for taking time out of their busy schedules to provide us with some helpful insight.