KDnuggets Home » News » 2016 » Sep » Software » Behind the Dream of Data Work as it Could Be ( 16:n33 )

Behind the Dream of Data Work as it Could Be

This post is an insider's overview of data.world, and their attempt to build the most meaningful, collaborative, and abundant data resource in the world.

By Joe Boutros, data.world.

Before Wired co-founder John Battelle wrote that our company “lays the traceroutes for a data revolution,” before we were planning conference panels with NASA, before we had our first user who wasn’t already a friend, there was Alicia. Alicia will never know how important she is, because she never actually existed. That minor detail won’t keep me from introducing her to you. But first, some context.

Data dream

My name is Joe Boutros, and I do exist. I’m the Director of Product Engineering at data.world. We are building the most meaningful, collaborative, and abundant data resource in the world. Our platform helps data practitioners work together to solve problems faster by creating new ways to discover, prepare, collaborate, and share.

We know that no data platform will succeed in trying to be everything to everyone. On the other end of the spectrum, there is limited value in trying to solve ultra-niche problems. With that in mind, the central ambition of our product development research is to identify the most common and needlessly frustrating, time-consuming, and difficult stages of data work.

Over the course of this ongoing research, we’ve spoken at length to hundreds of data practitioners from academia, the public sector, non-profits, and the private sector. If you’re reading this, you’re probably a data practitioner in one of those groups.

We learned that the majority of your time is spent on things you would rather not be doing. Finding, cleaning, prepping, and joining data: the “first mile” of data work. By the time you're ready to start the analysis, you're overworked and frustrated, thinking "Why can't this be easier?”

Time and time again, the ad hoc and exhausting nature of this "first-mile" of data work came up in conversation. If you could leverage the preparation others have done on the same data, if you could easily access their knowledge about the data, you would be able to get to the interesting stuff more quickly. You could solve problems faster.

As Eric Schwartz, a University of Michigan professor who works on real-world problems like Flint’s water crisis, told us:

The first mile problem you describe is actually also the last mile for a lot of people too, because that is where they quit.

How many critical efforts have been abandoned because the first mile was too long or difficult?

Let's imagine data work as it could be.

The story you’re about to read never happened. Alicia is fictional—she doesn’t exist in just one person we know. During those hundreds of hours of user research, my colleague Ian Greenleigh and I sat down and wrote a series of stories about the community members we are lucky enough to serve, and how our nascent platform could help them solve problems faster.

If you work with data, you know Alicia, or parts of her story. Some of her aspirations, frustrations, the mundane and the exciting parts of her job, may resonate with you. Good! That means you are part of the community we have dedicated ourselves to. Developing Alicia’s story helped us channel and empathize with our users before we had them. And now that we do, Alicia’s story helps us reflect on where we were, where we are now, and where we’re going next.

We’ve already deployed many of the features Alicia encounters along her journey, but not all of them. Some functionality may transpire in other ways, some features in this story may even be shelved as our understanding of your needs gets even deeper. This story is a piece of a vision, not a product roadmap.

And now, Alicia’s story with my “behind the scenes” commentary.

I work for Outcomes Worldwide (OW) as a data scientist. Basically, people who are designing aid programs come to me with questions and I answer them with data. Or, they come to me with an existing program that isn’t achieving its goals, and I create models and test their theories to try to figure out what’s not working. I wish I could do more for these programs, but in this organization, there are far more people with problems than there are people with data science skills to help solve them.

“Data scientist” and “data practitioner” are broad labels. We interviewed folks with backgrounds that range from statistics to engineering to finance. They have varied skillsets and depth of experience. Some are just getting started in their field and some remember a time before data science became “Data Science.”

But here’s what many of them had in common:

Their skills are under increasing demand, but in short supply within their organizations. They are dedicated and overworked. They want to be more productive with their limited time so they can solve more problems and help more people. By far the most common theme: they spend too much of their valuable time finding, cleaning, and normalizing data–“the ETL dance.”

Let’s see how Alicia can breeze through parts of her work that previously resulted in frustrating roadblocks.


Here’s a hypothesis I was asked to test recently: OW urban aid programs contribute to alcoholism. Someone in our press office heard through his backchannels that a journalist has reached this conclusion about another global aid org that we’re frequently compared to, and wants to make sure we’re prepared to answer any related inquiries once the article hits.

If my work helps us avoid a media storm, that’s a good outcome. But if I share my findings with the senior policy wonks I know, my work could help us redesign our programs to maximize positive impact while minimizing any unintended negative effects. That’s an excellent outcome.

People want to apply their data science skills to what they consider meaningful. They want their work to be shared widely, and not just within their organizational silos.