KDnuggets Home » News » 2016 » Sep » Software » Behind the Dream of Data Work as it Could Be ( 16:n33 )

Behind the Dream of Data Work as it Could Be

This post is an insider's overview of data.world, and their attempt to build the most meaningful, collaborative, and abundant data resource in the world.

Product war room

A fraction of the notes in our Product War Room.

I have internal data about the service areas of these programs. I’d like to see if there’s a correlation between proximity to program coverage areas and alcohol intake.

So I search Google for “alcohol intake city” and I find a dataset on a site called data.world.

I see that the dataset was updated a month ago, so it’s even fresher than my own OW aid data. A badge tells me that the user who uploaded it is associated with the University of Chicago, and I also see that this dataset has been starred 102 times and there are 4 people deep into a conversation about collection methodology.

Contrast this to your typical data portal experience, which might allow you to download raw data files, but little else. It’s on you to chase down the essential context–where the data comes from, how it was collected, what others have used it for, what that weird header means.

data.world is not just a data portal. One of our core goals is to store and convey critical context (metadata, social signals, discussions, provenance, etc.) with the data so you can quickly decide if the data you’re browsing is worth your time and your trust–before downloading and wading through its intricacies on your own.

If I was to go back and rewrite Alicia’s story today, I’d expand this section to play up the utility of working knowledge–that kind of context that comes directly from dataset owners sharing their notes, hunches, intended next steps, etc.


A list of recommended and related datasets catches my eye. One of them will give me the same information, but it’s also enhanced with the coordinates of the cities. That’s perfect, because my data uses coordinates for locations instead of city names.

data.world is built on the powerful foundation of linked data. Not only do we connect people who work with data, we connect the datasets themselves and the knowledge that surrounds them.

Every layer is “data aware,” and this allows datasets to easily join and interoperate with each other. In the future, this will enable the platform to instantly suggest enhancements and connections to your data as soon as it’s uploaded, as well as related and complementary datasets, even if they have nothing obvious in common.

For example, if you’re looking for data about my hometown of Youngstown, Ohio, a search for “Youngstown” on a typical data portal would only return datasets that contain the literal searchterm “Youngstown.” An advanced “data aware” platform, however, could return datasets that contain any ZIP code, coordinate, or landmark within Youngstown city limits, even if those datasets don’t contain any reference to Youngstown by name.

This is by far the most ambitious part of our vision, and one of the areas where we still have a lot of work to do. But by treating datasets as extensible objects from day one, we’re quickly delivering this type of functionality in phases.


OK, now I need to grab the joined dataset, so I hit the Download button. A message box pops up so I can select which format I need. It’s awesome that it’s not making me register to download the file, but the message box makes a pretty compelling pitch as to why I should sign up anyway. It says that registered users can merge their own data with open data in private datasets, run queries within the app, and–this is the most intriguing to me–connect to API endpoints.

It’s worth registering, just to see if this will speed up my work. The process is just a few fields, and then I’m presented with a question: “What do you want to do first?” I can choose from:

  • Upload a dataset
  • Find a dataset
  • Query a dataset
  • Combine datasets

I want to grab the API endpoint so I can call the data with my familiar tools, so I select “Query a dataset.”

Both datasets I’ve looked at are right there, even though I was looking at them before I registered. That’s a nice, convenient touch. I select the joined version with the coordinates, and it takes me to the Query tab. The tab is overlayed with some highlights and guidance about SPARQL querying. I’m really after that API endpoint, so I follow the overlays to grab it.

I leave the platform to pipe the data into R Studio via the API, combine the data with the OW program data, and find that in urban areas where OW programs operate, alcohol consumption is significantly higher. Now I’m well on my way to answering the question.

There are as many techniques and toolchains as there are skillsets, backgrounds, and data domains. Data practitioners have no interest in replacing tools they know, trust, and regularly use. It’s important to us to integrate with and enhance the ways you solve problems with data. data.world cannot be a walled garden if our community is to thrive.


I need to find more data to help explain that effect. I log back into the platform and I’m greeted by the same tutorial options, but this time “Query a dataset” has a green check mark next to it. I select “Find a dataset,” and I get an overlaid search screen. Besides the search bar, it has a few suggestions based on what I’ve looked for previously, and I can browse by selecting tags like “Alcohol” and “Cities.”

I wonder...

What if these OW programs are being deployed in cities where unemployment is high, and unemployment is actually more predictive of alcoholism?
I search “unemployment city” and sort by “Updated,” since I need it to match the freshness of my data. I find a recent dataset that fits the bill. This time I want to try downloading it. I grab the file in the format I need, load it into R Studio, and what do you know? High unemployment is correlated with high alcohol intake, and the presence of an OW is not correlated to alcohol intake in cities with low unemployment.

I have a feeling I’m going to get a lot more out of data.world. In the next few days, I’ll get emails that prompt me to learn more about Uploading Datasets and Combining Datasets. I think I’ll upload some new OW data and see what people do with it. I’m excited to share my findings internally, and I’ll definitely tell the other data scientists about this time-saving platform.

I hope you enjoyed meeting Alicia.  We’re having a blast shaping data.world for the community she represents, and we’re getting closer to the vision every day.  If you want to solve problems faster with data, and if you’re as excited as we are about the dream of data work as it could be, join the data.world preview! You’ll be an early part of what will become the most meaningful, collaborative, and abundant data resource in the world.

Bio: Joe Boutros is Director of Product Engineering at data.world where he spends his time crafting the perfect blend of technology, design, and user research to delight the data community.