Five Principles for Applying Data Science for Social Good

Well-meaning data scientists often fail to reach their full potential when working for social good. The following 5 principles can help improve this situation.

By Jake Porway, DataKind.

Jake Porway also talked about these ideas in his Strata + Hadoop World NYC 2015 keynote address, "What does it take to apply data science for social good?"
Social Good
"We're making the world a better place." That line echoes from the parody of the Disrupt conference in the opening episode of HBO's "Silicon Valley." It's a satirical take on our sector's occasional tendency to equate narrow tech solutions like "software-designed data centers for cloud computing" with historical improvements to the human condition.

Whether you take it as parody or not, there is a very real swell in organizations hoping to use "data for good." Every week, a data or technology company declares that it wants to "do good" and there are countless workshops hosted by major foundations musing on what "big data can do for society." Add to that a growing number of data-for-good programs from Data Science for Social Good's fantastic summer program to Bayes Impact's data science fellowships to DrivenData's data-science-for-good competitions, and you can see how quickly this idea of "data for good" is growing.

Yes, it's an exciting time to be exploring the ways new datasets, new techniques, and new scientists could be deployed to "make the world a better place." We've already seen deep learning applied to ocean health, satellite imagery used to estimate poverty levels, and cellphone data used to elucidate Nairobi's hidden public transportation routes. And yet, for all this excitement about the potential of this "data for good movement," we are still desperately far from creating lasting impact. Many efforts will not only fall short of lasting impact - they will make no change at all.

At DataKind, we've spent the last three years teaming data scientists with social change organizations, to bring the same algorithms that companies use to boost profits, to mission-driven organizations in order to boost their impact. It has become clear that using data science in the service of humanity requires much more than free software, free labor, and good intentions.

So how can these well-intentioned efforts reach their full potential for real impact? Embracing the following five principles can drastically accelerate a world in which we truly use data to serve humanity.

1. "Statistics" is so much more than "percentages"

We must convey what constitutes data, what it can be used for, and why it's valuable.

There was a packed house for the March 2015 release of the No Ceilings Full Participation Report. Hillary Clinton, Melinda Gates, and Chelsea Clinton stood on stage and lauded the report, the culmination of a year-long effort to aggregate and analyze new and existing global data, as the biggest, most comprehensive data collection effort about women and gender ever attempted. One of the most trumpeted parts of the effort was the release of the data in an open and easily accessible way.

I ran home and excitedly pulled up the data from the No Ceilings GitHub, giddy to use it for our DataKind projects. As I downloaded each file, my heart sunk. The 6MB size of the entire global dataset told me what I would find inside before I even opened the first file. Like a familiar ache, the first row of the spreadsheet said it all: "USA, 2009, 84.4%."

What I'd encountered was a common situation when it comes to data in the social sector: the prevalence of inert, aggregate data. Huge tomes of indicators, averages, and percentages fill the landscape of international development data. These datasets are sometimes cutely referred to as "massive passive" data, because they are large, backward-looking, exceedingly coarse, and nearly impossible to make decisions from, much less actually perform any real statistical analysis upon.

The promise of a data-driven society lies in the sudden availability of more real-time, granular data, accessible as a resource for looking forward, not just a fossil record to look back upon. Mobile phone data, satellite data, even simple social media data or digitized documents can yield mountains of rich, insightful data from which we can build statistical models, create smarter systems, and adjust course to provide the most successful social interventions.

To affect social change, we must spread the idea beyond technologists that data is more than "spreadsheets" or "indicators." We must consider any digital information, of any kind, as a potential data source that could yield new information.

2. Finding problems can be harder than finding solutions

We must scale the process of problem discovery through deeper collaboration between the problem holders, the data holders, and the skills holders.

In the immortal words of Henry Ford, "If I'd asked people what they wanted, they would have said a faster horse." Right now, the field of data science is in a similar position. Framing data solutions for organizations that don't realize how much is now possible can be a frustrating search for faster horses. If data cleaning is 80% of the hard work in data science, then problem discovery makes up nearly the remaining 20% when doing data science for good.

The plague here is one of education. Without a clear understanding that it is even possible to predict something from data, how can we expect someone to be able to articulate that need? Moreover, knowing what to optimize for is a crucial first step before even addressing how prediction could help you optimize it. This means that the organizations that can most easily take advantage of the data science fellowship programs and project-based work are those that are already fairly data savvy - they already understand what is possible, but may not have the skill set or resources to do the work on their own. As Nancy Lublin, founder of the very data savvy and Crisis Text Line, put it so well at Data on Purpose - "data science is not overhead."

But there are many organizations doing tremendous work that still think of data science as overhead or don't think of it at all, yet their expertise is critical to moving the entire field forward. As data scientists, we need to find ways of illustrating the power and potential of data science to address social sector issues, so that organizations and their funders see this untapped powerful resource for what it is. Similarly, social actors need to find ways to expose themselves to this new technology so that they can become familiar with it.

We also need to create more opportunities for good old-fashioned conversation between issue area and data experts. It's in the very human process of rubbing elbows and getting to know each other that our individual expertise and skills can collide, uncovering the data challenges with the potential to create real impact in the world.