Should Data Science Really Do That?

Data Science amazing progress in its ability to do predictions and analysis is raising important ethical questions, such as should that data be collected? Should the collected data be used for that application? Should you be involved?

By Yanir Seroussi.

Over the past few years, Kaggle ex-president Jeremy Howard and Kaggle founder Anthony Goldbloom have been delivering a talk titled “Can data science really do that?”. The answer is a resounding yes for a growing number of applications that have been traditionally thought of as being beyond the ability of machines. Given this new reality, one of the key questions we need to be asking ourselves is “Should data science really do that?”. In the words of DJ Patil, “just because you can it doesn’t mean you should”.


This article discusses three should questions. I don’t have clear answers to any of these questions, and would love to hear your opinion in the comments or privately. The questions are:

  • Should that data be collected?
  • Should the collected data be used for that application?
  • Should you be involved?

Should that data be collected?

As all data scientists know, any algorithm is only as good as the data it’s fed. When it comes to data collection, one of the key issues is balancing people’s right to privacy with the benefits arising from the use of personal data. Regulatory constraints often influence what data can and cannot be collected. However, with companies working across jurisdictions and advances in technology far outpacing the technical capabilities of lawmakers and enforcers, relying on regulators to protect things like individual privacy is unrealistic. Further, these same regulators are often the greediest data collectors of all.

Advances in tracking technology open up options that are both terrifying and exhilarating. For example, Larry Page claims that if Google were allowed to mine healthcare data, they’d “probably save 100,000 lives next year”. The cynical interpretation of this statement is that Google – that generates 90% of its revenue from advertising – simply wants access to more data to enable better ad targeting. This appetite for data collection is in no way limited to Google, which makes people increasingly uncomfortable with the vast scope and magnitude of corporate surveillance. In response to this discomfort, many companies have introduced tools to opt out of personalised advertising. This is a step in the right direction, but I believe that people should be given the choice to opt in to have their data collected, rather than having to opt out of every single tracker.

Should the collected data be used for that application?

Imagine you work at a financial institution and you’re tasked with improving the performance of a credit scoring model. During your initial feature exploration you find that considering gender and race would yield a more accurate credit score when combined with other features. Should you use these features? What about other features that can be seen as proxies for gender and race, such as place of residence?

In highly-regulated industries such as finance, the answer in many jurisdictions is that you can’t legally use these features – you may not discriminate based on gender or race. Other industries are less restrictive, which means it is often up to companies to self-regulate. This self-regulation may be motivated by ethical concerns of individual stakeholders, or by fears of a public backlash if they were found to systematically discriminate against people.

The line between unjust discrimination and legitimate data-driven modelling becomes blurrier with more complex data. For example, healthcare providers and insurers may have access to DNA samples of their clients. Models built on genetic data may be used to set insurance premiums, and to grant or prohibit access to certain services. It’s not hard to imagine scenarios where similar models are used to determine suitability for certain jobs, effectively setting the course of people’s lives. This may not be a desirable outcome, as all models are likely to have some systematic biases due to data incompleteness and the curse of dimensionality.

Discovering systematic biases is very tricky when thousands to millions of features are used to train nonlinear models. It could be that the data scientist has nothing but the best intentions in mind, but their model ends up discriminating against certain groups of the population. How to best handle such situations without unnecessarily inhibiting progress is an open question.

Should you be involved?

Would you help a nonprofit use machine learning to increase the donation yield of its campaigns? What if this nonprofit is an evangelical church? What if it’s a pro-choice organisation?

Data scientists are in high demand. This means that we can often choose to only work on projects that are aligned with our personal values, or at least avoid projects that clash with our beliefs. Companies know that, which is why many of them have mission statements that show how they are making the world a better place. It is up to you to choose, so make sure you choose well.

Kaggle provides us with another example for this kind of dilemma. Around the end of 2013, the company announced its plans to shift focus to providing data science services to the oil and gas industry (they changed direction again recently). This was seen as a disappointing move by those of us who like Kaggle’s platform and believe the scientific consensus that most known fossil fuel reserves should be kept in the ground to avoid catastrophic climate change. Kaggle’s pivot has led to the departure of Jeremy Howard from the company, due to his desire “to use machine learning to change the world - ideally, for the better”. The use of the word ideally is interesting – would you really want to use machine learning to change the world for the worse?

Bio:Yanir Seroussi is a data scientist from Sydney, Australia. He has a PhD from Monash University and a BSc from Technion, and has worked as a software engineer and data scientist with companies in various industries. Yanir offers data science consulting services, while working on his own side projects. He’s always happy to discuss exciting ethical applications for his data science superpowers.