Data Science Challenges

This post is thoughts for a talk given at the UN Global Pulse lab in Kampala, and covers the challenges in data science.

Quantifying the Value of Data

Most of the data we generate is not directly usable. To make it usable we first need to identify it, then curate it and finally make it accessible. Historically this work was done because data was actively collected and the collection of data was such a burden in itself that its curation was less of an additional overhead. Now we find that there’s a sea of data, but most of it is undrinkable. We require data-desalination before it can be consumed.

Sea water ocean waves

  • How do we measure value in the data economy?
  • How do we encourage data workers: curation and management.
    • Incentivization
    • Quantifying the value in their contribution.

This relates also to the questions of measurement above. Direct work on data generates an enormous amount of ‘value’ in the data economy. Yet, the credit allocation for this work is not properly accounted for. There are many organizations where the curation of data is treated almost as a dirty after thought, you might be shown simulations of cities of questionable value (in the real world) but highlighting work at the data-face is rare. Until this work is properly quantified the true worth of an organisation will not be understood. This may be because such work is difficult to ‘embody’.

By embodiment here we mean the delivery of something tangible to (perhaps) a decision maker who has an interest in the data. Data is diffusive and pervasive by its nature. This means for non-experts its potential is sometimes difficult to realize. Similarly, for data experts who are non-domain experts there are challenges in understanding what aspect of the domain requires implementation.1

Another important action is to encourage greater interaction between application domains and data scientists. Embedding of data scientists within application teams and greater education of domain experts in the possibilities and limitations of the data. This is particularly important in data science education: when projects are proposed they should be undertaken through close interaction with the application domain, for example through project placements.

Visualization of data seems very important as an intermediary between the data scientist and the application domain. A visualization acts in such a way so as to embody the data set and generate questions around it. Getting into the habit of visualizing data also forces the data generators to perform some basic quality control. Critical analysis of visualized data should be widely taught, it doesn’t require such strong technical understanding as full analysis, but acts as an important quality control on the data set close to where it is collected.

A final possibility would be the adoption of ‘data readiness levels’ for describing the nature of data collected and its potential usability. Data readiness levels would mirror ‘technology readiness levels’ which assess the deployability of technology in application. Technology is a similarly diffuse idea to ‘data’. Readiness levels ensure that reports and/or discussions have some way of accounting for deployability in discussions even for non-experts.

Better quantifying the value of data also has important implications for incetivisation markets which encourage users to provide data in exchange for incentives. The extent to which this economy will become separate from the standard economy of monetary exchange for services is also unclear. And as it does separate, it compounds the problem of measurement highlighted above.

Privacy, Loss of Control and Marginalization

While society is perhaps becoming harder to monitor, the individual is becoming easier. Our behavior can now be tracked to a far greater extent than ever before in our history. What would have been considered surveillance in our past is now standard practice.

We are to a great extent compounding this problem, for example social media monitoring for ‘hate speech’ as an early warning system of potential inter-tribal tensions could easily evolve into monitoring social media for ‘political dissent’.

Even less nefarious purposes, such as marketing, become more sinister when the target of the marketing is well understood and the (digital) environment of the target is also so well controlled.

As computers collect more data about us they will characterize us better and, given a particular scenario, they are likely to be able to predict our own actions better than we can predict them ourselves. If a system external to ourselves can predict our actions better than us, does this have implications for our free will?

Marginalization and Discrimination

This also has the potential for powerful discrimination against the disadvantaged. When automated decision making is taking place, then there is the possibility for significant discrimination on the basis of race, religion, sexuality, health status. This is all prohibited under European law, but it can pass unawares, it can be implicit in our processes.

Applications such as credit scoring, insurance, medical treatment will all suffer if particular sections of society are under-represented in data sets collected for those applications. Predictions made would be less accurate. This has particular consequences for developing economies if these applications are developed mainly in developed economies (or even more specifically in Silicon Valley).

To ameliorate the downsides of these outcomes we should be working to ensure that the individual retains control of their own data. This is the concept of privacy. We accept, in our real lives, the principle that we should be able to express ourselves differently according to the nature of our social relationship. I share more with my doctor than my students. This control of self needs to be replicated in the digital world. Technologies like differential privacy are the key here.

With regards to discrimination, we need to increase awareness of the pitfalls among researchers and ensure that technological solutions are being delivered not merely for the set of #FirstWorldProblems but for the wider set of challenges that the greater part of the world’s population is facing. That involves increasing the capability to meet those challenges within the populations that are facing them.


Data science offers a great deal of promise in resolving our challenges in health, wealth and well being, but it is also associated with a set of potential pitfalls. As data scientists it is particularly incumbent upon us to avoid these pitfalls and ensure that our community takes steps to resolve challenges as rapidly and equitably as possible. The nature of data is changing and will continue to change our societies. We need to work to ensure that those changes are carried out in a manner that narrows inequality and preserves the individual freedoms we have come to expect.

  1. The importance of embodiment is reflected in a mini-industry of simulation that seems to pervade complex systems. But often simulation without motivation. For example, it seems impressive to build a large scale simulation of a city like Sheffield and it seems like such a simulation should be useful for decision makers. In practice though, most decision makers focus (at any given time) on a particular aspect of the city and a full complex simulation is not required. Such simulations do, however, impress non-domain experts.

Bio: Neil Lawrence is a professor of machine learning and computational biology at the University of Sheffield. He leads the ML@SITraN group. His research interests are in probabilistic models with applications in computational biology and personalized health. He blogs regularly.

Original. Reposted with permission.