Data Science Challenges
This post is thoughts for a talk given at the UN Global Pulse lab in Kampala, and covers the challenges in data science.
By Neil Lawrence, University of Sheffield.
This post is thoughts for a talk given at the UN Global Pulse lab in Kampala as part of the second Data Science in Africa Workshop at the UN Global Pulse Lab in Kampala, Uganda. It covers challenges in data science.
Data is a pervasive phenomenon. It affects all aspects of our activities. This diffusiveness is both a challenge and an opportunity. A challenge, because our expertise is spread thinly: like raisins in a fruitcake, or nuggets in a gold mine. It is an opportunity, because if we can resolve the challenges of difussion we can foster a multi-faceted benefits across the entire University.
What Got Us Here
The old world of data was formulated around the relationship between human and data. Data was expensive to collect, and the focus was on minimising subjectivity through randomised trials and hypothesis testing.
Historically, the interaction between human and data was necessarily restricted by our capability to absorb its implications and the laborious tasks of collection, collation and validation. The bandwidth of communication between human and computer was limited (perhaps at best hundreds of bits per second).
This status quo has been significantly affected by the coming of the digital age and the development of fast computers with extremely high communication bandwidth. In particular, today, our computing power is widely distributed and communication occurs at Gigabits per second. Data is now often collected through happenstance. Its collation can be automated. The cost per bit has dropped dramatically, but the care with which it is collected has significantly decreased.
Traditional data analyses focused on the interaction between data and human. Sometimes, these data may have been processed by computer, but often through human driven data entry.
Today, massively interconnected processing power combined with widely deployed sensorics has led to manyfold increases in the channel between data and computer. This leads to two effects:
- automated decision making within the computer based only on the data.
- a requirement to better understand our own subjective biases to ensure that the human to computer interface formulates the correct conclusions from the data.
This process has already revolutionised biology, leading to computational biology and a closer interaction between computational, mathematical and wet lab scientists. Now we are seeing new challenges in health and computational social sciences. The area has been widely touted as ‘big data’ in the media and the sensorics side has been referred to as the ‘internet of things’. In some academic fields overuse of these terms has already caused them to be viewed with some trepidation. However, the phenomena to which the refer are very real. With this in mind we choose the term ‘data science’ to refer to the wider domain of studying these effects and developing new methodologies and practices for dealing with them.
The main shift in dynamic we’d like to highlight is from the direct pathway between human and data (the traditional domain of statistics) to the indirect pathway between human and data via the computer scientist. This change of dynamics gives us the modern and emerging domain of data science.
The field of data science is rapidly evolving. Different practitioners from different domains have their own perspectives. In this post we identify three broad challenges that are emerging. Challenges which have not been addressed in the traditional sub-domains of data science. The challenges have social implications but require technological advance for their solutions.
Paradoxes of the Data Society
The first challenge we’d like to highlight is the unusual paradoxes of the data society. It is too early to determine whether these paradoxes are fundmental or transient. Evidence for them is still somewhat anecdotal, but they seem worthy of further attention.
The Paradox of Measurement
The first paradox is the paradox of measurement in the data society. We are now able to quantify to a greater and greater degree the actions of individuals in society, and this might lead us to believe that social science, politics, economics are becoming quantifiable. We are able to get a far richer characterization of the world around us. Paradoxically it seems that as we measure more, we understand less.
How could this be possible? It may be that the greater preponderance of data is making society itself more complex. Therefore traditional approaches to measurement (e.g. polling by random sub sampling) are becoming harder, for example due to more complex batch effects, a greater stratification of society where it is more difficult to weigh the various sub-populations correctly.
The end result is that we have a Curate’s egg of a society: it is only ‘measured in parts’. Whether by examination of social media or through polling we no longer obtain the overall picture that can be necessary to obtain the depth of understanding we require.
One example of this phenomenon is the 2015 UK election which polls had as a tie and yet in practice was won by the Conservative party with a seven point advantage. A post-election poll which was truly randomized suggested that this lead was measurable, but pre-election polls are conducted on line and via phone. These approaches can under represent certain sectors. The challenge is that the truly randomized poll is expensive and time consuming. In practice on line and phone polls are usually weighted to reflect the fact that they are not truly randomized, but in a rapidly evolving society the correct weights may move faster than they can be tracked.
Another example is clinical trials. Once again they are the preserve of randomized studies to verify the efficacy of the drug. But now, rather than population becoming more stratified, it is the more personalized nature of the drugs we wish to test. A targeted drug which has efficacy in a sub-population may be harder to test due to difficulty in recruiting the sub-population, the benefit of the drug is also for a smaller sub-group, so expense of drug trials increases.
There are other less clear cut manifestations of this phenomenon. We seem to rely increasingly on social media as a news source, or as a indicator of opinion on a particular subject. But it is beholden to the whims of a vocal minority.
Similar to the way we required more paper when we first developed the computer, the solution is more classical statistics. We need to do more work to verify the tentative conclusions we produce so that we know that our new methodologies are effective.
Filter Bubbles and Echo Chambers
A related effect is own own ability to judge the wider society in our countries and across the world. It is now possible to be connected with friends and relatives across the globe, and one might hope that would lead to greater understanding between people. Paradoxically, it may be the case that the opposite is occurring, that we understand each other less well.
This argument, sometimes summarised as the ‘filter bubble’ or the ‘echo chamber’ is based on the idea that our information sources are now curated, either by ourselves or by algorithms working to maximise our interaction. Twitter feeds, for example, contain comments from only those people you follow. Facebook’s newsfeed is ordered to increase your interaction with the site.
In our diagram above, if humans have a limited bandwidth through which to consume their data, and that bandwidth is saturated with filtered content, e.g. ideas which they agree with, then it might be the case that we become more entrenched in our opinions than we were before. We don’t see ideas that challenge our opinions.
This is not a purely new phenomenon, in the past people’s perspectives were certainly influenced by the community in which they lived, but the scale on which this can now occur is much larger than it has been before.