Machine Learning Meets Humans – Insights from HUML 2016

Report from an important IEEE workshop on Human Use of Machine Learning, covering trust, responsibility, the value of explanation, safety of machine learning, discrimination in human vs. machine decision making, and more.

Fabio Roli – Safety of Machine Learning

On the topic of anthropomorphic ML, Fabio Roli began his discussion with an engaging foray into science fiction. Recalling Fred Hoyle’s Black Cloud, Roli recounted a story of humans who encounter a sentient cloud. They soon realize that it is intelligent but are unable to communicate with it because all attempts at communication stem from anthropomorphic assumptions.

Turning back towards machine learning, Roli asked whether it is good or responsible to anthropomorphize developments in AI. Already many companies build humanoid robots. We already engage in dialogue with anthropomorphic chatbots (Alexa, Siri, Cortana).  What are the benefits of this tendency to the anthropomorphic? What are its dangers? Do we present false or misleading expectations of capabilities?

Roli went on to describe problems with machine learning owing to adversarial examples. He addressed the susceptibility of convolutional neural networks to adversarially perturbed images and proposed approaches in which ML algorithms learn closed decision boundaries. In a model with closed decision boundaries, the algorithm might abstain from points sufficiently unlike those previously seen. How precisely to draw closed boundaries on the space of images remains a challenging open question and the method has not  to my knowledge been reduced to practice in this domain.

Describing another line of research, Roli introduced work aimed at detecting malware on android devices. This work presented an interesting spin on opacity in ML algorithms. In malware detection, even controlling for accuracy, opacity might be an asset and not a vice. A truly transparent algorithm might be easier for a malware coder to evade. Opaque algorithms might be harder to game.

Viola Schiaffonati – Preliminary steps for experimentally evaluating the impact of AI

Professor Schiaffonati’s talk focused on the safe deployment of machine learning algorithms, calling to memory the NIPS workshop on reliable ML in the wild. In the talk, she focused on the distinction between learning by experimentation (in vitro) and learning by doing (in vivo). Among the ideas presented was a proposal for special testing zones.

Already, many technologies are rolled out systematically with testing phases. Google for example conducts extensive betas internally with employees receiving advanced access to new technology. Nevertheless, for many players, especially smaller startups and research groups, deployment can be considerably more haphazard and these problems are by no means solved.

Mirelle Hildebrandt – The Issue of Bias

Mirelle Hildebrandts’ talk was both one of the most entertaining and the toughest to summarize. The talk alternately addressed the law, bias, and the fundamental task of pattern recognition. It also featured frequent leaps to the classics, complete with interjections from Hume and a reference to the no free lunch theorem. To do the talk justice, I’d suggest to watch the video, but I’ll quickly summarize the most important takeaways.

To me the most profound moment in the talk came early when Hildebrandt addressed why we need law in the first place.

“We need law to create a playing field such that actors can act ethically”

“What we don’t want is an incentive structure such that companies who want to act ethically will be pushed out of the market.”

While it might seem absurd that we should need to justify the existence of law (!!!), in today’s Silicon Valley climate these points hit home.

Consider the many startups now that outmaneuver their highly regulated predecessors precisely by skirting regulation. Uber out-competed taxis in part because it’s a slick app but also because they operated without commercial insurance, didn’t employ commercially licensed drivers and didn’t have to pay for medallions in cities where this is the prohibitive cost to entering the taxi market. Given recent events, the point rings more profound. How can a politician compete effectively without abusing  the truth? For ethical behavior to be prevail it must be encouraged by society.

Later in her talk, Hildebrandt addressed contestability. This point hits to the heart of ML interpretability research again. In many real-life situation, people need/want the ability to contest decisions. If a decision-maker denies you a loan due to insufficient income, you could contest this by showing evidence of income and demand that this new information be taken into account. Unfortunately few attempts at interpretable ML models possess the ability to handle a protest and revise predictions under new information.

Krishna Gummadi – Discrimination in human vs. machine decision making

In his talk, Krishna Gummadi took on discrimination in algorithmic decisions. He started with the idea of “socially salient group”. In short, this would be something like race or gender. Basically any group that we want to be careful not to discriminate against.

Gummadi then formalized several notions of discrimination. One for example, would be disparate treatment. A predictive model is guilty of disparate treatment of P(y|x,z) != P(y|x) where is a sensitive feature and are the insensitive features.

Gummadi explained why omitting sensitive features may not be sufficient to ensure that a model doesn’t discriminate. In particular, if (i) the available labels are biased, and (ii) the insensitive features are correlated with the sensitive features, then the model could learn to reproduce the discriminatory behavior of the labellers. For a long-form discussion of this problem, I addressed it in the previous post The Foundations of Algorithmic Bias.

In his proposed solution, Gummadi proposed encoding fairness as a constraint. Take for example recidivism prediction. In this approach the model might be constrained to predict the same number of recidivism cases among white and black arrestees.  Then subject to this fairness constraint, the model would maximize accuracy. In practice, setting exact equality could yield trivial predictions (give the same prediction to everyone), so the authors introduce some slack. In this approach, while the model wouldn’t be guilty of disparate treatment, it would still peek at the sensitive feature during training.

This research strikes me as interesting both for its attempts to formalize notions of discrimination and due to its exploration of a technical solution. However, I’d share a couple caveats about this particular technical solution that one ought to consider before actually using it in practice.

My two reservations are as follows. First, I think if a model peeks at a sensitive feature during training, but not inference time, this only removes disparate treatment in a narrow technical sense. It respects the mathematical definition but not the spirit of disparate treatment concerns. Say, for example, that the sensitive feature were race and that the insensitive features included correlated features like zip code. The model could then be expected to use zip code explicitly as a proxy for race. At this point, the model could not reasonably be said to be ignoring race. So if the goal is to learn an affirmative action strategy, why do it implicitly and not explicitly?

This brings us to my second reservation about the approach: the arbitrariness of the learned weights. Consider a dataset for predicting recidivism. Imagine that the dataset contains a throwaway feature Q that has no plausible connection to the prediction task but is correlated with the sensitive feature Z. If we train any reasonable ML classifier, this feature will get no weight: P(Y|Q) = P(Y). But under Gummadi’s proposed model, this nonsense feature now might shoulder considerable weight (to meet the fairness constraint). So for the sake of paying lip-service to disparate treatment, we now produce a seemingly nonsensical model.

To be clear, I think this work is valuable. The benefits and pitfalls of any algorithms can only be examined once an algorithm is proposed. Gummadi’s approach makes a bold attempt both at a problem formulation and a solution.

Arshak Navruzyan – Avoiding bad machine learning predictions in critical decision domains

The lone non-academic among the invited speakers, Arshak Navruzyan of Startup.ML offered an entrepreneur’s perspective on machine learning. Early on, Navruzyan pointed out the difficulty of doing machine learning owing to its interdisciplinary nature. Doing machine learning well requires mathematical skills, engineering talent, and some amount of problem domain expertise. It also requires the ability to identify an important problem, to conceive of its impact, and to communicate its importance effectively. Moreover, executing on all of the above while keeping an eye towards ethics requires yet another dimension of competence.

Navruzyan suggested that we should expect things to go wrong if the people doing this work are ill-equipped for it. He positioned his program Startup.ML, which seeks to train new data scientists as one solution. Over a several month program, Navruzyan and his colleagues train a class of fellows for careers to solve practical problems using machine learning. During this introduction, Navruzyan claimed that the Startup.ML fellowship boasted a 2% acceptance rate, making the eyebrow-raising claim that the bootcamp is therefore more selective than elite computer science institutions. The strongest aspect of the talk was his case that programs like Startup.ML could help to enable careers in data science beyond the handful of elite ML PhDs benefitting most now.

Navruzyan suggested that one way to address the multidisciplinary nature of machine learning is to build teams composed of individuals with complementary skill sets. He presented a chart which depicted the stereotypical skills of ML PhDs, general computer scientists, mathematicians, statisticians, natural scientists, and project managers. I’m not a fan of hard-coding stereotypes like this. But the point about inter-disciplinary team-building is reasonable.

Later in the talk Navruzyan discussed machine learning predictions in critical domains, ostensibly the purpose of the talk. He cited adversarial examples as one problem, motivating startup.ML’s growing focus on reinforcement learning. This struck me as odd. There is no good reason to suspect that reinforcement learning is safer than supervised learning in critical domains or less susceptible to adversarial examples. In fact, even absent adversarial intervention, reinforcement learning can be subject instability during training. Deep reinforcement learning in particular can diverge or oscillate, even on toy examples.

Bettina Berendt -What does it mean to ask about “the human use of machine learning?”

The final invited talk of the day was delivered by Bettina Berendt. In it she posed six questions about the social impacts of machine learning and the responsibilities of individuals and organizations deploying machine learning in the wild. Since her talk was designed more to ask question and foster conversation than to propose definitive answers, I present them here, excerpted from her abstract.

  1. Which actors are involved in formulating the (e.g. privacy) problem?
  2. How does the researcher conceptualise the problem (e.g. privacy) in terms of the major legal and ethical positions currently being discussed?
  3. Is informing users of (e.g. privacy) dangers always a good thing?
  4. Do we want to influence users’ attitudes and behaviours?
  5. Who is the target audience?
  6. What can we do in our various roles – as academics, teachers, intellectuals, etc.?

In a note-worthy moment, Berendt criticized the tendency of theorists who take on social issues for ignoring prior work in the social sciences. This is a thorny issue and I can see multiple angles.

On one side there is a question of intellectual opportunity. We could ask, are today’s researchers missing out by ignoring previous research? On the other hand, there’s an issue of credit assignment. Are today’s machine learning researchers, armed with fame and funding usurping the helm of long-researched fields? Having just come from NIPS 2016, credit-assignment battles were fresh in my mind. Likely each side has some merit.

How best to engage prior literature on privacy, fairness, and discrimination and to what extent this work is overlooked and under-cited I leave as an exercise for a future post or an ambitious reader.

Final Thoughts

The summit in Venice was a bold effort to bring together various experts at the intersection of machine learning, ethics, and public policy. The talks were diverse and informative, leaving me to wonder, what venue will emerge as the permanent home for this work? At present, this community is scattered across several workshops, none of which publishes a peer-reviewed proceedings.

As the field develops and as technological progress increased the importance of this work, will a full conference (or journal) emerge as a definitive publishing venue? Where can policy-makers and machine learning practitioners turn to for authoritative research on social impacts of machine learning?

Regarding the future of this community I pose the following challenges/ questions:

  1. The necessary research requires both legal, critical, and technical scholarship. It seems unlikely that we can count on building community exclusively of individuals who excel in all disciplines. Likely many papers will emerge with important insights for policy but minimal technical contributions. Other papers might have technical contributions but contribute to policy less directly. What set of standards should we apply to evaluating research?
  2. From an audience perspective, research in this area should be accessible both by machine learning researchers and practitioners, legal scholars and practitioners, policy researchers and politicians. Can this be accomplished in one venue?
  3. Could a multi-track conference overcome some of these potential organizational problems?
  4. Could we look to Machine Learning for Healthcare (MLHC) as an exemplar? This new conference brings together work at the intersection of core machine learning and clinical applications. As a published author and peer reviewer there, I can reflect that while process isn’t yet perfect, it’s off to a promising start.
  5. We need academic incentives for engaging in this research. What changes must we make to our disparate academic communities to encourage this work?
    • Philosophers and public policy academics get no credit for research presented at conferences
    • Machine learning researchers get little career advancement for writing philosophy papers
    • Machine learning venues rarely publish position papers

While these questions may remain open for some time, I’m hopeful after HUML2016 that a small but growing community of researchers is committed to making progress.

Original. Reposted by permission.