Anecdotes from 11 Role Models in Machine Learning

The skills needed to create good data are also the skills needed for good leadership.

By Robert Munro, Author, Human-in-the-Loop Machine Learning

I recently wrote the book that I wish existed when I was introduced to machine learning: Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-Centered AI. Most machine learning models are guided by human-annotated data, but most machine learning books and courses focus on algorithms. You can often get state-of-the-art results with good data and simple algorithms, but you rarely get state-of-the-art results from the best algorithm with bad data. So if you need to go deep in one area of machine learning first, you could argue that the data side is more important.

In addition to the technical focus of the book, it features anecdotes from 11 machine learning experts. Each shared an anecdote about data-related problems they encountered building and evaluating machine learning models in real-world situations. Their stories tell us something important about machine learning leadership more broadly, with each anecdote tying into a lesson about running successful data science projects.

The 11 experts in Machine Learning featured in Human-in-the-Loop Machine Learning. (All images used with the permission of each expert and repeated below with their individual anecdotes)

The experts were selected by two criteria: they have all founded successful machine learning companies and they all worked directly on the data side of machine learning. They are all good role models for people considering careers in machine learning: Ayanna Howard, Daniela Braga, Elena Grewal, Ines Montani, Jennifer Prendki, Jia Li, Kieran Snyder, Lisa Braden-Harder, Matthew Honnibal, Peter Skomoroch, and Radha Basu. If you are early in your career and struggling to create good data for your models, then I hope you can relate to many of the anecdotes in the book, which are shared here:

Ayanna Howard

“Parents are the perfect subject matter experts”

Models about people are rarely accurate for people that are not represented in the data. There are many demographic biases that can lead to people being under-represented, like ability, age, ethnicity, and gender. And there are often intersectional biases, too: if people are under-represented across multiple demographics, then sometimes the intersection of those demographics is more than the sum of the parts. Even if you do have the data, it might be difficult to find annotators with the right experience to correctly annotate it.

When building robots for kids with special needs, I found that there was not sufficient data for detecting emotion in children, detecting emotion in people from under-represented ethnicities, and detecting emotion in people on the autism spectrum. People without immersive experience tend to be really poor at recognizing emotions in these children, which limits who can provide the training data that says when a child really is happy or upset. Even some trained child physicians have difficulties accurately annotating data when addressing the intersectionality of ability, age, and/or ethnicity. Fortunately, we found that a child’s own parents were the best judge of their emotions, so we created interfaces for parents to quickly accept/reject a model’s prediction of the child’s mood. This allowed us to get as much training data as possible while minimizing the time and technical expertise that parents needed to provide that feedback. Those children’s parents turned out to be the perfect subject matter experts to tune our systems to their child’s needs.

Bio: Ayanna Howard is the Dean of the College of Engineering at Ohio State University. She was previously the Chair of the School of Interactive Computing at Georgia Tech; co-founder of Zyrobotics, which makes therapy and educational products for children with special needs; worked at NASA; and has a PhD from the University of Southern California.

Daniela Braga

“Confessions about sourcing languages”

At our company we pride ourselves on going the extra mile to ensure we’re getting the best data, which sometimes leads to hilarious situations. For text and speech data, the hardest problem is often finding fluent speakers. Finding people with the right qualifications and who speak the right language is one of the most difficult and overlooked problems in machine learning.

Recently, we were doing a major project collection for a client with specific language requirements. After a few missed attempts to source the right people for a rare language, one of our people went to a church where he knew he’d find individuals who would meet the requirements. While he found the people he needed for our client, he accidentally turned up during confession time. The priest assumed he was there for this reason, so, true to form, he made his full confession, including about sourcing languages.

Bio: Daniela Braga is the Founder and CEO of DefinedCrowd, a company that creates training data for machine learning, including text and speech data in more than 60 languages.

Elena Grewal

“Synthetic controls: evaluating your model without evaluation data”

How can you measure your model’s success if you are deploying an application where you can’t run A/B tests? Synthetic control methods are a technique that you can use in this case: you find existing data that is closest in features to where you are deploying the model and use that data as your control group.

I first learned about synthetic controls when studying education policy analysis. When a school tries some new method to improve their students’ learning environment, they can’t be expected to improve only half the students’ lives so that the other half can be a statistical control group. Instead, education researchers might create a “synthetic control group” of schools that are most similar in terms of the student demographics and performance. I took this strategy, and we applied it at Airbnb when I was leading data science there. For example, when Airbnb was rolling out a product or policy change in a new city/market and could not run an experiment, we would create a synthetic control group of the most similar cities/markets. We could then measure the impact of our models compared to the synthetic controls for metrics like engagement, revenue, user ratings, and search relevance. Synthetic controls allowed us to take a data-driven approach to measuring the impact of our models, even where we didn’t have evaluation data.

Bio: Elena Grewal is founder and CEO of Data 2 the People, a consultancy that uses data science to support political candidates who aim to have a positive impact on the world. Elena previously led Airbnb’s 200+ person data science team and has a PhD in Education from Stanford University.

Ines Montani

“Good interfaces give you quality, not just quantity”

When I talk to people about usable interfaces for annotation, the reaction is too often “Why bother? Annotations aren’t very expensive to collect, so even if your tool is twice as fast, that’s still not that valuable.” This viewpoint is problematic. First, many projects need buy-in from subject matter experts such as lawyers, doctors, or engineers who will be doing much of the annotation. More fundamentally, even if you’re not paying people much, you still care about their work and people can’t give you good work if you set them up to fail. Bad annotation processes often force workers to switch focus between the example, the annotation scheme, and the interface. This requires active concentration and is quickly exhausting.

I worked in web programming before I started working in AI, so annotation and visualisation tools were the first pieces of AI software that I started thinking about. I have been especially inspired by the “invisible” interfaces in games, which make you think about what to do, not how to do it. But it is not about gamification to make a task “fun” like a “game”: it is about making the interface as seamless and immersive as possible to give them the best chance to do the task well. That will create better data and be more respectful to the people creating it.

Bio: Ines Montani is the co-founder of Explosion. She’s a core developer of spaCy, and the lead developer of Prodigy.

Jennifer Prendki

“Not all data is equal”

If you care about your nutrition, you don’t go to the supermarket and randomly select items from the shelves. You might eventually get the nutrients you need by eating random items from the supermarket shelves, however, you will eat a lot of junk food in the process. I think it is weird that in Machine Learning, people still think it’s better to “sample the supermarket randomly” than figuring out what they need and focusing their efforts there.

The first Active Learning system I built was by necessity. I was building Machine Learning systems to help a large retail store make sure that when someone searched on the website, the right combination of products came up. Almost overnight, a company re-org meant that my human labeling budget was cut in half and we had a 10x increase in inventory that we had to label. So, my labeling team had only 5% the budget per item that we previously did.

I created my first Active Learning framework to discover which was the most important 5%. The results were better than random sampling with a bigger budget. I have used Active Learning in most of my projects since, because not all data is equal!

Bio: Jennifer Prendki is the CEO of Alectio, finding the right data for Machine Learning. She previously led data science teams at places including Atlassian, Figure Eight, and Walmart.

Jia Li

“The difference between academic and real-world data labeling”

It is much harder to deploy machine learning in the real-world than for academic research, and the main difference is the data. Real-world data is messy and often hard to access due to institutional hurdles. It is fine to conduct research on clean, unchanging datasets, but when you take those models into the real-world, it can be hard to predict how they will perform.

When I was helping to build ImageNet, we didn’t have to worry about every possible image class that we might encounter in the real world. We could limit the data to images that were a subset of concepts in the WordNet hierarchy. In the real-world, we don’t have that luxury. For example, we can’t collect large amounts of medical images related to rare diseases. Labeling of such images further requires domain expertise, which poses even more challenges. Real-world systems need both AI technologists and domain experts collaborating closely to inspire research, provide data & analysis, and develop algorithms to solve the problem.

Bio: Jia Li was CEO and co-founder of Dawnlight, a health-care company that uses machine learning. She previously led research divisions at Google, Snap, and Yahoo!, and has a PhD from Stanford.

Kieran Snyder

“Your early data decisions continue to matter”

The decisions that you make early in a machine learning project can influence the products that you are building for many years to come. This is especially true for data decisions: your feature-encoding strategies, labeling ontologies, and source data will have long-term impacts.

In my first job out of graduate school, I was responsible for building the infrastructure that allowed Microsoft software to work in dozens of different languages around the world. This included making fundamental decisions like deciding on the alphabetical order of the characters in a language, something that didn’t exist for many languages at the time. When the 2004 Tsunami devastated countries around the Indian Ocean, this was an immediate problem for Sinhalese-speaking people in Sri Lanka: there was no easy way to support searching for missing people, because Sinhalese didn’t yet have standardized encodings. Our timeline for Sinhalese support went from several months to several days so that we could help the missing person’s service, working with native speakers to build solutions as quickly as possible. The encodings that we decided on at that time were adopted by Unicode as the official encodings for the Sinhalese language and will now encode that language forever. You won’t always be working on such critical time-lines, but you should always consider the long-term impact of your product decisions right from the start.

Bio: Kieran is the CEO and Co-Founder of Textio, a widely-used augmented writing platform. Kieran previously held product leadership roles at Microsoft and Amazon and has a PhD in linguistics from the University of Pennsylvania.

Lisa Braden-Harder

“Annotation bias is no joke”

Data scientists usually underestimate the effort needed to collect high-quality highly subjective data. Human agreement for relevance tasks is not easy when you are trying to annotate data without solid ground truth data and engaging human annotators is successful only with strongly communicated goals, guidelines and quality control measures. This is especially important when working across languages and cultures.

I once had a request for Korean knock-knock jokes from a US personal assistant company expanding into Korea. It wasn’t a quick conversation to explain to the product manager why that wouldn’t work and to find culturally appropriate content for their application: it unraveled a lot of assumed knowledge. Even among Korean speakers, the annotators creating and evaluating the jokes needed to be from the same demographics as the intended customers. It was one example of why the strategies to mitigate bias will touch every part of your data pipeline, from guidelines to compensation strategies that target the most appropriate annotation workforce: annotation bias is no joke!

Bio: Lisa Braden-Harder is a Mentor at the Global Social Benefit Institute at Santa Clara University. She was Founder and CEO of the Butler Hill Group, one of the largest and most successful annotation companies; and prior to that worked as a programmer for IBM and completed Computer Science degrees at Purdue and NYU.

Matthew Honnibal

“Consider the total cost of annotation projects”

It helps to communicate directly with people annotating your data, just like anyone else in your organization. Inevitably, some of your instructions won’t work in practice and you will need to work closely with your annotators to refine them. You’re also likely to keep refining the instructions and adding annotations long after you go into production. If you don’t take the time to factor in refining the instructions and discarding wrongly labeled items, then it is easy to end up with an outsourced solution that looked cheap on paper but was expensive in practice.

In 2009 I was part of a joint project between the University of Sydney and a major Australian news publisher that required named entity recognition, named entity linking, and event linking. While academics were increasingly using crowdsourced workers at that time, we instead built a small team of annotators that we contracted directly. This ended up being much cheaper in the long run, especially for the more complicated “entity linking” and “event linking” tasks where crowdsourced workers struggled and our annotators were helped by working and communicating with us directly.

Bio: Matthew Honnibal is the creator of the spaCy NLP library and the co-founder of Explosion. He has been working on NLP research since 2005.

Peter Skomoroch

“Sunlight is the best disinfectant”

You need to look at real data in depth to know exactly what models to build. In addition to high-level charts and aggregate statistics, I recommend that data scientists go through a large selection of randomly selected, granular data regularly to let these examples “wash over you”. Just like executives look at company-level charts every week and network engineers look over stats from system logs, data scientists should have an intuition for their data and how it is changing.

When I was building LinkedIn’s Skill Recommendations feature, I built a simple web interface with a “random” button that would show individual recommendation examples alongside the corresponding model inputs so that I could quickly view the data and get an intuition for the kinds of algorithms and annotation strategies that might be the most successful. This is the best way to ensure that you have uncovered potential issues and obtain the high quality input data that is vital: you’re shining a light on your data, and sunlight is the best disinfectant.

Bio: Peter Skomoroch is the former CEO of SkipFlag (acquired by WorkDay) and worked as a Principal Data Scientist at LinkedIn in the team that invented the title “data scientist”.

Radha Ramaswami Basu

“Human insights and scalable machine learning equals production AI”

The outcome of AI is heavily dependent on the quality of the training data that goes into it. A small UI improvement like a magic wand can result in large efficiencies when applied across millions of data points in conjunction with well-defined processes for quality control. An advanced workforce is the key factor: training and specialisation increases quality and insights from an expert workforce can inform model design in conjunction with domain experts. The very best models are created by a constructive and ongoing partnership between machine and human intelligence.

We recently took on a project that required pixel level annotation of the various anatomic structures within a Robotic Coronary Artery Bypass Graft or CABG video. Our annotation teams are not experts in anatomy or physiology so we implemented teaching sessions in clinical knowledge to augment the existing core skills in 3D spatial reasoning and precision annotation, led by a solutions architect who is a trained surgeon. The outcome for our customer was successful training and evaluation data for our customer. The outcome for us was to see people from under resourced backgrounds in animated discussion about some of the most advanced uses of AI as they quickly became experts in one of the most important steps in medical image analysis.

Bio: Radha Basu is founder and CEO of iMerit. iMerit leverages technology and AI work force consisting of 50% women and youth from underserved communities to create advanced technology workers for global clients. Radha has previously worked at HP, took SupportSoft public as CEO, and founded the “Frugal Innovation Lab” at Santa Clara University.

Leadership skills for machine learning

Creating good data requires a broader skillset than creating good algorithms. Many of the skills required to create good training data are also good leadership skills and are exemplified by the experts featured in my book:

Radha is one of the most successful leaders in Silicon Valley in any industry, having already taken one company public and now the founder and CEO of a profitable AI company that employs thousands of people. I especially like how her anecdote shows that outsourced annotators can become domain experts, growing in their career potential as a result of their work.

Peter encourages data scientists to always look at the data, showing that even for leaders of a company it is important to understand the data that you are working with.

Matthew’s anecdote highlights how annotation alone is not the only cost that goes into creating good data, a point often lost on people who only use anonymous crowdsourced workers which is common in academia but rare in industry.

Lisa emphasizes how looking at the data is important but fully understanding the data will not be possible in cases where you don’t have the right cultural context to understand it. This highlights how good leadership means bringing in people with greater knowledge than yourself for their tasks.

Kieran’s anecdote is another great example of understanding the cultural context of the people creating the data, where knowledge of a particular language was needed to support time-critical disaster response efforts.

Jia’s anecdote on the difference between academic and real-world data emphasizes how the narrow set of skills most people learn in academic machine learning programs tend not to apply to real-world situations.

Jennifer also highlights the practical reality of many real-world situations: you have limited time and budget, so how do you choose the right data when you still need to ship a product that people will use?

Ines started her career thinking about good user experiences from web interfaces, emphasizing how important good interface design is for good data annotation tools, no matter who is annotating the data.

Elena highlights yet another practical reality of real-world models: how to evaluate the success of model changes when you can’t even run A/B tests, let alone use held-out evaluation data?

Daniela’s story talks about meeting with a community providing language data on their own terms, and provides some levity to remind us to not to take ourselves too seriously.

Ayanna gives my favorite example of how important it is to decide who can label data: the parent/guardian of a special needs child is probably the only accurate and ethical annotator to understand and encode that child’s emotions.

Even in academia where the focus is on algorithms, researchers understand the importance of data. Christopher D. Manning, director of the Stanford Artificial Intelligence Laboratory, shares this in the book’s foreword:

“It is an open secret of machine learning practitioners in industry that obtaining the right data with the right annotations is many times more valuable than adopting a more advanced machine learning algorithm.”

Honorable mentions

There are many other people I know who qualify as experts — founders of companies who have worked on the data side of machine learning in their careers — but the timing of the book and limited chapters meant that only so many experts could be included. Given more time, additional role models might include Alyona Medelyan, Aman Naimat, Fang Cheng, Hilary Mason, Ivan Lee, John Akred, Mark Sears, and Monica Rogati. There are a dozen more people who also come to mind, including people who don’t meet the criteria I used for the book but are still role models. Thanks also to Emmanuel Ameisen for the inspiration to invite and feature experts on my book. I got the idea after he did this for his book, Building Machine Learning Powered Applications.

Following role models in machine learning

For someone new to machine learning, it can be difficult to identify what career paths are available. Just like most courses focus on algorithms, most lists of machine learning leaders focus on algorithm researchers. The diversity of the experts’ backgrounds in this article shows that there are many possible career paths to leadership in machine learning, with backgrounds in education, linguistics, UI development, physics, and many other areas outside of computer science. Therefore, if you are working on the data side of machine learning and you don’t have a computer science background, you shouldn’t feel like an outsider. Working on data-related problems in machine learning is necessary for a successful career and is a common path to leadership.

I’m sharing the stories all here, so that you don’t have to buy the book to learn from these expert anecdotes. If you do buy the book, I am donating all author proceeds to initiatives for better datasets, especially for low-resource languages and for health and disaster response, so you will be contributing to good causes. Although it wasn’t part of the criteria for selection, all the experts have worked on applications with clear positive impact on the world, so it was delight to give these 11 role models for good leadership more recognition in my book!

Bio: Robert Munro (@WWRob) worked in refugee camps for the UN in West Africa before his PhD at Stanford that focused on machine learning in health and disaster response. He helped respond to the recent Ebola outbreak in West Africa, the MERS-coronavirus outbreak 10 years ago, and was CTO of a global epidemic tracking organization. Robert also ran AWS's first NLP service, Amazon Comprehend, and has worked as a leader in many Silicon Valley technology companies.

Original. Reposted with permission.

Related: