5 Ways Data Scientists Can Help Respond to COVID-19 and 5 Actions to Avoid

How can data scientists help with the COVID-19 response within their organization and more broadly? While there are many valuable and interesting opportunities to apply your skills, there can be unintended consequences even from your best attempt. So, consider this general advice for data scientists who want to help with this and any disaster response.

By Robert Munro, Author, Human-in-the-Loop Machine Learning.

There are many Data Scientists thinking about how they can help respond to the SARS-CoV-2 virus and the disease it causes "COVID-19". This article is written in response to this current disaster but is intended as general advice for data scientists who want to help with disaster response.

I worked in post-conflict development for the UN in West Africa before coming to Silicon Valley to complete a Ph.D. focused on adapting Machine Learning to low resource languages in health and disaster response contexts. I've helped respond to many disasters worldwide, including the recent Ebola outbreak in West Africa, the MERS-coronavirus outbreak 10 years ago, and as CTO of a global epidemic tracking organization.

However, I think I have made the biggest impact by helping large tech companies support more languages, not in actual disaster response. If you don't speak a privileged language like English, then you are more likely to be the victim of a disaster and to have less information available to make the right decisions about your own recovery. So, ensuring better language coverage is vital. When I led AWS's first Natural Language Processing (NLP) and Machine Translation solutions, and when I had the two largest phone manufacturers as customers for NLP and Speech Recognition data, I used my influence to ensure more diverse language support within those companies. While it is harder to quantify, I think that this might have ultimately done more to help people in disasters than all my time as a disaster responder.

So, if you are a data scientist working at a company that makes widely used technology, the best thing that you can do might be to ensure that there is more diverse language support for your language technologies. This will continue to help in future disasters.

Pathogens, like most organizations, cluster with linguistic diversity. Source: "Artificial Intelligence for Social Good," Robert Munro, Stanford lecture in From Languages to Information, 2015.

Content moderation is also very important in disasters. Criminals prey on disaster victims, especially elderly people in financial scams, and by targeting children for abuse. If your company has content moderation systems that track and report potential financial scams and abuse of minors, then this is very important work.

If you don't think that you can help with language diversity or by tracking scams/abuse at your company, and you still want to contribute to the response to SARS-CoV-2, then here are 5 ways that you can help:

  1. Help the people around you interpret information.
  2. Translate information from experts into more languages.
  3. Prepare data that might be directly related to the response.
  4. Analyze data that is not directly related to the response.
  5. Research using existing disaster response datasets.

Unfortunately, there are many actions where it is likely that you will do more harm than good. More than 90% of your ideas as a data scientist don't actually work out when put into practice, and you should not expect a better success rate in disaster response, especially when you have no experience in the field. So the remaining 5 ways that you can help are 5 things to avoid:

  1. Don't share your own conclusions about how to respond.
  2. Don't work with organizations that are not responding.
  3. Don't start any work that you can't support as long as needed.
  4. Don't amplify any fake media.
  5. Don't release people's personal information.


1. Help the people around you interpret information


As a data scientist you objectively evaluate information regularly and probably have a well-tuned sense of what true scientific reporting looks like in healthcare even though you might not have experience in the field. A lot of your family and other people around you probably have less experience than you. Now is the time to teach them how to interpret log scales on graphs and why they should be suspicious of any graph without a scale.

Misleading image that is being shared on social media at the moment.

This is a good example of the kind of misleading information that is being shared on social media at the moment. Beyond the need to interpret log scales, here are some additional things to note in this example graph:

  1. This graph's y-axis reports the overall number of cases per country. However, the two countries at the bottom, Singapore and Hong Kong, only have about 2% of the population of the USA. This will bias towards flatter curves for Singapore and Hong Kong.
  2. The graph's x-axis starts at 100 cases for every country. For smaller countries, this will tend to be later in the response and will bias towards flatter curves for those countries.
  3. The "Masks/No Masks" circles look like they were added later by a person who is an expert on self-driving cars but has no prior expertise in healthcare for coronaviruses. This may not be obvious to every person who sees this graphic even if the person who added the circles might have tried to make it graphically distinct. Therefore, this graphic could lead people to believe that it is endorsed by "Johns Hopkins" who are well-known in healthcare circles when it is not.
  4. There is a binary distinction between "masks" and "no masks" that is almost certainly false. There will be different levels of mask use in different countries that are all between 0% and 100%, but not at either extreme.
  5. You can not infer a direct causal effect between the masks and the number of cases from a graphic like this alone. Perhaps the countries with lower cases introduced manyprotective measures all at once, masks included. Any one of the other protective measures could have made the difference, or perhaps none at all: it could have been some other factor or the extra caution in social distancing that accompanied these protective measures.
  6. Different countries will have different ways of reporting the number of cases. For example, some will only report people with symptoms, and some will have only tested people with symptoms in the first place. It is not always possible to account for this.
  7. Different countries will have populations more or less susceptible to the virus and, therefore, more or less likely to be tested. Age is one big factor in this case.
  8. There are more than 100 times as many cases in the USA, compared to Japan, Singapore, and Hong Kong (note the log scale). Ask people to evaluate what this implies. Some example questions: do you really think that any one homemade solution can prevent more than 99% of cases? How can there be a conspiracy that would involve millions of healthcare workers who are worried about their own loved ones?

Combatting misinformation like this spreading on the internet might be the most important thing that you can do as a data scientist. If someone you are close to comes to you with information like this, point out all these issues to them and then ask them why someone who does know the truth might be trying to mislead them. An obviously fake graphic like this could lead someone to distrust masks. That would also be wrong. This should have no influence on someone's decision, and the only advice should be:

Only take the advice of your trusted healthcare providers.

You might also want to talk about why healthcare organizations aren't talking about issues like this, which is because they don't see data that tells them it is important at this moment. If people share this too much it can become political and force organizations like the CDC and WHO into a response that is politically driven instead of health-driven. So, you should caution people about sharing this type of information regardless of whether they agree.


2. Translate information from experts into more languages


Do you speak a language outside of English? Especially a less widely spoken language? There's a good chance that a lot of valuable information is not being translated into those languages, or worse, that a lot of misinformation is spreading without the contradicting of correct information available.

Any relevant data that is translated and/or transcribed in a way that can be used by Machine Translation and Speech Recognition models will be useful. For example, two years ago, I led a project to create 10+ hours of disaster and health-related recordings of informational messages from the Red Cross in Swahili, with transcriptions and English translations. This data was made open source, and every Machine Translation and Speech Recognition service that uses this data is now more accurate for communications related to COVID-19. If you can create similar datasets and open-source them, that will help COVID-19 and any future response in those languages.

If you don't have any existing datasets, then I recommend helping an organization like Translators Without Borders. They were one of the organizations that helped with the Swahili dataset above and who work closely with organizations responding to disasters.

If you are not a professional translator, then don't translate advice about preventing or treating COVID-19. Instructional material and medical terminology are among the hardest types of translations to get right. I ran the largest use of crowdsourcing for translation in a disaster, so please take my advice on this one point.


3. Prepare data that might be directly related to the response


Epidemiologists are data scientists, and like the rest of us, they spend most of their time preparing data. If you are able to take data that might be directly related to the response and transform it into a more usable format, then you can help with the response directly.

One example of this might be taking a dataset of anonymized transportation routes that contains ambiguous or non-standard location names and transforming those locations into unambiguous geo-locations. Another example would be making past research papers about coronaviruses more easily searchable so that virologists can come up-to-speed on the past research as efficiently as possible.

Epidemiologists typically come from the social sciences, so expect them to be more rigorous when it comes to the right statistical analysis of the data, compared to a machine learning-focused data scientist.


4. Analyze and share data that is not directly related to the response


If you are not an epidemiologist, virologist, or another scientist with a lot of experience responding to disasters, then you are not going to be able to get up-to-speedd on an entire field in only a few months. Most of the interventions that you could do would end up hurting people instead of helping them (see below).

However, you can analyze data that tells us something important about the outbreak but doesn't have a direct relationship to the response itself. There are many ways that people's behaviors are changing as a result of COVID-19. Most disaster response professionals will focus on the direct response and might not get back to other relevant data later.

For example, when there was an outbreak of Ebola in West Africa a few years ago, I was advising many organizations because I had lived and worked in Sierra Leone and Liberia in addition to my more general disaster response experience. One thing I calculated that wasn't directly related to the outbreak was to estimate the number of people who died from causes other than Ebola because they were avoiding healthcare clinics. I calculated that for every person who died from Ebola, ten more died from treatable illnesses: The silent victims of Ebola

This helped the response indirectly because we used it to reduce the number of misleading news stories in the countries. Too many media outlets had decided to run information campaigns across the region without considering anything other than reducing Ebola deaths. So, I was able to provide this analysis to international health organizations who used it to help keep the media on-message as much as possible.

For COVID-19, what data can you find and analyze about human behavior that might help indirectly? For example, can you see how the reduction in driving and, therefore, car accidents might free up more hospital beds? The chances are that this might be an important number but, no-one has looked into this on a national scale. Similarly, how many fewer deaths are there due to less pollution? Or how what is the net benefit on the global carbon footprint now that we can actually measure the result of reduced pollution? Climate change will ultimately kill more people than COVID-19, and this might be one of our best chances to get accurate data about global changes in human behavior.

There is a lot that data scientists can teach us right now without the risk of contributing directly to the response, and they might ultimately have a greater impact on the world.


5. Research using existing disaster response datasets


If you really want to focus on disaster response, then many datasets are relevant to disaster response, and any insight into those past datasets will help us build models today for COVID-19 and other disasters in the future.

One NLP dataset that I helped create contains 30,000 messages drawn from events including an earthquake in Haiti in 2010, an earthquake in Chile in 2010, floods in Pakistan in 2010, and super-storm Sandy in the U.S.A. in 2012. These are all disasters that I helped respond to, and this dataset also includes news articles spanning a large number of years and 100s of other disasters: Access to data

Importantly, some of this data is in languages other than English. For example, the Haitian Kreyol data was used as a shared task in the 2011 Workshop on Machine Translation. This dataset is also used in classes run by AI4All, Udacity, and universities, including Stanford. The more people who have experience in disaster-related data, the more prepared we can be in the future.

If you work in computer vision, then I recommend researching systems that act to support healthcare professionals in their interpretations of images. Healthcare companies will get little or no value from a computer vision system that can only detect one type of infection and only provides a prediction, rather than an interface to help a healthcare professional with their own diagnosis.

Avoid research which is popular in academic circles only because the data is easy to collect or the problem is easy to model. These include English-only social media analysis in NLP and automated diagnosis for single conditions in medical images in Computer Vision. Results from these kinds of studies don't help us decide what approach will help us in actual disasters.


6. Don't share your own conclusions about how to respond


If you are not a healthcare worker or disaster response expert, then you should not give your medical opinion about how people should protect themselves. Despite working in disaster response for a decade, I am only directing people to more authoritative sources. You will not see me give you advice about how to protect or treat yourself in this article or on social media. Please do the same.

Furthermore, if you are quoting expert individuals or organizations, it is better to point people to those sources than to copy them to your website. Unless you are prepared to constantly monitor the healthcare experts for any changes in their advice and immediately update your material to reflect the latest advice, you will be printing misinformation at some point and creating confusion about who the authoritative source should be.

Resist any urge to take part in the discussion. There is absolutely no way that you can learn enough information to be useful within a short amount of time. For example, think about what would happen if someone read the most popular machine learning research papers of the last few years, but had no other experience. Would that prepare them to ship useful machine learning models for real applications? Absolutely not. Those papers have nothing about making machine learning work in the real world, and we know that for every paper, there were 100s or 1000s of experiments that showed negative results.

The same is true for any of the sciences directly responding to a disaster, whether it is about epidemiology, virology, or equipment like face-masks. Reading the 100 most relevant papers will not let you make a useful contribution. You will be biased by the particular problems that make it into papers about early research and the bias to only publish positive results. You will likely get people killed.


Don't work with organizations that are not responding


Most organizations that are reaching out to data scientists for help with COVID-19 are not directly helping with the response to COVID-19. To give a very high-level introduction to the aid industry, here's a graphic showing how a lot of aid organization work in disaster response:

High-level overview of how aid organizations are structured. A small number of large organizations that do aid at the national or international level are known as "Operational Organizations," but most of them use local "Implementing Partners" for the actual disaster response work. Some local aid organizations might be wholly independent or joint independent and helping larger orgs. "Non-Operational Organizations" are the smallest but can erroneously look like they are big and operational.

If someone is asking you to help, then how do you know if they are actually responding? The best organization to help is one operating locally. Does your local hospital or food distribution center for refugees need help? Start with them. You can work with big organizations like the CDC and WHO, but this is the worst time to start trying to get the attention of the big organizations as any time spent bringing you up-to-speed is time they are not responding to the outbreak. In any case, most of these large organizations would be directing you to a local implementing partner.

The non-operational organizations are typically small and use disasters as funding and publicity opportunities. Look for them talking about "partnerships" with bigger organizations like the WHO, but nowhere saying that they are an "implementing partner." This is typically code for "not actually part of the response." If they reach out to you, then the chances are that you are the product, and they are telling potential funders "look, we have volunteer data scientists from company X, and we will beat COVID with innovation."

As a rule of thumb, if it's not a national organization that you already know about (like the CDC or equivalent in your country), and it's not one of the first 30 UN Agencies in their list of Funds, Programmes, Specialized Agencies and Others, then look for organizations that you know are operating in your local area.


8. Don't start any work that you can't support as long as needed


I've never had trouble recruiting people at the start of a disaster, but I've always had trouble recruiting people who can help for a meaningful length of time. If you're writing code, building models, or writing documentation now, can you ensure that you will be able to support that in 3 months or 6 months?

Keep in mind that you might get ill yourself or have to look after others. If you or someone you are a caregiver for is more likely to get a worse case of COVID-19, then you should not be putting yourself on the critical path for a response if you are not already an essential worker. Furthermore, you need to be highly empathetic, but dispassionate, to be an effective disaster responder. If you are caught up in worry about yourself and your loved ones, then you are probably coming from a place of personal passion and will have trouble acting with objective empathy. I can't trust anyone in that situation, and so I always put people like this onto non-critical tasks in disasters.


9. Don't amplify any fake media


There are a number of fake media narratives that appear in every disaster. The most destructive are ones that target the response organizations for not doing the right thing. Even popular media outlets do this: they find one small part of the response where one organization has not recently done any work or where there are policies that disagree with other organizations. No matter how small the problem might be, it is easy for a media organization to present this as "potentially endangering millions of lives" and to get people on both sides of the argument to comment. Essentially, they invent controversy when none should exist.

The favorite targets for quotes on are politicians that are not in power in a country, because those politicians will blame the party in power, and technology mavericks in areas like data science because that is where confidence often out-paces competence.

Journalists know that these kinds of articles are grossly unethical, and they avoid putting their name to it. So, look for news articles that don't have an author or are invited authors from data science or opposition political parties.

The worst part of this narrative is the messages like "don't trust the WHO" or "don't trust the CDC," etc. Even if this issue being argued about is correct, the broader story about distrusting these organizations will do more harm than addressing this one issue.


10. Don't release people's personal information


Most of the world's governments will have at least some people in those governments now trying to implement measures to take away your civil liberties. Specific to coronaviruses, I spoke about this at KDD last year, sharing how the company I ran during the MERS Coronavirus outbreak decided not to help with social media analysis because of the implications for people's privacy:

Starting from 1:13 (https://www.youtube.com/watch?v=4ll77xuYszc&feature=youtu.be&t=4380), at KDD in 2019, I talked about how the Saudi Arabian Government used a coronavirus outbreak (MERS-CoV) as the pretext to identify dissidents on social media.

The same will be true for many criminals. While crime tends to go down during disasters, because people are overwhelmingly good, there are some people who thrive in the chaos to exploit people.

So, related to the very first point in this article, look out for your family. Elderly people are specifically targeted during disasters in scams to take their money by releasing their personal identities. Children are often targeted by sexual predators, and so be especially careful about any data source, even if it appears open. For example, National Geographic published the phone numbers of children in Haiti following the 2010 earthquake there.

In general, there should be no need to make any data public, and you should be careful about even reporting exactly how you are responding to the disaster. If validation for your contribution is important, then I recommend getting that privately or after the response is over.


What if I can't do anything to help?


If there is nothing that you can do to help right now, then I recommend longer-term actions:

  1. Prepare to support the people around you. In most countries, deaths are going to go up rapidly in the next few weeks. If you don't lose someone that you know, a friend of yours almost certainly will. Be there for them. They are not going to get the emotional support that they would typically get from healthcare workers because they are so overworked. You can fill this gap.
  2. Prepare yourself for fatigue. Disaster Fatigue (also known as Compassion Fatigue) is the mental and emotional exhaustion that most people will experience after several weeks or months of a changed lifestyle due to COVID-19. I am less worried about human behavior at the height of the deaths in the USA than in the following 2–3 weeks as people get fatigued. I've seen seasoned disaster response professionals snap more in this period than at any other time. This is when you will need to find those extra reserves to be strong for yourself and for those around you.
  3. Help with disaster response research in the future. There are always disasters happening, and most have little or no media attention. I built the first datasets for disaster response at Stanford at the same time that ImageNet was being built, and it is safe to say that my datasets have only reached 1% the people that ImageNet has, which has been disappointing. Like any other technology or science, it takes months and sometimes years to advance our approaches, and so that is best when notalso responding to a disaster.

Thank you for helping with the response!

Original. Reposted with permission.


Bio: Robert Munro worked in refugee camps for the UN in West Africa before his PhD at Stanford that focused on machine learning in health and disaster response. He helped respond to the recent Ebola outbreak in West Africa, the MERS-coronavirus outbreak 10 years ago, and was CTO of a global epidemic tracking organization. Robert also ran AWS's first NLP service, Amazon Comprehend, and has worked as a leader in many Silicon Valley technology companies.