Outbreak Analytics: Data Science Strategies for a Novel Problem
You walk down one aisle of the grocery store to get your favorite cereal. On the dairy aisle, someone sick from COVID19 coughs. Did your decision to grab your cereal before your milk possibly keep you healthy? How can these unpredictable, nearrandom choices be included in complex models?
By Susan Sivek, Alteryx.
You walk down one aisle of the grocery store to get your favorite cereal. On the dairy aisle, someone sick from COVID19 coughs.
Did your decision to grab your cereal before your milk possibly keep you healthy?
It's the kind of question we're all asking as we make simple daily decisions during this global pandemic. But imagine now that it's your job to create a model for the spread of COVID19 that accounts for humans' unpredictable, nearrandom choices. Your model also could include government mandates for social distancing, hospital care availability, preexisting conditions among a population, and more.
Sound complicated? It sure is, but researchers in a variety of fields are working to create the most accurate, useful models possible to predict and explain the spread of COVID19. I'm not an epidemiologist, but if you too have been impressed and intrigued by the data visualizations and modeling publicized so far, this post is for you.
I explored some recent research to learn a little bit about how these researchers go about what some call outbreak analytics. Here's an introductory look at just a few components of this unique analytics process. We'll find some lessons for other analytics applications, too.
Estimating Virus Transmission
"Once we know the R_{0}, we'll be able to get a handle on the scale of the epidemic."
 the movie Contagion
The reproduction number, or R_{0} (pronounced "R naught"), refers to the number of people someone who is ill is likely to infect. The R_{0} changes in every time and place that a disease occurs due to factors like demographics, climate, social structures, and social distancing measures. Researchers calculate the R_{0} by determining the time between a first infection and a second infection (the generation time, which can then be charted as a generation time distribution when this time is known for many pairs of sick people). Estimating the R_{0} is critical to predicting a disease's spread. For SARSCoV2, which causes COVID19, the R_{0} is currently thought to vary between 1.5 to 3.5 in different places.
An R package called R0 (available here) can calculate the current R_{0} using outbreak data. Offering five different methods for calculation, it also has an option for sensitivity analysis that shows which selection of time window or generation time provided the best fit to the data.
There's also a similar R package, EpiEstim, developed by a different group of researchers that has an Excel option. EpiEstim is based on a branching process model and estimates the R_{0} based on simple time series data. This model tries to capture the number of people each infected person will infect, but with an element of randomness (or stochasticity)  like a random encounter in the grocery store with an infected person. The chart below (part of a larger graphic) shows the estimates of R_{0} generated with this model for five past outbreaks.
Sequencing Pathogen Genomes
"It shows novel characteristics. … It's Godzilla, King Kong and Frankenstein all in one."
 Contagion
Genetic analysis of SARSCoV2 samples from patients who become ill in different places and at different times can help researchers track the spread and mutation of the virus. This analysis can also help the quick identification of possible treatments. Researchers recently demonstrated a new machine learning approach to identifying the type of an unknown virus and its various strains based on its genome (i.e., figuring out its taxonomic classification, like the broad category of "coronavirus").
This approach converted the SARSCoV2 genomic sequence into a numerical representation (see the full research paper for details). The sequences of nearly 15,000 other viruses were also used in the training data. The researchers trained six different machine learning models (Linear Discriminant, Linear SVM, Quadratic SVM, Fine KNN, Subspace Discriminant, and Subspace KNN). The trained models then output their predictions for labels at the highest taxonomic level for the genomes of the virus strains causing COVID19. The researchers then shifted the models to focus on the next, more specific taxonomic level and repeated the process again.
The graphic below shows the researchers' results from their last two tests, which classified 153 viral sequences into four subgenera and COVID19, and then classified 76 viral sequences either as other Sarbecovirus types or as COVID19.
This strategy helped identify not only that SARSCoV2 should be correctly classified with other Coronaviridae and Betacoronavirus pathogens, but also that it has important similarities to other viruses found among bats. The researchers argue that their approach is faster (under 10 minutes, including 10fold crossvalidation) and can compare more, and more diverse, samples than previous analytic processes. While there may be insights here for the current pandemic, this kind of approach could be helpful in future outbreaks.
Predicting Disease Spread Despite Uncertainty
"And from there, using our model, based on the R_{0} of 3.2 … here is where we expect to be in 48 hours."
 Contagion
One epidemiological modeling approach is to create a "SIR model," which divides the population as a whole into "compartments": those who are susceptible to the disease, infected, and removed (either recovered from the illness and presumably granted some degree of immunity, or no longer in the population because they died).
However, generating such a model is never easy. Outbreak data can be impossible to gather accurately for many reasons, including institutional barriers, lack of testing, unknown or asymptomatic cases, and so forth. And, as we've seen, governments have implemented varied social distancing and isolation measures at different times, which can have unpredictable results on the number of people in that "susceptible" compartment.
To deal with all the sources of uncertainty, one group of researchers developed what they call an "eSIR" model  an extended model including "a timevarying probability that a susceptible person meets an infected person or vice versa," as well as a new compartment to include people who are susceptible but choose selfquarantine. Both of these factors would vary in specific regions based on when isolation protocols were implemented.
To further incorporate uncertainty into the model, the researchers used a Markov chain Monte Carlo (MCMC) algorithm. (Here are two explanations for MCMC: one simpler, one more complex.) The MCMC approach allows for the approximation of a distribution that can't be directly known  like the true number of SARSCoV2 infections  or that is too expensive to figure out. The eSIR model's forecasts aim to reveal "turning points" in the outbreak. Turning points include when the daily number of new cases stops growing, and when the number of infected cases reaches its highest point. The model can also provide an estimate of the R_{0}.
The researchers are distributing an R package called eSIR that generates the models, ggplot2 objects, and summary statistics. What's useful about this approach is that it can help determine which quarantine strategies might be most useful and when. As the researchers say, "... too strict quarantine can cause backfire; people may lose their trust and patience in their committed system, and consequently may try to reduce compliance or even avoid quarantine." The risk of implementing strict quarantine systems for prolonged periods has to be weighed against the gains in disease prevention. This model provides one approach to that important calculation.
Takeaways for All Modeling
What lessons can we learn from these stillpreliminary models that extend beyond the pandemic? There are a few key items.
First, these models show rapid innovation to address an extremely complex situation. (The genomic and 'eSIR' studies described above are still "preprints," meaning they have not yet been peerreviewed and have been released as quickly as possible to contribute to the scientific community's effort to address the pandemic.) It's truly impressive to see the creativity that researchers are applying to this crisis with such speed despite great stress, and it's inspiring for everyone as we seek solutions to many new challenges.
Second, another challenge that may be familiar to data people: getting decisionmakers to take action based on insights from data. The great "flattening the curve" visualization and related forecasting seem to have had a strong impact on policymakers and the general public. Likewise, data experts need to be able to communicate about their analyses and models clearly  for example, with effective data visualizations  and organizations should build data literacy across all areas.
Finally, the research I reviewed frequently mentioned the challenge of getting quality COVID19 data upon which to build good models. Even in normal times, it can be hard to get the kind and quality of data we need. Models for aspects of the pandemic  or any other phenomenon  are only as good as the data upon which they are built. Data abounds everywhere, but not all are trustworthy, usable, or relevant. Every organization needs solid data gathering, management, and analytics structures in place (as these epidemiologists recommend for outbreaks). With that preparation, if a crisis of any kind emerges, your responses can be informed by accurate, relevant, uptodate data readily at hand.
Original. Reposted with permission.
Bio: Susan Currie Sivek, Ph.D., is a writer and data geek who enjoys figuring out how to explain complicated ideas in everyday language, sometimes in silly ways. She appreciates good food, science fiction, and dogs.
Related:
Top Stories Past 30 Days

