Submit a blog to KDnuggets -- Top Blogs Win A Reward

Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2021 » Apr » Tutorials, Overviews » FluDemic – using AI and Machine Learning to get ahead of disease ( 21:n17 )

FluDemic – using AI and Machine Learning to get ahead of disease

We are amidst a healthcare data explosion. AI/ML will be more vital than ever in the prevention and handling of future pandemics. Here, we walk you through the different facets of modeling infectious diseases, focusing on influenza and COVID-19.

By DataDriven Health, AI Technology firm, transforming population health and syndromic surveillance.

2314 exabytes. This is the volume of healthcare data estimated to have been generated in 2020, according to the World Economic Forum. To put things in perspective, if one gigabyte is the size of Earth, then an exabyte is the size of the sun.

Image source.

It is imperative that we leverage the power of Machine Learning to analyze the sea of data and gain meaningful insights to help improve public health. The COVID-19 pandemic has caused a global catastrophe, a devastating loss of human life, and unprecedented socioeconomic disruptions. But what if we could have gotten ahead of this spread and stopped the surge before it even happened?


Our Motivation


The coronavirus outbreak has put modeling infectious diseases in the spotlight. This is exactly where FluDemic comes in.  Our goal is to assist government, health system administrators, community leadership, and all people to be proactive in their decision-making through data. Our experienced data scientists and analysts have worked closely to study the patterns of spread, understand various socioeconomic impacts, and create predictive models using proprietary machine learning algorithms. FluDemic currently provides a platform to track and predict COVID-19 and Influenza-Like Illnesses (ILI).


How do we add value?


There are essentially three ways FluDemic is making a difference:

1. Disease Tracking:

For COVID-19, FluDemic tracks several key metrics such as cases, deaths, testing data, vaccination rates, and hospitalization rates. For influenza, metrics such as percentage of outpatient visits attributable to ILI, pneumonia, and influenza deaths, positive test rates, and ILI activity levels are tracked.

Metric bar for COVID-19.

Metric bar for Influenza.

In addition to the raw values, we also provide a standardized view of time and population-adjusted metrics wherever pertinent, so our audience gets a true picture of what’s actually going on. The metrics are tracked across two dimensions:

  • Space: for a geographical view across counties, states, and countries. This visualization evaluates which area is tackling the disease most effectively.
  • Time: for a temporal view of the outbreak. This helps assess the effectiveness of key policy decisions. It also captures the trend of the disease, whether it is on the rise or falling.

2. Hotspot detection and prediction:

Monitoring the spatial distribution of diseases and identifying where most cases are concentrated provide critical information to public health officials. Predicting these hotspots one to two weeks in advance enables in helping officials to efficiently and effectively allocate scarce resources such as PPE, hospital beds, and vaccines to areas with the greatest need. FluDemic targets multiple populations and predicts which geographies are most susceptible for future surges, outbreaks, and socioeconomic risk for spread.

7-day prediction of population-adjusted COVID-19 cases for the state of New York.

3. Community awareness and impact:

Since the spread of disease ultimately lies in the hands of the public, and whether they adhere to various policies such as social distancing and mask mandates, it is essential that they understand the potential consequences of their actions. The mobility data indicates movement trends over time by geography across different categories of places, such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential. Additionally, the FluDemic socioeconomic risk metrics provide a qualitative view of the risk of infection and death based on location.

It would be naive to call the coronavirus pandemic just a health crisis. According to the World Health Oganization (WHO), “Nearly half of the world’s 3.3 billion global workforce is at risk of losing their livelihoods.” FluDemic sheds some light on this by tracking unemployment rates. Numerous businesses have gone bankrupt, and millions more face an existential threat. These repercussions are captured by the Composite Leading Indicator (CLI), which shows fluctuations in economic activity, and Confidence Indexes, which represent national business outlooks and household spending.

Check out this short FluDemic tutorial if you are interested in learning more about the site:


Data and Modeling


There is a great deal of diversity in the data feeding FluDemic. For COVID-19, several different sources are providing key data for cases, deaths, testing, hospitalizations, mobility, unemployment, confidence indexes, social distancing measures, and vaccinations.  For influenza, data is being sourced to display measures such as Influenza-Like-Illness (ILI), pneumonia and influenza deaths, vaccination rates, and testing.  Please check the About section of FluDemic for the details of the data sources used.

Despite the variety of available information, modeling epidemiological processes are challenging for several reasons. Privacy and confidentiality issues make it difficult to distribute data on an individual level publicly. Thus much of the information available is at the population level. Essential information, such as a person’s pre-existing conditions, cannot be accounted for in public models. When dealing with large-scale data, we need to acknowledge that the collection process and quality of data vary greatly from state to state. For instance, influenza surveillance is a voluntary process, and each state reports its information with a different degree of completeness and with different delays. Such variation in the reporting regime tends to obscure finer regional differences.

The modeling of a modern epidemic is also complicated by the real-world situations in a highly interconnected world, where the frequency of long-range travel can elevate an epidemic to a global phenomenon in a matter of weeks. It is not enough to only consider the local transmission. The models must incorporate the connectivity of populations. To balance this, we have greater amounts and better data than ever before to quantify such behavior.

We consider the progression of infectious diseases in four stages: Exposure, Infection, Hospitalization, and Fatality (e.g., estimating individual risks of COVID-19-associated hospitalization and death using publicly available data). The process begins with exposure, where one or more uninfected individuals are in proximity to an infected person. Although exposure can occur at home or a grocery store, it is compounded when unprotected groups of people are in close proximity for a significant period of time. Dining indoors at bars and restaurants or participating in live events can trigger a large number of people being simultaneously exposed to the disease, even if a very small fraction of the participants are infected. Following exposure, each infectious disease is characterized by an incubation period during which time the virus infects the body and multiplies. After this period, symptoms exhibit, and the individual is likely tested and a case recorded. The modeling challenge involves identifying the statistical distribution of incubation delay and incorporating it into models which track exposure-inducing circumstances.

In a fraction of the population, the infected individual shows severe deterioration in health, leading to hospitalization and occasionally death. However, both hospitalization and subsequent death follow time delays, which vary significantly from individual to individual, often associated with the patient’s pre-existing conditions, the access to, and quality of therapeutic care. While incorporating this into our population-level models, we use time delay distributions and their convolutions to track the progression from infection to death. We also note that for a novel disease like COVID-19, the treatment regimens evolve through the life of the epidemic, and so does the case fatality. Access to quality care is affected when hospital beds—especially ICU beds and ventilators—become restricted during surges. Despite the complexities, knowing the state of cases and hospitalizations gives us a window into the future of expected fatalities.

Prediction of population-adjusted 7-day rolling average cases and deaths per day for the state of New York.

Hotspots are defined as counties with an abnormally high number of daily cases or deaths after accounting for expected variation. Sequence time-series models coupled with time-delayed regressors are used to predict the trend of daily cases and deaths for each county. The resulting predictions are then scaled by population and smoothed using a seven-day rolling average, letting us identify emerging hotspots.

Variation of Mortality Risk across the United States.

To connect the cases and deaths in a particular county to the underlying demographic and socioeconomic factors, we compute the risk factors associated with the population scaled measures of infections (morbidity) and deaths (mortality). The primary demographic and socioeconomic factors include the population and population density of a county, along with the age, income, and household-size distributions. During an epidemic, behavioral factors such as measures of mobility and mask usage also inform the risk model.

The multitude of factors associated with socioeconomic risk are not independent of each other—each factor providing some independent information while also being measures of common patterns. We unravel this information into independent combinations using a variety of techniques such as Principal-Component Analysis which also allows us to reduce the number of independent model parameters, making the risk estimates robust (e.g., socioeconomic status and cardiovascular health in the COVID-19 pandemic).

A second important aspect of modeling risk is the inherent non-linearity in the contribution of the terms. Let us consider the prevalence of mask usage, mobility, and population density. Independently each of these terms contributes to the socioeconomic risk. However, in combination, their effect is much stronger. A county with a large population density and high mobility would see a far greater impact of mask usage than a sparsely populated county or a county where people have tended to remain at home. These nonlinear effects are accounted for in our models by using sequential polynomial regressions with all cross terms.


What’s Next?


Publicly available data is often plagued with reporting errors, significant delays, and lower geographic granularities. This poses challenges to the accuracy and actionability of the models. With the Premium version of FluDemic, the focus is on utilizing the sea of clinical level data provided by health systems. The data is anonymized, aggregated, and fed into the ML models, which now provide greater accuracies.  Key improvements that make this solution more actionable:

  1. The data is real-time/near real-time.
  2. The data provides insight into the patient population—age, gender, and co-morbidities of the patient populations.  The ML models provide more specificity related to different cohorts, which is key with resource mobilization and targeted messaging.
  3. The geographic granularity is at the census tract and block group level.
  4. Hospitalization data is available at the institution level rather than at the state level only for publicly available sources. This is especially important for resource mobilization.
  5. Case counts are derived from “gold source” systems, such as confirmed lab test results or prescriptions for COVID-19 and Influenza.
  6. Genomic sequencing will accelerate the development of a viral genomic surveillance network, which will predict and alert stakeholders of future surges, assess threats presented by new variants, and prepare us for the inevitable pandemics in our future. This model will enhance the monitoring and forecasting of other current infectious diseases that plague us each year, such as influenza.

In addition to COVID-19 and influenza, the ML models will be implemented in different therapeutic areas such as diabetes, cancer, COPD, and CHF.

Visit FluDemic


Bio: By the team at Data Driven Health, an AI Technology Company Focused on Transforming Population Health and Syndromic Surveillance.


Sign Up

By subscribing you accept KDnuggets Privacy Policy