Silver BlogShould Data Scientists Model COVID19 and other Biological Events

Biostatisticians use statistical techniques that your current everyday data scientists have probably never heard of. This is a great example where lack of domain knowledge exposes you as someone that does not know what they are doing and are merely hopping on a trend.



covid-19-data-science

The role of a data scientist has been expanding for years. It went from crunching numbers with statistics to building scalable databases to building production ready machine learning or deep learning models. Biostatistics and Epidemiology are highly specialized fields of statistics that universities offer different degrees for them.

Biostatisticians use statistical techniques that your current everyday data scientists have probably never heard of. This is a great example where lack of domain knowledge exposes you as someone that does not know what they are doing and are merely hopping on a trend. While it is known among the community to build a prediction model to see who is more likely to have survived the titanic or classify an iris plant to being your data science journey, perhaps more caution should be given to more serious matters such as a global pandemic that is killing hundreds of thousands and potentially millions of people.

 

Epidemiologists and Biostatisticians

 

Epidemiology is the study of the frequency and distribution of diseases within human populations and environments. Epidemiology is an important aspect of public health as it relates to the understanding of a disease in the population and assess its risk. Typically epidemiologists have an experienced science background in areas such as biology, medicine and virology etc. This is how an epidemiologist builds their domain knowledge to actually be able to understand what they are modelling.

Biostatistics is the application of statistical techniques to scientific research in health-related fields, including medicine, epidemiology, and public health. Someone who has a degree in statistics could probably become an analyst with data from retail, demographics, real estate, economics, finance etc. The biological sciences are a whole different space and require a separate qualification altogether.

Now a data scientist could come from a non statistical/mathematical background and suddenly start modelling disease data to show their skills. This is not the right type of data to to show your knowledge. It is up to each individual to know whether they have the ability to properly handle the data. So much false and misleading content has been published that further stains the profession of a data scientist as it shows that there are still people who are ignorant to the data and only care about using a random forest of xgboost model from Python instead of R (because R is not as cool apparently as it once was to some people) and promote it on LinkedIn hoping a recruiter or a senior data scientist will be impressed.

COVID-19 predictions, Dunning-Kruger Effect and the Hippocratic Oath of a Data Scientist
by Raj Iqbal sums it up perfectly. The Dunning-Kruger Effect is in laymans terms when someone overestimates their abilities when in reality they have a very low ability to accomplish the task.

 

Forecasting COVID 19

 

The following was taken from here.

Forecasting and Time Series expert Rob J Hyndman has said that for forecasting to be relatively accurate, there are 3 main factors:

  1. how well we understand the factors that contribute to it;
  2. how much data is available;
  3. whether the forecasts can affect the thing we are trying to forecast.

For example, forecasts of tomorrow’s stock prices are much less accurate because factors 1 and 3 above are not satisfied. First, the factors that contribute to changes in stock prices are not particularly well understood and depend at least partly on human psychology. Second, well-publicised forecasts of the stock market can directly affect the behaviour of many investors.

The above 3 factors are not all applicable to diseases but we can see how number 2 is a problem because of underestimation of the actual number of cases. The second problem is that the forecasts of COVID-19 can affect the thing we are trying to forecast because governments are reacting, some better than others. A simple model using the available data will be misleading unless it can incorporate the various steps being taken to slow transmission.

He and other scientists use compartmental epidemiological models to model an infection process. The simplest models are based on classifying living individuals in the population as Susceptible, Infectious or Recovered – hence they are called SIR models.

 

Conclusion

 
While a data scientist is tasked with analysing data to bring us insights, it is our responsibility to realize where our talents are to be used and when to take a step back and let the actual experts lead the charge. Infectious disease modelling is just too much of a specialized and sensitive area to blindly give your two cents. We need to be aware of the situations when we are needed and when we are not needed.

Related: