Gold BlogWhat’s wrong with the approach to Data Science?

The job ‘Data Scientist’ has been around for decades, it was just not called “Data Scientist”. Statisticians have used their knowledge and skills using machine learning techniques such as Logistic Regression and Random Forest for prediction and insights for longer than people actually realize.



Data science is the application of statistics, programming and domain knowledge to generate insights into a problem that needs to be solved. The Harvard Business Review said Data Scientist is the sexiest job of the 21st century. How often has that article been referenced to convince people?

The job ‘Data Scientist’ has been around for decades, it was just not called “Data Scientist”. Statisticians have used their knowledge and skills using machine learning techniques such as Logistic Regression and Random Forest for prediction and insights for decades. Those same statisticians were also likely very knowledgeable in Linear Algebra and Calculus. Statisticians were even responsible for one of the greatest gifts to Data Science — R. Ross Ihaka and Robert Gentleman are the two statisticians responsible for giving us the R language that gives us the ability to conduct complex analysis with a few lines of code. See this paper: R: A Language for Data Analysis and Graphics by Ross Ihaka and Robert Gentleman.

These days analysts infer and predict we are seeing a shortfall of millions of data scientists over the next few years, universities are introducing new programs and potential students are racing to become the next cohort of data scientists that are going to fill that gap and get that $100,000 pay packet with the №1 job in America (that is not reality especially for beginners although there can be outliers). While a university education would be adequate in most of the cases, a lot of people are opting for MOOC’s to support their career path into Data Science. Coursera, edX, Data Camp and Data Quest are some of the most famous MOOC’s that people turn to and trust. The problem here is that the people taking these courses might not be qualified in the basics of statistics to at least understand what is being taught.

Some people are forgetting that Data Science or whatever the name it gets is based on years of hard work, learning and passion to do the job. The statisticians of past were not glamoured like “Data Scientists” are today. The profession is at risk of being cheapened to a job that someone could get started with by paying less than $100 online. I am all for more access to education and online learning but the profession should not be cheapened.


What does a Data Scientist need to consider on a machine learning project?




This is where a person can use automated technology to build models but it still fails without the human intuition to guide it and really dig into what is wrong and what is right.

Data Preprocessing
Handling outliers, missing values, encoding categorical variables, binning, type conversions, incorrect spelling duplicate rows, class imbalances etc. cannot all be handled exclusively with automated technology. According to media and current data scientists, data preprocessing takes up to 80% of a data scientist's job. While automation can help to a certain extent, the inference from Exploratory Data Analysis should be made by a person with industry knowledge or someone with an adequate grasp of statistics.

Hyperparameter Tuning
Machine learning packages come with models that have default hyperparameters set for it to be trained and tested. One can simply change the arguments to tune the model. However, just changing arguments without exploring the result of the changes is a waste of time. Changing a hyperparameter could lead to a model being over-fitted, under-fitted, biased etc. Different models will have different ways of being explored as well. For example when tuning a decision tree, it would be best to plot the tree to see the results of tuning because by changing the complexity parameter, you may prune the tree to an extent that you underfit/overfit your model.

Imbalanced Data
This is a problem with all real world data. Class imbalances are not an abnormality but actually something that needs to be accommodated/accounted for in the modelling and evaluation stage. Either they need to resample the data or change the threshold probability to predict each class (i.e. decrease the threshold probability to predict the minor class).



While there are many people that do make the transition successfully, many do not see transitioning into a doctor, engineer or lawyer so willfully. The profession of data science should not be diluted to a point where we say anyone can be a data scientist. Everyone should have the opportunity to join any profession of their choice but to say anyone can do it just dilutes the profession into something that would not require the amount of work an undergraduate, post graduate or doctoral candidate would put into becoming a Data Scientist.