Automation in Data Science Workflows
Will data science, known for replacing innately iterative work with automation, become automated? Will data scientists’ jobs be automated too?
Machine learning solutions have already automated a large part of how the world used to operate and are looking after their own inefficiencies now. So yes, the data science world is not immune from the vantage of automation and is witnessing core machine learning engineering processes getting automated to enable smoother and faster development.
Photo by RODNAE Productions
Think of the times when multiple steps – from data integration to model training, selection, and deployment – were done manually. Each step is very rigorous and requires extensive effort from data scientists. Inarguably, automation becomes highly valued in helping data scientists through end-to-end modeling and deployment processes.
Automated Machine Learning (AutoML) significantly boosts the developer’s productivity, allowing them to focus on the key modeling areas that require their time and attention.
Before we assess the pros and cons of AutoML, let us first understand how the data science world used to function prior to the automation of machine learning processes to understand its value proposition better.
Automation Over Manual Efforts – A Win-Win for Organizations and the Data Science Community
AutoML is often seen in the light of replicating data scientists' work but is rather an enabler for building better models faster. There is a gamut of things that are still done manually by data scientists and pose challenges to machine learning implementation. Ryohei Fujimaki, the CEO of dotData, explains as follows.
It's critical for organizations not to view automation as a "replacement" for data scientists but instead as a tool of the trade. We've found that many enterprises now divide the feature engineering process out of the data science organization and into dedicated groups that focus on feature discovery. Regardless of the setup, providing automation tools and platforms to make the data scientist's job easier should be the focus.
– Ryohei Fujimaki, the CEO of dotData
One of the most important yet very crucial and time-consuming steps of a machine learning pipeline is data analysis and attesting to good quality data. Any failure or deviation to detail at this step can cost you heavily and thus calls for a skilled data analyst to set the foundations right.
Besides data analysis, data cleaning and feature engineering give a significant lift to the model to learn the phenomenon much faster. But the caveat is that these skills are built over time. So, instead of waiting to build the right team and skills to sift patterns through the humongous datasets and generate valuable insights, the automation of machine learning workflows removes the barriers to building models.
Put simply, it helps enterprises quickly scale their machine-learning initiatives by enabling non-technical experts to leverage the power of such sophisticated algorithms. Not only does automation help improve the model accuracy, but it also brings the industry’s best practices so no one has to reinvent the wheel on already solved areas involving repetition.
Sparing data scientists the time spent on endless trivial tasks that can be easily automated, empowers them with the brain power to bring innovation to life.
Referring to Microsoft’s view on AutoML, it is the process of automating the time-consuming, iterative tasks to build ML models with large scale, efficiency, and productivity all while sustaining model quality.
It requires a mindset shift to enhance the processes and build systems through automating manual tasks such as feature engineering, feature discovery, model selection, and more.
The data science process is still a largely manual endeavor. Applied properly, automation can provide data scientists a great deal of aid without having to fear 'job losses.' When AutoML first became popular, the dialogue in the DS community was largely about the pros and cons of automating the entire life cycle of the data science process. At dotData, we've found that such an "all or nothing" approach underestimates the complexity of the data science process - especially in large organizations. As a result, we believe that companies should focus instead on providing automation, which makes the life of the data scientist simpler and their job more effective. One such area is feature engineering. Data scientists spend an inordinate amount of time working with data engineers and subject matter experts to discover, develop and optimize the best possible features for their models. By automating a large part of the feature discovery process, data scientists can focus on the task they are truly designed to perform: building the best possible ML models.
– Ryohei Fujimaki, the CEO of dotData
Besides boosting productivity and efficiency, it also alleviates the risk of human errors and biases which adds to model reliability. But, as experts say, excess of everything is bad. So, automation can be best utilized when assisted with some degree of human oversight to factor in real-time information and domain expertise.
Focus Areas of Automation
Now that we understand the benefits of automation, let us zoom in on the specific steps and processes that are most wieldy of time and effort. Automation in the areas listed below has the potential to make a noteworthy increase in efficiency as well as accuracy:
- Data Preparation: Data coming from disparate sources makes it a challenging task for the data scientists to prepare it in the right format to input to the model training stage. It involves a multitude of steps such as data collection, cleaning, and preprocessing to name a few.
- Feature Selection and Feature Engineering: Selecting and presenting the right features to model is foundational to learning the right phenomenon. Not only does automation helps in finding the right features, but it is also used to engineer new features to accelerate the learning process.
- Model Selection: It is the process of finding the best-performing model among the set of candidate models and governs the accuracy as well as the robustness of the model development pipeline. AutoML is very useful in iterating and identifying the right model for the given task.
- Hyperparameter Optimization: Selecting the right model is not sufficient, you also need to find the right hyperparameters for a given machine learning algorithm such as learning rate, number of layers, and number of epochs. Such model settings require a machine learning engineer to tune these parameters that optimally solve the machine learning problem. An automated hyperparameter optimization is an indispensable tool that finds the best architecture for your model by assessing various combinations.
- Model Monitoring: No machine learning model is able to continue giving accurate predictions without the need for periodic retraining. Automated tools monitor and trigger the model pipeline to take corrective actions if the deployed model deviates from the expected performance.
Image from Canva
Automation, in general, is dreaded as “technology taking away jobs”, however, it essentially helps in streamlining repetitive and mundane tasks. Automation in data science is a big enabler for data scientists by cutting down on manual efforts thereby allowing for improved and efficient modeling processes. One must supplement AutoML with fair participation of human expertise and oversight to get the full benefits of automating the challenging parts of data science workflows.
Vidhi Chugh is an AI strategist and a digital transformation leader working at the intersection of product, sciences, and engineering to build scalable machine learning systems. She is an award-winning innovation leader, an author, and an international speaker. She is on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.