Contest 2nd Place: Automating Data Science

This post discusses some considerations, options, and opportunities for automating aspects of data science and machine learning. It is the second place recipient (tied) in the recent KDnuggets blog contest.

comments

By Ankit Sharma, DataRPM.

Editor's note: This blog post was an entrant in the recent KDnuggets Automated Data Science and Machine Learning blog contest, where it tied for second place.

Data scientist is the sexiest job of 21st century. But even Data Scientists have to get our hands dirty to get things done. What if some of the manual laborious tasks are automated and we bring the cool factor back to the job by only focusing on the logic and research.

Any generic data science pipeline will have the following components. To automate DS pipeline, we need to automate each individual component separately.

Automation pathway

Lot of research has been undergoing in automating Data Cleaning & Feature Generation & Selection aspect of DS pipeline because they consume almost 70% of the data science project time.

For this post we’ll focus on the not much talked about topics: Algorithm Selection and Parameter Tuning. Both of these components are considered manual in nature. Data scientists spend hours applying multiple ML algorithms and then tuning the parameters. For every new problem they follow the same process and over period they develop the intuition of which algorithm will work better for which type of dataset and how best they can tune the parameters. If you look closely this is in itself a learning problem. Lately lot of research & development has been started to automate these components to get best accuracy in least amount of time and efforts.

Automating Algorithm selection can be done using concept called Meta Learning while to automate Parameter tuning there are different techniques like Grid search, bayesian optimization, etc which we’ll be discussing later in the post.

Automating Algorithm Selection using Meta Learning

There are two essential parts of Meta-learning - Meta Store & Meta-model algorithm. Meta store is used to capture metadata of datasets and metadata of algorithm which gave best results after hyper-parameter tuning. As soon as dataset is loaded into the environment, its metadata is generated and stored in the centralized MetaStore. Before running any machine learning algorithm (specially for classification and regression model) Meta-model algorithm will analyze the metadata of this dataset and run algorithm ranking logic which will predict the machine learning model out of 10-20 different models that which model might perform better on this dataset.

Meta-model algorithm

Once the algorithm is selected, next step is to run hyper-parameter tuning algorithms to tune this model. Meta model algorithm will essentially use MetaStore and machine learning algorithm. Meta-model algorithm will also help in reducing the hyper parameter space which boosts convergence rapidly to provide best result.

To set up MetaStore, we need to capture Dataset metadata and Best Algorithm run metadata. Lot of features can be stored for both. Some of the useful feature list is as follows:

Dataset metadata

No of observations
Ratio of the number of observations by the number of attributes
Number of Categorical features
Number of Numerical features
Number of Classes
Class imbalance ratio
Percentage of missing values
Multivariate normality
Skewness of features
Percentage of the attributes after Feature Selection
Percentage of outliers
Decision Tree model characteristics like nodes per feature, maximum tree depth, shape, tree imbalance, etc
Max fisher’s discriminant ratio
Maximum (individual) feature efficiency
Collective feature efficiency
Fraction of points on the class boundary
Ratio of average intra/inter class nearest neighbor distance
Leave-one- out error rate of the one-nearest neighbor classifier
Non-linearity of the one-nearest neighbor classifier
Average number of points per dimension
Number of optimal clusters
WSSE of clusters
Domain of the dataset

Algorithm metadata

Dataset ID
Algorithm ID
Hyperparameters
Runtime

Meta-model algorithm

This is the brain of Meta-learning. This itself is a machine learning algorithm which basically learns from the various run of different algorithms on a lot of datasets. Algorithm metadata which we derived from best algorithm run on a particular dataset and stored it in MetaStore, and dataset metadata are combined together to form a new dataset which is fed to machine learning algorithm which ranks section of ML algorithms for a particular dataset.

Algorithm meta-learning

Metalearning to obtain metaknowledge for algorithm selection, source: Soares, C., Giraud-Carrier, C., Brazdil, P., Vilalta, R.: Metalearning: Applications to Data Mining. Springer (2009)

Hyper parameter tuning

Almost all machine learning algorithm have hyper-parameters. SVM has Regularization parameter( C ), k-means has Number of clusters(k), Neural network has Number of hidden layers, dropout, etc as hyperparameters.

Efficiently finding the best solution of these hyper-parameters is the key to any ML solution. There are basically three famous techniques to tune the hyper-parameters of any machine learning algorithms:

Manual Search: Data Scientists try different values and by Luck may end up getting the best solution. It is very common among DS community.
Grid Search: Machine does all the work by evaluating each parameters combinations and checking for best result. It works only for less feature; as the feature set increases so does the number of combination and thereby computation time.
Random Search: Works with large features

There has been some recent advancements in this field where data scientists are using Bayesian approach to optimize hyper-parameters. This approach works well with very large number of features but can be an overkill for less data and dataset with less features. The goal is to maximize some true unknown function f. Information about this function is gained by making observations, which are evaluations of the function at specific hyper-parameter values. These observations are used to infer a posterior distribution over the function values representing the distribution of possible functions. This approach takes less number of iterations to optimize the values of hyperparameters. It may not find the best solution always but will give relatively close to best solution in amazingly less number of iterations.

There has been lot of buzz in this field lately. There are some commercial and open source tools available which trying to make lives of Data Scientist easier.

Open source tools to look for:

Contest 2nd Place: Automating Data Science

More On This Topic

Latest Posts

Top Posts