20 Lessons From Building Machine Learning Systems
Data science is not only a scientific field, but also it requires the art and innovation from time to time. Here, we have compiled wisdom learned from developing data science products for over a decade by Xavier Amatriain.
9. It pays off to be smart in choosing your hyper parameters
Automating the hyperparameter optimization by choosing the right metric is good thing. But, is it enough to choose the value that maximizes the metric? E.g. is a regularization lambda of 0 better than a lambda = 1000 that decreases your metric by only 1%? Also, think about using Bayesian Optimization (Gaussian Processes) instead of grid search.
10. There are things you can do offline and there are things you can’t… and there is “nearline” for everything in between
System architecture has the blueprint for multiple personalization algorithm services like ranking, row selection, ratings for the movies. Which requires recommendation involving multi-layered Machine Learning.
11. Implicit signals beat explicit ones(almost always)
Many data scientists have acknowledged that implicit feedback is more useful. But, is implicit feedback really always more useful? If so, why it is? Implicit data which is(usually) more dense, and available for all users. It provides better representative of user behavior vs. user reflection. It is more related to final objective function and hence it is better correlated with AB test results, e.g. rating vs watching. However, it is not always the case that direct implicit feedback correlates well with long-term retention E.g. clickbait(offering some rewards for clicking on important items)the solution: Combine different forms of implicit + explicit to better represent long-term goal.
12. Your Model will learn what you teach it to learn
Machine learning algorithm is not an arbitrary process, it involves step-by-step scientific methodologies. Your model will learn according to, training data (e.g. implicit and explicit), Target function (e.g. probability of user reading an answer), Metric (e.g. precision vs. recall). Example 1 :○Optimize probability of a user going to the cinema to watch a movie and rate it “highly” by using purchase history and previous ratings. Use NDCG(normalized Discounted Cumulative Gain) of the ranking as final metric using only movies rated 4 or higher as positives.
13. Supervised vs. plus Unsupervised Learning
While developing models we can’t simply choose supervised or unsupervised learning. One has to use them simultaneously and iteratively. Unsupervised learning is good tool for dimensionality reduction and feature engineering. You need to learn the “magic” behind combining unsupervised/supervised learning. E.g.1 clustering + knn, clustering will help us group the similar data points and feature vectors which could later-on be used for building knn models. E.g.2 Matrix Factorization MF can be interpreted a unsupervised method for dimensionality reduction using algorithms like PCA and Clustering (e.g. NMF), which could later use for Supervised learning using Labeled targets ~ regression. One of the “tricks” in Deep Learning is how it combines unsupervised/supervised learning to get better results. E.g. stacked Autoencoders, and training of convolutional nets.
14. Everything is an ensemble
Netflix Prize was won by an ensemble, initially Bellkor was using GDBTs, later BigChaos introduced ANN-based ensemble. Most practical applications of ML run an ensemble. You can add completely different approaches (e.g. CF and content-based). Also, you can use many different models at the ensemble layer: LR, GDBTs, RFs, ANNs. Ensembles are the way to turn any model into a feature! E.g. if you don’t know the way to go is to use Factorization Machines, Tensor Factorization, or RNNs? Treat each model as a “feature” and then feed them into an ensemble.
15. The output of your model will be the input of another one(and other design problems)
Ensembles turn any model into a feature. That is a great way to get better results, but that can be a mess! Make sure the output of your model is ready to accept data dependencies E.g. can you easily change the distribution of the value without affecting all other models depending on it? Try to avoid feedback loops, as they can create dependencies and bottlenecks in pipeline. One of the best ways to do so is to treat your ML infrastructure as you would your software one. You should apply best Software Engineering practices (e.g. encapsulation, abstraction, cohesion, low coupling…). However, Design Patterns for Machine Learning software are not well known/documented, so make sure at the very beginning how you will approach this.