20 Lessons From Building Machine Learning Systems

Data science is not only a scientific field, but also it requires the art and innovation from time to time. Here, we have compiled wisdom learned from developing data science products for over a decade by Xavier Amatriain.

16. The pains & gains of Feature Engineering

Import properties of a well-behaved ML features are reusable, transformable, interpretable and Reliable. Reusability: You should be able to reuse features in different models, applications, and teams. Transformability: Besides directly reusing a feature, it should be easy to use a transformation of it (e.g. log(f), max(f), ∑ft over a time window…). Interpretability: In order to do any of the previous, you need to be able to understand the meaning of features and interpret their values. Reliability: It should be easy to monitor and detect bugs/issues in features. The better you able to capture the features, your results will be more accurate. E.g. quora answer ranking: What is a good Quora answer? It is truthful, reusable, provides explanation and well formatted. In order to quantify these properties we have to do dimensions translation into features. We could select features like, Interaction features (upvotes/downvotes, clicks, comments…), User features (e.g. expertise in topic) and then build models.

17. The two faces of your ML infrastructure

Whenever you develop any ML infrastructure, you need to target two different modes. Mode 1: ML experimentation which requires Flexibility, Easy-to-use and Reusability. Mode 2: ML production where you need all of the above + performance & scalability. Ideally you want the two modes to be as similar as possible. How to combine them? There are different ways for that like, providing good intermediate options where ML “researchers” experiment on iPython Notebooks using Python tools (scikit-learn, Theano…). You want to use same tools in production whenever possible, implement optimized versions only when needed. Another way could be implement abstraction layers on top of optimized implementations so they can be accessed from regular/friendly experimentation tools

18. Why you should care about answering questions (about your model)

The step known as Model debuggability comes at very end of model development, where you want to explain your model behavior to the product owners. It is important to derive the value of a model means what value it brings to the product. Product owners/stakeholders have expectations on the product. It is important to answer questions to why did something fail. So to bridge gap between product design and ML algorithms, Model debuggability is so important.  It can determine how particular model to use, which features to rely on and how to implementation of tools. E.g. why am I seeing or not seeing this on my homepage feed?

19. You don’t need to distribute your ML algorithm

Most of what people do in practice can fit into a multi-core machine. By using methods smart data sampling, offline schemes and efficient parallel code. There are dangers of “easy” distributed approaches such as Hadoop/Spark, such as costs and network latencies. E.g. using Spark we trained a model in 6 hours, using 15 machines. Total developer time was 4 days. Same model was trained using C++ within 10 minutes and 1 machine. So, most practical applications of Big Data can fit into a (multi-core) implementation.

 20. The untold story of Data Science and vs. ML engineering


Many companies struggling with, where do Data Scientists fit in an organization? It is valuable to have strong DS who can bring value from the data. But, strong DS with solid engineering skills are unicorns and finding them is not scalable. As DS need engineers to bring things to production and engineers have enough on their plate to be willing to “productionize” cool DS projects. There can be two solutions for this. 1. The data-driven ML innovation funnel. It involves three parts, part 1 Data research & hypothesis building (Data Science). 2. ML solution building & implementation (ML Engineering). And 3. Online experimentation & AB Testing analysis (Data Science). Another solution is to broaden the definition of ML Engineers to include from coding experts with high-level ML knowledge to ML experts with good software skills

In nutshell, choose the right metric; be thoughtful about your data. Understand dependencies between data and models and optimize only what matters. Make sure you teach your model what you want it to learn. Ensembles and the combination of supervised/unsupervised techniques are key in many ML applications. It is important to focus on feature engineering. Be thoughtful about your ML infrastructure/tools and organizing your teams.


Bio: Devendra Desale is a data science graduate student currently working on text mining and big data technologies. He is also interested in enterprise architectures and data-driven business. When away from the computer, he also enjoys attending meetups and venturing into the unknown.