Things I Have Learned About Data Science
Read this collection of 38 things the author has learned along his travels, and has opted to share for the benefit of the reader.
By Saeed Hajebi, Head of Big Data & AI at Vodafone
- Data is always messy. If you think your data is clean, perhaps you have not looked into it yet; if you think your data is messy, it's even messier.
- Nobody cares how you did it; just do it correctly.
- People do not care how much you know until they know how much you care (about them and their business).
- Data > Big Data. In 2-3 years, nobody will talk about Big Data anymore.
- Delivering value > algorithms, tools, and technologies.
- Delivering value > cool charts.
- Delivering value > buzzwords.
- SQL > any languages (Python, R, and even English).
- It always pays off to be damn good at numbers, Excel, and PowerPoint (and yes, presentation skills); Tableau is a big plus.
- Downloading some code and data and running them does not make you a data scientist. The same is true for doing data science courses.
- Participating in Kaggle competitions does not make you a data scientist, although it can help you learn from others.
- Winning Kaggle competitions does not necessarily make you a good data scientist.
- Real world experience > anything.
- ETL is always needed - be good at it and learn a good tool for it (Talend is a good one). Also, learn scripting languages for ETL.
- Deep learning is cool, but it's still cool if you don't use it when you don't need it, and in 99% of cases you don't need it.
- Algorithms are commodities, your data is not.
- Ideas are commodities, execution is not.
- Deep learning expertise will soon become a commodity; problem-solving skills won't.
- Proper statistical knowledge > learning a new tool or library.
- Learn and talk your business' language; otherwise, people see you as a nerd who doesn't understand anything.
- In a lot of cases, a simple business rule (if else or CASE WHEN in SQL) does the work.
- Linear models and decision trees are good, learn them in depth and use them often.
- If you need a good performance and you don't have enough time, use GBM, but use it properly.
- Learn and use model interpretability techniques - shap is a good one.
- Python and R are both good. Stop debating and learn and use both.
- In communicating your model results, don't use AUC, Kappa, R^2, etc.; business users don't understand them and they don't care about them. Use lift and gains instead.
- AUC != Accuracy. Don't confuse your clients with AUC. AUC is just 'the probability of giving a higher score to a randomly chosen positive than a randomly chosen negative'. Example: If your use case is churn prediction and your AUC is 0.8, then a randomly chosen churner would get a higher score than a randomly chosen non-churner 80% of times. Not very interesting for clients. Right? The clients may think if you give them a list of 1000 high churn risk customers, 800 of them would be real churners; but it cannot be further from reality, especially in highly imbalanced problems.
- In predictive modelling use cases where human performance is not very good (or Bayes error rate is high), e.g., human behaviour prediction like churn or click prediction, a very high accuracy or AUC (> 0.9) could be a sign of data leakage.
- Don't spend too much time to get a better model performance, if it does not matter to the business; 'good enough' is enough. Spend most of your time to better understand your domain, your business problem, your data, and how to add value with analytics. No matter how accurate your models are, chances are they won't be used the way they should be. Example: You spend a month to improve your model's AUC from 0.8 to 0.82; but your marketing team would always need more leads, so you would end up using areas of your results that you are not happy with. You send a list of 1000 high propensity customers, they need 5000. I have seen data scientists spent more than 3 months to improve their AUC by 0.0001, and their model was not used at all. The real world is not a Kaggle competition. Think business.
- Deep learning is good when you don't have good tabular data and the problem is easy for humans (an expert can tell you the correct answer in less than 2 seconds) but hard for computers; e.g., image classification, NLP, computer vision, etc. If you have structured (tabular) data, or you can create good meaningful tabular data out of your raw data by feature engineering, and the problem is not easy for humans (for example human behaviour prediction like churn prediction, next click prediction, etc.), and you need the best prediction power, a fine-tuned gradient boosting tree (GBM, XGBoost, CatBoost, LightGBM) is probably your best bet. In such cases don't waste your time and money on deep learning.
- If machine learning is really what you need for a use case (please see #21), then in the majority of cases (perhaps more than 90%) a binary classification is the best choice; i.e., the problem is either already a binary classification or can be reduced to a binary classification problem. In about 5% of cases, a regression would be the way to go. Therefore, in about 95% of cases, you would be dealing with supervised learning. For the remaining 5%, try to reframe the problem into a supervised learning problem if possible. Avoid unsupervised learning as much as you can, as the results are not always useful - there is randomness, arbitrary choices to be made, etc.
- For every use case request, 'Begin with the End in Mind'. In other words, think about the whole 'Data to Value' process, end-to-end. Before jumping to model training, ask these questions of your customer:
- What is the business problem to be addressed? What are the definitions, assumptions, metrics, etc?
- What is the decision to be influenced/improved?
- Who is the customer?
- Who is the user?
- Who is the business sponsor?
- What is the general opportunity to add value from analytics (e.g., cost saving, revenue improvement, etc.)?
- What is the annual monetary value of the use case?
- How do they do the process or make the decision today?
- What are the selection criteria (e.g., is this a problem which deals with all customers or a specific group of them)?
- What are the expected deliverables? A deep-dive analysis and insight, a data-driven decision, a predictive model, a data product, a dashboard, etc?
- What is the success criteria or KPIs (What would be the measure of success or improvement)?
- Where will the data come from?
- Do we have the right to use the data?
- How can you ensure data quality? Remember, garbage in, garbage out. This is specifically true about the target variable.
- What ETL processes should be developed?
- How the results will be used? Examples: better targeting and reducing the number of contacts, save money, extra revenue, identifying new leads, etc.
- What organisational units involved/impacted by the results? (Think in terms of RACI: Responsible, Accountable, Consulted, Informed.)
- Assuming the model is ready, is there any potential deployment/utilisation constraints?
- What capabilities are required for utilising the model/results?
- What processes and systems will be impacted?
- Is everything ready to utilise model/results? If not, what is missing?
- Are there any potential blockers for model/results utilisation? If yes, what are the plans for unblocking them?
- Who will make sure that underlying infrastructure will support the end-to-end customer experience?
- What is the frequency of running the solution (real-time, daily, weekly, monthly, etc.)?
- Who will maintain, support, and service the solution?
- Do not bombard your customer with a lot of information, charts, numbers, buzzwords, and mumbo jumbo. Keep it simple st.../sweetheart (KISS). If the required outcome is a deep dive analysis, try to limit yourself to 3 actionable insights. I highly recommend taking this to the next level and providing recommendations for insightful actions. Please note there is a big difference between the two; the former is just some insights/ideas; while the latter is about actions/execution (please see #17- Ideas are commodities, execution is not.). Nonetheless, always consult with operational and business people to understand the validity and implications of your action recommendations.
- Start simple, deliver a solution quickly, and then add features and enhance your solution later on. That's what is called a minimum viable product (MVP) in software engineering. There is no harm in delivering a working solution which does not have any machine learning in it but satisfies the customer's need to a good extent in the first day/week/month - depending on the request and complexity of the environment. You can add machine learning later on to make it smarter, if required at all - please see #21.
- Prediction is easy, creating business impact is hard. Let's use a concrete example to explain this; however, the idea is simply generalisable. I use customer retention as the use case. The business problem is we want to increase the lifetime of customers staying and spending money with a company. We can define a number of analytical problems out of this use case. To keep it simple, I just focus on churn. We need to first predict what customers are likely to churn, then define some retention campaigns to make them stay and continue spending. Although churn prediction is one of the challenging analytical problems, it's the easiest part of the whole business problem of customer retention. You build a great model with AUC of say 0.85, which is very good. Your model creates a 60% lift in the first decile, so you can identify 60% of real churners in the top 10% of identified high-risk customers. You advise your marketing team to target the first decile with a marketing campaign. One month down the road, you notice the churn has increased! Yes. Prediction is easy, retention is hard. You may have woken up 'sleeping dogs'. You may need to use more advanced techniques like response modelling and uplift modelling to identify persuadables (more on this in a future post). Additionally, you need to team up with your marketing team to deeply understand the business context, customer needs, etc. Remember: Prediction is easy, creating business impact is hard.
- Data science (and/or AI) could be a boring job most of the times - if you only enjoy machine learning part of the work. People may think we AI experts would sit in an empty room with a cool view and think about the future of humanity. The reality, however, is very different. We have to deal with unrealistic expectations, unclear requirements, dirty and disparate data, and customers who just keep changing their questions and requests and expect their answers yesterday. We all have heard over and over that 80% of a data scientist's time would be spent on data cleaning. This could be true for the technical part of the job only. In the full context (see #32 for some details), the technical aspect would not take more than 50% of the time of a project. Based on my experience, model training and tuning would not take more than 2-3% of the time of a project. So, if this is the only thing you enjoy, you may need to think about switching job. A better option is to try to enjoy every aspect of the job.
- When to use machine learning? #21 argues 'in a lot of cases, a simple business rule (if else or CASE WHEN in SQL) does the work'. So, the question is when to use machine learning, and when to use the business rules? The answer is in the difference between traditional programming and machine learning. In traditional programming, we have 'data' and 'the logic' (the domain expertise coded as logic) as the input of the system, and the output is 'labeled data' (or an action or a decision). The quality of your output is a function of input data and the quality and the sophistication of the coded logic, which is normally defined by domain experts. However, in machine learning, we have 'data' and 'labeled data' as the input, and the logic (patterns in the data identified by an algorithm) is the output of the system, which is called 'model'. Therefore, if you have good historical data and labeled output, you can benefit from machine learning. Otherwise, the business rules would be a better option to start with and collect 'data' and 'labeled data'. After collecting enough 'labeled data', you can use machine learning to identify more sophisticated patterns and achieve better predictive performance.
- Sacrifice your model's performance for better business performance. #28 and #29 already discussed data leakage and model performance. The point here is a bit different. In predictive modelling and machine learning, when we do feature engineering and feature selection, we have to think about a lot of stuff. One of the most tricky ones is this: while including a feature may improve model performance (even without causing data leakage) it may decrease the business impact of the model. As an example, let's assume that we are modelling XXX product cross-sell use case. Having a feature like 'number of XXX marketing campaigns received' will highly likely improve model performance, but it will prioritise the customers who have already received the campaign, in every run; which is not what business wants. It's better to sacrifice your model's performance (AUC, again!) for better business performance, not the other way around.
- Learn every day – be a Learning Machine.
Bio: Saeed Hajebi is Head of Big Data & AI at Vodafone.
Original. Reposted with permission.
- 10 Simple Hacks to Speed up Your Data Analysis in Python
- Checklist for Debugging Neural Networks
- Good Feature Building Techniques and Tricks for Kaggle
|Top Stories Past 30 Days|