A Doomed Marriage of Machine Learning and Agile
Sebastian Thrun, the founder of Udacity, ruined my machine learning project and wedding.
By Ian Xiao, Engagement Lead at Dessa
TLDR: Three mistakes from taking Agile and Sebastian’s words too far in an ML project. What I think is Agile-able or not in an ML project.
Disclaimer: This post is not endorsed or sponsored by any of the firms I work for or by any of the tools or services I mentioned. I use the term AI, Data Science, and ML interchangeably.
Like What You Read? Follow me on Medium, LinkedIn, or Twitter. Also, do you want to learn business thinking and communication skills as a Data Scientist? Check out my “Influence with Machine Learning” guide.
It’s 11:35 pm, November, 3rd, 2016. I sat in a boardroom on Bay Street, the heart of Toronto. I stared at the floors of empty offices down the street. The lights in my room dimmed. My phone lit up. I got three notifications.
// Executive presentation, in 10 hours.
// Call Jess. Send the final wedding logistic details, in 12 hours.
// To the airport. Flight to Hong Kong for our wedding, in 18 hours.
“F*ck… F*ck. F*ck!”
I tapped the keypad; the laptop turned on; a number popped up. It’s still not working. How should I explain why the expected conversion dropped by 35%?
I picked up the phone and called Simon, a colleague who would support me in the presentation: “Let’s plan for damage control.”
“Hey, are you thinking about our wedding?” Jess, my lovely wife, asks.
I snap back to the present. It’s November 5th, 2019. I confess. No, I was not thinking about our wedding, but happy anniversary! I will never forget this date.
Damn it, Sebastian
I love success stories about ML. But, let’s face the truth: most ML projects fail. Many projects failed because we tried to solve the wrong problem with ML; some failed because we didn’t pay enough attention to the last mile problems and engineering and organizational issues. A few failed because we felt data science is boring and stopped caring. The rest, like the one in my confession, failed because of poor project management.
If I were to pick one lesson from the experience, it will be this: don’t overdo Agile. And it all started with a diagram from Sebastian Thrun.
This is my favorite ML workflow diagram partly because Sebastian had the best hand-writing of someone who wrote with a computer writing pad (seriously). But, there is an important nuance.
My Ignorance, The Nuance, and What Happened
First, I must admit my own ignorance. I was a naive advocate of the Agile methodology because Waterfall is so old school and unsexy. Like me, many of you might have heard this about Agile: iteration allows us to learn, test, and fail fast. (Also, we kill fewer trees by saving papers from writing the technical specs that no one reads and probably wrong in the early project).
The Nuance. There shouldn’t be an arrow pointing back to the Question and Evaluation (the green and red text) stages. The iteration principle of Agile should not apply to the stage when we define the core questions and the evaluation metric.
The core question must be clearly articulated as early as possible. Then, the sub-questions (see the hypothesis-driven problem-solving approach) must be thought out immediately. They help to 1) identify good feature ideas and drive exploratory analysis, 2) plan the timeline, and 3) decide what each Agile Sprint should focus on. The evaluation metric must be picked carefully according to the core question. I discuss this process in detail in my guide.
So, what happened? Not being able to recognize my own ignorance and the nuance in the ML workflow led to 3 fatal mistakes.
Mistake #1: We changed the target variable mid-way because of a misguided definition of the core business question. This led to data leakage that ruined our model performance. We had to re-code ~35% of our codebase and spend ~2 weeks re-trying different data transformation and model architectures. It was a nightmare.
Mistake #2: We prioritized low impact features because our sub-questions weren’t well thought out. Well, some of these features were innovative, but not useful in retrospect. This wasted valuable development time, which is extremely important in a time-boxed consulting setting.
Mistake #3: We changed our evaluation metrics. We had to re-design our model architecture and re-run the hyper-parameter search. This required new test cases. We had to run many regression tests manually.
All these mistakes reflected poorly in client experience, team morale, and timeline. Eventually, they led to a very stressful night before my wedding.
Okay. Let’s be fair. I did take Sebastian’s advice out of context. He’s the best teacher of ML (Sorry, Andrew Ng).
Sebastian’s workflow is designed mostly for experimentation-focused ML projects (e.g. BI or insight discovery). For the engineering-focused ML projects (e.g. full-stack ML systems) like ours, the workflow should be refined based on the best practices of software engineering and product management.
Here is my take on what can be Agile or not. It’s particularly important for time-boxed consulting or internal prototyping projects. There is no perfect approach to everything, so don’t take this too far and apply intelligently.
- Development of data features
- Development of model architecture
- Design & development of user-facing UI (e.g. Dashboards, Web-UI, etc.)
- Design and development of CICD test cases
- User experience if the end consumers are people; integration pattern and protocol if the end consumers are systems
- The definition of core, sub-questions, and list of top feature ideas
- The definition of the target variable (avoid changes at all cost. but if business requirement changes, locking down the CICD pipeline is very helpful to catch data or software bugs.)
- The evaluation metrics
- Pipelines: Data-to-Model and CICD with DVC
- Infrastructures: integration pattern, serving layer, critical databases, and hardware platforms
It’s been working well so far. Welcome any feedback as always.
Depends on the reaction to this article, I may do a follow up on how to plan an ML project using both Waterfall and Agile with sample work plan diagrams and requirements. Stay tuned by following me on Medium, LinkedIn, or Twitter.
As you can see, being able to articulate the core questions and derive analysis, feature ideas, model selection are critical to a project’s success. I consolidated the material from workshops I offered to students and practitioners in the Influence with ML mini-course. Hope you find it useful.
Like what you read? You may also like these popular articles of mine:
Data Science is Boring
How I cope with the boring days of deploying Machine Learning
The Last Defense against Another AI Winter
The numbers, five tactical solutions, and a quick survey
The Last-Mile Problem of AI
One Thing Many Data Scientists Don’t Think Enough About
Bio: Ian Xiao is Engagement Lead at Dessa, deploying machine learning at enterprises. He leads business and technical teams to deploy Machine Learning solutions and improve Marketing & Sales for the F100 enterprises.
Original. Reposted with permission.
- The Last Defense Against Another AI Winter
- Data Science is Boring (Part 1)
- Data Science is Boring (Part 2)