Lessons from 2 Million Machine Learning Models on Kaggle
Lessons from Kaggle competitions, including why XG Boosting is the top method for structured problems, Neural Networks and deep learning dominate unstructured problems (visuals, text, sound), and 2 types of problems for which Kaggle is suitable.
By Vasyl Harasymiv, Senior Data Scientist at GrubHub.
Here is a summary of Anthony Goldbloom presentation at the Data Science Chicago Meetup, Nov 2 2015.
Nice to see Anthony coming from financial statistics/econometrics (he mentioned his first job was with the Reserve Bank of Australia). Some interesting points mentioned:
- XG Boosting is the engine of choice for structured problems (where feature manufacturing is the key). Now available as python package. Behind XG are the typical suspects - Random Forest and Gradient Boosted Trees. However, hyper parameter tuning is only the few % accuracy points improvement on top, the major breakthroughs in predictive power come from feature manufacturing;
- Feature manufacturing for structured problems is the key process (or otherwise random permutation of features to find most predictive/telling combination) either by iteratively trying various approaches (as do thousands of individual contributions to Kaggle.com competition) or in an automatic fashion (as done by DataRobot. BTW, DataRobot is based partially in Boston and partially in Ukraine). Some Amazon engineers who attended from Seattle commented they are building a platform which would iteratively try to permute features to randomly (aka "genetic algorithm" fashion) find best features for structured problems, too;
- For unstructured problems (visuals, text, sound) - Neural Networks run the show (and their deep learning - auto feature extracting - and variants of those). Great example was application of NN to Diabetic Retinopathy problem at Kaggle.com which surpassed in accuracy commercially available products;
- Kaggle.com is really suitable for two types of problems:
- A problem solved now for which a more accurate solution is highly desirable - any fraction % accuracy turns into millions of $ (e.g. loan default rate prediction) or
- Problems which were never tackled by machine learning in order to see if ML can help solve them (e.g. EEG readings to predict epilepsy);
- Don't expect data scientists to perform best in the office! Anthony mentioned his first successful 24h data science hackathon when his senior was guiding him 5 min, coding himself for 15 min and then playing basketball for 40 min each hour. Personally, I find walking, gardening and running are great creativity boosters. How will you work tomorrow? :)