The Story of the Women in Data Science (WiDS) Datathon
The author shares their experience of almost winning the competition and the things they have learned from the failures. Learn more about the WiDS Datathon and tips on winning the next challenge.
Image by author
How it started...
It started with a curiosity of doing something different and fulfilling. I was hearing a lot about data science but never understood the full meaning so one day my friend sent me an article about GPT3 and that was it. I was determined to learn this magical world of data science, so I decided to take the Data Scientist with Python Track — DataCamp Learn. The learning path taught me about Python programming and statistical thinking.
To complete the data science track, we had to solve Case Study: School Budgeting with Machine Learning and participate in a DrivenData competition. This was my first experience of building a classification model, and it completely changed my mindset. I was working day and night to climb up the leaderboard and finished in 8th position.
Image by author
Getting into the top ten motivated me to participate in other competitions, and finally I found Kaggle. Kaggle organizes the most exciting and challenging machine learning competitions. I was particularly looking for beginner-level completions and found WiDS Datathon 2021. This competition has a strict requirement of at least 50% women participants on a team and that’s how I found Penguin.
WiDS Datathon 2021
The annual challenge WiDS Datathon encourages women to hone their data science skills through machine learning challenges focused on social impact. The competition had allowed four thousand participants from different regions to work on building a state-of-the-art predictive model on diabetes mellitus.
The WiDS Datathon 2021 focuses on patient health by providing MIT’s GOSSIS tabular dataset. To achieve the top spot on the leaderboard, we will be focusing on the model metric AUC: Area under the Receiver Operating Characteristic (ROC) curve between the predicted and the observed target (diabetes_mellitus_diagnosis).
Cover of WiDS Datathon 2021
After the initial introduction, we started working on various machine learning models and spent most of our time figuring out the solution of building a baseline model. We both worked on separate tasks and discussed our progress weekly via a Discord call. In a couple of months, we were able to build a working machine learning pipeline which got us to 10th rank.
The flow chart below shows all the steps we took to build a top-performing model.
Image by author
- Data: data ingestion and cleaning.
- Analysis: exploratory data analysis.
- Feature Engineering: removing multiple high correlated features, adding new features, and initial model shap analysis.
- Scaling: filling missing values, label encoding, and standard scaler.
- AutoML: experimented with multiple opensource AutoML such as AutoMLJar, Tabnet, Auto Keras, and H2O.
- Optuna: hyper-parameter optimization using Optuna and randomized search CV.
- Ensembling: weightage ensembling based on model performance.
- Rank data: using geometric averaging on multiple experiments outputs to achieve a better model performance metric.
It took us 229 experiments to achieve the AUC score of 0.8746.
“Hard work & focus is the key to success.”
Kaggle Leaderboard — WiDS Datathon 2021
In this section, we will be learning about “what worked for us” and how you can also use these tips to win the next datathon.
Image by author
On Kaggle, most teams are sharing their code so you can simply fork the top-scoring notebook and submit the .csv file. This technique won’t get you in the top ten ranks or even in the top 100. The worst part is that you will drop on the leaderboard after the private dataset is released.
“Copy the notebook, learn how they are solving the problem, and make changes to achieve a better score.”
To win the competition you need to come up with your strategy and model pipeline. Our strategy was to focus on feature engineering using AutoML to solve the problem. We have read multiple notebooks, blogs, and research papers on similar topics to understand what works in most cases.
“Keep looking for new models, tools, and frameworks to improve your score.”
Trial & Error
Experimenting with multiple models and techniques was the main ingredient to our success. We never thought that using a neural network for a simple dataset was a silly approach. We have tested all automatic machine learning frameworks, deep learning models, gradient boosting, ensembling techniques, hyperparameters optimization, and feature engineering techniques.
“Use your time in experimenting and learning new ways to solve the same problem.”
Don’t limit yourself to just Kaggle notebooks, search the GitHub projects, technical blogs, research papers, and other code-sharing platforms. It’s OK to copy code if you are going to use it for learning purposes and if you want to share your code with the public, just add author attribution.
“Don’t be afraid to think wild and look for a solution at unusual places.”
At the start, we never understood the concept of sharing ideas or code. But with time, the Kaggle community has taught us data augmentation, features engineering, and model ensembling techniques. In short, they were the real reason we are on top of the ladder.
“Go to the Discussion tab and read what other participants are talking about or ask questions.”
Collaboration is not limited to the discussion tab on Kaggle, you need to collaborate with your teammates, ask people at the workplace, or simply ask people on other forums for example creating Reddit threads. Communicating your issues and ideas will help you find the solutions that work for you.
The Outcome of the Competition
The competition focuses on promoting women in general by helping them equip with the necessary tools to start a career in data science. In a typical Kaggle competition, female participation is limited due to lack of encouragement and support, whereas in WiDS it is the complete opposite, and the competition is dominated by female participants. The Datathon also provided a learning experience for beginners to work on real-life projects. Learn more at Results and Impact.
WiDS Datathon 2021 Results and Impact
I had an amazing interview with Karen Ebert Matthys, the Co-Founder, and Co-Director at Women in Data Science. She is also the Executive Director/External Partners at the Institute for Computational & Mathematical Engineering (ICME), Stanford University. After finishing her engineering degree, she worked in a male-dominated career and sometimes she was the only woman in tech-related events. This motivated her to launch the WiDS conference to provide equal opportunities for women to get involved and bring diversity to the workplace.
Apart from organizing datathons, WiDS also includes an annual technical conference and regional WiDS events around the world. WIDS also provides career support through professional workshops online. Recently, they took the initiative to educate secondary school students about data science. In short, they are working at all levels to provide support for women all over the world.
“The WiDS Datathon welcomes everyone from beginners to experienced Kagglers. It’s a great way to boost data science skills and meet others in the WiDS community worldwide. We've been delighted with the diversity of WIDS datathon participants -- including students from a wide range of degrees (business, sciences, agriculture, engineering, cs, etc) as well as professionals in the industry, academia, government, and NGOs .”
Learn more about WiDS Datathon 2021 results and impact here.
WiDS Next Challenge
The WiDS Datathon 2022 will run from early January to late February 2022 on Kaggle. The registration is open here and to participate, you also need to create a Kaggle account. I will highly recommend you to participate as a team rather than an individual and join all the workshops. The next Datathon focuses on climate change so you will be contributing to the global cause.
Photo by Li-An Lim on Unsplash
The participants will analyze regional differences in building energy efficiency and build models to predict building energy consumption. The competition will help governments or city management to maximize energy efficiency. Learn more at WiDS Datathon 2022 Challenge: Using Data Science to Mitigate Climate Change.
A message from Karen Ebert Matthys:
“Please register here and join us for the kick-off webinar on January 7th! Also, check here throughout January for datathon workshops that are offered worldwide where you can get coaching and look for team members.”
We are thankful to the helpful community of Kaggle and Women in Data Science. Our special thanks will always go to people who have shared their solutions and Ideas.
Some of the names we would like to mention from Kaggle:
- How to build energy consumption prediction model | Viridis
- Machine learning for energy consumption prediction and scheduling in smart buildings | SpringerLink
- Power-Laws-Forecasting: Winners of the Power Laws forecasting competition (github.com)
- Introduction to Machine Learning for Beginners | by Ayush Pant | Towards Data Science
- Machine Learning Fundamentals with Python Track — DataCamp Learn
- Learn from WiDS Datathon 2021 Winners, including a Kaggle Grandmaster (widsconference.org)
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.