To Kaggle Or Not
Kaggle is the most well known competition platform for predictive modeling and analytics. This article looks into the different aspects of Kaggle and the benefits it can bring to data scientists.
Kaggle is the most well known competition platform for predictive modeling and analytics. The company was founded in 2010 in Melbourne, Australia, and a year later, it moved to San Francisco after receiving funding from Silicon Valley. In 2017, it was acquired by Google. Read more about its history and future in Interview with Anthony Goldbloom, CEO of Kaggle.
The term “data science” has gradually floated and elevated into the English lexicon over the past decade. And so, the words “data science” and “kaggle” have become inextricably linked, and many in the data science community contemplate and debate the utility of the platform:
Is Kaggle… useful?
My Initial Thoughts on Kaggle
Like many people, I had some preconceived notions about Kaggle competitions. I had heard about them for several years, and these were my thoughts or opinions from others in the field:
- I had heard the legend that retired PhD’s with decades of experience were the ones winning the Kaggle comps. (I had often wondered if these geniuses were on a beach with clear turquoise water and flawless wifi access or in a dark, dusty, cluttered office…)
- I had close to zero chance of winning
- Would I really learn something of value?
- What is the point of investing time to improve accuracy by 0.01 points?
- Is it really the best use of my time? Ought I not invest the time learning another, more valuable, data science skill?
- The winners have to use complex ensemble methods
- The data is artificially clean, and that is unrealistic
- Doing one Kaggle competition will not make me a qualified data scientist, so why bother?
- I am not sure where to begin…
My First Kaggle Competition
Kaggle Competitions and the NYC Marathon
What I discovered is that Kaggle competitions are a lot like the NYC marathon. Most people participate for the journey, not for winning first place.
Verdict: Yes to Kaggle
I would say “yes”, there is value in doing a Kaggle competition, either for the beginner or seasoned data scientist. Here are the many reasons why.
While there are learning benefits to acquiring your own datasets or scraping the web, the downside to that is there is no benchmark, no way to compare your findings. There is the possibility of significant errors, and no one would know because there is no validation being performed. Kaggle competitions provide a platform for “checking your work.”
For All Levels, There is Learning
For the beginner, there is lots to learn:
- Becoming familiar with the Kaggle platform
- Downloading data using Kaggle CLI or API
- The structured ecosystem allows for people with less advanced statistical skills to focus on that
- Understanding the evaluation metrics
- Use devops skills: Git, cloud computing
- Kaggle offers some free interactive tutorials
For the experienced practitioner, there is always more to learn:
- The structured ecosystem allows for people with more advanced statistical skills to focus on that
- Explore hyperparameters more deeply
- Focus on state of the art and emerging methodologies
- Post-competition analysis of winner entries
- Managing with very large datasets (1 million records or more)
- Setting up GPU-enabled machine for deep learning
- Use deep learning and compare results to traditionally used algorithms
All throughout the data science community, you will hear references to datasets. You will become familiar with popular datasets to which other learning platforms and conference speakers refer.
Despite the fact that the dataset is provided, there remains the requirement to understand the data and the evaluation metrics. Contrary to popular belief, there is still “dirty data” which requires further investigation. Digging deeper into misclassified items begets adjustments to the algorithm.
It is true, doing one Kaggle competition does not qualify someone to be a data scientist. Neither does taking one class or attending one conference tutorial or analyzing one dataset or reading one book in data science. Working on competition(s) adds to your experience and augments your portfolio. It is a complement to your other projects, not the sole litmus test of one’s data science skillset.
Often, people are unsure whether to pursue a career in data science. Participating in a competition is one informative way to gauge your abilities and excitement. If you truly enjoy the process of Kaggle, it will point you more clearly in the right direction. If you prefer to spend your time doing something else, that is all right too; it is one way to find out.
Getting Started with Kaggle
This article provides extensive information on Kaggle as well as tips on getting started: The Beginner’s Guide to Kaggle
There are kernels, which is code in Jupyter Notebooks that others have shared. You are free to copy and use them to get started on a competition. Code is available in both R and Python.
Each competition has a discussion board for asking questions and upvoting kernels and topics.
Kaggle has a Slack team: KaggleNoobs slack channel. There are almost 4000 members, and there is a channel for AMA (Ask Me Anything), where they regularly interview Kaggle participants and winners.
- You can participate in competitions that have closed. Keep in mind, it is about the learning, not the end result.
- There are a variety of topics (random forests, multi-class, neural networks, NLP) and types of datasets (images, structured data, text, big data)
Partner with Someone
- Whether you are a beginner or experienced in data science, work with someone
- Note that it is best to have separate teams on Kaggle so you can each make the maximum daily submissions of results, but merge teams later towards the end
I think it is worthwhile to participate in at least one competition. There is a difference in having an opinion on something you have tried versus not. Kaggle is evolving, like everything, especially since its acquisition by Google. Check back periodically and see what is new.
It Doesn’t Have to be Kaggle
While Kaggle is the most well-known platform, there are many other opportunities to participate in competitions:
- many university analytics departments have an annual competition
- conferences often have competitions or what are called “tasks”
- private companies sponsor their own competitions
Here is a sample list of other data science competitions. Spending some time with google search will produce more recent and active opportunities.
My Kaggle Experience & Spot-Chasing Retirement, Marios Michailidis, 2016
Machine Learning Isn’t Kaggle Competitions, Julia Evans, 2014
Bio: Reshama Shaikh is a freelance data scientist/statistician and MBA with skills in Python, R and SAS. She worked for over 10 years as a biostatistician in the pharmaceutical industry. She is also an organizer of the meetups group NYC Women in Machine Learning & Data Science and PyLadies. She earned MS in statistics from Rutgers and her MBA from NYU Stern School of Business.
Original. Reposted with permission.
- How Do I Get My First Data Science Job?
- The Doing Part of Learning Data Science
- The Art of Learning Data Science