Getting Started With Kaggle Competitions

Kaggle is best known for its competitions—prizes up to $100,000 draw some of the brightest machine learning minds to the site. But there’s an archive of challenges for participants of all levels. Where should beginners get started and what do they need to know before making their first entry?

Kaggle is a machine learning and data science community site created in 2010 by founder and CEO Anthony Goldbloom. The site boasts a variety of data science tools, including open datasets, full courses, notebook capabilities and discussion boards. By 2017, Kaggle reached 1 million registered users and was acquired by Google, according to Venture Beat.

Goldbloom said the goal for Kaggle was to create a robust set of tools for data scientists. “We want you to be able to access great code/analysis that you can fork, data that you can analyze and join to and discussion that you can learn from,” he explained in a 2017 interview.

What Are Kaggle Competitions?

Kaggle competitions are online machine learning challenges for data science enthusiasts to learn new skills, practice old ones and sometimes win prizes. Every competition includes a dataset, evaluation metrics and rules for all participants.

Every competitor is part of a “team,” which can consist of anywhere from one person to the competition maximum, which varies by set of rules.

There are four primary types of Kaggle competitions:

  • Getting Started: recommended for machine learning beginners or first-time Kaggle users.
  • Playground: centered on fun; a slightly elevated skillset from Getting Started.
  • Featured: tend to use commercially relevant problems and have large prizes.
  • Research: experimental problems that don’t typically include a prize offering.

What are you playing for? Kaggle competitions offer a few different prizes and outcomes at the end, depending on the competition type. The Getting Started competitions and some of the Playground or Research competitions offer knowledge or kudos. Simply put, there is no prize for completing these challenges other than building your skillset.

More tangible outcomes include prizes or swag, which vary by the competition and the sponsor. For example, the DonorsChoose.org Application Screening competition on Kaggle offered prizes that included Google Pixelbook laptops, Google Pixel 2 mobile phones and gift cards to the authors of the most upvoted kernels.

Challenges can also offer money, some up to $500,000 ($1.5 million total across eight cash prizes) like the Passenger Screening Algorithm Challenge sponsored by the Department of Homeland Security. Competitions with large prizes are some of the most competitive, drawing thousands of team entries.

There are Kaggle competitions that function as interviews, and the prize is a job interview with the sponsoring company. Allstate, Facebook and Walmart have all used Kaggle as a recruiting method for data science positions in the past.

To get started, you need to create a free Kaggle account. You can then pick a competition, agree to the rules and get started cleaning the dataset. For first timers feeling overwhelmed, Kaggle provides a library full of resources and forums to make it easier. You can also check out interviews with past competition winners on their strategies and best practices.

5 Pieces of Advice from Kaggle Competition Pros

Pick a competition that excites you, even if you aren’t an expert in the applied industry.

Dmitry Gordeev and Philipp Singer (The Zoo) won the NFL Big Data Bowl (2020), even though they admit they knew almost nothing about American football before entering. “Don’t worry about having domain knowledge to attempt a specific problem,” they advised. “The main thing we learned in this competition is that you don’t necessarily need domain knowledge or industry [knowledge] to successfully tackle the data science challenge.”

Shubin Dai (bestfitting), No. 1 on the Kaggle leaderboard in May 2018, keeps all his initial findings in one space. “Within the first week of a competition launch, I create a solution document, which I follow and update as the competition continues on,” he said. “I must first try to get an understanding of the data and the challenge at hand, then research similar Kaggle competitions and all related papers.”

Craft a unique approach using what you’ve learned from your collection of sources.

Nicole Finnie, silver medalist in the 2018 Data Science Bowl, said it can help you stand out in the competition. “Once I had a good feel for the theory, then it just took lots of time and work to implement,” she explained. “When you use a popular kernel, make sure to try to implement ideas and concepts from different research papers; that will be more likely to set you apart from other Kagglers.”

Divide the work into manageable pieces.

Data Science for Good: City of Los Angeles, explained how it helps you learn and apply the theory. “Always try to break the data science problem into smaller chunks and try to solve it in an iterative process,” he said. “The most important data science skills are applied and practical data science skills.”

Practice writing robust kernels and exploratory data analysis (EDA) to get a better understanding of the data.

Martin Henze, the first Kaggle Kernels Grandmaster, considers EDA and data visualization to be a pillar of his success. “Plotting a data set from many different angles, and with many different styles and tools, helps me immensely in discovering patterns and correlations,” he said. “More importantly: A Kernel is a perfect lab book in which to document your approach and results — and therefore a great foundation for a successful competition contribution. In my view, learning how to plan, execute, and document your work is one of the most fundamental building blocks for the success of any data-related project.”

7 Kaggle Competitions to Get You Started

For the first competition: Titanic: Machine Learning from Disaster.

Hailed as “the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works,” the Titanic competition asks participants to predict which passengers survived the crash.

Resources to get started:

For someone looking to expand on a statistics background: House Prices: Advanced Regression Techniques.

Predict the final price of homes in Ames, Iowa, based on 79 different variables, such as the roof style or land slope. Kaggle recommends this competition for students with some machine learning background who want to practice their skills before entering a featured competition.

Resources to get started:

For someone interested in natural language processing (NLP): Real or Not? NLP with Disaster Tweets.

Practice NLP in this competition that asks participants to predict whether tweets are describing actual disasters or not. Start out with 10,000 real tweets to build your model.

Resources to get started:

For someone interested in image recognition: Facial Keypoints Detection.

Detect and locate facial keypoints like the eyes, nose and mouth on a variety of images to be used for analyzing facial expressions and biometrics.

Resources to get started:

Predict whether women of Pima Indian heritage have diabetes based on variables such as blood pressure and skin thickness.

Resources to get started:

Predict which bank customers will make specific transactions in the future based on anonymized variables.

Resources to get started:

For urban planning data: Bike Sharing Demand.

Predict bike rental demand in Washington, D.C., based on the duration, departure and arrival locations, and time elapsed for past rentals. The dataset includes hourly data for two years’ worth of bike rentals.

Resources to get started: