3 Best Sites to Find Datasets for your Data Science Projects

When first learning data science, you will inevitably find yourself looking for more datasets to practice with. Here, we recommend the 3 best sites to find datasets to spark your next data science project.

By Angelia Toh, Co-Founder of Self Learn Data Science.

You will inevitably find yourself looking for a dataset somewhere along your data science learning journey. Especially when we advocate for working on data science projects in ‘How to Become a Data Scientist in 2020’, you should always be on the lookout for interesting datasets that you could experiment on. Here we list down 3 best sites where we get our datasets from for our data science projects.

1. Kaggle


You should be very familiar with Kaggle by now. Companies have been releasing their data in Kaggle to harness the strength of the community and solve their real-life problems. This makes Kaggle the perfect place to find datasets with real problem statements to solve. If you want to practice building machine learning models without the hassle of generating or labeling data, Kaggle is the best place for you. Furthermore, the notebooks section of Kaggle allows users to share their codes and models, which serve as a great learning resource. I highly recommend beginners to find their first data science project in Kaggle.

2. Google Dataset Search


Just out of beta early this year (2020), the Google Dataset Search is the most comprehensive Dataset search engine available. It claims to index more than 25 million datasets online and has helped scientists and researchers to better locate datasets since its inception in Sep 2018. Armed with the function to filter according to data types, date updated, and more, the Google Dataset Search has become the favorite for most of us.

If the dataset is available online, you would be sure to find it using the search engine.

3. Data.gov


When looking for data science datasets, you might want to look at what your government has made publicly available. These data, when put into good use, might result in solutions that benefit your community as a whole. Data.gov is an open data lake by the U.S. Government, where the government’s data are released to promote research and development within the scientific communities. At Data.gov, data are categorized into topics such as health, energy, or education, making it easy to navigate and find the data you need.

What if you are not a resident in the U.S.? Try searching for “data your country” with your favorite search engine. More often than not, you will find sites where your local government publishes its data. For example, here is the site for India while this is for the UK.

Using these sites, you will be able to find any datasets that interest you. Remember, practicing data science is the best way to learn. So keep these sites handy as you will definitely need it.

Bio: Angelia Toh, ‘Impossible’ is just a reminder that ‘ I’m possible’. Never stop learning | Self-Taught Data Scientist, Co-Founder of Self Learn Data Science.