8 Places for Data Professionals to Find Datasets
Here is a curated list of sites and resources invaluable for data professionals to acquire practice datasets.
Practice makes perfect — it’s the best way to master any topic or industry. When it comes to datasets, expanding your horizons is necessary because the field is so vast.
For professionals working with any form of data, from machine learning to visualization, the following sites and resources are invaluable for practice.
Kaggle is a reliable resource for practice data. It incorporates writing and sharing code into some of its datasets, which gives you an additional bonus for mastering the field. You’ll work with data topics like natural language processing (NLP) and image classification.
For instance, for text mining, you can dive into the “Star Trek” scripts project and start practicing with the data Kaggle provides. This data repository is unique because you can see feedback from other users who have worked on the same projects as you. The possibilities are endless.
Google, being the massive search engine that it is, has also branched out into countless other fields. You can check the Dataset Search page for topics that interest you. If you’d like to explore animal and human bonds, all you have to do is type it in and the search will pull up projects for you to choose from.
You can also use Google Trends and Google Finance to find data on any topics you’re looking for. Trends will show you the usage and searches for terms over the years, while Finance brings up stock information you can process.
If you’d like a community-based approach to finding practice data, you can turn to Reddit. Reddit has become a search engine in its own way. The subreddit r/datasets is a prime example of Reddit’s resourcefulness.
You’ll find plenty of contributions from like-minded people about where to find practice data. People will share sites and projects they’ve found helpful and guide you on the right path. You can then share your own projects that you’ve worked with to keep the momentum going.
4. U.S. Government
With the countless datasets the United States government works with, it becomes an opportune resource for practice. Specifically, health data is some of the most abundant information you can find. With over 218,000 datasets available, you’ll be able to find projects in any area.
The current COVID-19 pandemic has brought countless datasets into public view. For instance, you can work with a dataset about using public health data to combat the pandemic. Since the U.S government makes so many datasets available, this repository is ideal for any form of practice.
5. Election Data
While not a repository, election data is everywhere. This past presidential election has garnered attention like never before in history. Due to the vast number of mail-in ballots and tech-based engagement, this election has generated datasets in new ways.
To practice a range of subtopics — like exploratory data analysis, machine learning, statistical modeling and visualization — you can find election-based datasets on any repository. Google, Kaggle and the U.S. government will be overwhelmingly helpful.
This election will have a lasting impact, and this relevance will make its datasets good practice for years to come.
6. Census Data
Similar to election data, census datasets are always changing. The U.S. and world populations fluctuate over the course of a year, especially with something as deadly as the COVID-19 pandemic.
GitHub is a prominent resource for datasets of all kinds. For census data, you can download specific projects that will help you with exploratory data analysis, modeling, visualization and statistics. Working with census information brings you something fresh each time, and you can narrow down the numbers or zoom out as much as possible.
7. Awesome Public Datasets
Just as GitHub has resources for census datasets, it also hosts of one of the best resources for data practice on the internet. Awesome Public Datasets has a wide variety of information you can work with.
GitHub collects its data from public sources like blogs, users and any form of public data. You’ll find datasets on topics from agriculture to museums to software. When you use GitHub, you can match your interests with the best practices.
Now that machine learning is an imperative part of the tech world, it’s critical to understand how it all works. Datasets that focus on machine learning are the best way to master the field. UCI Machine Learning Repository is one of the best resources for this practice.
This site collects various databases, data generators and theories which are all critical for machine learning. You’ll analyze algorithms and understand the intricacies of what makes machine learning so valuable today. UCI should be a primary resource for this study.
The Best Data Practice
These resources and repositories are some of the best places on the internet to get thorough practice with datasets. Ultimately, you’ll want to work with the sites that pique your interests. If you want to investigate machine learning, UCI will be an ideal resource. If you want to work with population information, the U.S. government will be invaluable.
Then, since these resources provide free practice with projects, you can expand your horizons even further and put all kinds of dataset analyses on your resume.
Bio: Devin Partida is a big data and technology writer, as well as the Editor-in-Chief of ReHack.com.
- The List of Top 10 Lists in Data Science>
- Dataset Splitting Best Practices in Python
- How Data Professionals Can Add More Variation to Their Resumes