The Complete Collection Of Data Repositories – Part 2
Check out the collection of the best data repositories on healthcare, natural language, neuroscience, physics, social network, sports, time series, transportation, miscellaneous, and super data repositories.
Image by Author
Editor's note: For the full scope of repositories included in this 2 part series, please see The Complete Collection Of Data Repositories – Part 1.
In an Artificial Intelligence project, data curation takes the majority of resources and time. It is the most critical part of the process which can determine the success of a service or a product. To minimize the effort in finding the correct datasets that work for you, I have created a collection of data repositories. A single data repository consists of multiple datasets for a particular field of study, and I am presenting you with the gold mine of high-quality datasets.
Most of the data sources listed below are free and open to the public. However, some are not. The collection of data repositories is divided into 2 parts, which consist of 20 categories based on various fields of science. The first part consists of agriculture, audio, biology, climate, computer vision, economics, education, energy, finance, and government.
In the second part we will be covering:
- Natural Language
- Social Network
- Time Series
- Super Data Repositories
The Healthcare category consists of patients and hospital records. You can find data on air quality, viruses, diseases, mortality statistics, and vaccination progress. The government and pharmaceutical industry can use this data to come up with a strategy for dealing with virus outbreaks or come up with standard operating procedures.
Natural Language consists of text data for machine translation, sentimental analysis, and named entity recognition. You can use a large multi-language NLP corpus to train transformers and use it for performing text generation, summarization, question & answers, and building conversational applications.
- Core Data from En.Wikipedia.Org
- Bad Word Lists
- Open Multilingual Wordnet
- Multi-Domain Sentiment Dataset
- Data Repository For Pretrained NLP Models and NLP Corpora.
Neuroscience consists of the data of brain imagery from FMRI to various CT scans. Brain activity data can be used to detect neurological disorders and come up with ways to rewire the brain for treating various diseases. FMRI data can also be used for detecting tumors and anomalies in the brain.
You have nanoparticles data from CERN and universe exploratory data from Exoplanet. This category is limited, but you can use this data to understand nanoparticles and use them to come up with solutions in solving energy issues and discover the universal key to creation.
Image from rawpixel.com
The social network data is generally scraped from GitHub, Reddit, and other social media sites. You can use them to create sentimental analysis or create clustering of various groups on a certain platform. Some data scientists are using Reddit and Twitter data to train chatbots, which will analyze the trend and interact with other users.
Get stats of your favorite sport and predict the winning team. The sports category is more than just a scorecard. You can get metadata on team history, play-by-play, video data for sports analytics, and live game data.
- Sports Statistics & Sports Data
- Soccer Data and APIs
- NFL Play By Play
- WTA Tennis Rankings, Results, and Stats
- USA Soccer Teams - Location And Metadata.
You now have access to all kinds of time series data such as energy consumption, weather, geographical anomalies, transport, sports, stock price. In short, all the categories mentioned in this collection have some sort of time series data. You can use them to forecast events and detect anomalies or events.
- Cross-National Time-Series Data Archive
- Heart Rate Time Series
- Time Series Data Library
- The Turing Change Point
From bike-sharing data with GPS coordinates to monitoring air traffic live. You have access to all kinds of transportation data which can date back to 20 years. Companies are using these data to create marketing campaigns, optimize fuel consumption, create new routes, and observe consumers' behaviors.
- TLC Trip Record Data
- Bike Share Data Best Practices Wiki
- Openflights: Airport And Airline Data
- City Of Toronto Open Data Portal
- Bureau Of Transportation Statistics
Some data repositories were hard to categorize, so I have created a new category to include random data repositories. These repositories consist of data dumps, restaurant ratings, and cybersecurity datasets.
- Data Dump (March 2011 To March 2016)
- Yelp Dataset
- Canadian Institute For Cybersecurity
- 9/11 Pager Data
- Open Library Data Dumps
- Post Databases Washington Post
Image by macrovector
Super Data Repository
Super Data Repositories is the data bank for all fields of science. You can discover many related datasets by running simple queries on any platform I have mentioned below. I usually start with Kaggle and then move to HuggingFace to find a dataset that works for me. Sometimes I am looking for a specific dataset, and that is where other platforms such as Zenodo and data.hub comes in handy.
Tip: If you are looking for a state-of-the-art machine learning dataset then check out Papers With Code.
- Hugging Face
- Papers With Code
- UCI Machine Learning Repository
- Open Data On AWS
- Data Planet
- Data Portal
- Open Access Directory by Simmons
Data Science without data is a bunch of mathematical equations and nothing else. Procuring a dataset or finding open-source data online can be a daunting task, so we need some kind of list or a cheat sheet for discovering various types of data. The data repositories are helping researchers, companies, and nonprofits to come up with solutions to the world's problems.
In this blog, we have learned about healthcare, natural language, neuroscience, physics, social network, sports, time series, transportation, miscellaneous, and super data repositories. Just like data science cheat sheets, you can bookmark the collection, and whenever you start working on a new data project, just skim through the list and discover the data repository that works for you.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.