The Complete Collection Of Data Repositories – Part 1
Check out the collection of the best data repositories on agriculture, audio, biology, climate, computer vision, economics, education, energy, finance, and government.
Image by Author
Editor's note: For the full scope of repositories included in this 2 part series, please see The Complete Collection Of Data Repositories – Part 2.
Finding the data that works for your business can take up a lot of time. There are several data-sharing platforms that are offering a wide variety of data datasets, but they can’t provide you with a dataset for a specific field of study. That's why I have created a list of data repositories, which will help you find any dataset without searching on the internet. A single data repository consists of multiple datasets for a particular field of study.
The collection of data repositories is divided into 2 parts, which consist of 20 categories based on various fields of science. Most of the data sources listed below are free. However, some are not. It took me more than 2 days to collect the repositories, which are in high quality and easily downloadable. I used duckduckgo.com to search for most resources, but the majority of repositories are from Awesome Public Datasets and KDnuggets.
In the first part we will be covering:
- Computer Vision
In this category, the datasets are mostly related to crop monitoring, remote sensing indices, grain size, geochemistry, soil, and sediment analysis. The dataset is mostly in tabular form, but you can also find visual data for monitoring crops and detecting weeds in the crop field.
The audio repositories are rich and can be used for automatic speech recognition, text to speech, songs classification, emotion detection, translation, and detecting hate speech. This is a gold mine for any beginners or mid-size company to develop state-of-the-art solutions.
- Million Song Dataset
- A Collection Of Speech Corpus For ASR And TTS
- Hate Speech Corpus
- Korean Read Speech Corpus
- Voice_datasets By Jim-Schwoebel
- Common Voice
The biology category mostly consists of images of cells, cancer cells, types of genomes, genes, and protein structure. You can use them to generate new strains of viruses or come up with life-saving drugs. Most of the datasets are for research purposes and can be easily downloadable directly.
- IGSR | Data Collections
- Cancer Cell Line Encyclopedia
- NCBI -NIH
- Cell Image Library
- HMS Lincs Project
- RCSB PDB
The climate repositories contain satellite imagery, time-series data of winds and temperature, global weather, and climate spatial data. You can use it to forecast weather, monitor the effects of global warms, and detect natural disasters.
- AWC - Adds Text Data Server
- European Climate Assessment & Dataset
- Developer Information – World Bank Data Help Desk
- Global Climate and Weather Data
- Global Wind Atlas
Image by Freepik
Computer Vision is highly in demand. Companies are developing all kinds of solutions to improve current processes or create new services such as warehouse management, self-driving cars, face detection, generative art, and robots.
- Find Open Datasets for AI Projects
- Computer Vision Online
- VisualData Discovery
- Wilma Bainbridge
- Youtube Faces Database
- Sun Database
- Computer Vision Datasets
- Image Datasets For Machine Learning
The world economics data consist of trade statistics, human development index, geospatial data of food supplies, and macroeconomics data. You can use them to analyze current trade deficits and forecast countries' development.
- American Economic Association
- Our World In Data
- Atlas Of Economic Complexity Dataverse
- UN Comtrade: International Trade Statistics
- Human Development Data Center
- Joint External Debt Hub
In the educational category, you can find the data on student’s assessments, report cards, college performance, graduation rate, and surveys filled by individual students, school principals, and parents.
The energy category is filled with global power consumption, smart meter data from various buildings, and the power station's energy production rate. We can use it to strategize the implementations of renewable energy, save cost on electricity, and cater to the high demand of global energy consumption.
- Global Power Plant Database
- Smart Meter Data Analytics
- DOE Global Energy Storage Database
- Energy Data Resources
Image by rawpixel.com
In this section, you can find data on debts, banking statistics, GDP, exchange rate, consumer price, and much more. Finance is the backbone of the modern economy, and to create a stable economy, we can use this data to predict the next financial crisis, detect crimes, and forecast stock prices.
You can find government data on any country, state, or even county. Many government officials promote fairness and inclusiveness by sharing the data with the public. The most prominent data sets are from the US, India, Canada, New Zealand, and the UN. These data have all kinds of information from crime to food security.
- City Of Austin Texas
- Government Of Canada
- Stats NZ
- World Development Report
- UN Data
In this blog, we have covered 10 categories of data repositories. We have also discovered the type of datasets and their use case. These datasets are a goldmine, and you can't find them on Kaggle or any general sites. Most data scientists search on either Kaggle or on Google to get a dataset, and sometimes they are happy with what we get. They spend most of the time cleaning and augmenting data instead of looking for better data resources. This changes everything because I am going to use my collection of repositories to find what I am looking for.
In the second part, we will be looking at healthcare, natural language, neuroscience, physics, social network, sports, time series, transportation, miscellaneous, and super data repositories.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.