Where Does Data Come From?

In this article, we will go over the top five ways to collect or receive data, whether to help optimize an AI-driven machine or simply forecast future consumer demand.

By Nahla Davies, KDnuggets on August 5, 2022 in Data Science

Photo by Christina Morillo

Data is driving the world forward at an increasingly fast pace. It is used to aid machine learning, optimize AI-driven computers, and predict future outcomes with incredible accuracy. Our modern age remains defined by continuous technological breakthroughs fueled by data. Raw data is the guiding post for new technology and helps keep new developments in line with reality and everyday functionality.

Data gives us better control over our lives. Whether it’s informing public policy, fine tuning autonomous driving vehicles, predicting when we’ll need to order a refill of hand soap, or providing us with relevant content suggestions on our social media, data can help answer our questions about life, often before we’ve even realized we’ve had them!

Because of its power as a form of business intelligence, data on consumers is invaluable to nearly every company. Data is especially valuable to a tech company that utilizes machine learning for its products. Raw data can help enhance the abilities of software powered by machine learning by using what it “learns” about real life through the raw data fed through it.

Unlike humans, machine learning tools don’t need to take study breaks, so it seems inevitable that artificially intelligent computers will be the source of many future scientific discoveries. How can an ambitious tech startup best gain access to large amounts of data and maintain control?

In this article, we will go over the top five ways to collect or receive data, whether to help optimize an AI-driven machine or simply forecast future consumer demand.

Where Does Raw Data Come From?

Data exists all around us, but collecting and organizing data for a specific project can sometimes be overwhelming. Here are the top five common sources for raw data.

1. Publicly available data

We’ll start with the most obvious source of data – public data, which can be found in government records or other public databases such as Facebook, LinkedIn, or Google. Public data is any information out in the open, such as newspaper stories, city census information, or voter registration lists. As our society continues to incorporate more technology into everyday life, data gathered about people will only continue to grow.

For example, a recent study showed that demographic changes in a neighborhood can be accurately predicted by information gathered through the U.S. Census Bureau, potentially removing the need for labor-intensive door-to-door census surveys. While this is an innocuous example, other technical improvements in gathering public data, such as facial recognition technology, remain controversial modes of collecting data and are thus rarely used.

Whether you are raking through Twitter as part of sentiment analysis or using local demographic statistics to build a preliminary data model, public data can be a helpful foundation to build upon. While it is a good starting point for your research or project, it also makes your data models easier to replicate. Statistics show that 81% of retailers collect data in large quantities to help with their marketing and development.

Using public data can make your models more generic, but it can result in a level of transparency that can add to your project. For example, cryptocurrencies such as Bitcoin are traded on a public blockchain that is permissionless and accessible by everyone, yet transactions remain very secure.

2. Data from the use of your software

Now that you have a model based on publicly available data, it is time to finetune it with more specific data.

The best data to use for machine learning or to develop an artificially intelligent program is data that is specific to your program or type of user. For example, autonomous cars continuously gather data from their drivers to enhance their ability to drive autonomously. Conversational AI chatbots rely on data inputs and user behavior to enhance their ability to reply to requests and answer questions accurately.

This is an extremely relevant way of gathering data because it is highly specified. For example, if you were developing an AI-powered search database for a company that works in finance, you could use publicly available financial data to begin the foundational construction of your database. However, to truly hone the database so that it is custom-made for the types of questions and inquiries that arise in the atmosphere of a finance department, that software would need to rely on the interactions it has with its users to learn. That’s why AI-powered software may start off clunky or irrelevant and grow much more accurate and efficient with frequent use.

3. Human entry

Another method of collecting data comes from human entry. In this method, trained operators or engineers work on the design or application of a program while simultaneously collecting data. Supervising and controlling the system manually as it is operating, developers can work on the prototype for their new model while also collecting real-world data. A system may start off being 70% controlled by an operator and 30% autonomous, but once enough data is gathered, and artificial intelligence is bolstered, the system may progress to being 95% autonomous as it “learns” how to behave.

Self-driving cars, for example, go through 5 stages before they can become fully autonomous. The cars start with minimal self-driving features – such as the ability to detect a car ahead of it and break, drive straight to stay within a lane, or maintain a certain speed. These features are powered using cameras and sensors, which also play an important role in gathering data about driving behaviors, neighborhoods, and common roadblocks.

4. Data collection

A more old-fashioned form of data collection, “brute force” data acquisition is still an effective method. This is when data is collected purposefully rather than picked up from publicly available data or as part of the testing or development of your product. For example, a city census taker might go door-to-door verifying information about citizens that live there. Similarly, a surveying vehicle can be tasked with driving around a neighborhood to gather images with the purpose of creating an HD map.

In both of these scenarios, the main goal is data collection. Finding patterns and using the data comes later – without humans or artificial intelligence interference to make that data meaningful. While this method is time-consuming and labor-intensive, this type of hard-won data can be difficult for competitors to replicate.

5. Purchase data sets

An increasingly popular way for companies to gain access to high-quality data is to simply purchase data sets from a reputable company. When purchasing data for use in your model, you don’t have control over the type or quality of data you receive, and there is always the possibility that it will be outdated or irrelevant to your project.

However, it is a quick and easy way to get the data you need to begin training your program. Companies who obtain data with this method should research the reputation of the company they are buying from, the source of the data, and how it was collected to confirm that it is relevant for their purposes before purchasing.

Conclusion

Data is all around us and will continue to fuel technological growth in our society. As artificial intelligence and machine learning, in particular, propel us into an exciting new age, we will see an increasing demand for high-quality and real-time data from tech companies.

If you are looking for data for your own projects, the recently revamped KDnuggets curated collection of Datasets for Data Science, Machine Learning, AI & Analytics is a great place to start.

Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed — among other intriguing things — to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.