Data access is severely lacking in most companies, and 71% believe synthetic data can help

MOSTLY AI has conducted the first-ever synthetic data survey in the data science AI/ML community. Check out the results here.

By KDnuggets on July 17, 2023 in Partners

Data access and the state of synthetic data in 2023

TL;DR: On average, only 15% of AI/ML models are in production. Regarding the reason behind the failure of AI/ML projects, 35% cited a lack of AI/ML talent, while 28% blamed a lack of data access. Sixty-one percent of respondents noted it takes months to access quality data, with 71% agreeing that synthetic data is the missing piece of the puzzle required for AI/ML projects to succeed.

The state of synthetic data in 2023 is heavily influenced by the hype around generative AI and the omnipresent boom of AI-powered technologies, thanks to the recent LLM breakthroughs. Here at MOSTLY AI, we have experienced a spike in inbound requests and general inquiries since ChatGPT went mainstream.

People are excited to leverage AI in their day-to-day work and are seeking structured data alternatives via generative AI superpowers. While LLMs are a different beast altogether, with pre-trained models and supervised learning, AI-powered synthetic data generators can provide data access to representative synthetic data that can be readily used as a replacement for original data. Synthetic data offers a privacy-safe way to democratize data access and augment datasets to fit specific purposes. The result is shorter time-to-data, easier data access, and data science task automation.

Synthetic data generators are already helping people who work with structured data, from data scientists to AI/ML engineers. But how well is the category understood, and how far along are we to full-scale adoption?

Tobi Hann, the CEO of MOSTLY AI, says:

Synthetic data platforms are changing how we work with data and also how we develop data-centric AI/ML across all industries. We see the greatest rates of adoption today in areas where a large amount of sensitive and business-critical data is being handled, such as banking, insurance, and healthcare. This year so far has further expansion of interest in the synthetic data domain, and I suspect that, at least in part, this is due to all the attention ChatGPT has brought to the generative AI scene."

However, data access remains an issue for most organizations, and privacy concerns are more pressing than ever. Although the urgency to adopt and scale AI is tangible across industries, data privacy issues and a lack of awareness of privacy-enhancing technologies, such as synthetic data, prevent most companies from capitalizing on the shift toward AI-supported work and services.

Why AI/ML projects fail to materialize

While more and more people embrace AI-powered tools in their tech stack, large-scale deployment of AI/ML models is still a limited privilege. Progress is visible, but moving AI/ML into production is still hard. Yet, companies are scrambling more than ever to make this happen. While projects developing and scaling AI or sophisticated ML were scarce years ago, everyone is now trying to materialize these projects with a new-found sense of urgency. Despite the ambitions, happy endings are still hard to come by.

We asked survey respondents the reason for AI/ML projects' failure to materialize. Of the respondents, 35% cited a lack of AI/ML talent, while 28% blamed a lack of data access. Solving these issues is no easy task, and we wholeheartedly believe AI-generated synthetic data can help on both fronts.

Data access: The greatest bottleneck

The most shocking data gathered during the survey was this: Only 18% of respondents said that access to quality data is not a problem for them. For 20%, it takes weeks, while for 61% of people asked, it takes months to get data access. No wonder data-centric projects don't take off.

It's easy for OpenAI to train LLMs on publicly available corpora (copyright issues pending, of course), but for the average data team, even their in-house data assets are locked away by internal policies, destroyed by data masking, and only available for specific use cases. If companies are to keep up in the AI race, this needs to change fast. AI/ML talent also needs data access to be able to grow and develop expertise as well as domain knowledge.

Toy datasets only get you so far, especially when you are beginning your data science journey and want to test your assumptions. The development of in-house talent and the rise of citizen data scientists cannot take off without meaningful data democratization efforts, which is also a data access issue.

The missing piece of the AI/ML puzzle

Synthetic data versions are the easiest assets to help accelerate data access and unlimited data consumption. Among respondents, 71% agreed that synthetic data is the missing puzzle piece for AI/ML projects to succeed. We are well on track to reach Gartner's estimate that by 2030, synthetic data will completely overshadow real data in AI models. It looks like synthetic data is indeed the future of AI.

Seventy-two percent of the 332 survey respondents plan to use an AI-powered synthetic data generator within the next few years, and almost 40% plan to use one in the next three months, with most people citing data augmentation as their main use case (46%).
Although excitement is high, the survey also highlighted a heightened need for educating the data community about the benefits, limitations, and use cases of synthetic data.

Misconceptions are widespread, even among AI/ML experts

There is still a lot of confusion around the term "synthetic data"; 59% of respondents didn't know the difference between rule-based and AI-generated synthetic data. This suggests that synthetic data companies have a huge responsibility to educate data consumers and learn firsthand what it's like to work with synthetic versions of real datasets and how to do it well. Free, robust synthetic data generators with easy-to-use UIs coupled with API options, like MOSTLY AI's synthetic data platform, are the most likely to succeed in educating the public.

"We have to educate people big time. Since we work with synthetic data day in and day out, we take a lot of related knowledge for granted, and only when conversations get to a deeper level do we realize that sometimes even engineers have fundamental misunderstandings about the way synthetic data generation works and the use cases it is capable of solving. Our number one priority is to get people hands-on with synthetic data technology, so they really learn the capabilities in their day-to-day tasks and might even discover new ways of working with synthetic data that we didn't think about," added Tobi Hann.

Synthetic data potential

When asked about the most frequently used data anonymization tools and techniques, 49% of respondents said that they use data masking to anonymize data. Twenty percent said they simply remove PII from datasets – an approach that is not only unsafe from a privacy perspective but can also destroy data utility needed for high-quality training data. Privacy-enhancing technologies, like homomorphic encryption, AI-generated synthetic data, and others, account for 31%.

There is certainly room to grow and change habits around data anonymization and data prep for the better. MOSTLY AI's team will continue to keep an eye on synthetic data trends, and we'll repeat the survey next year. If you want to stay in the loop on the latest news around synthetic data – be it the latest research results, regulations, or the business side of things – sign up for the monthly Synthetic Data Newsletter!

If you are ready to accelerate data access in your company or would like to try our state-of-the-art data augmentation features, sign up for your free-forever account to get hands-on with MOSTLY AI's easy-to-use and secure synthetic data platform. Our team is available directly from the app to support you to help you make the most of synthetic data generation.