It was only a matter of time before the rise of artificial intelligence led to the creation of artificial data to power AI models. Although the idea of synthetic data is not new, its use and potential has grown rapidly over the past year. A number of tech startups and universities are now offering synthetic data services for a variety of uses, including insurance and finance.
While synthetic data can be collected through sensors and through images, video or audio, just as real data is collected, Ali Jahanian, a former researcher at the Computer Science and Artificial Intelligence Laboratory ( CSAIL) from MIT, states that “the idea is to have an algorithm – like a simulator or a generative model – to generate such data modalities with the aim of having these synthetic data as realistic as the real ones.
Synthetic data provides cost savings and eliminates privacy concerns
The advantages of synthetic data over real-world data signal huge potential. At its core, synthetic data is simply less expensive to collect and maintain than real data; real-world datasets can cost millions of dollars.
Another consideration is that training AI models on real data has resulted in issues of data privacy, bias, and fairness – issues that are essentially eliminated with the use of synthetic data. And often the type of real data required for a project is simply not available or of poor quality.
Click on the banner below for exclusive content on emerging technologies in higher education.
Jahanian says to think of “a generative model that generates synthetic data as an interface to your real data. That means you can get your real data but transform it in ways you couldn’t with your real data.” At CSAIL, Jahanian and his team were able to transform daytime scenes into nighttime scenes and turn a dormant volcano into an active volcano.
“These are examples of transformations that you can get for free from a generative model that aren’t available in the actual data you’ve collected,” he says.
Already, research by Jahanian and his colleagues has shown that some results with synthetic data are comparable to those using real-world data, but other results are even better with synthetic data. Synthetic data also allows the AI to train, which Jahanian says can be “cool and scary at the same time.”
LEARN MORE: Learn about some of the emerging AI technologies in higher education.
The use of synthetic data will continue to grow
With respect to the higher education space, an application of synthetic data could be to “provide different narratives about a concept by being able to generate rich content. Imagine if everyone could generate the content they need by customizing it. This can help each individual learn in their own learning style. Maybe a person needs more knowledge to understand a concept,” says Jahanian.
The percentage of data used for the development of AI and analytics projects that will be synthetically generated by 2024
Source: blogs.gartner.com, “By 2024, 60% of data used for AI and analytics project development will be synthetically generated,” July 24, 2021
It is possible that synthetic data will entirely eliminate the need to use real-world data in the near future. Research firm Gartner predicts that synthetic data will completely eclipse real data in AI models by 2030. Jahanian says he agrees.
“I believe it will create parallel worlds and it will have its uses,” he says. “Depending on how wealthy we want this world to be, it could take a few years. However, we are currently seeing examples of synthetic languages or image builds – like OpenAI GPT-3 and DALL-E – that are very close to human capabilities or even beyond in some specific cases.