Generative artificial intelligence is attracting a lot of attention for its ability to generate text and images. But these media represent only a fraction of the information that is spreading in our society today. Data is generated every time a patient passes through a medical system, a storm affects a flight, or a person interacts with a software application.
Using generative AI to create realistic synthetic data around these scenarios can help organizations more effectively treat patients, reroute aircraft, or improve software platforms — especially in scenarios where real-world data is limited or sensitive.
For the past three years, MIT spinout DataCebo has offered a generative software system called Synthetic Data Vault to help organizations create synthetic data to do things like test software applications and train machine learning models.
The Synthetic Data Vault, or SDV, has been downloaded more than a million times, and more than 10,000 data scientists use the open source library to generate synthetic tabular data. The founders — principal investigator Kalyan Veeramachaneni and alumna Neha Patki ’15, SM ’16 — believe the company’s success is due to SDV’s ability to revolutionize software testing.
SDV goes viral
In 2016, Veeramachaneni’s group at the Data to AI Lab introduced a suite of open source generative AI tools to help organizations create synthetic data that match the statistical properties of real data.
Companies can use synthetic data instead of sensitive information in programs while preserving statistical relationships between data points. Companies can also use synthetic data to run new software through simulations to see how it works before releasing it to the public.
Veeramachaneni’s group ran into a problem as it worked with companies that wanted to share their data for research.
“MIT helps you see all these different use cases,” explains Patki. “You work with financial companies and healthcare companies, and all of these projects are useful for formulating solutions in different industries.”
In 2020, researchers founded DataCebo to build more SDV features for larger organizations. Since then, the use cases have been as impressive as they have been varied.
With DataCebo’s new flight simulator, for example, airlines can plan for rare weather events in a way that would be impossible using only historical data. In another application, SDV users synthesized medical records to predict health outcomes for cystic fibrosis patients. A team from Norway recently used SDV to create synthetic student data to assess whether different admissions policies are meritocratic and bias-free.
In 2021, the data science platform Kaggle hosted a competition for data scientists to use SDV to create synthetic datasets to avoid using proprietary data. Approximately 30,000 data scientists participated in creating solutions and predicting results based on real company data.
As DataCebo has grown, it has stayed true to its MIT roots: all of the company’s current employees are MIT alumni.
Testing the charging software
Although their open source tools are used for a variety of use cases, the company is focused on increasing its appeal in software testing.
“You need data to test these software applications,” says Veeramachaneni. “Traditionally, programmers manually write scripts to create synthetic data. With generative models, created using SDV, you can learn from a sample of collected data and then sample a large amount of synthetic data (having the same properties as real data), or create specific scenarios and edge cases, and use the data to test your application. “
For example, if a bank wanted to test a program designed to reject transfers from accounts with no money in them, it would have to simulate multiple accounts making transactions at the same time. Doing so with manually created data would take a lot of time. With DataCebo generative models, customers can create any edge case they want to test.
“It’s common for industries to have data that is sensitive in some capacity,” says Patki. “Often when you’re in a domain with sensitive data you’re dealing with regulations, and even if there are no legal regulations, it is in the best interest of companies to be aware of who accesses what at what time. So synthetic data is always better from a privacy perspective.”
Scaling of synthetic data
Veeramachaneni believes DataCebo is advancing the field of what he calls synthetic business data, or data generated from user behavior on large enterprise software applications.
“This type of business data is complex and not universally available, unlike language data,” says Veeramachaneni. “When people use our publicly available software and report whether a certain pattern works, we learn a lot of those unique patterns, and that allows us to improve our algorithms. From one perspective, we are building a corpus of these complex patterns, which is easily accessible for language and images. “
DataCebo has also recently released features to improve the usefulness of SDV, including tools to assess the “realism” of generated data, called the SDMetrics library, as well as a way to compare model performance called SDGym.
“It’s about organizations trusting that new data,” says Veeramachaneni. “[Our tools offer] programmable synthetic data, which means we allow businesses to bring their specific insights and intuition to build more transparent models.”
As companies in every industry rush to adopt AI and other data science tools, DataCebo ultimately helps them do so in a more transparent and accountable way.
“In the next few years, synthetic data from generative models will transform all data work,” says Veeramachaneni. “We believe that 90 percent of business operations can be done with synthetic data.”