Is fake data good news?

Synthetic data promises to remove bias from AI, bolster data privacy and improve our lives, but can something fake ever create real impact?

Fake news. Fake reviews. Deep-fake videos. Technology has made it easier than ever to generate content with the aim of obfuscating. ‘Fake’ has always had inherently negative connotations, yet there is a rising genre of fake content that is altogether more positive. It’s still algorithmically generated, it’s still created with the purpose of disguising the truth, but it has potential to make the world fairer, more open, and safer. At least, that’s what advocates of synthetic data are on a mission to prove as the technology gets set to enter the mainstream.

Technically, the concept of synthetic data has been around for decades. Using a computer to generate code to build datasets that wouldn’t exist without it formed the basis of early algorithms. Music synthesisers use a form of synthetic data, for instance. So do flight simulators. Any program in which the output is true-to-life, but is generated by machines, uses synthetic data.  

Yet the rise of deep neural networks has elevated synthetic data beyond the hobbyist. Highly advanced algorithms can now build entire datasets that look like the real thing, act like the real thing, yet are fake. And the benefits are as multifaceted as the datasets themselves.

The amount of data consumed globally rises exponentially by around 30 per cent year-on-year. From two zettabytes of data in 2010, estimates suggest we’ll exceed 74 zettabytes by the end of this year – a figure that is set to more than double again by 2025. Within this, Gartner predicts that by 2024, 60 per cent of the data used for training AI and analytics will be synthetically generated.

Synthetic data was largely born from a need to make better use of this bulging data: to take incomplete datasets and plug the gaps, or to make data sharing easier.

“We have sophisticated ways of collecting data, but most organisations don’t have advanced data strategies,” Fernando Lucini, the lead for data science and machine learning engineering at Accenture tells E&T. “Data is often cut off in silos and it’s difficult to build a complete picture. Regulations such as GDPR [also] make it difficult to use certain data sets, as there are strict privacy measures.”

In a landmark paper from 2014, researchers from Nvidia introduced a then-revolutionary AI tool known as the generative adversarial network (GAN). This network could generate entirely artificial images of faces to a degree never before seen. Albeit slightly blurry, and in black and white. In 2018, these same researchers returned with a vastly superior tool which powers the website This Person Does Not Exist. Each time you refresh the site, a hyper realistic, artificially generated image of a person appears. They’re so lifelike, they look like photos.

GANs consist of a Generator and a Discriminator. These play an adversarial ‘game’ with each other. The Generator generates fake data, similar to those in the training set, with the aim of tricking the Discriminator into thinking that data is real. It’s the Discriminator’s job to correctly spot the fake data from the real data. By working in tandem, the Generator gets better at creating realistic data, and the Discriminator gets better at distinguishing between fake and real.

“Adversarial approaches combine and merge content from unconnected examples to generate new ones,” Tim Ensor, director of artificial intelligence at Cambridge Consultants, tells E&T. “This might include images of faces with different hair and features, or cars and pedestrians in a variety of locations. The approach is processor- and memory-intensive, as it requires a neural network to be trained many thousands of times, but the advantage is that the final accuracy of a [GAN] model can be increased.”

Such results may not seem that ground-breaking today, particularly with the growing ubiquity of deep-fake videos, but these GAN breakthroughs paved the way for a multitude of uses for the technology. One being the creation of synthetic ‘structured’ data.

Whereas photos, videos and voice data are classed as unstructured, structured data is the type most companies collect from customers. “If you were to create synthetic data 10 years ago about financial data you would have had to rely on everything you already knew,” says Alexandra Ebert, chief trust officer at synthetic data start-up Mostly AI. “This was useful for testing, but not to get insights from. AI changed this. With AI-generated synthetic data, it’s possible to learn all the patterns or correlations of an existing data set to allow you to analyse it and figure out insights about customer behaviour that you couldn’t get beforehand.”

Yet it’s far from easy. Not every organisation has a GAN, deep neural network, or AI program to hand. To unlock the benefits of the data they hold, these organisations need to partner with third parties, whether that’s start-ups specialising in analytics, freelancers brought in to develop new software, or fellow experts who join forces to tackle a global problem – as we’ve seen in the wake of Covid. This then brings with it a host of privacy and regulatory headaches.

Before synthetic data, privacy regulations were navigated through onboarding, or classic data anonymisation techniques. The former involves third parties going through security checks and validations to access the original dataset. The latter involves redacting cells so the crux of the data remains but customers can’t be identified.

Both have pitfalls. Onboarding can take six months at a time and can prove costly, while classic anonymisation is fraught with issues and trade-offs. The more data that’s redacted, the more privacy is protected, but this means the data becomes less and less useful with each redaction. That’s not to mention the fact that data which companies think is unidentifiable can be, and has been, traced back to individual customers.

A study from 2015 found that researchers were able to correctly identify 80 per cent of banking customers from just three random transactions. A follow-up project put this at 70 per cent from two transactions.

Synthetic data promises to solve these issues in one fell swoop. It takes the original dataset, creates a fake yet highly accurate replica without identifying information, and makes it possible for that synthetic set to be shared with any and all partners. “Synthetic data maintains all the valuable information and predictive power but doesn’t have any personally identifiable information,” continues Lucini. “This means companies are able to unlock the power of personal data and expand and augment these datasets, enabling them to train machine learning models where previously there was not enough data to do so.”

Synthetic data generation techniques also mean such sets can be used at scale. The rules uncovered can be applied to wider, or more specific use cases. “If you’re trying to build a fraud detection algorithm, often examples of fraud make up 0.1 per cent of the total transactions,” explains Harry Keen, CEO and co-founder of synthetic data start-up Hazy. “It’s really difficult to build a detection algorithm on so few examples, but you can synthetically recreate fraud to the point where the total data includes, say, 20 per cent fraudulent transactions. Suddenly the detection algorithm is more capable of understanding what fraud looks like.”

Beyond financial transactions, synthetic data is vital for healthcare. When the coronavirus took hold, global taskforces were set up to share knowledge. At the heart of these discussions was highly sensitive patient data. Data that was evolving rapidly. Data that couldn’t be shared without redactions.

One of the leading bodies using synthetic data as part of the Covid effort is the Clinical Practice Research Datalink (CPRD), a UK Department of Health research service, funded by the Medicines and Healthcare products Regulatory Agency (MHRA). Prior to Covid, the CPRD’s synthetic data sets studied diabetes, and validated existing algorithms. Since Covid, these sets have served as a sample that can provide insights into what information may be available in electronic healthcare records. Yet the potential is more far-reaching. The data could be used to train and validate clinical decision-making algorithms. It can be adapted, refreshed and scaled as more data becomes available, or the data evolves rapidly. It can be used for modelling Covid-19 risk factors, disease progression and outcomes.

For all the benefits of AI, the programs are only ever as good and fair as the data they are fed. In recent years, AI has been criticised for reinforcing biases. An Amazon program, for instance, that used AI to filter job applicants discriminated against women because it was trained on CVs from a predominantly male workforce. An algorithm used in US court systems to predict the likelihood a person would reoffend was found to predict twice as many false positives in black offenders as white ones.

The nature of synthetic data generation means that if bias is present in the original data, it will be reproduced in the synthetic set. However, because of the way different synthetic data elements can be scaled up, or down, the technology can alternatively be used to remove or alter this bias. If a data set contains 80 per cent men and 20 per cent women, the generation network could be adapted to increase the examples of female records. “We could also over-engineer this. A hiring algorithm, for example, could suggest more female candidates to ensure we end up with 50 per cent female and 50 per cent male,” continues Ebert.

Using synthetic data to balance bias is particularly useful in unstructured data, as Dr Maya Dillon, the VP of growth and innovation at Corsight AI, explains: “The bias within facial recognition, one that more accurately recognises men than women, or certain ethnicities and ages, is a direct consequence of bias in the training data. With synthetic data, it is possible to generate equally distributed datasets and significantly reduce, or perhaps even wholly mitigate, such biases.” Corsight AI has been trained on synthetic data to recognise faces even when people are wearing masks.

Yet it isn’t as black and white – literally and figuratively – as it seems. This only works when the bias is spotted. Humans are largely the cause of adding bias to datasets, so they’re rarely best placed to detect it. “Bias can be retrieved anywhere, in text or in speech,” says Dr Lea El Samarji, the Europe AI & IOT director at Avanade. “It’s related to humans and not to technology. We are developing the AI models, so if our thinking is biased and the dataset is biased, then the model will be.”

Secondly, as Keen continues: “If you made a completely unbiased data set it may not have anything interesting to tell you.” You’d effectively be removing the nuances that make that data unique, useful or valuable. You could also add in new biases while trying to correct the historic ones.

Thirdly, and crucially, there may be times when the bias needs to remain for the output to be representative. A recent Uber study uncovered female drivers were paid 7 per cent less than male drivers, despite the algorithm being ‘blind’ to gender.

“If you were to use this information to build a predictive model, one that could be used to let a driver know how much they’d earn, it would be important to keep the bias in. Otherwise, a female would get too optimistic a prediction,” adds Ebert. “In most cases, it is desirable to get the bias out of data, but this is a really specific example where it might not be helpful.”

It’s such a bone of contention that Zasada believes removing bias from synthetic data is an “impossible task”. “A more realistic expectation would be more and new methods of generating synthetic data that result in higher quality or greater anonymity.”

Regardless of where you stand, synthetic data is approaching the mainstream. Hazy has partnered with Halifax and Accenture. Mostly AI has helped save a Fortune 100 bank $10m (£7.2m) a year on validation costs. CPRD is exploring ways to further use synthetic data to help fight Covid and other diseases. Ensor believes if we can use synthetic data to create consistent general time series data it could help us better predict the future.

“Synthetic data offers an exciting opportunity to get unlimited data, under any required distribution of features,” concludes Dillon. “[It] has allowed us to reduce bias within algorithms, improve models and increase customer confidence. If a person understands that algorithms have been trained to eradicate bias, and their data will be analysed fairly, the trust in this technology will increase.” And with that, its use cases and benefits will continue to soar for the benefit of everyone.


Synthetic cities for self-driving cars

Beyond structured and unstructured datasets, a third use for synthetic data is in what’s known as recorded simulated processes.

Take self-driving cars. It’s difficult, time-consuming, and costly to train driving algorithms on all possible scenarios on real roads. A car could drive for a year and not encounter a child kicking a ball into the road, for example. Google’s self-driving car programme, Waymo, has developed a test environment where it models a dense urban environment – a synthetic city, if you will.

Similarly, it’s not possible for governments to observe all the ways coronavirus spreads through a population, so it can generate synthetic models based on real-world data to plot movement. “The data can be recorded traditionally, but the events taking place in [the model] environment are simulated to resemble an actual city,” Maciej Zasada, technical director at UNIT9, tells E&T. “Synthetic data is usually cheaper to produce than real-life observations, so can provide similar value quicker, cheaper, in bigger volumes and with anonymity.”

It’s so useful that analysts predict 10 per cent of governments will use a synthetic population with realistic behaviour patterns by 2025 to train AI.


Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles