Software Development
Cybersecurity / Data

What is synthetic data? Here’s what it means, and how companies can use it for cybersecurity

Thomas George, senior systems engineer at Vidoori, breaks down what the technology is, how it works and how it can be used to defend against data breaches.

Computer code. (Stock photo)
Amid this year’s huge push for cybersecurity, everyone thinks they have the next big thing to protect companies, governments and the people they serve.

One of the news tools in the toolkit is synthetic data, a new way to help with testing and citizen privacy. Loosely, synthetic data — which has many definitions and uses — is a method for data measurement where artificial intelligence and other technologies develop model data closely resembling real data to maintain privacy among citizens. But we’ll get into all that in a bit.

So, what exactly is synthetic data? How can it be used to keep our information safe? Thomas George, a senior systems engineer at Silver Spring, Maryland integration and testing firm Vidoori, broke it down for us:

What is synthetic data?

In a nutshell, synthetic data is data that’s been created by a person or computer, instead of the traditional monitoring and collecting of regular data. In practice, it most often looks like finding the patterns in existing data and recreating them with an algorithm or artificial intelligence.

This looks like a lot of different things, depending on the dataset, but there are a few ways to think of it. George gave the example of addresses. A real database of a town might look like one set of residences at a certain address, and the synthetic data will recreate similar data with similar features of age, race or other characteristics; If a neighborhood has a lot of families, for instance, the synthetic data will recreate that with similar housing sizes but different names. In this case, the data is virtually identical to the real thing, but the people involved are protected.

“The idea is that you create the synthetic data to mimic the patterns and attributes of real data, but it’s not the real data itself,” George said.

In practice, synthetic data can be used for different things in many different sectors, but George thinks it works best for data from elections, finances or anything else that requires private data exchanges.

How it works

With the patterns in hand, all it would take to create the data would be programming code to account for those frequencies. Companies could even program their technology to revert to synthetic data automatically in creating other tools and technology. At scale, though, this could be difficult for smaller companies to find the bandwidth or funds, but there are other options. George noted that adding a synthetic data option to popular tools like Amazon Web Services that use the cloud could offer an easy way to grow the method.

It could even be as simple as plugging the patterns or demographic breakdown into an app, instead of creating the code, to render the data. This means it has the potential to be faster than traditional data collection, but George cautioned that there are shortcomings. Namely, the data and algorithms being used need to be airtight.

“It’s only as good as the model and the person,” George said. “If you haven’t explored your data thoroughly and you program it with the patterns that you know, then it’s only as good as you.”

So, where does cybersecurity come in?

There are a few ways synthetic data can be useful in cybersecurity. For one, in a breach on a system using synthetic data, it’s protected because the data isn’t real. The real data can be kept separate in a more protected space, and hackers would only have access to the synthetic data.

“You’ve protected your clients’ privacy by not ever exposing them in the first place,” George said.

George also thinks synthetic data is a great tool for testing in cybersecurity. Early testing in software and other technology might not have much historic data to train algorithms on. But using synthetic data means algorithms can be tested several times on different types of data, including tests for cyber vulnerabilities to make sure technology is protected.

The other advantage in testing, George said, is that system components aren’t so reliant on one another when real data doesn’t need to be collected. If one part of a system goes down or is running late, you can simulate the results and continue with testing. This removes some of the dependencies on other systems in testing and allows for fast testing, according to George. In cybersecurity, better and faster tests (hopefully) mean more protection.

“It’s an outstanding tool for testing in general,” George said.

Engagement

Join the conversation!

Find news, events, jobs and people who share your interests on Technical.ly's open community Slack

Trending

DC daily roundup: Inside UMCP's new ethical AI project; HBCU founder excellence; a big VC shutters MoCo office

DC daily roundup: Esports at Maryland rec center; High schoolers' brain algorithm; Power data centers with coal?

DC daily roundup: Tyto Athene's cross-DMV deal; Spirit owner sells to Accenture; meet 2GI's new cohort

Ethical AI development is the focus of a new institute at UMD that’ll offer degrees and certifications

Technically Media