From healthcare advances to improvements in customer service, anonymised data is vital to improving the world around us. But according to research published today, it is all too easy to reverse engineer such data to identify specific individuals.
In research published today in the journal Nature Communications, data scientists from Imperial College London and UCLouvain found that machine learning could overcome standard anonymisation techniques to re-expose sensitive personal data of almost all individuals, even when they are from incomplete datasets.
This means that personal information that is turned into anonymised data before being sold for use in artificial intelligence (AI) projects, market research and beyond can be reverse engineered by companies that have purchased it without them ever seeing the original data.
This also means that businesses can use anonymised data to build increasingly detailed personal profiles of individuals without their knowledge.
How anonymised data can be easily reverse engineered
The research is notable because it demonstrates for the first time just how easy it is to reverse engineer anonymised data.
The machine learning model developed by the data scientists allowed them to correctly re-identify 99.98% of Americans in any anonymised dataset using just 15 characteristics, such as age, gender and marital status.
The researchers also created a tool to allow users to check how easily their own data can be exposed using this method, which can be found here.
“While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog,” explained study first author Dr Luc Rocher, from UCLouvain.
This means that companies that already hold some data on an individual can use anonymised data to build an increasingly complex – and disturbingly detailed – profile of them, leveraging information they already hold about a person to unlock further details from datasets that are meant to be completely anonymous.
Greater anonymisation standards needed?
For the researchers, the findings show that current approaches to anonymised data are not fit for purpose – adding weight to growing concerns about the practice that have largely been rebuffed.
“Companies and governments have downplayed the risk of re-identification by arguing that the datasets they sell are always incomplete,” said senior author Dr Yves-Alexandre de Montjoye, from Imperial’s Department of Computing, and Data Science Institute.
“Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”
3 Things That Will Change the World Today
“We’re often assured that anonymisation will keep our personal information safe,” added co-author Dr Julien Hendrickx, from UCLouvain.
“Our paper shows that de-identification is nowhere near enough to protect the privacy of people’s data.”
The research, then, underscores the need for more rigorous legislation on the handling and anonymisation of persona data.
“It is essential for anonymisation standards to be robust and account for new threats like the one demonstrated in this paper,” said Hendrickx.
“The goal of anonymisation is so we can use data to benefit society,” added de Montjoye.
“This is extremely important but should not and does not have to happen at the expense of people’s privacy.”