Appendix 7
De-identification techniques and privacy evaluation in synthetic data
Updated guidance for reducing linkability, understanding disclosure risks, and evaluating synthetic data privacy using a portfolio of evidence rather than a single score.
De-identification is not binary
The appendix treats de-identification as a risk-management exercise. Stronger privacy controls generally reduce utility, so the trade-off must be documented and justified.
Privacy risk is multi-dimensional
Identity, membership, and attribute disclosure can each arise in different ways. Evaluating only one of them is not enough.
Built for governance teams
Use this appendix when custodians, requestors, scientists, and reviewers are documenting Step 4 decisions and supporting a Re-identification Risk Assessment.
De-identification techniques
De-identification refers to technical and organisational approaches that reduce the likelihood that data can be associated with an identifiable person. Even without direct identifiers, data may still be personal information if it remains reasonably linkable.
Aggregation and suppression
Remove identifiers or overtly identifying fields, or reduce granularity so the data is less easily linkable to a person.
Generalisation
Replace precise values with broader categories, such as substituting exact dates of birth with age bands.
Pseudonymisation
Use cryptographically protected transformations such as keyed hashing with appropriate key management rather than plain hashing alone.
Perturbation
Introduce controlled change through noise addition, micro-aggregation, or data swapping to reduce disclosure risk.
Types of privacy risk
The appendix distinguishes among multiple disclosure risks. Real privacy assessment needs to account for all of them, not just direct re-identification.
Identity disclosure
A synthetic record can be confidently linked to a specific person. Direct identifiers should already be removed, but residual linkability can still matter.
Membership disclosure
An adversary can infer whether a specific person was included in the training dataset, which can itself be highly sensitive.
Attribute disclosure
An adversary can infer new sensitive information about an individual using synthetic data plus auxiliary knowledge they already hold.
Landscape of evaluation metrics
No single measure defines privacy safety. These methods should be read as lenses on risk, not bounded guarantees of privacy loss.
Categories of privacy metrics
Use multiple methods aligned to a realistic threat model.
| Type | Category | Method | What it tells you |
|---|---|---|---|
| Non-adversarial | Re-identifiability | k-Anonymity | Checks whether each individual is indistinguishable from at least k - 1 other individuals against a set of quasi-identifiers. |
| Non-adversarial | Re-identifiability | l-Diversity | Extends k-anonymity by ensuring sensitive attributes within each anonymised group have at least l distinct values. |
| Non-adversarial | Re-identifiability | t-Closeness | Requires the distribution of a sensitive attribute in a group to remain close to the overall dataset distribution. |
| Non-adversarial | Memorisation and similarity | Hitting Rate (Common Row Proportion) | Measures the percentage of exact matching rows between the synthetic and source data. |
| Non-adversarial | Memorisation and similarity | Close Value Ratio | Assesses the probability of near matches using a distance threshold. |
| Non-adversarial | Memorisation and similarity | Similarity Ratio (epsilon-identifiability) | Tests whether fewer than an epsilon ratio of synthetic observations are similar enough to those in the original dataset. |
| Non-adversarial | Memorisation and similarity | Nearest Neighbour Accuracy | Evaluates proximity between source and synthetic distributions, but should be interpreted cautiously because similarity-based metrics can miss serious leakage. |
| Non-adversarial | Distinguishability | Data Likelihood | Measures the likelihood of synthetic data belonging to the source data distribution. |
| Non-adversarial | Distinguishability | Detection Rate | Measures how easily models can distinguish source data from synthetic data. |
| Adversarial | Singling out attacks | Singling Out Attack (Univariate) | Observes the uniqueness of a single attribute in the synthetic data. |
| Adversarial | Singling out attacks | Singling Out Attack (Multivariate) | Examines uniqueness across combinations of attributes. |
| Adversarial | Record linkage attacks | Public-Public Linkage | Uses the synthetic dataset to establish links between records found in two external datasets. |
| Adversarial | Record linkage attacks | Public-Synthetic Linkage | Links synthetic rows to an external dataset using matching criteria, creating a basis for inference attacks. |
| Adversarial | Attribute inference attacks | Exact Match AIA | Determines a missing target attribute by matching overlapping quasi-identifiers. |
| Adversarial | Attribute inference attacks | Closest Distance AIA | Infers a sensitive value using the nearest synthetic neighbour where k = 1. |
| Adversarial | Attribute inference attacks | Nearest Neighbours AIA | Uses the k nearest synthetic neighbours where k is greater than 1. |
| Adversarial | Attribute inference attacks | ML Inference AIA | Trains a predictive model on synthetic data to infer target attributes. |
| Adversarial | Membership inference attacks | Closest Distance MIA | Infers membership if a target record is more similar to synthetic data than to unrelated data. |
| Adversarial | Membership inference attacks | Nearest Neighbours MIA | Extends the closest-distance approach to proximity against multiple neighbours, but still inherits the limits of similarity-based methods. |
| Adversarial | Membership inference attacks | Probability Estimation MIA | Uses hypothesis testing to assess whether a target record belongs to the synthetic data distribution. |
| Adversarial | Membership inference attacks | MIA Shadow Model | Uses shadow models trained with and without the target record to classify membership. |
Limitations of common metrics
Similarity-based metrics and average-case scores are useful for finding some problems, but they are poor proof of safety. Privacy is a worst-case question focused on whether any individual is exposed.
Similarity-based metrics
Measures such as nearest-neighbour similarity are intuitive, but research has shown they can miss serious privacy leakage and do not provide bounded privacy guarantees.
Average-case metrics like F1
Aggregate scores can hide a small group of highly vulnerable people. A high F1 score clearly signals a privacy failure, but a low score does not prove safety.
Differential Privacy and auditing
Differential Privacy
Differential Privacy is a property of the generation process, not the output dataset. Pure epsilon-DP is the strictest form, while approximate (epsilon, delta)-DP allows a small failure probability. Smaller epsilon values give stronger protection.
Audit claims empirically
Real-world implementations can fail because of design flaws, incorrect assumptions, or bugs. Empirical auditing is still required even when a generator claims formal privacy guarantees.
Canary-based auditing
Inject carefully constructed artificial records into training data, train the generator, then test whether those canaries are detectable or reconstructable in the output. Detectable canaries are concrete evidence of memorisation or leakage.
Practical considerations for privacy evaluation
The framework text emphasises context-aware evaluation and transparent reporting rather than a single mechanical checklist.
Base evaluations on realistic quasi-identifiers that reflect likely adversary knowledge.
Evaluate the entire dataset rather than only a pre-selected subset of records.
Assess both membership disclosure and attribute disclosure, not just one attack surface.
Empirically validate Differential Privacy claims, especially when the privacy budget is not close to zero.
Report results across multiple synthetic data generation runs and keep worst-case outcomes visible.
Future directions and open challenges
- Better empirical privacy metrics that capture worst-case rather than average-case risk.
- More practical, automated, and reproducible privacy auditing tools.
- Clearer interpretation of epsilon and delta in operational settings.
- Better handling of cumulative privacy loss across repeated synthetic data releases.
- Stronger methods for time-series, longitudinal data, free text, and other complex data types.
The appendix concludes that privacy evaluation is not optional. Responsible practice depends on realistic threat modelling, transparent assumptions, empirical auditing, and a portfolio of complementary evidence to understand and manage residual risk.