SynD Framework logoSynthetic Health Data Governance Framework (SHDGF)
Home
ResourcesAbout SynD

Appendix 7

De-identification techniques and privacy evaluation in synthetic data

Updated guidance for reducing linkability, understanding disclosure risks, and evaluating synthetic data privacy using a portfolio of evidence rather than a single score.

Use in Step 4 Evaluation metrics Auditing and DP Open source PDF Download PDF

De-identification is not binary

The appendix treats de-identification as a risk-management exercise. Stronger privacy controls generally reduce utility, so the trade-off must be documented and justified.

Privacy risk is multi-dimensional

Identity, membership, and attribute disclosure can each arise in different ways. Evaluating only one of them is not enough.

Built for governance teams

Use this appendix when custodians, requestors, scientists, and reviewers are documenting Step 4 decisions and supporting a Re-identification Risk Assessment.

De-identification techniques

De-identification refers to technical and organisational approaches that reduce the likelihood that data can be associated with an identifiable person. Even without direct identifiers, data may still be personal information if it remains reasonably linkable.

Aggregation and suppression

Remove identifiers or overtly identifying fields, or reduce granularity so the data is less easily linkable to a person.

Generalisation

Replace precise values with broader categories, such as substituting exact dates of birth with age bands.

Pseudonymisation

Use cryptographically protected transformations such as keyed hashing with appropriate key management rather than plain hashing alone.

Perturbation

Introduce controlled change through noise addition, micro-aggregation, or data swapping to reduce disclosure risk.

Types of privacy risk

The appendix distinguishes among multiple disclosure risks. Real privacy assessment needs to account for all of them, not just direct re-identification.

Identity disclosure

A synthetic record can be confidently linked to a specific person. Direct identifiers should already be removed, but residual linkability can still matter.

Membership disclosure

An adversary can infer whether a specific person was included in the training dataset, which can itself be highly sensitive.

Attribute disclosure

An adversary can infer new sensitive information about an individual using synthetic data plus auxiliary knowledge they already hold.

Landscape of evaluation metrics

No single measure defines privacy safety. These methods should be read as lenses on risk, not bounded guarantees of privacy loss.

Categories of privacy metrics

Use multiple methods aligned to a realistic threat model.

TypeCategoryMethodWhat it tells you
Non-adversarialRe-identifiabilityk-AnonymityChecks whether each individual is indistinguishable from at least k - 1 other individuals against a set of quasi-identifiers.
Non-adversarialRe-identifiabilityl-DiversityExtends k-anonymity by ensuring sensitive attributes within each anonymised group have at least l distinct values.
Non-adversarialRe-identifiabilityt-ClosenessRequires the distribution of a sensitive attribute in a group to remain close to the overall dataset distribution.
Non-adversarialMemorisation and similarityHitting Rate (Common Row Proportion)Measures the percentage of exact matching rows between the synthetic and source data.
Non-adversarialMemorisation and similarityClose Value RatioAssesses the probability of near matches using a distance threshold.
Non-adversarialMemorisation and similaritySimilarity Ratio (epsilon-identifiability)Tests whether fewer than an epsilon ratio of synthetic observations are similar enough to those in the original dataset.
Non-adversarialMemorisation and similarityNearest Neighbour AccuracyEvaluates proximity between source and synthetic distributions, but should be interpreted cautiously because similarity-based metrics can miss serious leakage.
Non-adversarialDistinguishabilityData LikelihoodMeasures the likelihood of synthetic data belonging to the source data distribution.
Non-adversarialDistinguishabilityDetection RateMeasures how easily models can distinguish source data from synthetic data.
AdversarialSingling out attacksSingling Out Attack (Univariate)Observes the uniqueness of a single attribute in the synthetic data.
AdversarialSingling out attacksSingling Out Attack (Multivariate)Examines uniqueness across combinations of attributes.
AdversarialRecord linkage attacksPublic-Public LinkageUses the synthetic dataset to establish links between records found in two external datasets.
AdversarialRecord linkage attacksPublic-Synthetic LinkageLinks synthetic rows to an external dataset using matching criteria, creating a basis for inference attacks.
AdversarialAttribute inference attacksExact Match AIADetermines a missing target attribute by matching overlapping quasi-identifiers.
AdversarialAttribute inference attacksClosest Distance AIAInfers a sensitive value using the nearest synthetic neighbour where k = 1.
AdversarialAttribute inference attacksNearest Neighbours AIAUses the k nearest synthetic neighbours where k is greater than 1.
AdversarialAttribute inference attacksML Inference AIATrains a predictive model on synthetic data to infer target attributes.
AdversarialMembership inference attacksClosest Distance MIAInfers membership if a target record is more similar to synthetic data than to unrelated data.
AdversarialMembership inference attacksNearest Neighbours MIAExtends the closest-distance approach to proximity against multiple neighbours, but still inherits the limits of similarity-based methods.
AdversarialMembership inference attacksProbability Estimation MIAUses hypothesis testing to assess whether a target record belongs to the synthetic data distribution.
AdversarialMembership inference attacksMIA Shadow ModelUses shadow models trained with and without the target record to classify membership.

Limitations of common metrics

Similarity-based metrics and average-case scores are useful for finding some problems, but they are poor proof of safety. Privacy is a worst-case question focused on whether any individual is exposed.

Similarity-based metrics

Measures such as nearest-neighbour similarity are intuitive, but research has shown they can miss serious privacy leakage and do not provide bounded privacy guarantees.

Average-case metrics like F1

Aggregate scores can hide a small group of highly vulnerable people. A high F1 score clearly signals a privacy failure, but a low score does not prove safety.

Differential Privacy and auditing

Differential Privacy

Differential Privacy is a property of the generation process, not the output dataset. Pure epsilon-DP is the strictest form, while approximate (epsilon, delta)-DP allows a small failure probability. Smaller epsilon values give stronger protection.

Audit claims empirically

Real-world implementations can fail because of design flaws, incorrect assumptions, or bugs. Empirical auditing is still required even when a generator claims formal privacy guarantees.

Canary-based auditing

Inject carefully constructed artificial records into training data, train the generator, then test whether those canaries are detectable or reconstructable in the output. Detectable canaries are concrete evidence of memorisation or leakage.

Practical considerations for privacy evaluation

The framework text emphasises context-aware evaluation and transparent reporting rather than a single mechanical checklist.

1

Base evaluations on realistic quasi-identifiers that reflect likely adversary knowledge.

2

Evaluate the entire dataset rather than only a pre-selected subset of records.

3

Assess both membership disclosure and attribute disclosure, not just one attack surface.

4

Empirically validate Differential Privacy claims, especially when the privacy budget is not close to zero.

5

Report results across multiple synthetic data generation runs and keep worst-case outcomes visible.

Future directions and open challenges

  • Better empirical privacy metrics that capture worst-case rather than average-case risk.
  • More practical, automated, and reproducible privacy auditing tools.
  • Clearer interpretation of epsilon and delta in operational settings.
  • Better handling of cumulative privacy loss across repeated synthetic data releases.
  • Stronger methods for time-series, longitudinal data, free text, and other complex data types.

The appendix concludes that privacy evaluation is not optional. Responsible practice depends on realistic threat modelling, transparent assumptions, empirical auditing, and a portfolio of complementary evidence to understand and manage residual risk.

Go to Step 4 Back to appendices