Appendix 1

About synthetic health data

Synthetic data is generated to resemble real health data in structure and statistical properties—without being direct copies of individual records. It can enable safer analysis, development, and testing when used within strong governance.

Framework overview Start at Step 1 Privacy evaluation (Appendix 7) Open original PDF Download PDF

What it is

Data generated from a model trained on real records to preserve useful patterns while reducing direct privacy risk.

What it’s for

Education, software testing, interoperability checks, prototyping, and analytics workflows where record-perfect truth isn’t required.

What it isn’t

A drop-in replacement for clinical truth datasets—overfitting and leakage can still occur, and quality/privacy testing remains essential.

How synthetic health data is made

At a high level, three pieces come together: a source dataset, a generative model, and the resulting synthetic dataset. The goal is to emulate statistical properties without recreating real people.

1) Source dataset

Real records used to learn distributions and relationships. Even without direct identifiers, quasi-identifiers and rich event data can make people identifiable in context.

2) Generative model

Learns patterns from the source data. Model choice and training approach affect both utility (how realistic) and privacy (how much leakage risk).

3) Synthetic dataset

New records for “synthetic individuals”. If correlations are simplified, privacy improves but some use cases become unsuitable.

Privacy protection vs data utility

Preserving more detailed correlations can improve realism and usefulness, but may increase re‑identification risk—especially for rare combinations or unique trajectories. Trade-offs must be explicit and justified by the use case.

Example

If a model preserves the overall caesarean rate but not its correlation with sex, you may see implausible records (e.g., caesareans for male patients). That can be acceptable for interface training, but unsuitable for clinical hypothesis testing.

Governance is still required

Synthetic data can reduce risk, but it doesn’t remove legal obligations by default. Test re‑identification risk and, if it’s more than very low, treat outputs as personal information and apply a lawful pathway before use or sharing.

Re‑identification risk (Step 4) Lawful pathways (Appendix 9)

Where this fits in the Framework

This appendix provides plain-language context that underpins Steps 1–5, and complements the specialised guidance in Appendices 3, 7, 8, and 9.

Back to appendices Explore the Framework

About synthetic health data