Appendix 1
About synthetic health data
Synthetic data is generated to resemble real health data in structure and statistical properties—without being direct copies of individual records. It can enable safer analysis, development, and testing when used within strong governance.
What it is
Data generated from a model trained on real records to preserve useful patterns while reducing direct privacy risk.
What it’s for
Education, software testing, interoperability checks, prototyping, and analytics workflows where record-perfect truth isn’t required.
What it isn’t
A drop-in replacement for clinical truth datasets—overfitting and leakage can still occur, and quality/privacy testing remains essential.
How synthetic health data is made
At a high level, three pieces come together: a source dataset, a generative model, and the resulting synthetic dataset. The goal is to emulate statistical properties without recreating real people.
1) Source dataset
Real records used to learn distributions and relationships. Even without direct identifiers, quasi-identifiers and rich event data can make people identifiable in context.
2) Generative model
Learns patterns from the source data. Model choice and training approach affect both utility (how realistic) and privacy (how much leakage risk).
3) Synthetic dataset
New records for “synthetic individuals”. If correlations are simplified, privacy improves but some use cases become unsuitable.
Privacy protection vs data utility
Preserving more detailed correlations can improve realism and usefulness, but may increase re‑identification risk—especially for rare combinations or unique trajectories. Trade-offs must be explicit and justified by the use case.
Example
If a model preserves the overall caesarean rate but not its correlation with sex, you may see implausible records (e.g., caesareans for male patients). That can be acceptable for interface training, but unsuitable for clinical hypothesis testing.
Governance is still required
Synthetic data can reduce risk, but it doesn’t remove legal obligations by default. Test re‑identification risk and, if it’s more than very low, treat outputs as personal information and apply a lawful pathway before use or sharing.
Where this fits in the Framework
This appendix provides plain-language context that underpins Steps 1–5, and complements the specialised guidance in Appendices 3, 7, 8, and 9.