
Evaluating Synthetic Data Quality for Regional Microsimulation: Comparing Model-Generated and Commercial Data Sources for Population Modelling
Background and Motivation Regional microsimulation models face critical data challenges: survey data lack local representativeness, administrative data access is often restricted, and commercial datasets are costly. Increasingly, researchers turn to synthetic data generated through statistical models, but questions remain about their validity for policy analysis compared to established commercial alternatives like Experian.
This study evaluates synthetic population data against Experian commercial data for Essex microsimulation applications. Both datasets represent different forms of modelled data: Experian combines administrative records, commercial sources, and modelled estimates, while synthetic data is generated through statistical algorithms. Although both are typically validated against official statistics (ONS Census, household size, income benchmarks, age structure), validation against marginal distributions does not guarantee equivalence for policy analysis, where joint distributions and correlation structures matter critically. Our existing work developed UKMOD-aligned weights for Essex by reweighting FRS survey data to match Experians joint distribution of household characteristics. This provides a validated benchmark against which to assess synthetic data alternatives.
Methodology The comparison evaluates synthetic data against our established reweighted UKMOD variant by applying identical tax-benefit policy scenarios through UKMOD using each dataset. Policy simulation outputs - including income distribution changes, gains and losses by household type, and demographic impacts, are compared to assess whether synthetic data replicates the distributional patterns produced by our machine learning-based reweighted data. This direct comparison provides practical evidence on whether synthetic data offers comparable analytical utility to established reweighting methodologies for regional microsimulation applications.
Expected Contribution This research provides practical evidence on whether synthetic data produces comparable policy analysis results to established reweighting methodologies for regional microsimulation. The study addresses a fundamental question: can synthetic data adequately substitute for more resource-intensive data alignment approaches while maintaining analytical reliability? Findings will inform data strategy decisions for researchers and local authorities conducting sub-national distributional analysis, particularly where resource constraints, data access limitations, or timeliness considerations favor synthetic data solutions.