Data Synthesis

Home /
Tags /
Data Synthesis

BEAMM project : How do we deal with data ? Statistical matching and WGAN generation.

In the framework of the BEAMM project (BElgian Arithmetic Micro-simulation Model), we propose several methods to address data issues. The core of this project is to develop a tax-benefit microsimulation model for Belgium accessible online, requiring intensive data handling. Our challenges consist in creating a unified data set containing variables from different surveys and developing a completely synthetic database for the online development of the BEAMM platform.

Building a cross-border synthetic population for Luxembourg and neighbouring regions

Synthetic populations are essential for modeling complex systems that require individual-level data. However, they are typically limited to a single country. Creating a synthetic population across multiple countries is challenging because the data available from national statistical institutes are inconsistent: the variables available differ, and for shared variables, the categories may not align. Fortunately, Eurostat provides access to a large amount of aggregated socio-economic data that is consistent across EU countries. Although these data are less detailed than what can be obtained from national statistical institutes, they provide a solid basis for generating synthetic populations.

Firm microsimulation and VAT policy analysis

I am a Research Associate at PolicyEngine, a nonprofit that provides free, open-source software to compute the impact of public policy in the US and UK. Previously, I served as a researcher at the London School of Economics. My work focuses on microsimulation, economic modelling, and public policy analysis, particularly the UK tax and benefit system.

Imputing lifetime incomes: Baseline projections for the UK

Most studies that report distributional comparisons of income focus on income evaluated over periods that vary between one week and one year. Distributional studies of weekly income recognise the importance of short-term constraints, particularly in relation to material deprivation and poverty. Distributional studies of annual income recognise the capacity of many people to save the proceeds of temporary income peaks to carry them through temporary income troughs. Income measured over longer periods is rarely analysed due to the relative (in)availability of survey data, rather than any more fundamental motivation. Unfortunately, analysis of lifetime incomes for contemporary population cross-sections is complicated in part by the limited historical context captured by existing panel studies, and in part because future incomes are unobservable. Microsimulation is one method to fill gaps in the available statistical record. This study describes how microsimulation methods were used to project lifetime incomes for a contemporary population cross-section of the UK.

Machine Learning Approaches to Predicting Consumption Expenditure: A Comparative Analysis for SILC–HBS Statistical Matching

This study examines whether supervised machine learning can improve the prediction of household expenditure shares within the standard statistical matching pipeline that fuses EU SILC–type microdata with Household Budget Survey (HBS) expenditures. The conventional approach uses a transparent two part econometric design: a probit model for participation (extensive margin) and an OLS regression for conditional spending (intensive margin). While robust, this framework is known to struggle in categories with pronounced zero inflation, nonlinear participation boundaries, heterogeneous spending patterns, or timing noise. We assess whether replacing the parametric steps with Gradient Boosted Trees (GBT) for participation and Gradient Boosted Regression (GBR) for conditional expenditure yields systematically better predictions without altering the downstream imputation workflow. We combine Swiss SILC 2020 as the recipient dataset and Swiss HBS 2015–2017 as the donor survey. Because these samples have no shared identifiers, we harmonize variables following established Eurostat/JRC practices. Seventeen covariates present in both sources are aligned through recoding and aggregation, and we uprate nominal incomes and expenditures using the harmonized index of consumer prices (HICP) to ensure comparability with the SILC reference year. We apply EUROMOD style categorical aggregation to mitigate incidental zeros, remove extreme expenditure to income ratios, and enforce a common structure for the predictors used in both stages of the model. This creates a coherent evaluation environment in which alternative prediction models can be compared fairly. The imputation pipeline remains unchanged to ensure comparability with policy applications. First, we estimate participation for each aggregated COICOP category using the selected model (probit baseline or GBT alternative). Second, we model conditional expenditure given participation using OLS (baseline) or GBR (alternative). Third, we compute fitted shares and apply a pseudo R² screen to restrict attention to categories where covariates meaningfully explain variation. All diagnostics and matching steps are identical across methods so that any downstream differences are attributable solely to the prediction component. The design yields (i) cross validated probability and error metrics for extensive and intensive margins; (ii) threshold sweep summaries to document operating point sensitivity under imbalance; and (iii) downstream compatibility with the standard donor selection step used in EUROMOD/SWISSMOD type applications. Because the imputation workflow and diagnostics are held constant, the study isolates the contribution of flexible predictors relative to the classical probit–OLS baseline in a way that is transparent for policy use.