
Building a cross-border synthetic population for Luxembourg and neighbouring regions
Synthetic populations are essential for modeling complex systems that require individual-level data. However, they are typically limited to a single country. Creating a synthetic population across multiple countries is challenging because the data available from national statistical institutes are inconsistent: the variables available differ, and for shared variables, the categories may not align. Fortunately, Eurostat provides access to a large amount of aggregated socio-economic data that is consistent across EU countries. Although these data are less detailed than what can be obtained from national statistical institutes, they provide a solid basis for generating synthetic populations.
This work is part of MMUST+, an Interreg project developing a multimodal mobility model for the Luxembourg cross-border area using the synthetic population as input. We propose a multi-stage framework that combines iterative proportional fitting (IPF) (Deming and Stephan 1940) and stochastic synthetic reconstruction (Lenormand and Deffuant 2013) to generate synthetic populations that are statistically and structurally realistic.
The first stage consists of generating several entities: individuals, family nuclei, households, and dwellings. For each entity, we generate attributes covering socio-demographics, employment, education, household structure, dwelling information and spatial location. To generate these entities, we chose the IPF method for its efficiency and low algorithmic complexity. We ran a separate IPF instance for each entity using as many Eurostat marginals as possible. Since no survey includes all variables, we used a uniform seed with some structural zeros, but the multi-variable marginals preserve most dependency information. In the future, if survey data or microdata from statistical institutes become available, these could be used as seed for IPF to improve accuracy. Finally, we applied the truncate-replicate-sample (TRS) (Lovelace and Ballas 2013) integerisation to obtain integer counts.
In the second stage, dwelling attributes are assigned to households by probabilistically drawing dwellings for each household, with probabilities derived from IPF weights. If a dwelling and household are incompatible (different locations or mismatched number of occupants), the probability is set to zero.
The third stage addresses the assignment of individuals to family nuclei and the grouping of isolated individuals and family nuclei into households. We implemented the stochastic sample-free synthetic reconstruction algorithm described in (Lenormand and Deffuant 2013). The probabilities required by this method were computed using age-gap distributions derived from data available from Eurostat and the Human Fertility Database, in order to guide realistic relationships between partners and between parents and children. Hard constraints (e.g. maximum two parents per nucleus) were imposed by setting the corresponding probabilities to zero.
Preliminary validation demonstrates that the population reproduces key aggregate statistics, household structures, and family compositions across the cross-border region. While the current approach relies on aggregated data, future integration of survey microdata from national institutes could further improve accuracy.
Overall, this method offers a practical and flexible approach for generating synthetic populations. By combining IPF-based synthesis, TRS integerisation and stochastic synthetic reconstruction, it produces populations that are consistent with aggregate statistics, household- and individual-level structures.