
Beyond IPF: Generative Modeling of Synthetic Populations with Variational Autoencoders
Synthetic populations are a key component of many transportation and urban analysis frameworks, as they provide disaggregated representations of individuals or households used to feed traffic simulators and exposure models. Beyond mobility studies, they are increasingly mobilized to assess territorial sensitivity to external factors such as environmental nuisances or construction noise. Traditionally, synthetic populations are generated using calibration-based methods such as Iterative Proportional Fitting (IPF), which adjust a micro-sample to match aggregated census constraints. While robust and interpretable, these approaches are limited in high-dimensional settings and can only reproduce individuals that are already present in the initial sample.
Recent advances in machine learning offer new perspectives for synthetic population generation. In particular, Variational Autoencoders (VAEs) have demonstrated strong capabilities in learning complex joint distributions and generating realistic synthetic data in domains such as image and text generation. Applied to population synthesis, VAEs allow the modeling of rich, multidimensional dependency structures between socio-demographic attributes and enable the generation of more diverse populations. However, VAEs alone do not naturally enforce consistency with known marginal distributions derived from official statistics, which remains a key requirement in applied territorial studies.
This contribution presents a hybrid methodology that combines the generative power of VAEs with the statistical guarantees of IPF. First, a VAE is trained on a microdata sample to learn a low-dimensional latent representation of individuals and to generate synthetic agents capturing complex correlations between attributes. In a second step, IPF is used as a post-processing procedure to adjust the generated population so that selected marginal distributions strictly match external constraints. This decoupled strategy leverages the strengths of both approaches: the flexibility and scalability of VAEs in high-dimensional spaces, and the ability of IPF to enforce consistency with official statistics.
The presentation will detail the architecture of the VAE and the integration of IPF as a calibration layer. Results will illustrate how this approach improves diversity, realism, and statistical coherence of synthetic populations compared to traditional methods.