Workshop

Home /
Categories /
Workshop

Creating and Utilising Synthetic Population Data: Examples, Innovations and Pitfalls

Through their ability to fill important data gaps, synthetic populations have become well-established resources in research spanning a wide range of population geography aligned disciplines. By providing readily available data on key life domains and for entire populations, the role of synthetic populations is ever growing – for example within the context of modelling policy questions around public budgets, urban planning, climate mitigation, or health inequalities. Nevertheless, many approaches to creating synthetic populations present important limitations, impacting their robustness and utility for policy and research. In this talk I will outline the creation and utility of synthetic population data, covering important innovations such as the nesting of household and individual level structures, validation, approaches to sharing datasets, and undertaking applied research based on these datasets.

Data Without Barriers: Synthetic Data as a Catalyst for Responsible Innovation

The ability to access and use high-quality data is becoming a key enabler, and bottleneck, for innovation across AI and digital systems. Yet privacy constraints, regulation, and data scarcity continue to limit what organizations and researchers can do. Synthetic data generation is increasingly emerging as a powerful ingredient for enabling responsible, inclusive, and scalable data-driven innovation.

Digital twins: challenges, pitfalls, and opportunities

Li and O’Donoghue (2013) emphasized microsimulations to cover two areas, the microsimulations per se in terms of what-if-questions as well as synthetic data generation as an important base for performing microsimulations. More and more methods such as data fusion of different surveys, prediction methods, as well as modern ML approaches are applied. However, modelling strategies need to be adjusted accordingly, in particular depending on cross-sectional or longitudinal applications. Further, the increasing attention is laid on the granularity of the modelling. All in all, little attention is laid on the accuracy of the generated data as well as on assumptions and implicit decisions of developers of microsimulation models. The presentation focuses on different aspects of synthetic data generation and so-called digital twins. Special attention will be laid on timely and regional granularity as well as of unobserved heterogeneities of the simulations including uncertainties of the entire modelling process. Additionally, specific data situations and disclosure limitations will be addressed.

Enhanced data fusion and anonymization for microsimulation systems

The fusion and anonymization of multiple heterogeneous data sources remain major challenges in applied statistics. In this work, we consider the joint use of demographic and fiscal census data together with several sample surveys. The objective is to integrate these sources in order to obtain a coherent representation of the overall population and to enable the evaluation of policy changes, such as reforms of the fiscal system, while ensuring that the resulting data are fully synthetic and thus completely anonymized.

Generating Synthetic Populations for Transportation: A Variational Autoencoder Approach

Synthetic populations are commonly used in transportation analysis to feed traffic simulators. Recently, they have also been used to assess the sensitivity of a territory to factors such as construction noise. However, traditional methods as Iteratif Proportianal Fitting (IPF) for generating synthetic populations, based on sampling and calibration to aggregated data, have limitations. Indeed, they only allow generating individuals similar to those in the initial sample. Machine Learning and Statistical Learning methods, such as Variational Autoencoders (VAE), offer a promising alternative. VAE have already demonstrated their effectiveness in generating realistic images. We present here how to use VAE to generate synthetic populations, allowing for more varied representations of a territory.