Simulating the Simulation: Evaluating Simulation Strategies in Causal Inference

Carlos Rodriguez Ameal ( Ghent University ) — “Simulating the Simulation: Evaluating Simulation Strategies in Causal Inference”

July 1, 2026, 0:00 am TBC TBC

Conference presentation

Validation & methods

Because causal estimands are unobservable in reality, benchmarking causal effect estimators is inherently challenging. To overcome this, simulation studies are increasingly used to evaluate causal inference methods under realistic conditions such as small sample sizes, limited overlap, and complex confounding. Yet it remains unclear to what extent conclusions drawn from simulated settings extrapolate to real-world performance. Existing benchmarks rely on a wide range of outcome-generation strategies, from parametric structural models to flexible machine learning, without clear guidance on their reliability. This paper formalizes simulation-based benchmarking and introduces a framework to characterize the types of bias that may arise from different generative strategies, focusing particularly on Average Treatment Effect (ATE). We classify and compare several approaches proposed in the causal inference literature, including parametric structural models[1], machine-learning conditional mean–variance models[2], and adversarial generative models such as Wasserstein GANs[3]. We apply these modeling strategies to commonly used benchmark datasets in which we synthetically define an outcome-generating function so that the true ATE is known and systematic evaluation is possible. Across Monte Carlo experiments, we compare estimator performance in terms of bias, variance, and mean squared error. We examine how closely these performance metrics align across different simulation strategies relative to the structural data-generating process, and whether commonly used generator diagnostics are predictive of estimator behavior. Our preliminary results indicate that the choice of simulation strategy can substantially alter conclusions about estimator reliability relative to the underlying data distribution. These findings underscore the importance of principled design and validation of simulation frameworks when benchmarking causal methods.

[1] T. Wendling, K. Jung, A. Callahan, A. Schuler, N. H. Shah, and B. Gallego, “Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases,” Stat. Med., vol. 37, no. 23, pp. 3309–3324, Oct. 2018, doi: 10.1002/sim.7820. [2] A. Schuler, K. Jung, R. Tibshirani, T. Hastie, and N. Shah, “Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset,” Oct. 31, 2017, arXiv: arXiv:1711.00083. doi: 10.48550/arXiv.1711.00083. [3] S. Athey, G. W. Imbens, J. Metzger, and E. Munro, “Using Wasserstein Generative Adversarial Networks for the design of Monte Carlo simulations,” J. Econom., vol. 240, no. 2, p. 105076, Mar. 2024, doi: 10.1016/j.jeconom.2020.09.013.