top of page

Generalizing Trimming Bounds for Endogenously Missing Outcome Data Using Random Forests

  • Writer: Ye Wang
    Ye Wang
  • Jul 10, 2021
  • 2 min read

Updated: Dec 2

Political Analysis, 2025, with Cyrus Samii and Junlong Aaron Zhou.


Why this problem matters

Many experiments and quasi-experiments suffer from the problem of endogenously missing outcomes: researchers only observe results for people who complete a key task, such as a survey, conversation, or interview, while dropout or nonresponse itself can be affected by the treatment. This creates what Slough (2023) calls the “phantom counterfactual” issue, where some units have undefined potential outcomes under some treatment conditions. In Santoro and Broockman’s (2022) study—one we revisit in our paper—subjects were asked to have a brief conversation with another participant. If a subject learned that their partner was from the opposing political party, some found the task unappealing and dropped out before producing any outcome data. As a result, the subjects in the treatment and control groups who do provide outcomes come from fundamentally different populations, and the difference in their outcomes no longer identifies any meaningful causal quantities.


Existing approaches struggle in such settings. They typically rely on strong parametric assumptions to impute the missing outcomes. Such assumptions often fail in real-world applications and may generate misleading conclusions.


What our method contributes

We extend the approach of trimming bounds (TB) proposed in Lee (2009), which provides bounds for the treatment effect on “always-responders”—individuals who would remain in the study regardless of treatment assignment—under a monotonicity assumption that treatment cannot increase (or decrease) dropout. We first generalize this framework by allowing for conditional monotonicity, following Semenova (2023), and require the monotonicity assumption to hold only within covariate strata rather than globally.


Under this assumption, we introduce the covariate-tightened trimming bounds (CTTB), using generalized random forests to incorporate high-dimensional pretreatment covariates when estimating the proportion of always-responders and the corresponding quantiles. This approach guarantees bounds that are no wider—and often dramatically narrower—than classical Lee bounds. Beyond population-level effects, our method also delivers conditional bounds, allowing researchers to study how treatment effects vary across covariate profiles. Simulations show that our method reduces mean squared error and maintains confidence interval coverage across sample sizes. We demonstrate its usefulness by reanalyzing several published studies, showing that researchers can extract substantially more information from data with endogenous missingness without relying on exclusion restrictions or parametric selection models.


ree

Figure 2: Replication results from Santoro and Broockman (2022)

  • Google Scholar
  • github
  • Twitter
bottom of page