Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimension Data

Zhiqiang Tan; Tobias Gerhard; Baoluo Sun

doi:10.25302/11.2021.ME.151132740

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimension Data

Zhiqiang Tan, PhD, Tobias Gerhard, PhD, and Baoluo Sun, PhD.

Author Information and Affiliations

Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Nov.

Structured Abstract

Background:

Appropriate causal inference methods are required for comparative effectiveness research to produce valid and relevant findings from observational data. Two prominent classes of such methods are based on unconfoundedness or instrumental variable (IV) assumptions. Although extensive research has been done, it remains highly challenging to estimate propensity scores (PSs) and regression functions and to perform subsequent inference about average treatment effects (ATEs). The conventional approach employs an iterative process of model building and fitting, depending on ad hoc modeling choices, where statistical uncertainty is difficult to quantify. Recently, various methods have been proposed that apply off-the-shelf machine learning algorithms but either ignore statistical inference or invoke strong smoothness assumptions to justify consistent estimation of regression functions and subsequent statistical inference about treatment effects.

Objectives:

The objective of our research is to develop and evaluate a new set of statistically rigorous, numerically tractable, and pragmatic methods for drawing inferences about ATEs using PSs or IVs, while fitting PS and regression models with a large number of regressors.

Methods:

We propose regularized calibrated estimation and model-assisted inferences about ATEs under the assumption of no unmeasured confounding or local ATEs under IV assumptions in high-dimensional settings. We derived numerical algorithms to implement the methods, conducted simulation studies to evaluate the methods, and investigated empirical applications of the proposed methods, compared with existing methods.

Results:

The proposed methods are shown to yield valid statistical inference (ie, CIs and hypothesis testing) about treatment parameters under weak technical conditions in high-dimensional settings. Simulation studies and empirical applications demonstrate the advantages of the proposed methods compared with related methods.

Conclusions:

We developed new statistical methods and theory using PSs or IVs for causal inference. Using the proposed methods, PS and regression models can be fitted with a possibly large number of regressors, including main effects and interactions of the covariates, and CIs, and hypothesis tests can be obtained about treatment effects in a numerically tractable and statistically principled manner. Our methods are implemented in the publicly released R package RCAL.

Limitations:

Currently used methods handle cross-sectional studies and assume that the treatment and instrument are binary. It is desirable to extend our methods to handle multivalued treatments and instruments and to analyze longitudinal and survival data.

Background

Causal Inference

PCORI was established with the mission to help patients, clinicians, and other health care stakeholders make better-informed health care decisions and improve health care delivery and outcomes. One of the most frequently sought types of information is whether a specific treatment will cause deterioration or improvement in the outcomes of interest compared with other treatments. To provide such information, it is important to conduct comparative effectiveness research (CER). However, a fundamental challenge in nonrandomized, observational studies is that it can be enormously difficult to disentangle the effects of the treatment from other risk factors, known as confounding variables (or covariates), which might influence the health outcomes and, at the same time, vary between patients who had the intervention treatment and those who had the comparator treatment. The subject of how to address confounding and draw inferences about treatment effects is broadly called causal inference, spanning multiple disciplines, including epidemiology, econometrics, and statistics.

As discussed in the PCORI Methodology Report (PCORI Methodology Committee, 2021), Section III (8), appropriate causal inference methods are required for CER to produce valid and relevant findings from observational data, where treatment choices are self-selected rather than randomly assigned. There is extensive literature on various analytical methods for causal inference (Morgan & Winship 2014; van der Laan & Robins 2003). The PCORI Methodology Report explicitly mentions 2 methods, propensity scores (PSs) (Rosenbaum & Rubin 1983) and instrumental variables (IVs) (Angrist et al. 1996; Hernan & Robins 2006), as being “relatively well-developed and increasingly used in CER.” The Methodology Report provides the following standards for the use of these 2 methods:

CI-5. Report the assumptions underlying the construction of propensity scores and the comparability of the resulting groups in terms of the balance of covariates and overlap.
CI-6. Assess the validity of the instrumental variable (ie, how the assumptions are met) and report the balance of covariates in the groups created by the instrumental variable.

The PS in CI-5 is defined as the conditional probability of receiving the treatment given covariates (Rosenbaum & Rubin 1983). For IV analysis (CI-6), the conditional probabilities of the instrument given covariates (called IV PSs) can also be used to achieve covariate balance between instrument groups (Tan 2006b, 2010b).

Underlying the preceding standards, however, are at least 2 nontrivial statistical tasks: first, estimating PSs (or IV PSs), and then using the estimated PSs to conduct inferences about treatment effects. In fact, the exact PSs in an observational study are unknown and need to be estimated from empirical data, depending on a PS model, often specified in the form of logistic regression, where the response variable is treatment status and the regressors may include main effects, nonlinear terms, or interactions of the covariates. How such PS models are built and fitted can have a substantial impact on subsequent estimation of treatment effects. In particular, inverse probability weighted (IPW) estimation using PSs may perform poorly even when the PS model appears to be nearly correct (Kang & Schafer 2007). Moreover, statistical uncertainty from model building and fitting for estimating PSs must be properly taken into account when conducting inferences (including CIs and hypothesis tests) about treatment effects.

To mitigate possible misspecification of PS models, researchers can use doubly robust estimation by combining both a PS model and an outcome regression (OR) model, where the response variable is the outcome of interest, and the regressors may include main effects, nonlinear terms, or interactions of the covariates, similar to the PS model. For example, various doubly robust estimators for the average treatment effect (ATE) have been derived in the form of an augmented IPW estimator (Robins et al. 1994; Tan 2006a, 2010a). In theory, a doubly robust estimator remains consistent (ie, unbiased in large samples) if either the PS model or the OR model is correctly specified. In practice, all models used are only approximations to some extent. Hence, the performance of doubly robust methods still depends on how PS and OR models are built and fitted.

Gaps in Existing Methods

Per the preceding discussion, applications of PS and IV methods for estimating treatment effects involve statistical modeling and estimation of PSs and OR functions from empirical data, in addition to the fact that certain structural assumptions are required for the identification of treatment effects at the population level. Moreover, statistical uncertainty from such modeling and estimation must be properly incorporated to obtain valid CIs and hypothesis tests about treatment effects. Below, we outline various limitations in existing methods.

From the perspective of model building for PSs and OR, existing methods suffer broadly from 3 types of limitations:

Some methods resort to simple models, for example, on OR PSs with main effects of the covariates only, and hence are susceptible to model misspecification.
Some methods employ an iterative process of model building and fitting (McCullagh & Nelder 1989; Rosenbaum & Rubin 1984), depending on ad hoc modeling choices about nonlinear and interaction terms of the covariates, where statistical uncertainty is difficult to quantify in subsequent inferences about treatment effects.
Some methods apply off-the-shelf machine learning algorithms, such as random forests or boosted machines (Hastie et al. 2009; McCaffrey et al. 2004), but either ignore statistical inference or invoke strong smoothness assumptions to justify consistent estimation and statistical inference (Chernozhukov et al. 2018).

These limitations are aggravated in high-dimensional settings, with either a large number of covariates (even with main effects only) or a large number of possible nonlinear and interaction terms for even a moderate number of covariates.

To perform model fitting for PSs and outcome regression, existing methods mainly use maximum likelihood or variants as in conventional predictive modeling, regardless of the roles of the fitted functions in the stage of estimating treatment effects. In particular, PS models are often fitted as logistic regression by maximum likelihood (Rosenbaum & Rubin 1984). However, as increasingly realized among practitioners, the purpose of estimating PSs is not primarily to predict treatment status, but to balance risk factors for the outcome across treatment groups to control for measured confounders (Weitzen et al. 2004; Westreich et al. 2011; Wyss et al. 2014). Hence, there is a disconnection between model fitting (ie, parameter estimation) and model evaluation: the parameters of logistic regression are estimated by the predictive criterion of maximum likelihood, but the adequacy of the fitted model is evaluated in terms of covariate balance achieved.

The preceding viewpoint about model fitting is in consonance with recent methodological developments on PSs, including a logistic calibration weighting method (Folsom 1991) and related methods (Kim & Haziza 2014; Vermeulen & Vansteelandt 2015), a calibrated likelihood method (Tan 2010a), and covariate-balancing PSs (Imai & Ratkovic 2014). All of these methods can be seen to address the issue of model fitting in classical, low-dimensional settings, in providing a different way from maximum likelihood to estimate PSs or, equivalently, the IPWs. However, the fundamental issue of model building (ie, how to build a PS model) remains open and challenging, especially in cases with a limited number of observations but a large number of regressors (including nonlinear and interaction terms of the covariates).

Specific Aims

The objective of our research is to develop and evaluate a new set of statistically rigorous, numerically tractable, and practically useful methods using PSs and IVs for estimating treatment effects. We build on a strong track record of working on theory, methods, and applications of PSs and IVs (eg, Tan 2006a b, 2010a b c; Winterstein et al. 2012; Huybrechts et al. 2013; Gerhard et al. 2014), and exploit related ideas in high-dimensional statistical theory and methods, including least absolutely shrinkage and selection operator (Lasso) regularized estimation (Tibshirani 1996) and debiased inferences (Zhang & Zhang 2014; van de Geer et al. 2014; Javanmard & Montanari 2014).

The specific aims of our research are as follows:

Aim 1. Develop new statistical theory and methods for estimating PSs and for drawing inference about treatment effects from observational data, with possibly a large number of covariates or regressors.
Aim 2. Develop new statistical theory and methods using IVs for drawing inferences about treatment effects from observational data, with possibly a large number of covariates or regressors.
Aim 3. Develop and disseminate user-friendly software, including accessible and transparent documentation, for implementation of the new methods.

Significance

The methodological significance of our research is to develop new methods that will substantially improve on existing methods in removing or alleviating the methodological limitations mentioned in the “Gaps in Existing Methods” section above.

Our approach allows flexible models, prespecified with a possibly large number of regressors, such as main effects and interactions of the covariates, and employs sparsity-inducing regularized estimation to facilitate model selection in a numerically tractable and statistically principled manner.
Our approach employs calibrated estimation, where the loss functions used in fitting PS and OR models are carefully chosen such that valid inferences can be obtained about treatment parameters while accommodating possible model misspecification. For fitting PS models, calibrated estimation directly induces covariate balancing.
Our approach is potentially applicable to a variety of causal inference tasks, including inferences about treatment effects using PSs under unconfoundedness or using IVs under the corresponding assumptions.

Regarding the applied significance, our research is expected to have substantial and wide-ranging impact on CER by facilitating practical implementation of PS and IV methods for causal inference, as mentioned in the PCORI Methodology Report. The methods and software developed can be widely applied in various CER studies to increase the transparency, validity, and accuracy of the estimates of treatment effects, in reducing the bias and variation associated with ad hoc model building or over-optimistic machine learning. As such, the proposed methods are critical in providing patients, clinicians, and other health care stakeholders with the information needed to make better-informed health care decisions and improve health care delivery and outcomes.

Changes in the Research Strategy

There were 2 main changes in the research strategy. First, the statistical approach described in the original research proposal involves combining boosting or additive learning (Schapire & Freund 2012; Buhlmann & Hothorn 2007; Friedman et al. 2009) and calibrated estimation (Folsom 1991; Tan 2010a). This approach deals with mainly estimation of PSs and IV PSs, but the issue of conducting valid inferences after PS estimation is not addressed.

During the research, we decided to pursue the approach of combining Lasso regularized estimation and calibrated estimation, which not only facilitates model selection in a numerically tractable manner when building and fitting PS and regression models but also makes it possible to derive model-assisted inferences (including CIs and hypothesis tests) about treatment effects in high-dimensional settings. In other words, we believe that the current methods developed in our research are statistically and numerically more satisfactory than those originally suggested in the application for funding.

Second, the original research proposal included empirical applications, using PSs or IVs, to the then-ongoing PCORI project, “Comparative Effectiveness of Adaptive Pharmacotherapy Strategies for Schizophrenia.” That study, which is now completed (Stroup et al. 2019), compared the effectiveness of 4 alternative medication strategies beyond antipsychotic monotherapy (addition of a second antipsychotic, an antidepressant, a mood stabilizer, or a benzodiazepine) for people diagnosed with schizophrenia. Health outcomes included times to psychiatric hospitalization, cardiovascular events, and death. However, the current methods developed in our research assume that the treatment is dichotomous and the outcomes of interest are fully observed (hence excluding possibly censored survival data). Although effort has been made to investigate survival analysis in our new approach (Tan 2019) as part of the modified contract, a fully developed method is beyond the scope of the current project. Hence, we conducted empirical applications using 2 existing observational studies, with Connors et al (1996) using PSs and Card (1995) using IVs; these are among the example studies often used in the related literature to evaluate new methods for causal inference.

Patient and Stakeholder Engagement

Our stakeholder committee includes Chacku Mathai, director of the Support, Technical Assistance, and Resources (STAR) Center of the National Alliance on Mental Illness; Mark Olfson, professor in the department of psychiatry, College of Physicians and Surgeons, Columbia University; Elizabeth Stuart, professor in the departments of mental health, biostatistics, and health policy and management, Johns Hopkins Bloomberg School of Public Health; and Almut Winterstein, professor in the department of pharmaceutical outcomes and policy, College of Pharmacy, University of Florida. Hence, the stakeholder team is multidisciplinary, with members ranging from biostatisticians to clinical researchers.

During the project period, we met with the stakeholder committee via teleconferences every 9 months and discussed methods and results via emails, with additional interactions on an as-needed basis. We received constructive comments and suggestions from the stakeholder consultants, which led to improvements in various aspects of the research, including comparison of the new and existing methods, presentation and interpretation of the results, applications of the methods, and the utility and user interface of the software.

Our research is primarily centered on statistical methods and theory. Through our engagement with the broader stakeholder communities, we understand better the needs in CER and are learning to focus our methods on addressing meaningful issues from practical perspectives. We also improve communication of the technical aspects of the methodology in nontechnical, substantive terms.

Propensity Scores and Unconfounded Estimation

Setup

Suppose that a simple random sample of n patients is available from a population under study. The observed data consist of independent and identically distributed observations {(T_i, Y_i, X_i):I = 1,…,n}, where T_i is a dichotomous treatment (T_i = 1 if treated or T_i = 0 otherwise), Y_i is an outcome variable, and X_i = (X_1i, …,X_pi) is a collection of measured covariates. In the potential outcomes framework for causal inference (Neyman 1923; Rubin 1974), 2 potential outcomes ( $Y_{i}^{0}$ , $Y_{i}^{1}$ ) are defined on each patient to indicate what the response would be under treatment 0 or 1, respectively. By consistency, the observed outcome Y_i is assumed to be either $Y_{i}^{0}$ or $Y_{i}^{1}$ , depending on whether T_i = 0 or T_i = 1. Two causal parameters commonly studied in CER are the ATE, defined as E(Y¹ − Y⁰) = μ¹ − μ⁰, with μ^t = E(Y^t), and the ATE on the treated (ATT), defined as E(Y¹ − Y⁰ |T = 1). For concreteness, we mainly discuss estimation of ATE.

For causal inference, a fundamental difficulty in estimating ATEs is that for each patient, only 1 potential outcome, Y⁰ or Y¹, is revealed as the observed outcome, and the other is missing (Holland 1986). Nevertheless, the ATE can be identified from the observed data under the 2 assumptions:

Unconfoundedness: T and (Y⁰, Y¹) are conditionally independent given X (Rubin 1976)
Overlap (or positivity): 0 < π^*(X) < 1, where π^*(X) = P(T = 1|X) is called the PS (Rosenbaum & Rubin 1983)

Under these assumptions, there are broadly 2 modeling approaches for estimating ATE.

One approach is to build a regression model for the OR function m^t* (X) = E (Y|T = t, X):

E (Y | T = t, X) = m^{t} (X; α^{t}) = ψ {α^{t T} g^{t} (X)},

(1)

where ψ(∙) is an inverse link function; g⁰ (X) and g¹ (X) are vectors of known functions of X, allowed to include nonlinear and interaction terms; and (α⁰, α¹) are vectors of unknown parameters. Let ( $({\hat{α}}_{M L}^{0}, {\hat{α}}_{M L}^{1})$ ) be the least-squares or maximum quasi-likelihood estimators of (α⁰, α¹), and let ${\hat{m}}_{M L}^{t} (X) = m^{t} (X; {\hat{α}}_{M L}^{t})$ . If OR model (1) is correctly specified, then μ^t = E(Y^t) can be consistently estimated by a substitution estimator, ${\hat{μ}}_{O R}^{t} = n^{- 1} \sum_{i = 1}^{n} {\hat{m}}_{M L}^{t} (X_{i})$ for t = 0, 1.

An alternative approach is to build a regression model for the PS, π^*(X) = P(T = 1|X) (Rosenbaum & Rubin 1983):

P (T = 1 | X) = π (X; γ) = Π {γ^{T} f (X)},

(2)

where Π(∙) is an inverse link function; f(X) is a vector of known functions of X, allowed to include nonlinear and interaction terms; and γ is a vector of unknown parameters. Let ${\hat{γ}}_{M L}$ be the maximum likelihood estimator of γ. Various methods have been proposed to estimate ATEs by matching, stratification, or weighting on the fitted PS ${\hat{π}}_{M L} (X) = π (X; {\hat{γ}}_{M L})$ . We focus on IPW, which is central to rigorous theory of statistical estimation for causal inference and missing-data problems (eg, Tsiatis 2006). Two standard IPW estimators for μ¹ are

{\hat{μ}}_{I P W}^{1} ({\hat{π}}_{M L}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{T_{i} Y_{i}}{{\hat{π}}_{M L} (X_{i})}, {\hat{μ}}_{r I P W}^{1} ({\hat{π}}_{M L}) = {\hat{μ}}_{I P W}^{1} ({\hat{π}}_{M L}) / {\frac{1}{n} \sum_{i = 1}^{n} \frac{T_{i}}{{\hat{π}}_{M L} (X_{i})}} .

(3)

Similarly, 2 standard IPW estimators for μ⁰ are

{\hat{μ}}_{I P W}^{0} ({\hat{π}}_{M L}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{(1 - T_{i}) Y_{i}}{1 - {\hat{π}}_{M L} (X_{i})}, {\hat{μ}}_{r I P W}^{0} ({\hat{π}}_{M L}) = {\hat{μ}}_{I P W}^{0} ({\hat{π}}_{M L}) / {\frac{1}{n} \sum_{i = 1}^{n} \frac{1 - T_{i}}{1 - {\hat{π}}_{M L} (X_{i})}} .

(4)

If PS model (2) is correctly specified, then these IPW estimators are asymptotically valid as the sample size n increases to ∞. However, if PS model (2) is misspecified, the IPW estimators can perform poorly. Even when the PS model is correct or mildly misspecified, inverse weighting can be instable due to estimated PSs close to 0 or 1 (eg, Kang & Schafer 2007).

To attain consistency for the mean μ^t, the OR estimator ${\hat{μ}}_{O R}^{t}$ or the IPW estimator ${\hat{μ}}_{r I P W}^{t} ({\hat{π}}_{M L})$ relies on correct specification of OR model (1) or PS model (2), respectively. In contrast, there are doubly robust estimators depending on both OR and PS models in the augmented IPW form, which for μ¹ reads as follows:

{\hat{μ}}^{1} ({\hat{m}}^{1}, \hat{π}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, T_{i}, X_{i}; {\hat{m}}^{1}, \hat{π}),

(5)

where ${\hat{m}}^{1} (X)$ and $\hat{π} (X)$ are fitted values of m^1*(X) and π^*(X), respectively, and

φ (Y, T, X; {\hat{m}}^{1}, \hat{π}) = \frac{T Y}{\hat{π} (X)} - {\frac{T}{\hat{π} (X)} - 1} {\hat{m}}^{1} (X) .

(6)

For example, the augmented IPW estimator used by Robins et al (1994) is ${\hat{μ}}^{1} ({\hat{m}}_{M L}^{1}, {\hat{π}}_{M L})$ , using the fitted values ${\hat{m}}_{M L}^{1} (X)$ and ${\hat{π}}_{M L} (X)$ from maximum (quasi-)likelihood estimation. See Kang & Schafer (2007) and Tan (2007, 2010a) for reviews in low-dimensional settings.

In high-dimensional settings, Belloni et al (2014) and Farrell (2015) studied the estimator ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L})$ using the fitted values ${\hat{m}}_{R M L}^{1} (X)$ and ${\hat{π}}_{R M L} (X)$ from Lasso regularized maximum likelihood (RML) estimation or similar methods. Their results are mainly of 2 types, each under suitable conditions. The first type shows double robustness: ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L})$ remains consistent if either OR model (1) or PS model (2) is correctly specified. The second type establishes valid CIs: ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{μ}}_{R M L})$ admits the asymptotic expansion

{\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, T_{i}, X_{i}; m^{1 *}, π^{*}) + o_{p} (n^{- \frac{1}{2}})

(7)

if both OR model (1) and PS model (2) are correctly specified. To address the gap between these results, it is desirable to develop new methods that lead to valid CIs without requiring both OR and PS models to be correctly specified.

There has been a large, growing literature on causal inference related to our work, in addition to work from Belloni et al (2014) and Farrell (2015). Examples include Robins et al (2017), Chernozhukov et al (2018), Smucler et al (2019), van der Laan et al (2019), Ning et al (2020), Dukes and Vansteelandt (2021), Hirschberg and Wager (2021), and Bradic et al (2021). See Tan (2020a, 2020b) and Ghosh and Tan (2020) for further discussion.

PSs and IPW Estimation

We developed regularized calibrated estimation for fitting PS models in high-dimensional settings. For concreteness, assume that PS model (2) is a logistic regression model:

P (T = 1 | X) = π (X; γ) = {[1 + \exp {- γ^{T} f (X)}]}^{- 1},

(8)

where f(x) = {1, f₁ (x), …, f_p (x)}^T is a vector of known functions including the constant, and γ = (γ₀, γ₁, …, γ_p)^T is a vector of unknown coefficients. The maximum likelihood estimator ${\hat{γ}}_{M L}$ is defined as a minimizer of the likelihood loss

ℓ_{M L} (γ) = \frac{1}{n} \sum_{i = 1}^{n} \log [1 + \exp {γ^{T} f (X_{i})}] - T_{i} γ^{T} f (X_{i}),

or, equivalently, as a solution to the score equation

\frac{1}{n} \sum_{i = 1}^{n} {T_{i} - π (X_{i}; γ)} f (X_{i}) = 0.

Alternatively, calibrated estimation uses a system of estimating equations such that the weighted averages of the covariates in the treated subsample are equal to the simple averages in the overall sample. The calibrated estimator ${\hat{γ}}_{C A L}^{1}$ is defined as a solution to

\frac{1}{n} \sum_{i = 1}^{n} {1 - \frac{T_{i}}{π (X_{i}; γ)}} f (X_{i}) = 0.

(9)

Interestingly, the estimator ${\hat{γ}}_{C A L}^{1}$ can be equivalently defined as a minimizer of the following loss function, called the calibration loss:

ℓ_{C A L} (γ) = \frac{1}{n} \sum_{i = 1}^{n} T_{i} \exp {- γ^{T} f (X_{i})} + (1 - T_{i}) γ^{T} f (X_{i}) .

(10)

Moreover, ℓ_CAL (γ) is convex in γ and is strictly convex and bounded from below under a certain nonseparation condition (Tan 2020a). The preceding idea of calibrated estimation and similar methods have been studied, and sometimes independently (re)derived, in various contexts of causal inference, missing-data problems, and survey sampling (eg, Folsom 1991; Tan 2010a; Graham et al. 2012; Hainmueller 2012; Imai & Ratkovic 2014; Kim & Haziza 2014; Vermeulen & Vansteelandt 2015; Chan et al. 2016).

To compare calibrated and likelihood estimation, we establish a new relationship between the loss functions ℓ_CAL (γ) and ℓ_ML (γ). To allow for misspecification of PS model (8), we write ℓ_ML (γ) = κ_ML (γ^Tf) and ℓ_CAL (γ) = κ_CAL (γ^Tf), where for a function g(x),

κ_{M L} (g) = \frac{1}{n} \sum_{i = 1}^{n} \log [1 + \exp {g (X_{i})}] - T_{i} g (X_{i}), and

κ_{C A L} (g) = \frac{1}{n} \sum_{i = 1}^{n} T_{i} \exp {- g (X_{i})} + (1 - T_{i}) g (X_{i}) .

Then, κ_ML (g^*) and κ_CAL (g^*) are well defined for the true log odds ratio g^*(x) = log[π^*(x)/{1-π* (x)}], even when PS model (8) is misspecified, that is, when g^*(x) is not of the form γ^T f(x). For 2 probabilities, ρ ∈ (0, 1) and ρ′∈ (0,1), the Kullback-Leibler divergence is

L (ρ, ρ') = ρ' \log (ρ' / ρ) + (1 - ρ') \log {(1 - ρ') / (1 - ρ)} \geq 0.

In addition, let K(ρ,ρ′) = ρ′/ρ – 1 − log(ρ′/ρ), which is strictly convex in ρ′/ρ, with a minimum of 0 at ρ′/ρ = 1. Then Tan (2020a, Proposition 1) shows that

E {ℓ_{M L} (γ) - κ_{M L} (g^{*})} = E [L {π (X; γ), π^{*} (X)}], and

E {ℓ_{C A L} (γ) - κ_{C A L} (g^{*})} = E [K {π (X; γ), π^{*} (X)}] + E [L {π (X; γ), π^{*} (X)}] .

Hence, minimizing the expected calibration loss in γ involves reducing both the expected likelihood loss E [L{π(X;γ), π^*(X)}] and an additional term E [K{π(X; γ), π^*(X)}], which can be related to the mean squared relative error (MRSE) between π(X; γ) and π^*(X):

MRSE (γ) = E [{\frac{π^{*} (X)}{π (X; γ)} - 1}^{2}] .

This measure of relative errors in PSs governs the mean squared error (MSE) of the IPW estimator, $MSE (γ) = E [{{\hat{μ}}_{I P W}^{1} (γ) - μ^{1}}^{2}]$ , where ${\hat{μ}}_{I P W}^{1} (γ) = n^{- 1} \sum_{i = 1}^{n} T_{i} Y_{i} / π (X_{i;} γ)$ . The following result was obtained by Tan (2020a, Proposition 2):

Suppose that E[(Y¹)²] ≤ c and π^*(X) ≥ δ for some constants c > 0 and δ ∈ (0,1).

If π(X; γ) ≥ a π^*(X) for some constant a ∈ (0,1/2), then

$MRSE (γ) \leq \frac{5}{3 a} E [K {π (X; γ), π^{*} (X)}] \leq \frac{5}{3 a} E {ℓ_{C A L} (γ) - κ_{C A L} (g^{*})} .$
(11)

The factor 5/(3a) in general cannot be improved up to a constant, independent of a.
If π(X; γ) ≥ b for some constant b ∈ (0,1), then

$MRSE (γ) \leq \frac{1}{2 b^{2}} E [L {π (X; γ), π^{*} (X)}] \leq \frac{1}{2 b^{2}} E {ℓ_{M L} (γ) - κ_{M L} (g^{*})} .$
(12)

The factor 1/(2b²) in general cannot be improved up to a divisor of order log(b⁻¹).

The expected calibration loss is inflated in (11) by a leading factor 5/(3a), which remains bounded from above as long as a is bounded away from 0, even when some PSs π(X; γ) are close to 0. In contrast, the expected likelihood loss is inflated in (12) by a leading factor 1/(2b²), which may be arbitrarily large when some PSs π(X; γ) are close to 0. This result demonstrates that the minimization of the expected calibration loss controls the MRSE of PSs more strongly than does the expected likelihood loss. See Tan (2020a), Section 3.2, for further discussion.

The preceding discussion compares calibrated and likelihood estimation through their loss functions. Next, we propose a regularized calibrated estimator of γ in PS model (8). There are 2 motivations. First, regularization is needed in the situation where ${\hat{γ}}_{C A L}^{1}$ does not exist because the calibration loss (10) is unbounded from below. In our numerical study (Tan 2020a), nonconvergence was found, for example, in 100% of repeated simulations with n = 200 and p = 50. Second, regularization is also needed in high-dimensional settings where the dimension of the regressor vector f (X) is close to or greater than the sample size.

The regularized calibrated estimator, denoted by ${\hat{γ}}_{R C A L}^{1}$ , is defined by minimizing the calibration loss ℓ_CAL (γ) with a Lasso penalty (Tibshirani 1996):

ℓ_{R C A L} (γ) = ℓ_{C A L} (γ) + λ ∥ γ_{1 : p} ∥_{1},

(13)

where γ_1:p = (γ₁, …, γ_p)^T excluding the intercept γ₀, ∥⋅∥₁ denotes the L₁ norm such that $∥ γ_{1 : p} ∥_{1} = \sum_{j = 1}^{p} | γ_{j} |$ , and λ ≥ 0 is a tuning parameter. By the Karush-Kuhn-Tucker (KKT) condition for minimization of (13), the fitted PSs, ${\hat{π}}_{R C A L}^{1} (X) = π (X; {\hat{γ}}_{R C A L}^{1})$ , satisfies

\frac{1}{n} \sum_{i = 1}^{n} \frac{T_{i}}{{\hat{π}}_{R C A L}^{1} (X_{i})} = 1,

(14)

\frac{1}{n} | \sum_{i = 1}^{n} \frac{T_{i} f_{j} (X_{i})}{{\hat{π}}_{R C A L}^{1} (X_{i})} - \sum_{i = 1}^{n} f_{j} (X_{i}) | \leq λ, j = 1, \dots, p .

(15)

The equality holds in (15) for any j such that the jth estimate ${({\hat{γ}}_{R C A L}^{1})}_{j}$ is nonzero. The weighted average of each covariate f_j(X_i) over the treated group may differ from the overall average of f_j(X_i) by no more than λ. In other words, introducing the Lasso penalty to calibrated estimation yields a relaxation of the equalities (9) to box constraints (15). Moreover, by (14), the IPWs, $1 / {\hat{π}}_{R C A L}^{1} (X_{i})$ with T_i = 1, sum to the sample size n. Then, the 2 resulting IPW estimators, ${\hat{μ}}_{I P W}^{1} ({\hat{π}}_{R C A L}^{1})$ and ${\hat{μ}}_{r I P W}^{1} ({\hat{π}}_{R C A L}^{1})$ , obtained from (3) with ${\hat{π}}_{M L} (X)$ replaced by ${\hat{π}}_{R C A L}^{1} (X)$ , are identical to each other.

We present a novel algorithm for computing the proposed estimator ${\hat{γ}}_{R C A L}^{1}$ , that is, minimizing ℓ_RCAL (γ) in (13) for any fixed λ. The basic idea of the algorithm is to iteratively form a quadratic approximation to the calibration loss ℓ_CAL (γ) in (10) and solve a Lasso-penalized weighted least-squares problem, similar to existing algorithms for Lasso-penalized maximum likelihood–based logistic regression (eg, Friedman et al. 2010). However, we construct a suitable quadratic approximation after an additional step of replacing certain sample quantities with model expectations. This idea is known as Fisher scoring and was previously used to derive the iterative reweighted least-squares method for fitting generalized linear models with noncanonical links, such as probit regression (McCullagh & Nelder 1989). Moreover, to reduce computational cost, we exploit the majorization-minimization technique (Wu & Lange 2010; Bohning & Lindsay 1988) in solving the Lasso-penalized weighted least-squares problem. See Tan (2020a), Section 4.2, for a detailed discussion.

We provide a theoretical analysis of the regularized calibrated estimator ${\hat{γ}}_{R C A L}^{1}$ and the resulting IPW estimator of μ¹, allowing for misspecification of PS model (8), in high-dimensional settings where the number of regressors in f(X) is close to or greater than the sample size. Our analysis requires a sparsity assumption that only a small but unknown subset (relative to the sample size) of “significant” regressors is associated with nonzero coefficients. Such assumptions are commonly made in high-dimensional regression, where Lasso-type-penalized estimation can theoretically be shown to achieve nearly as small errors of estimation as in oracle estimation, where those significant regressors were known and only those were included in the regression model (eg, Buhlmann & van de Geer 2011). Heuristically, sparsity-based methods and theory, including ours, can be motivated according to the “bet on sparsity” principle: “Use a procedure that does well in sparse problems, since no procedure does well in dense problems” (Hastie et al. 2015).

Our analysis of Lasso-penalized M-estimators deals with the convergence of the estimators, ${\hat{γ}}_{R C A L}^{1}$ , to their target values ${\bar{γ}}_{C A L}^{1}$ under model misspecification, which is related to but distinct from previous results on excess prediction errors (eg, Buhlmann & van de Geer 2011). Moreover, our analysis of ${\hat{μ}}_{I P W}^{1} ({\hat{π}}_{R C A L}^{1})$ , the IPW estimator of μ¹ based on ${\hat{γ}}_{R C A L}^{1}$ , carefully exploits the results described earlier on the calibration loss ℓ_CAL (γ) to obtain convergence under weaker conditions than previously realized. Our results show that the squared difference $| {\hat{μ}}_{I P W}^{1} ({\hat{π}}_{R C A L}^{1}) - {\hat{μ}}_{I P W}^{1} ({\bar{π}}_{C A L}^{1}) |^{2}$ is of the order

s \frac{\log p}{n},

provided that $s \sqrt{\frac{\log p}{n}}$ is sufficiently small, for example, tends to 0 as n and p increase, where s denotes the sparsity size of ${\bar{γ}}_{C A L}^{1}$ , that is, the number of nonzero elements in ${\bar{γ}}_{C A L}^{1}$ . See Tan (2020a, Section 4.3) for a detailed discussion.

So far, our theory and methods have focused mainly on the estimation of μ¹, but they can be directly extended to an estimation of μ⁰, and hence the ATE, μ¹ − μ⁰. For estimation of μ⁰ with PS model (8), the calibrated estimator of γ, denoted by ${\hat{γ}}_{C A L}^{0}$ , is defined as a solution to

\frac{1}{n} \sum_{i = 1}^{n} {1 - \frac{1 - T_{i}}{1 - π (X_{i}; γ)}} f (X_{i}) = 0.

By exchanging T with 1 − T and exchanging γ with − γ in (10), the corresponding loss function minimized by ${\hat{γ}}_{C A L}^{0}$ is

ℓ_{C A L}^{0} (γ) = \frac{1}{n} \sum_{i = 1}^{n} (1 - T_{i}) \exp {γ^{T} f (X_{i})} - T_{i} γ^{T} f (X_{i}) .

For fixed λ ≥ 0, the regularized calibrated estimator ${\hat{γ}}_{R C A L}^{0}$ is defined as a minimizer of

ℓ_{R C A L}^{0} (γ) = ℓ_{C A L}^{0} (γ) + λ ∥ γ_{1 : p} ∥_{1} .

The fitted PS, ${\hat{π}}_{R C A L}^{0} (X) = π (X; {\hat{γ}}_{R C A L}^{0})$ , then satisfies (14) and (15) with T_i replaced by 1 − T_i and ${\hat{π}}_{R C A L}^{1} (X_{i})$ replaced by $1 - {\hat{π}}_{R C A L}^{0} (X_{i})$ , and similar interpretations apply as discussed previously. The resulting IPW estimator of μ⁰ is ${\hat{μ}}_{I P W}^{0} ({\hat{π}}_{R C A L}^{0})$ , obtained from (4) with ${\hat{π}}_{M L} (X)$ replaced by ${\hat{π}}_{R C A L}^{0} (X)$ , and that of μ¹ − μ⁰ is ${\hat{μ}}_{I P W}^{1} ({\hat{π}}_{R C A L}^{1}) - {\hat{μ}}_{I P W}^{0} ({\hat{π}}_{R C A L}^{0})$ .

An interesting feature of our approach is that 2 different estimators of the PS are used when estimating μ¹ and μ⁰. The estimators ${\hat{γ}}_{R C A L}^{1}$ and ${\hat{γ}}_{R C A L}^{0}$ may in general have different asymptotic limits when PS model (8) is misspecified, even though their asymptotic limits coincide when model (8) is correctly specified. Such possible differences should not be of concern: the 2 estimators ${\hat{μ}}_{I P W}^{1} (\hat{π})$ and ${\hat{μ}}_{I P W}^{0} (\hat{π})$ are decoupled, involving 2 disjoint subsets of fitted PSs on the treated {i: T_i = 1} and the untreated {i: T_i = 0}, respectively. In fact, separate estimation of PSs and inverse probability weights for the treated and untreated samples can lead to more flexible approximations, and hence potentially less bias in the presence of model misspecification. Furthermore, the discovery of whether substantial differences exist between these separately fitted PSs can be used for diagnosis of the validity of model (8). See Chan et al (2016, Section 2.3) for a related discussion.

The proposed method for estimating PSs is implemented as part of the publicly released R package RCAL (Tan & Sun 2020). The implementation is based on the Fisher scoring descent algorithm in Tan (2020a). At the low level, the least-squares Lasso problem is solved using a variation of the active set algorithm, which enjoys a finite termination property (Osborne et al. 2000; Yang & Tan 2018).

Model-Assisted and Doubly Robust Inferences

Note: The material in this section is adapted from Tan Z. Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. Ann Statist. 2020;48:811-837.

We develop new methods and theory in high-dimensional settings to obtain not only doubly robust point estimators for ATEs, which remain consistent if either a PS model or an OR model is correctly specified, but also model-assisted CIs, which are valid when the PS model is correctly specified but the OR model may be misspecified.

The development in the “PSs and IPW Estimation” section only involves estimating PSs and applying the IPW estimators. To mitigate possible misspecification of PS models and facilitate constructing CIs, we use a doubly robust estimator of μ¹ in the augmented IPW form (5), depending on both a PS model and an OR model. There are 2 motivations, in theory. First, the double-robustness property ensures consistency of point estimation even when 1 of the OR and PS models is misspecified. Second, more importantly in high-dimensional settings, augmented IPW estimation makes it possible to establish a simple asymptotic expansion of the resulting estimator such that valid and numerically tractable CIs can be obtained for μ¹. In contrast, obtaining valid CIs is difficult based on IPW estimation only, due to penalized estimation in high-dimensional settings, even when the PS model is correctly specified.

To focus on main ideas, consider a logistic PS model (8) as in the “PSs and IPW Estimation” section and a linear OR model,

E (Y | T = 1, X) = m^{1} (X; α^{1}) = α^{1 T} f (X),

(16)

where $α^{1} = {(α_{0}^{1}, α_{1}^{1}, \dots, α_{p}^{1})}^{T}$ , of the same dimension as γ. That is, OR model (1) is specified with the identity link and the regressor vector g¹ (X) is the same as f (X) in PS model (2). This condition can be satisfied possibly after enlarging model (1) or (2) to reach the same dimension. Our point estimator of μ¹ is

{\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, T_{i}, X_{i}; {\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}),

(17)

where φ () is defined in (6), ${\hat{π}}_{R C A L}^{1} (X) = π (X; {\hat{γ}}_{R C A L}^{1})$ with the regularized calibrated estimator ${\hat{γ}}_{R C A L}^{1}$ as in the “PSs and IPW Estimation” section, and ${\hat{m}}_{R W L}^{1} (X) = m^{1} (X; {\hat{α}}_{R W L}^{1})$ with ${\hat{α}}_{R W L}^{1}$ defined as follows. See Tan (2020b), Section 3.2, for a discussion on how the construction of these estimators is linked to the desired asymptotic expansion (22) for ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ .

The estimator ${\hat{α}}_{R W L}^{1}$ is a regularized weighted least-squares estimator of α¹, defined as a minimizer to the penalized loss function

ℓ_{R W L} (α^{1}; {\hat{γ}}_{R C A L}^{1}) = ℓ_{W L}^{1} (α^{1}; {\hat{γ}}_{R C A L}^{1}) + λ ∥ α_{1 : p}^{1} ∥_{1},

(18)

where $α_{1 : p}^{1} = {(α_{1}^{1}, \dots, α_{p}^{1})}^{T}$ excluding the intercept $α_{0}^{1}$ , λ ≥ 0 is a tuning parameter, and $ℓ_{W L} (α^{1}; {\hat{γ}}_{R C A L}^{1})$ is the weighted least-squares loss,

ℓ_{W L} (α^{1}; {\hat{γ}}_{R C A L}^{1}) = \frac{1}{2 n} \sum_{i = 1}^{n} T_{i} \frac{1 - {\hat{π}}_{R C A L}^{1} (X_{i})}{{\hat{π}}_{R C A L}^{1} (X_{i})} {Y_{i} - α^{1 T} f (X_{i})}^{2} .

That is, the observations in the treated group are weighted by ${1 - {\hat{π}}_{R C A L}^{1} (X_{i})} / {\hat{π}}_{R C A L}^{1} (X_{i})$ , which differs slightly from the commonly used inverse probability weight $1 / {\hat{π}}_{R C A L}^{1} (X_{i})$ . Larger weights are associated with observations with lower-fitted PSs, thereby requiring model (16) to be better fitted in covariate regions with more missing data.

Similarly as discussed regarding ${\hat{γ}}_{R C A L}^{1}$ , there are simple and interesting implications of the estimator ${\hat{α}}_{R W L}^{1}$ . By the KKT condition for minimizing (18), the fitted OR function ${\hat{m}}_{R W L}^{1} (X)$ satisfies

\frac{1}{n} \sum_{i = 1}^{n} T_{i} \frac{1 - {\hat{π}}_{R C A L}^{1} (X_{i})}{{\hat{π}}_{R C A L}^{1} (X_{i})} {Y_{i} - {\hat{m}}_{R W L}^{1} (X_{i})} = 0,

(19)

\frac{1}{n} | \sum_{i = 1}^{n} T_{i} \frac{1 - {\hat{π}}_{R C A L}^{1} (X_{i})}{{\hat{π}}_{R C A L}^{1} (X_{i})} {Y_{i} - {\hat{m}}_{R W L}^{1} (X_{i})} f_{j} (X_{i}) | \leq λ, j = 1, \dots, p .

(20)

In (20), the inequality reduces to equality for any j such that the jth estimate ${({\hat{α}}_{R W L}^{1})}_{j}$ is nonzero. Equation (19) implies that the estimator ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ can be recast as

\begin{array}{l} {\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{m}}_{R W L}^{1} (X_{i}) + \frac{T_{i}}{{\hat{π}}_{R C A L}^{1} (X_{i})} {Y_{i} - {\hat{m}}_{R W L}^{1} (X_{i})} \\ = \frac{1}{n} \sum_{i = 1}^{n} T_{i} Y_{i} + (1 - T_{i}) {\hat{m}}_{R W L}^{1} (X_{i}), \end{array}

(21)

which takes the form of linear prediction estimators known in the survey literature (eg, Sarndal et al. 1992): $n^{- 1} \sum_{i = 1}^{n} T_{i} Y_{i} + (1 - T_{i}) {\hat{m}}^{1} (X_{i})$ , for some fitted OR function ${\hat{m}}^{1} (X_{i})$ . As a consequence, ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ always falls within the range of the observed outcomes {Y_i: T_i = 1, i = 1,…,n} and the predicted values ${{\hat{m}}_{R W L}^{1} (X_{i}) : T_{i} = 1, i = 1, \dots, n}$ . This boundedness property is not satisfied by the estimator ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L})$ , where ${\hat{m}}_{R M L}^{1} (X)$ and ${\hat{π}}_{R M L} (X)$ are the fitted values similar to ${\hat{m}}_{M L}^{1} (X)$ and ${\hat{π}}_{M L} (X)$ in the “Setup” section but based on regularized likelihood estimation in OR and PS models.

We provide a high-dimensional analysis of the proposed estimator ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ , allowing for possible model misspecification. See Tan (2020b, Section 3.3) for a detailed discussion. In contrast, with asymptotic expansion (7) for ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L})$ with correctly specified OR and PS models, our main result shows that under suitable conditions, the estimator ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ admits the asymptotic expansion

{\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, T_{i}, X_{i}; {\bar{m}}_{W L}^{1}, {\bar{π}}_{C A L}^{1}) + o_{p} (n^{- \frac{1}{2}}),

(22)

where ${\bar{π}}_{C A L}^{1} (X) = π (X; {\bar{γ}}_{C A L}^{1})$ , ${\bar{m}}_{W L}^{1} (X) = m^{1} (X; {\bar{α}}_{W L}^{1})$ , and ${\bar{γ}}_{C A L}^{1}$ and ${\bar{α}}_{W L}^{1}$ are limit or target values defined as follows. With possible model misspecification, the target value ${\bar{γ}}_{C A L}^{1}$ is defined as a minimizer of the expected calibration loss

E {ℓ_{C A L} (γ)} = E [T \exp {- γ^{T} f (X)} + (1 - T) γ^{T} f (X)] .

If PS model (8) is correctly specified, then ${\bar{π}}_{C A L}^{1} (X) = π^{*} (X)$ . Otherwise, ${\bar{π}}_{C A L}^{1} (X)$ may differ from π^*(X). The target value ${\bar{α}}_{W L}^{1}$ is defined as a minimizer of the expected loss

E {ℓ_{W L} (α^{1}; {\bar{γ}}_{C A L}^{1})} = \frac{1}{2} E [T \frac{1 - {\bar{π}}_{C A L}^{1} (X)}{{\bar{π}}_{C A L}^{1} (X)} {Y - α^{1 T} f (X)}^{2}] .

If OR model (16) is correctly specified, then ${\bar{m}}_{W L}^{1} (X) = m^{1 *} (X)$ . But ${\bar{m}}_{W L}^{1} (X)$ may in general differ from m^1*(X). Then, Tan (2020b), Proposition 1, shows that if either logistic PS model (8) or linear OR model (16) is correctly specified, the following results hold under suitable conditions related to the sparsity of the target values ${\bar{γ}}_{C A L}^{1}$ and ${\bar{α}}_{W L}^{1}$ .

${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ is consistent for μ¹ and asymptotically normally distributed:

$\sqrt{n} {{\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) - μ^{1}} \to_{D} N (0, V),$
(23)

where $V = v a r {φ (Y, T, X; {\bar{m}}_{W L}^{1}, {\bar{π}}_{C A L}^{1})}$ .
A consistent estimator of V is

$\hat{V} = \frac{1}{n} \sum_{i = 1}^{n} {φ (Y_{i}, T_{i}, X_{i}; {\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) - {\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})}^{2} .$
(24)
An asymptotic (1 − c) CI for μ¹ is

${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) \pm z_{\frac{c}{2}} \sqrt{\frac{\hat{V}}{n}},$
(25)

where z_c_/2 is the (1 − c/2) quantile of N(0,1). Hence, a doubly robust CI for μ¹ is obtained.

Next, we present model-assisted CIs for μ¹, in the setting where a generalized linear OR model is used together with a logistic PS model. Consider a generalized linear OR model with a canonical link,

E (Y | T = 1, X) = m^{1} (X; α^{1}) = ψ {α^{1 T} f (X)},

(26)

that is, model (1) with the vector of covariate functions g¹ (X) taken to be the same as f (X) in model (8). Our point estimator of μ¹ is ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ as defined in (17), where ${\hat{π}}_{R C A L}^{1} (X) = π (X; {\hat{γ}}_{R C A L}^{1})$ and ${\hat{m}}_{R W L}^{1} (X) = m^{1} (X; {\hat{α}}_{R W L}^{1})$ . The estimator ${\hat{γ}}_{R C A L}^{1}$ is the regularized calibrated estimator of γ as before. However, ${\hat{α}}_{R W L}^{1}$ is a regularized weighted likelihood estimator of α¹, defined as a minimizer of the penalized loss function

ℓ_{R W L} (α^{1}; {\hat{γ}}_{R C A L}^{1}) = ℓ_{W L}^{1} (α^{1}; {\hat{γ}}_{R C A L}^{1}) + λ ∥ α_{1 : p}^{1} ∥_{1},

(27)

where $α_{1 : p}^{1} = {(α_{1}^{1}, \dots, α_{p}^{1})}^{T}$ , excluding the intercept $α_{0}^{1}$ , λ ≥ 0 is a tuning parameter, and $ℓ_{W L} (α^{1}; {\hat{γ}}_{R C A L}^{1})$ is the weighted likelihood loss,

ℓ_{W L} (α^{1}; {\hat{γ}}_{R C A L}^{1}) = \frac{1}{n} \sum_{i = 1}^{n} T_{i} \frac{1 - {\hat{π}}_{R C A L}^{1} (X_{i})}{{\hat{π}}_{R C A L}^{1} (X_{i})} [- Y_{i} α^{1 T} f (X_{i}) + Ψ {α^{1 T} f (X_{i})}],

with $Ψ (u) = \int_{0}^{u} ψ (t) d t$ . The regularized weighted least-squares estimator ${\hat{α}}_{R W L}^{1}$ for a linear OR model is recovered in the special case of the identity link, ψ(u) = u and Ψ(u) = u²/2. In addition, the KKT condition for minimizing the penalized loss function (27) remains the same as in (19) and (20); hence, the estimator ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ can be put in the prediction form (21), which ensures that the boundedness property that ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ always falls within the range of the observed outcomes {Y_i: T_i = 1, i = 1,…, n} and the predicted values ${{\hat{m}}_{R W L}^{1} (X_{i}) : T_{i} = 1, i = 1, \dots, n}$ .

Tan (2020b, Proposition 3) shows that if logistic PS model (8) is correctly specified but OR model (26) may be misspecified, then ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ admits asymptotic expansion (22) with ${\bar{π}}_{C A L}^{1} (X) = π^{*} (X)$ , and results (23) to (25) hold under suitable conditions, where ${\bar{m}}_{W L}^{1} (X) = m^{1} (X; {\bar{α}}_{W L}^{1})$ and the target value ${\bar{α}}_{W L}^{1}$ is defined as a minimizer of the expected loss $E {ℓ_{W L} (α^{1}; {\bar{γ}}_{C A L}^{1})}$ , similar to before. The CIs obtained for μ¹ are said to be PS based, OR assisted. We make the following remarks:

As mentioned in Tan (2020b, Section 3.4), it is also possible to derive OR-based, PS-assisted CIs for μ¹ with a nonlinear OR model. In that case, an estimator of γ would be defined depending on an estimator of α¹, similar to how ${\hat{α}}_{R W L}^{1}$ is defined depending on ${\hat{γ}}_{R C A L}^{1}$ .
The preceding PS-based, OR-assisted CIs are convenient when ATEs associated with several outcomes need to be estimated using the same set of fitted PSs. Moreover, regularized calibration estimation of PSs enjoys attractive properties, as discussed in the “PSs and IPW Estimation” section.
With additional complexity, a 2-step procedure was developed by Ghosh and Tan (2020) to obtain doubly robust CIs for μ¹.

Our theory and methods are presented mainly on estimation of μ¹, but they can be directly extended for estimating μ⁰ and hence ATE, that is, μ¹ − μ⁰. Consider a logistic PS model (8) and a generalized linear OR model,

E (Y | T = 0, X) = m^{0} (X; α^{0}) = ψ {α^{0 T} f (X)},

(28)

where f(X) is the same vector of covariate functions as in PS model (8) and α⁰ is a vector of unknown parameters. Our point estimator of ATE is ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1}) - {\hat{μ}}^{0} ({\hat{m}}_{R W L}^{0}, {\hat{π}}_{R C A L}^{0})$ , and that of μ⁰ is

{\hat{μ}}^{0} ({\hat{m}}_{R W L}^{0}, {\hat{π}}_{R C A L}^{0}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, 1 - T_{i}, X_{i}; {\hat{m}}_{R W L}^{0}, 1 - {\hat{π}}_{R C A L}^{0}),

where φ() is defined in (6), ${\hat{π}}_{R C A L}^{0} (X) = π (X; {\hat{γ}}_{R C A L}^{0})$ , and ${\hat{m}}_{R W L}^{0} (X) = m^{0} (X; {\hat{α}}_{R W L}^{0})$ . The estimator ${\hat{γ}}_{R C A L}^{0}$ is the same regularized calibrated estimator as in the “PSs and IPW Estimation” section. The estimator ${\hat{α}}_{R W L}^{0}$ is defined similarly as ${\hat{α}}_{R W L}^{1}$ , but with the loss function $ℓ_{W L} (α^{1}; {\hat{γ}}_{R C A L}^{1})$ in (27) replaced by

ℓ_{W L}^{0} (α^{0}; {\hat{γ}}_{R C A L}^{0}) = \frac{1}{n} \sum_{i = 1}^{n} (1 - T_{i}) \frac{{\hat{π}}_{R C A L}^{0} (X_{i})}{1 - {\hat{π}}_{R C A L}^{0} (X_{i})} [- Y_{i} α^{0 T} f (X_{i}) + Ψ {α^{0 T} f (X_{i})}] .

Under similar conditions as in the estimation of μ¹, the estimator ${\hat{μ}}^{0} ({\hat{m}}_{R W L}^{0}, {\hat{π}}_{R C A L}^{0})$ admits the asymptotic expansion

{\hat{μ}}^{0} ({\hat{m}}_{R W L}^{0}, {\hat{π}}_{R C A L}^{0}) = \frac{1}{n} \sum_{i = 1}^{n} φ (Y_{i}, 1 - T_{i}, X_{i}; {\bar{m}}_{W L}^{0}, 1 - {\bar{π}}_{C A L}^{0}) + o_{p} (n^{- \frac{1}{2}}),

where ${\bar{π}}_{C A L}^{0} (X) = π (X; {\bar{γ}}_{C A L}^{0})$ , ${\bar{m}}_{W L}^{0} (X) = m^{0} (X; {\bar{α}}_{W L}^{0})$ , and ${\bar{γ}}_{C A L}^{0}$ and ${\bar{α}}_{W L}^{0}$ are the target values defined similarly as ${\bar{γ}}_{C A L}^{1}$ and ${\bar{α}}_{W L}^{1}$ . Then, Wald CIs for μ⁰ and ATE can be derived similarly as in the estimation of μ⁰ and shown to be either doubly robust in the case of linear OR models, or valid if PS model (8) is correctly specified but OR models (26) and (28) may be misspecified for nonlinear outcome models.

The proposed method for model-assisted inference about ATEs is implemented as part of the publicly released R package RCAL (Tan & Sun 2020). The package provides both functions for regularized calibrated estimation of PSs and OR functions and for subsequent estimation of ATEs. In addition to a full reference manual, a vignette is also included in the package to give a direct and accessible introduction to the method with simple examples.

To facilitate applications, we focus on the parametric setting as described in the “Setup” section, where OR and PS models are generalized linear regressions in the usual form but prespecified with possibly a large number of regressors. Each regressor is a known function of the covariates, for example, the main effect or a spline term of a single covariate or an interaction of 2 covariates. Our method employs Lasso-type-penalized estimation to perform variable selection (or, more precisely, regressor selection). For technical justification, our theory is developed in Tan (2020a b) with suitable assumptions regarding the exact sparsity (ie, the number of nonzero coefficients) on the limit values of the estimators in OR and PS models, which may be misspecified. Nevertheless, various extensions can be investigated. For example, approximate or weak sparsity can be accommodated, along similar lines as in Negahban et al (2012), Smucler et al (2019), and Bradic et al (2021). Moreover, regularized calibrated estimation can be combined with related methods, such as nonlinear regression models (Hastie et al. 2009), targeted maximum likelihood estimation (van der Laan & Rubin 2006; van der Laan & Rose 2017), and ensemble learning (van der Laan et al. 2007).

Simulation Studies

We conduct extensive simulation studies to evaluate the performances of the proposed methods and compare them with existing methods in both low-dimensional and high-dimensional settings. These studies are reported, in full detail, in Tan (2020a) for PS estimation and in Tan (2020b) for model-assisted inference about ATEs. Below, we describe the simulation study in Tan (2020b, Section 4).

Let X = (X₁, …, X_p) be multivariate normal with means 0 and covariances cov (X_j, X_k) = 2^−|j−k| for 1 ≤ j, k ≤ p. In addition, let $X_{j}^{†} = X_{j} + {m a x (0, X_{j} + 1)}^{2}$ for j = 1,…, 4. Consider the following data-generating configurations:

C1.

Generate T given X from a Bernoulli distribution with

P (T = 1 | X) = {1 + \exp (- 1 - X_{1} - 0.5 X_{2} - 0.25 X_{3} - 0.125 X_{4})}^{- 1},

and, independently, generate Y¹ given X from a normal distribution with variance 1 and mean

E (Y^{1} |X) = X_{1} + 0.5 X_{2} + 0.25 X_{3} + 0.125 X_{4} .

C2.

Generate T given X as in (C1), but, independently, generate Y¹ given X from a normal distribution with variance 1 and mean

E (Y^{1} |X) = X_{1}^{†} + 0.5 X_{2}^{†} + 0.25 X_{3}^{†} + 0.125 X_{4}^{†} .

C3.

Generate Y¹ given X as in (C1), but, independently, generate T given X from a Bernoulli distribution with

P (T = 1 | X) = {1 + \exp (- 1 - X_{1}^{†} - 0.5 X_{2}^{†} - 0.25 X_{3}^{†} - 0.125 X_{4}^{†})}^{- 1} .

To study the estimation of μ¹, note that the observed data consist of independent and identically distributed observations {(T_iY_i, Y_i, X_i): i = 1,…, n}. Consider logistic PS model (8) and linear OR model (16), both with f_j(X) = X_j for j = 1,…, p. Then, the 2 models can be classified as follows, depending on the data configuration above:

C1.: PS and OR models both correctly specified
C2.: PS model correctly specified, but OR model misspecified
C3.: PS model misspecified, but OR model correctly specified

The true OR model in (C2) and PS model in (C3) are nonlinear (in the linear or logistic scale) in the regressors X₁,…, X_p used. See the supplemental material in Tan (2020b) for boxplots of X_j within {T = 1} and {T = 0} and scatterplots of Y against X_j within {T = 1} for j = 1,…, 4. Partly because $X_{j}^{†}$ is a monotone function of X_j, the misspecified OR model in (C1) or PS model in (C2) appears to be difficult to detect by standard model diagnosis.

For n = 800 and p = 200 or 1000, Table 1 summarizes the results for estimation of μ¹, based on 1000 repeated simulations. The methods RML and RCAL perform similarly to each other in terms of bias, variance, and coverage in the cases (C1) and (C3); however, RCAL leads to noticeably smaller absolute biases and better coverage than RML in (C2), when there are correct PS and misspecified OR models. The post-Lasso refitting method, RML2, yields coverage proportions closer to the nominal probabilities than does RCAL, but consistently and, in (C2), substantially higher variances and wider CIs. These properties can also be confirmed from the QQ plots of the estimates and t-statistics in the supplemental material of Tan (2020b).

Table 1

Summary of Results With Linear Outcome Models.

Empirical Application

Note: The material in this section is adapted from Tan Z. Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika. 2020;107;137-158; and Tan Z. Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. Ann Statist. 2020;48:811-837.

We provide an empirical application of the proposed methods to a medical study in Connors et al (1996) on the effects of right heart catheterization.

The observational study of Connors et al (1996) was of interest at the time when many physicians believed that the procedure led to better patient outcomes, but the benefit had not been demonstrated in any randomized clinical trials. The study included n = 5735 critically ill patients who were admitted to 5 medical centers. For each patient, the data consist of treatment status T, defined as 1 if the procedure was used within 24 hours of admission and 0 otherwise; health outcome Y, defined as survival time up to 30 days; and a list of 75 covariates X specified by medical specialists in critical care. For previous analyses using PSs, logistic regression was employed either with main effects only (Hirano & Imbens 2002; Vermeulen & Vansteelandt 2015) or with interaction terms manually added (Tan 2006a) in the approach of Rosenbaum and Rubin (1984).

To explore dependency beyond the main effects of the covariates, we consider a logistic PS model (8) and a logistic OR model (26) for 30-day survival status 1{Y > 30} as a binary outcome, with the regressor vector f (X) including all main effects and 2-way interactions of X except those with the fractions of nonzero values less than 46 (ie, 0.8% of the sample size 5735). The dimension of f (X) is p = 1855, excluding the constant. All variables in f (X) are standardized with sample means 0 and variances 1.

For estimating PSs, we apply regularized calibrated estimation and regularized maximum likelihood. For each method, the Lasso tuning parameter λ is determined using 5-fold cross-validation based on the corresponding loss function. Possible values of λ are searched in the set ${λ^{*} / 2^{\frac{j}{4}} : j = 0, 1, \dots, 24}$ , where λ^* is the value leading to a 0 solution.

To measure the effect of calibration in the treated sample for a function h(X) using a PS estimate $\hat{π}$ , we use the standardized calibration difference $C A L^{1} (\hat{π}; h) = [{\hat{μ}}_{r I P W}^{1} (\hat{π}; h) - \tilde{E} {h (X)}] / {\tilde{v a r}}^{\frac{1}{2}} {h (X)}$ , where $\tilde{E} ()$ and $\tilde{var}$ denote the sample mean and variance in the overall sample, respectively (including treated and untreated subjects), and ${\hat{μ}}_{r I P W}^{1} (\hat{π}; h)$ is defined as ${\hat{μ}}_{r I P W}^{1} (\hat{π})$ with Y replaced by h(X). For f_j(X) standardized with sample mean 0 and sample variance 1, ${CAL}^{1} (\hat{π}; h)$ reduces to ${\hat{μ}}_{r I P W}^{1} (\hat{π}; f_{j})$ . See, for example, Austin and Stuart (2015) for a related statistic for balance checking.

Figure 1 presents the standardized calibration differences for all the variables f_j(X) and the fitted PSs in the treated sample. From Table 1, the maximum absolute standardized differences are reduced from 35% to about 10% based on the estimators ${\hat{π}}_{R M L}$ and ${\hat{π}}_{R C A L}^{1}$ . However, the proposed estimator ${\hat{π}}_{R C A L}^{1}$ is obtained with a much smaller number, 32 vs 188, of nonzero estimates of coefficients γ_j in the PS model. The corresponding standardized differences for these 32 nonzero coefficients precisely attain the maximum absolute value, 0.102. The fitted PSs ${\hat{π}}_{R C A L}^{1} (X_{i})$ in the treated are consistently larger or smaller than ${\hat{π}}_{R M L} (X_{i})$ when close to 0 or, respectively, 1. As a result, the inverse probability weights $1 / {\hat{π}}_{R C A L}^{1} (X_{i})$ tend to be less variable than $1 / {\hat{π}}_{R M L} (X_{i})$ , which is also confirmed by Tan (2020a, Figure 5).

Figure 1

Standardized Calibration Differences CAL1(π^;fj) Plotted Against Index j for the Estimators (a) π^=E(T), (b) π^RML, and (c) π^RCAL1 With λ Selected by Cross-Validation.

Next, we apply the estimators ${\hat{μ}}^{1} ({\hat{m}}_{R W L}^{1}, {\hat{π}}_{R C A L}^{1})$ and ${\hat{μ}}^{0} ({\hat{m}}_{R W L}^{0}, {\hat{π}}_{R C A L}^{0})$ using regularized calibrated estimation and the corresponding estimators such as ${\hat{μ}}^{1} ({\hat{m}}_{R M L}^{1}, {\hat{π}}_{R M L})$ using regularized maximum likelihood estimation. A 5-fold cross-validation is also used to select the Lasso tuning parameter for the regularized estimators ${\hat{α}}_{R W L}^{1}$ and ${\hat{α}}_{R M L}^{1}$ . We also compute the (ratio) IPW estimators, such as ${\hat{μ}}_{r I P W}^{1}$ , along with nominal SEs obtained by ignoring the data dependency of the fitted PSs.

Table 2, reproduced from Tan (2020b), shows various estimates of survival probabilities and ATE. The IPW estimates from RCAL estimation of PSs have noticeably smaller nominal SEs than does RML estimation, for example, with the relative efficiency (0.026/0.023)² = 1.28 for estimation of μ¹. This improvement can also be seen from Figure S7 in the supplemental material of Tan (2020b), where the RCAL inverse probability weights are much less variable than are RML weights.

Table 2

Estimates of 30-Day Survival Probabilities and ATE.

The augmented IPW estimates and CIs are similar to each other from RCAL and RML estimation. The estimate of μ¹ from RML with post-Lasso refitting appears problematic, with a large SE. However, the validity of RML CIs depends on both PS and OR models being correctly specified, whereas that of RCAL CIs holds even when the OR model is misspecified. While an assessment of this difference is difficult with real data, Figure S7 in the supplemental material of Tan (2020b) shows that the sample influence functions for ATE using RCAL estimation appear to be more normally distributed, especially in the tails, than with RML estimation.

Finally, the augmented IPW estimates of the ATE are smaller in absolute values, and also with smaller SEs, than previous estimates based on main-effect models: about −0.060 ± 2 × 0.015 (Vermeulen & Vansteelandt 2015). The reduction in SEs might be explained by the well-known property that an augmented IPW estimator has a smaller asymptotic variance when obtained using a larger (correct) PS model.

Instrumental Variables

Framework

As mentioned in the “Setup” section, one of the assumptions needed to achieve point identification of ATEs is unconfoundedness, that is, all confounding variables related to both treatment status and potential outcomes should be measured and included as covariates in the analysis. IV methods are useful for CER in the presence of unmeasured confounding where some covariates underlying selection bias are unmeasured.

The conventional IV method (eg, Wooldridge 2002) has been widely used in econometrics since the publication of Wright (1928). This method deals with estimation of ATEs for continuous outcomes but implicitly assumes that individual treatment effects are homogeneous, independent of the treatment and instrument given the covariates (Heckman 1997). Recently, while handling continuous or discrete outcomes and allowing heterogeneous treatment effects, a rigorous IV framework was formulated in terms of potential treatment status and potential outcomes (Robins 1994; Angrist et al. 1996).

We describe the IV framework at the population level. Let Z be an instrument, D a treatment (in place of T in the “Propensity Scores and Unconfounded Estimation” section), Y an outcome of interest, and X a vector of preinstrument, pretreatment covariates. For simplicity, assume that both Z and D are dichotomous, taking value 0 or 1. For z ∈ {0, 1} and d ∈ {0, 1}, let D^z be the potential treatment status that would be observed if Z were set to z, and let Y^zd be the potential outcome that would be observed if Z were set to z and D were set to d. By consistency, we assume that D = D^z if Z = z, and Y = Y^zd if Z = z and D = d. The basic assumptions for a valid IV are as follows:

IV.1.: Instrumentation: the instrument Z is associated with the treatment D;
IV.2.: Exclusion restriction: Y^zd = Y^z′d, denoted as Y^d, for z ≠ z′ and d = 0, 1;
IV.3.: IV unconfoundedness: Z and (D^z, Y^zd) are conditionally independent given X for z ∈ {0, 1} and d ∈ {0, 1};
IV.4.: IV overlap: 0 < π^*(X) < 1, where π^*(X) = P(Z = 1│X) is called the instrument PS (Tan 2006b).

By the first 2 assumptions, an IV affects treatment status but has no direct effect on potential outcomes. By the last 2 assumptions, an IV serves as an experimental handle similarly as in the unconfoundedness and overlaps assumptions in the “Propensity Scores and Unconfounded Estimation” section. In particular, the instrument is allowed to be associated with the covariates, so that the groups with different instrumental levels (ie, different instrument groups) may differ in their distributions of the covariates. There has been increasing research on how to evaluate and conduct IV analysis in such situations (eg, Brookhart et al. 2007; Baiocchi et al. 2012).

The basic IV assumptions stated above, in general, do not yield point identification of ATEs or ATTs, although bounds can be obtained (Manski 1990; Balke & Pearl 1997). Additional assumptions can be imposed to identify specific treatment parameters. In the case of dichotomous D and Z, each subject can be classified into 1 of the 4 groups, depending on the values of (D⁰, D¹): compliers (D⁰ = 0 and D¹ = 1), always-takers (D⁰ = D¹ = 1), never-takers (D⁰ = D¹ = 0), and defiers (D⁰ = 1 and D¹ = 0). We focus on the approach based on the monotonicity assumption (Angrist et al. 1996):

IV.5.: Monotonicity: there exists no defier; that is, D⁰ ≤ D¹, in the population.

Under this assumption (in addition to the basic IV assumptions), the local ATE (LATE), also called the complier ATE, is defined as LATE (x) = E(Y¹ − Y⁰ |D⁰ < D¹, X = x) and can be identified by

\frac{E (Y | Z = 1, X = x) - E (Y | Z = 0, X = x)}{E (D | Z = 1, X = x) - E (D | Z = 0, X = x)} .

For high-dimensional covariates X, LATE(x) is difficult to interpret, depending on all covariates. Moreover, estimation of LATE(x) can be sensitive to modeling assumptions on the conditional expectations above. Hence, it is of interest to consider the population LATE (or in short LATE), defined as LATE = E(Y¹ − Y⁰ |D⁰ < D¹). As shown by Tan (2006b) and Frolich (2007), the LATE can be identified in 2 distinct ways:

L A T E = \frac{E {E (Y | Z = 1, X) - E (Y | Z = 0, X)}}{E {E (D | Z = 1, X) - E (D | Z = 0, X)}},

(29)

depending on the regression functions E(Y│Z = z, X) and E(D│Z = z, X) for z ∈ {0, 1}, or

LATE = \frac{E {\frac{Z}{π^{*} (X)} Y} - E {\frac{1 - Z}{1 - π^{*} (X)} Y}}{E {\frac{Z}{π^{*} (X)} D} - E {\frac{1 - Z}{1 - π^{*} (X)} D}},

(30)

depending on the instrument PS π^*(X) = P(Z = 1|X). Both (29) and (30) are in the form of a ratio of the difference in outcome Y over that in treatment D.

A further identification to be exploited in our approach is that the individual expectations θ_d = E(Y^d |D⁰ < D¹) for d ∈ {0, 1}, not just the difference LATE = θ₁ − θ₀, can also be identified. In fact, θ₁ is identified as

θ_{1} = \frac{E {E (D Y | Z = 1, X) - E (D Y | Z = 0, X)}}{E {E (D | Z = 1, X) - E (D | Z = 0, X)}},

(31)

or equivalently as

θ_{1} = \frac{E {\frac{Z}{π^{*} (X)} D Y} - E {\frac{1 - Z}{1 - π^{*} (X)} D Y}}{E {\frac{Z}{π^{*} (X)} D} - E {\frac{1 - Z}{1 - π^{*} (X)} D}} .

(32)

Similarly, θ₀ is identified as (31) or (32) with D replaced by 1 − D. The difference of the corresponding identification equations for θ₁ and θ₀ leads back to (29) or (30). As shown in Tan (2006b), both (31) and (32) can be derived from the following expression of θ₁:

θ_{1} = \frac{E (D^{1} Y^{1}) - E (D^{0} Y^{1})}{E (D^{1}) - E (D^{0})},

(33)

which is a ratio of 2 differences, depending on potential outcomes and treatments. Because Z is an experimental handle with (D, Y) as “outcomes” under Assumption (IV.3) (instrument unconfoundedness), each expectation in the numerator and denominator of (33) can be identified through OR averaging or inverse probability weighting, so that (31) or (32) is obtained. These results are parallel to related identification results under the assumption of treatment unconfoundedness (Tan 2007).

Existing Estimators

Suppose that {O_i = (Y_i, D_i, Z_i, X_i): i = 1, …, n} are independent and identically distributed observations of O = (Y, D, Z, X), where Y is an outcome, D is a binary treatment, Z is a binary instrument, and X is a vector of measured covariates. For estimating (θ₁, θ₀) and LATE from sample data, additional modeling assumptions are required to estimate unknown functions in the identification equations (29) and (30) or (31) and (32). There are at least 2 distinct approaches, depending on models for the instrument PS π^*(x) = P(Z = 1│X = x) or treatment and OR functions $m_{z}^{*} (x) = P (D = 1 | Z = z, X = x)$ and $m_{d z}^{*} (x) = E (Y | D = d, Z = z, X = x)$ for d, z ∈ {0, 1} (Tan 2006b). For simplicity, estimation of θ₁ is discussed, whereas that of θ₀ can be similarly handled. Throughout, $\tilde{E} ()$ denotes a sample average such that $\tilde{E} {b (O)} = n^{- 1} \sum_{i = 1}^{n} b (O_{i})$ for a function b(O).

First, consider an instrument PS model,

P (Z = 1 | X = x) = π (x; γ) = Π {γ^{T} f (x)},

(34)

where Π(∙) is an inverse link function, f(x) = {1, f₁(x),…,f_p(x)}^T is a vector of known functions, and γ = (γ₀, γ₁,…, γ_p)^T is a vector of unknown parameters. For concreteness, assume that logistic regression is used such that π(x; y) = [1 + exp{−γ^Tf(x)}]⁻¹. By (32), the IPW estimator of θ₁ is

{\hat{θ}}_{1, I P W} (\hat{π}) = \frac{\tilde{E} {\frac{Z}{\hat{π} (X)} D Y} - \tilde{E} {\frac{1 - Z}{1 - \hat{π} (X)} D Y}}{\tilde{E} {\frac{Z}{\hat{π} (X)} D} - \tilde{E} {\frac{1 - Z}{1 - \hat{π} (X)} D}} .

where $\hat{π} (X) = π (X; \hat{γ})$ is a fitted instrument PS. For low-dimensional X, $\hat{γ}$ is customarily the maximum likelihood estimator of γ. In high-dimensional settings, $\hat{γ}$ can be a Lasso-penalized maximum likelihood estimator ${\hat{γ}}_{R M L}$ .

Alternatively, for z ∈ {0, 1}, consider treatment and OR models, which can both be called “outcome regression” with (D, Y) as “outcomes”:

P (D = 1 | Z = z, X = x) = m_{z} (x; α_{z}) = ψ_{D} {α_{z}^{T} g (x)},

(35)

E (Y | D = 1, Z = z, X = x) = m_{1 z} (x; α_{1 z}) = ψ_{Y} {α_{1 z}^{T} h (x)},

(36)

where ψ_D(∙) and ψ_Y(∙) are inverse link functions, $g (x) = {1, g_{1} (x), \dots, g_{q_{1}} (x)}^{T}$ and $h (x) = {1, h_{1} (x), \dots, h_{q_{2}} (x)}^{T}$ are 2 vectors of known functions, and α_z and α_1z are 2 vectors of unknown parameters of dimensions 1 + q₁ and 1 + q₂ respectively. By (31), the OR-based estimator of θ₁ is

{\hat{θ}}_{1, O R} ({\hat{m}}_{#}, {\hat{m}}_{1 #}) = \frac{\tilde{E} {{\hat{m}}_{11} (X) {\hat{m}}_{1} (X)} - \tilde{E} {{\hat{m}}_{10} (X) {\hat{m}}_{0} (X)}}{\tilde{E} {{\hat{m}}_{1} (X)} - \tilde{E} {{\hat{m}}_{0} (X)}},

where ${\hat{m}}_{#} = ({\hat{m}}_{1}, {\hat{m}}_{0})$ , ${\hat{m}}_{1 #} = ({\hat{m}}_{11}, {\hat{m}}_{10})$ , and, for z ∊ {0,1}, ${\hat{m}}_{z} (X) = m_{z} (X; {\hat{α}}_{z})$ is a fitted treatment regression function and ${\hat{m}}_{1 z} (X) = m_{1 z} (X; {\hat{α}}_{1 z})$ is a fitted OR function. For low-dimensional X, ${\hat{α}}_{z}$ and ${\hat{α}}_{1 z}$ are customarily maximum quasi-likelihood estimators of α_z and α_1z or their variants. In high-dimensional settings, ${\hat{α}}_{z}$ and ${\hat{α}}_{1 z}$ can be Lasso-penalized quasi-likelihood estimators, ${\hat{α}}_{z, R M L}$ and ${\hat{α}}_{1 z, R M L}$ .

The consistency of the estimator ${\hat{θ}}_{1, I P W} (\hat{π})$ relies on correct specification of model (34), whereas the consistency of ${\hat{θ}}_{1, O R} ({\hat{m}}_{#}, {\hat{m}}_{1 #})$ relies on correct specification of models (35) and (36). The weighting and regression approaches can be combined to obtain doubly robust estimators through augmented IPW estimation (Tan 2006b), in a similar manner as in the setting of treatment unconfoundedness (Robins et al. 1994; Tan 2007). The expectations E(D¹) and E(D⁰) in (33) can be estimated by $\tilde{E} {φ_{D_{1}} (O; \hat{π}, {\hat{m}}_{1})}$ and $\tilde{E} {φ_{D_{0}} (O; \hat{π}, {\hat{m}}_{0})}$ , respectively, where

φ_{D_{1}} (O; \hat{π}, {\hat{m}}_{1}) = \frac{Z}{\hat{π} (X)} D - {\frac{Z}{\hat{π} (X)} - 1} {\hat{m}}_{1} (X),

(37)

φ_{D_{0}} (O; \hat{π}, {\hat{m}}_{0}) = \frac{1 - Z}{1 - \hat{π} (X)} D - {\frac{1 - Z}{1 - \hat{π} (X)} - 1} {\hat{m}}_{0} (X) .

(38)

Similarly, the expectations E(D¹ Y¹) and E(D⁰ Y¹) in (33) can be estimated by $\tilde{E} {φ_{D_{1} Y_{11}} (O; \hat{π}, {\hat{m}}_{1}, {\hat{m}}_{11})}$ and $\tilde{E} {φ_{D_{0} Y_{10}} (O; \hat{π}, {\hat{m}}_{0}, {\hat{m}}_{10})}$ respectively, where

φ_{D_{1} Y_{11}} (O; \hat{π}, {\hat{m}}_{1}, {\hat{m}}_{11}) = \frac{Z}{\hat{π} (X)} D Y - {\frac{Z}{\hat{π} (X)} - 1} {\hat{m}}_{1} (X) {\hat{m}}_{11} (X),

(39)

φ_{D_{0} Y_{10}} (O; \hat{π}, {\hat{m}}_{0}, {\hat{m}}_{10}) = \frac{1 - Z}{1 - \hat{π} (X)} D Y - {\frac{1 - Z}{1 - \hat{π} (X)} - 1} {\hat{m}}_{0} (X) {\hat{m}}_{10} (X) .

(40)

By (33), the resulting doubly robust estimator of θ₁ is

{\hat{θ}}_{1} (\hat{π}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) = \frac{\tilde{E} {φ_{D_{1} Y_{11}} (O; \hat{π}, {\hat{m}}_{1}, {\hat{m}}_{11})} - \tilde{E} {φ_{D_{0} Y_{10}} (O; \hat{π}, {\hat{m}}_{0}, {\hat{m}}_{10})}}{\tilde{E} {φ_{D_{1}} (O; \hat{π}, {\hat{m}}_{1})} - \tilde{E} {φ_{D_{0}} (O; \hat{π}, {\hat{m}}_{0})}},

(41)

where ${\hat{m}}_{#} = ({\hat{m}}_{1}, {\hat{m}}_{0})$ and ${\hat{m}}_{1 #} = ({\hat{m}}_{11}, {\hat{m}}_{10})$ . Consistency of ${\hat{θ}}_{1} (\hat{π}, {\hat{m}}_{#}, {\hat{m}}_{1 #})$ can be achieved if either model (34) or models (35) and (36) are correctly specified.

There is potentially another advantage of doubly robust estimators in high-dimensional settings. In this case, the estimator ${\hat{θ}}_{1, I P W} (\hat{π})$ or ${\hat{θ}}_{1, O R} ({\hat{m}}_{#}, {\hat{m}}_{1 #})$ in general converges at a lower rate than does $O_{p} (n^{- \frac{1}{2}})$ to the true value θ₁ under correctly specified model (34) or models (35) and (36), respectively. Denote ${\hat{π}}_{R M L} (X) = π (X; {\hat{γ}}_{R M L})$ , ${\hat{m}}_{z, R M L} (X) = m_{z} (X; {\hat{α}}_{z, R M L})$ , and ${\hat{m}}_{1 z, R M L} (X) = m_{1 z} (X; {\hat{α}}_{1 z, R M L})$ , obtained from Lasso-penalized likelihood estimation. By related results in Chernozhukov et al (2018, Section 5.2), it can be shown that if all models (34) and (35) and (36) are correctly specified, under suitable sparsity conditions, the estimator ${\hat{θ}}_{1, R M L} = {\hat{θ}}_{1} ({\hat{π}}_{R M L}, {\hat{m}}_{#, R M L}, {\hat{m}}_{1 #, R M L})$ converges to θ₁ at rate $O_{p} (n^{- \frac{1}{2}})$ and admits the asymptotic expansion

{\hat{θ}}_{1, R M L} = \frac{\tilde{E} {φ_{D_{1} Y_{11}} (O; π^{*}, m_{1}^{*}, m_{11}^{*})} - \tilde{E} {φ_{D_{0} Y_{10}} (O; π^{*}, m_{0}^{*}, m_{10}^{*})}}{\tilde{E} {φ_{D_{1}} (O; π^{*}, m_{1}^{*})} - \tilde{E} {φ_{D_{0}} (O; π^{*}, m_{0}^{*})}} + o_{p} (n^{- \frac{1}{2}}),

(42)

where π^*(X) = π(X; γ^*), $m_{z}^{*} (X) = m_{z} (X; α_{z}^{*})$ , and $m_{1 z}^{*} (X) = m_{1 z} (X; α_{1 z}^{*})$ , with $(γ^{*}, α_{z}^{*}, α_{1 z}^{*})$ the true values in models (34), (35), and (36). From this expansion, valid Wald CIs based on ${\hat{θ}}_{1, R M L}$ can be obtained for θ₁.

Model-Assisted Inference

We develop new methods and theory using IVs in high-dimensional settings to obtain model-assisted CIs for LATEs, which are valid if the instrument PS model is correctly specified, but the treatment and OR models may be misspecified. The material in this section is based on Sun and Tan (2020).

To focus on main ideas, we describe our new method for estimating θ₁. Estimation of θ₀ and LATE is discussed later in this section. Similarly as in the “Existing Estimators” section, consider logistic regression model (34) for estimating the instrument PS π^*(x) = P(Z = 1|X = x), and models (35) and (36) for estimating treatment and OR functions $m_{z}^{*} (x) = P (D = 1 | Z = z, X = x)$ and $m_{1 z}^{*} (x) = E (Y | D = 1, Z = z, X = x)$ , respectively, for z ∈ {0, 1}. For technical reasons, we require that the “regressor” vector f(x) in model (34) is a subvector of g(x) and h(x) in models (35) and (36). This condition can be satisfied possibly after enlarging models (35) and (36) to accommodate f(x).

A class of doubly robust estimators of θ₁ slightly more flexible than (41) is

{\hat{θ}}_{1} ({\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) = \frac{\tilde{E} {τ_{D Y_{1}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #})}}{\tilde{E} {τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})}},

(43)

where ${\hat{π}}_{#} = ({\hat{π}}_{1}, {\hat{π}}_{0})$ , with ${\hat{π}}_{1}$ and ${\hat{π}}_{0}$ 2 possibly different versions of fitted values for π^*, ${\hat{m}}_{#} = ({\hat{m}}_{1}, {\hat{m}}_{0})$ and ${\hat{m}}_{1 #} = ({\hat{m}}_{11}, {\hat{m}}_{10})$ with $({\hat{m}}_{z}, {\hat{m}}_{1 z})$ fitted values for $(m_{z}^{*}, m_{1 z}^{*})$ respectively, for z ∈ {0, 1}, and with $φ_{D_{z}}$ and $φ_{D_{z} Y_{1 z}}$ defined as (37) to (40),

τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#}) = φ_{D_{1}} (O; {\hat{π}}_{1}, {\hat{m}}_{1}) - φ_{D_{0}} (O; {\hat{π}}_{0}, {\hat{m}}_{0}),

τ_{D Y_{1}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) = φ_{D_{1} Y_{11}} (O; {\hat{π}}_{1}, {\hat{m}}_{1}, {\hat{m}}_{11}) - φ_{D_{0} Y_{10}} (O; {\hat{π}}_{0}, {\hat{m}}_{0}, {\hat{m}}_{10}) .

Our point estimator of θ₁ is ${\hat{θ}}_{1, R C A L} = {\hat{θ}}_{1} ({\hat{π}}_{#, R C A L}, {\hat{m}}_{#, R W L}, {\hat{m}}_{1 #, R W L})$ , where, for z ∈ {0, 1}, ${\hat{π}}_{z, R C A L} (X) = π (X; {\hat{γ}}_{z, R C A L})$ , ${\hat{m}}_{z, R W L} (X) = m_{z} (X; {\hat{α}}_{z, R W L})$ , and ${\hat{m}}_{1 z, R W L} (X) = m_{1 z} (X; {\hat{α}}_{1 z, R W L})$ are fitted values, and $({\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}, {\hat{α}}_{1 z, R W L})$ are estimators of (γ, α_z, α_1z) defined as follows.

For logistic regression model (34), the estimator ${\hat{γ}}_{z, R C A L}$ is a regularized calibrated estimator of γ (Tan 2020a), defined as a minimizer of the Lasso-penalized objective function

ℓ_{z, R C A L} (γ) = ℓ_{z, C A L} (γ) + λ ∥ γ_{1 : p} ∥_{1},

(44)

with the calibration loss functions

\begin{array}{l} ℓ_{0, C A L} (γ) = \tilde{E} {(1 - Z) \exp {γ^{T} f (X)} - Z γ^{T} f (X)} \\ ℓ_{1, C A L} (γ) = \tilde{E} {Z \exp {- γ^{T} f (X)} + (1 - Z) γ^{T} f (X)} . \end{array}

For treatment regression model (35), ${\hat{α}}_{z, R W L}$ is a regularized weighted likelihood estimator of α_z, defined as a minimizer of the Lasso-penalized objective function

ℓ_{z, R W L} (α_{z}; {\hat{γ}}_{z, R C A L}) = ℓ_{z, W L} (α_{z}; {\hat{γ}}_{z, R C A L}) + λ ∥ {(α_{z})}_{1 : q_{1}} ∥_{1},

(45)

with the weighted (quasi-)likelihood loss function

ℓ_{z, W L} (α_{z}; {\hat{γ}}_{z}) = \hat{E} {1 {Z = z} w_{z} (X; {\hat{γ}}_{z}) [- D α_{z}^{T} g (X) + Ψ_{D} {α_{z}^{T} g (X)}]},

where the weight function is w_z(X; γ) = π(X; γ)/{1 − σ(X; γ)} or {1 − π(X; γ)}/π(X; γ) for z = 0 or 1, respectively, and $Ψ_{D} (u) = \int_{0}^{u} ψ_{D} (t) d t$ . For OR model (36), ${\hat{α}}_{1 z, R C A L}$ is a regularized calibrated estimator of α_1z, defined as a minimizer of the Lasso-penalized objective function

ℓ_{1 z, R W L} (α_{1 z}; {\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}) = ℓ_{1 z, W L} (α_{1 z}; {\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}) + λ ∥ {(α_{1 z})}_{1 : q_{2}} ∥_{1},

(46)

with the loss function

ℓ_{1 z, W L} (α_{1 z}; {\hat{γ}}_{z}, {\hat{α}}_{z}) = \hat{E} (1 {Z = z} w_{z} (X; {\hat{γ}}_{z}) [- D Y α_{1 z}^{T} h (X) + m_{z} (X; {\hat{α}}_{z}) Ψ_{Y} {α_{1 z}^{T} h (X)}]),

where the weight function w_z (X; γ) is the same as above, and $Ψ_{Y} (u) = \int_{0}^{u} ψ_{Y} (t) d t$ .

Compared with regularized likelihood estimation in the “Existing Estimators” section, our method involves using a different set of estimators $({\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}, {\hat{α}}_{1 z, R W L})$ , which are called regularized calibrated estimators. Similarly as in Tan (2020b), these estimators are derived to allow model-assisted, asymptotic CIs for θ₁ based on ${\hat{θ}}_{1, R C A L}$ . Before discussing inference properties, we point out several interesting properties algebraically associated with our estimators.

First, by the KKT condition for minimizing (44), the fitted instrument PS ${\hat{π}}_{1, R C A L} (X)$ satisfies, similarly as in (14) and (15),

\frac{1}{n} \sum_{i = 1}^{n} \frac{Z}{{\hat{π}}_{1, R C A L} (X_{i})} = 1,

(47)

\frac{1}{n} | \sum_{i = 1}^{n} \frac{Z_{i} f_{j} (X_{i})}{{\hat{π}}_{1, R C A L} (X_{i})} - \sum_{i = 1}^{n} f_{j} (X_{i}) | \leq λ, j = 1, \dots, p .

(48)

These equations also hold with Z_i replaced by 1 − Z_i and ${\hat{π}}_{1, R C A L}$ replaced by $1 - {\hat{π}}_{0, R C A L}$ . Equation (47) shows that the IPWs, $1 / {\hat{π}}_{1, R C A L} (X_{i})$ with Z_i = 1, sum to the sample size n, whereas equation (48) implies that the weighted average of each covariate f_j(X_i) over the instrument group {Z_i = 1} may differ from the overall average of f_j(X_i) by no more than λ. Such differences are of interest in showing how a weighted instrument group resembles the overall sample. In contrast, similar results are not available when using the regularized likelihood estimator ${\hat{γ}}_{R M L}$ .

By the KKT condition associated with the intercept in α₁ for minimizing (45), the fitted treatment regression function ${\hat{m}}_{1, R W L} (X)$ satisfies, similarly as in (19),

\frac{1}{n} \sum_{i = 1}^{n} Z_{i} \frac{1 - {\hat{π}}_{1, R C A L} (X_{i})}{{\hat{π}}_{1, R C A L} (X_{i})} {D_{i} - {\hat{m}}_{1, R W L} (X_{i})} = 0.

(49)

A similar equation holds with Z_i replaced by 1 − Z_i and ${\hat{π}}_{1, R C A L}$ replaced by $1 - {\hat{π}}_{0, R C A L}$ , and ${\hat{m}}_{1, R W L}$ by ${\hat{m}}_{0, R W L}$ . As a result of (49), our augmented IPW estimator for E(D¹), defined as ${\hat{E}}_{R C A L} (D^{1}) = \tilde{E} {φ_{D_{1}} (O; {\hat{π}}_{1, R C A L}, {\hat{m}}_{1, R W L})}$ , can be simplified to

\tilde{E} [{\hat{m}}_{1, R W L} (X) + \frac{Z}{{\hat{π}}_{1, R C A L} (X)} {D - {\hat{m}}_{1, R W L} (X)}] = \tilde{E} {Z D + (1 - Z) {\hat{m}}_{1, R W L} (X)} .

Hence, ${\hat{E}}_{R C A L} (D^{1})$ always falls within the range of the binary treatment values {D_i: Z_i = 1, i = 1,…,n} and the predicted values ${{\hat{m}}_{1, R W L} (X_{i}) : T_{i} = 1, i = 1, \dots, n}$ , which are by definition in the interval [0,1]. This boundedness property is not satisfied by the usual estimator ${\hat{E}}_{R M L} (D^{1}) = \tilde{E} {φ_{D_{1}} (O; {\hat{π}}_{R M L}, {\hat{m}}_{1, R M L})}$ but is desirable for stabilizing the behavior of augmented IPW estimators, especially when used in the denominator of (43).

By the KKT condition associated with the intercept in α₁ for minimizing (46), the fitted functions ${\hat{m}}_{1, R W L} (X)$ and ${\hat{m}}_{11, R W L} (X)$ jointly satisfy

\frac{1}{n} \sum_{i = 1}^{n} Z_{i} \frac{1 - {\hat{π}}_{1, R C A L} (X_{i})}{{\hat{π}}_{1, R C A L} (X_{i})} {D_{i} Y_{i} - {\hat{m}}_{1, R W L} (X_{i}) {\hat{m}}_{11, R W L} (X_{i})} = 0.

(50)

A similar equation holds with Z_i replaced by 1 − Z_i, ${\hat{π}}_{1, R C A L}$ replaced by $1 - {\hat{π}}_{0, R C A L}$ , and $({\hat{m}}_{1, R W L}, {\hat{m}}_{11, R W L})$ replaced by $({\hat{m}}_{0, R W L}, {\hat{m}}_{10, R W L})$ . By (49), our augmented IPW estimator for E(D¹ Y¹), defined as ${\hat{E}}_{R C A L} (D^{1} Y^{1}) = \tilde{E} {φ_{D_{1} Y_{11}} (O; {\hat{π}}_{1, R C A L}, {\hat{m}}_{1, R W L}, {\hat{m}}_{11, R W L})}$ , can be simplified to

\tilde{E} [{({\hat{m}}_{1} {\hat{m}}_{11})}_{R W L} + \frac{Z}{{\hat{π}}_{1, R C A L} (X)} {D Y - {({\hat{m}}_{1} {\hat{m}}_{11})}_{R W L}}] = \tilde{E} {Z D Y + (1 - Z) {({\hat{m}}_{1} {\hat{m}}_{11})}_{R W L}},

where ${({\hat{m}}_{1} {\hat{m}}_{11})}_{R W L} = {\hat{m}}_{1, R W L} (X) {\hat{m}}_{11, R W L} (X)$ . As a consequence, ${\hat{E}}_{R C A L} (D^{1} Y^{1})$ always falls within the range of the observed values {D_iY_i: Z_i = 1, i = 1,…,n} and the predicted values ${{\hat{m}}_{1, R W L} (X_{i}) {\hat{m}}_{11, R W L} (X_{i}) : T_{i} = 1, i = 1, \dots, n}$ .

We provide a high-dimensional analysis of the proposed estimator ${\hat{θ}}_{1, R C A L}$ , provided that instrument PS model (34) is correctly specified but treatment and outcome models (35) and (36) may be misspecified. See Sun and Tan (2020, Section 3) for a detailed discussion. In contrast with the asymptotic expansion (42) for ${\hat{θ}}_{1, R M L}$ with all correctly specified models (34), (35), and (36), our main result shows that under suitable conditions, ${\hat{θ}}_{1, R C A L}$ is consistent for θ₁ and admits the asymptotic expansion

{\hat{θ}}_{1, R C A L} = \frac{\tilde{E} {τ_{D Y_{1}} (O; π_{#}^{*}, {\bar{m}}_{#}, {\bar{m}}_{1 #})}}{\tilde{E} {τ_{D} (O; π_{#}^{*}, {\bar{m}}_{#})}} + o_{p} (n^{- \frac{1}{2}}),

(51)

where $π_{#}^{*} = (π^{*}, π^{*})$ , ${\bar{m}}_{#} = ({\bar{m}}_{1}, {\bar{m}}_{0})$ , ${\bar{m}}_{1 #} = ({\bar{m}}_{11}, {\bar{m}}_{10})$ , and, for z ∈ {0,1}, ${\bar{m}}_{z} (X) = m_{z} (X; {\bar{α}}_{z})$ and ${\bar{m}}_{1 z} (X) = m_{z} (X; {\bar{α}}_{1 z})$ with $({\bar{α}}_{z}, {\bar{α}}_{1 z})$ the target values defined as the minimizers of the expected loss functions for $({\hat{α}}_{z, R W L}, {\hat{α}}_{1 z, R W L})$ . Then, Sun and Tan (2020, Proposition 1) show that if instrument PS model (34) is correctly specified but treatment and outcome models (35) and (36) may be misspecified, the following results hold under suitable sparsity conditions.

${\hat{θ}}_{1, R C A L}$ is consistent for θ₁ and asymptotically normally distributed:
$\sqrt{n} ({\hat{θ}}_{1, R C A L} - θ_{1}) \to_{D} N (0, V_{1})$
where $V_{1} = v a r {τ_{D Y_{1}} (O; π_{#}^{*}, {\bar{m}}_{#}, {\bar{m}}_{1 #}) - θ_{1} τ_{D} (O; π_{#}^{*}, {\bar{m}}_{#})} / E^{2} {τ_{D} (O; π_{#}^{*}, {\bar{m}}_{#})}$ .
A consistent estimator of V₁ is

${\hat{V}}_{1} = \frac{\tilde{E} [{τ_{D Y_{1}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) - θ_{1} τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})}^{2}]}{{\tilde{E}}^{2} {τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})} .},$

where $({\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) = ({\hat{π}}_{#, R C A L}, {\hat{m}}_{#, R W L}, {\hat{m}}_{1 #, R W L})$ .
An asymptotic (1 − c) CI for θ₁ is

${\hat{θ}}_{1, R C A L} \pm z_{\frac{c}{2}} \sqrt{\frac{{\hat{V}}_{1}}{n}},$

where z_(c/2) is the (1 − c/2) quantile of N(0,1).

Next, we describe how our method can be applied to estimate θ₀ and LATE, denoted as θ = θ₁ − θ₀. In addition to models (34), (35), and (36), consider the following OR model in the untreated population for z ∈ {0,1},

E (Y | D = 0, Z = z, X = x) = m_{0 z} (x; α_{0 z}) = ψ_{Y} {α_{0 z}^{T} h (x)},

where $h (x) = {1, h_{1} (x), \dots, h_{q_{2}} (x)}^{T}$ is a vector of known functions, and α_0z is a vector of unknown parameters of dimension 1 + q₂. For augmented IPW estimation of E{(1 − D⁰)Y⁰} and E{(1 − D¹)Y¹}, define

φ_{D_{0} Y_{00}} (O; \hat{π}, {\hat{m}}_{0}, {\hat{m}}_{00}) = \frac{1 - Z}{1 - \hat{π} (X)} (1 - D) Y - {\frac{1 - Z}{1 - \hat{π} (X)} - 1} {1 - {\hat{m}}_{0} (X)} {\hat{m}}_{00} (X),

φ_{D_{1} Y_{01}} (O; \hat{π}, {\hat{m}}_{0}, {\hat{m}}_{01}) = \frac{Z}{\hat{π} (X)} (1 - D) Y - {\frac{Z}{\hat{π} (X)} - 1} {1 - {\hat{m}}_{1} (X)} {\hat{m}}_{01} (X),

where ${\hat{m}}_{0 z} = m_{0 z} (X; {\hat{α}}_{0 z})$ is a fitted regression function. Then, a doubly robust estimator of θ₀, similar to that of θ₁ in (43) is

{\hat{θ}}_{0} ({\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{0 #}) = \frac{\tilde{E} {τ_{D Y_{0}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{0 #})}}{\tilde{E} {τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})}},

where ${\hat{m}}_{0 #} = ({\hat{m}}_{01}, {\hat{m}}_{00})$ , $τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})$ is as in (43), and

τ_{D Y_{0}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{0 #}) = φ_{D_{0} Y_{00}} (O; {\hat{π}}_{0}, {\hat{m}}_{0}, {\hat{m}}_{00}) - φ_{D_{0} Y_{01}} (O; {\hat{π}}_{1}, {\hat{m}}_{1}, {\hat{m}}_{01}) .

Our point estimator of θ₀ is ${\hat{θ}}_{0, R C A L} = {\hat{θ}}_{0} ({\hat{π}}_{#, R C A L}, {\hat{m}}_{#, R W L}, {\hat{m}}_{1 #, R W L})$ and that of LATE, θ = θ₁ − θ₀, is ${\hat{θ}}_{R C A L} = {\hat{θ}}_{1, R C A L} - {\hat{θ}}_{0, R C A L}$ , where ${\hat{π}}_{z, R C A L}$ and ${\hat{m}}_{z, R W L}$ remain the same as before, and ${\hat{m}}_{0 z, R W L} (X) = m_{0 z} (X; {\hat{α}}_{0 z, R W L}),$ , with ${\hat{α}}_{0 z, R W L}$ defined as follows: for z ∈ {0,1}, let ${\hat{α}}_{0 z, R C A L}$ be a minimizer of the Lasso-penalized objective function

ℓ_{0 z, R W L} (α_{0 z}; {\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}) = ℓ_{0 z, W L} (α_{0 z}; {\hat{γ}}_{z, R C A L}, {\hat{α}}_{z, R W L}) + λ ∥ {(α_{0 z})}_{1 : q_{2}} ∥_{1},

with the loss function

ℓ_{0 z, W L} (α_{0 z}; {\hat{γ}}_{z}, {\hat{α}}_{z}) = \hat{E} (1 {Z = z} w_{z} (X; {\hat{γ}}_{z}) [- (1 - D) Y α_{0 z}^{T} h (X) + {1 - m_{z} (X; {\hat{α}}_{z})} Ψ_{Y} {α_{0 z}^{T} h (X)}]),

where the weight function w_z(X; γ) is the same as above. Under similar conditions as in the estimation of θ₁, the estimator ${\hat{θ}}_{0, R C A L}$ admits an asymptotic expansion in the form of (51) and Wald CIs for θ₀, and LATE can be derived accordingly. In particular, an asymptotic (1 − c) CI for LATE is ${\hat{θ}}_{R C A L} \pm z_{c / 2} \sqrt{\hat{V} / n}$ , where

\hat{V} = \frac{\tilde{E} [{τ_{D Y_{1}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}) - τ_{D Y_{0}} (O; {\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{0 #}) - {\hat{θ}}_{R C A L} τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})}^{2}]}{{\tilde{E}}^{2} {τ_{D} (O; {\hat{π}}_{#}, {\hat{m}}_{#})}},

where $({\hat{π}}_{#}, {\hat{m}}_{#}, {\hat{m}}_{1 #}, {\hat{m}}_{0 #}) = ({\hat{π}}_{#, R C A L}, {\hat{m}}_{#, R W L}, {\hat{m}}_{1 #, R W L}, {\hat{m}}_{0 #, R W L})$ .

The proposed method for model-assisted inference using regularized calibrated estimation about LATEs is implemented as part of the publicly released R package RCAL (Tan & Sun 2020). In addition to a full reference manual, a vignette is also included in the package to give a direct and accessible introduction to the method.

Our work focuses on statistical methods and theory for estimation of LATEs in the modern IV framework as described in the “Framework” section. In practice, IV analysis needs to be conducted with care. First, it is often challenging to find valid IVs which can be justified to satisfy the IV assumptions IV.1 to IV.5. Moreover, the IV estimates of treatment effects may suffer large biases or SEs because of possible violations of the IV assumptions or weakness of the IVs in the association with the treatment. There has been extensive research which investigates the impact of weak IVs and related issues in the conventional IV method (Bound et al. 1995; Staiger & Stock 1997; Crown et al. 2011; Andrews et al. 2019), where 2 structural equations are postulated in relating the outcome to the treatment and covariates and then the treatment to the IVs and covariates (Wooldridge 2002). Further work is desired to study how the proposed methods are affected by weak IVs and possible violations of IV assumptions.

Simulation Studies

We present simulation studies to compare pointwise properties of ${\hat{θ}}_{1, R M L}$ based on regularized likelihood estimation without or with post-Lasso refitting and ${\hat{θ}}_{1, R C A L}$ based on regularized calibrated estimation and coverage properties of the associated CIs. The material in this section is based on Sun and Tan (2020, Section 4).

Let X = (X₁,…,X_p) be independent variables where each X_j is N(0,1) truncated to the interval (−2.5,2.5) and then standardized to have mean 0 and variance 1. Consider the transformed variables W₁ = exp(0.5 X₁), W₂ = 10 + {1 + exp(X₁)}⁻¹ X₂, W₃ = (0.04 X₁ X₃ + 0.6)³, and W₄ = (X₂ + X₄ + 20)². Let $X^{†} = (X_{1}^{†}, \dots, X_{p}^{†})$ , where $X_{j}^{†}$ is the standardized version of W_j to have mean 0 and variance 1 for j = 1,…,4, and $X_{j}^{†} = X_{j}$ for 5 ≤ j ≤ p. This setup follows that in the preprint Tan (2018) and ensures strict one-to-one mapping between X and X^†. See the supplemental material in Sun and Tan (2020) for scatterplots from a simulated data sample of the variables $(X_{1}^{†}, \dots, X_{4}^{†})$ , which are correlated with each other as would be found in real data. Consider the following data-generating configurations:

C1.

Generate Z given X from a Bernoulli distribution with

P (Z = 1 | X) = {1 + \exp (- X_{1}^{†} + 0.5 X_{2}^{†} - 0.25 X_{3}^{†} - 0.1 X_{4}^{†})}^{- 1} .

Then, independently, generate U from a standard logistic distribution,

D = 1 {1 - 2.5 Z + 0.25 X_{1}^{†} + X_{2}^{†} + 0.5 X_{3}^{†} - 1.5 X_{4}^{†} \geq U},

and Y¹ given X from a normal distribution with variance 1 and mean

E (Y^{1} |Z, X, U) = 0.5 X_{1}^{†} + X_{2}^{†} + X_{3}^{†} + X_{4}^{†} + 2 U .

C2.

Generate (Z, U) as in (C1), but generate

D = 1 {1 - 2.5 Z + 0.25 X_{1} + X_{2} + 0.5 X_{3} - 1.5 X_{4} \geq U},

and generate Y¹ given X from a normal distribution with variance 1 and mean

E (Y^{1} |Z, X, U) = 0.5 X_{1} + X_{2} + X_{3} + X_{4} + 2 U .

C3.

Generate Z given X from a Bernoulli distribution with

P (Z = 1 | X) = {1 + \exp (- X_{1} + 0.5 X_{2} - 0.25 X_{3} - 0.1 X_{4})}^{- 1},

and then generate (U, D, Y¹) as in (C1).

The observed data consist of independent and identically distributed copies {(Y_iD_i, D_i, Z_i, X_i): i = 1,…,n}. Consider the following model specifications:

M1): logistic instrument PS model (34), logistic treatment model (35), and linear outcome model (36) with $f_{j} (X) = g_{j} (X) = h_{j} (X) = X_{j}^{†}$ for j = 1,…,p.
M2): Logistic instrument PS model (34) and logistic treatment model (35) with $f_{j} (X) = g_{j} (X) = X_{j}^{†}$ for j = 1,…,p, and linear outcome model (36) with $h_{j} (X) = X_{j}^{†}$ for j = 1,…,p, 8 additional functions $(h_{p + 1} (X), \dots, h_{p + 8} (X))$ , 4 of which are linear spline basis functions of the fitted values ${\hat{m}}_{1} (X)$ , and 4 are linear spline basis functions of ${\hat{m}}_{0} (X)$ , where the knots are quantiles of the fitted values.

The instrument PS model is correct in configurations (C1) and (C2) but misspecified in configuration (C3). The treatment regression model is correct in configurations (C1) and (C3) but misspecified in (C2), whereas the OR model in either (M1) or (M2) is misspecified in all configurations (C1) to (C3), but it can be regarded as being “closer” to the truth in (C1) and (C3) than in (C2) due to using X^† instead of X as regressors. Therefore, the models in both (M1) and (M2) can be classified as follows in configurations (C1) to (C3):

C1): Instrument PS model was correctly specified, treatment and OR models were “more correctly” specified;
C2): Instrument PS model was correctly specified, treatment and OR models were “less correctly” specified; and
C3): Instrument PS model was misspecified, treatment and outcome models were “more correctly” specified.

Similarly as in Kang and Schafer (2007), for p = 4, the treatment and OR models in case (C2) and the instrument PS model in (C3) appear to be adequate by standard diagnosis techniques. See the supplemental material in Sun and Tan (2020) for scatterplots of Y against $X_{j}^{†}$ within {D = 1}, boxplots of $X_{j}^{†}$ within {D = 0} and {D = 1} as well as boxplots of $X_{j}^{†}$ within {Z = 0} and {Z = 1} for j = 1,…,4.

For n = 800 and p = 400 or 1000, Table 3, reproduced from Sun and Tan (2020), summarizes the results based on 1000 repeated simulations. The methods RCAL and RML perform similarly to each other in terms of absolute bias, variance, and coverage in (C1) and (C3), but RCAL yields noticeably smaller absolute biases and better coverage than do RML and RML2 in (C2). The post-Lasso refitting method RML2 appears to achieve coverages closer to the nominal probabilities in (C1) but yields substantially higher variances in all 3 cases (C1) to (C3). These properties can also be seen from the QQ plots of the estimates and t-statistics the supplemental material of Sun and Tan (2020). The performances of each of the 3 methods are similar with models (M1) or (M2) specified. Hence, in the settings studied, there is little benefit in adding the spline terms in the OR model.

Table 3

Summary of Results for Estimation of θ₁.

Empirical Application

The causal relationship between education and earnings has been of considerable interest in economics. Card (1995) proposed proximity to college as an instrument for completed education. The argument is that proximity to college could be taken as being randomized conditionally on observed covariates, and its influence on earnings could be only through that on the schooling decision. Consider the analytic sample in Card (1995) from the National Longitudinal Survey of Young men, which comprises 3010 men with valid education and wage responses in the 1976 interview. Similarly, as in Tan (2006b), we define the treatment as education after high school, that is, D = 1{years of schooling > 12}; the instrument Z is a binary indicator for proximity to a 4-year college; and the outcome Y is a surrogate outcome constructed for the log of hourly earnings at age 30 years. The raw vector of covariates X include a race indicator; indicators for 9 regions of residence and for residence in SMSA in 1966; mother's and father's years of schooling (momed and daded, respectively); indicators for missing values; indicators for living with both natural parents, with 1 natural parent and 1 stepparent, and with mother only at age 14 years; and the Knowledge of World of Work (kww) score in 1966 and a missing indicator. We use mean imputation for the missing values and standardize all continuous variables with sample mean 0 and variance 1.

We reanalyze the National Longitudinal Survey data to estimate the LATE of education beyond high school on log hourly earnings, using more-flexible, higher-dimensional models than previously allowed. We apply $({\hat{θ}}_{0, R C A L}, {\hat{θ}}_{1, R C A L})$ based on regularized calibrated (RCAL) estimation and $({\hat{θ}}_{0, R M L}, {\hat{θ}}_{1, R M L})$ based on regularized likelihood estimation (RML), as well as the post-Lasso variant (RML2). The specification for f(X) = g(X) consists of all the indicator variables mentioned above, momed, daded, linear spline bases in kww, as well as interactions between the spline terms with all the indicator variables. The vector h(X) augments f(X) and g(X) by adding linear spline terms for each fitted treatment regression ${\hat{m}}_{z} (X)$ , z ∈ {0,1}. We vary the model complexity by considering the number of knots in the set k ∈ {3,9,15}, with knots at the i/(k+1)-th quantiles for i = 1,…,k. The tuning parameter λ is determined using 5-fold cross-validation based on the corresponding penalized loss functions. As an anchor specification, we also consider main-effect models with f(X) = g(X) = (1,X^T)^T and $h (X) = {(1, X^{T}, {\hat{m}}_{0} (X), {\hat{m}}_{1} (X))}^{T}$ , where the nuisance parameters are estimated using nonpenalized likelihood or calibration estimation.

Table 4 reproduced from Sun and Tan (2020), shows the estimates of (θ₀, θ₁) and LATE of education beyond high school on log hourly earnings. Regularized estimation from RCAL, RML, and RML2 yields similar point estimates; the differences are small compared with the SEs. The RCAL and RML estimates have noticeably smaller SEs than do RML2 estimates. Interestingly, for splines with 15 knots, the LATE is estimated from RCAL with a 95% CI of 0.504 ± 0.498, which excludes 0, whereas those from RML and RML2 include 0.

Table 4

Estimates of the Effect of Education Beyond High School on Log of Earnings.

While the validity of CIs is difficult to assess using real data, Figure S9 in the supplemental material of Sun and Tan (2020) shows that the standardized sample influence functions for estimation of LATE. The curves from RCAL appear to be more normally distributed than are those from RML or RML2, especially in the tails. In addition, Figures S10 to S12 in the supplemental material of Sun and Tan (2020) present the standardized calibration differences for all the variables f_j(X), j = 1,…,p, similarly to Tan (2020a). Compared with RML and RML2, our method RCAL consistently yields smaller maximum absolute standardized differences and involves fewer nonzero estimates of γ_j in the instrument PS models.

Conclusions

We developed new statistical methods and theory for tackling 2 broad problems in CER. One is to estimate ATEs using PSs and doubly robust estimators under unconfoundedness. The other is to estimate LATEs using IVs in the presence of possible unmeasured confounding. By these methods, PS and regression models can be fitted with a possibly large number of regressors including main effects and interactions of the covariates, and CIs and hypothesis tests can be obtained about treatment effects in a numerically tractable and statistically principled manner. Our methods are implemented in the publicly released R package RCAL.

There are various interesting topics which warrant further investigation. For the current development, both the instrument and treatment are assumed to be binary. It is desirable to extend our methods to handle multivalued treatments and instruments. Moreover, while the current methods are concerned with causal inference in cross-sectional studies, medical studies frequently involve longitudinal and survival outcomes subject to censoring or dropout. It is also important to extend our methods to handle longitudinal and survival data.

References

Andrews, I., Stock, J.H., and Sun, L. (2019) “Weak instruments in instrumental variables regression: Theory and Practice,” Annual Review of Economics, 11, 727-53.
Angrist, J.D., Imbens, G.W., and Rubin, D.B. (1996) “Identification of causal effects using instrumental variables” (with discussion) , Journal of the American Statistical Association, 91, 444-472.
Austin, P.C. and Stuart, E.A. (2015) “Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies,” Statistics in Medicine, 34, 3661-3679. [PMC free article: PMC4626409] [PubMed: 26238958]
Balke, A., and Pearl, J. (1997) “Bounds on treatment effects from studies with imperfect compliance,” Journal of the American Statistical Association, 92, 1171-1176.
Baiocchi, M., Small, D.S., Yang, L., Polsky, D., and Groeneveld, P.W. (2012) “Near/far matching: a study design approach to instrumental variables,” Health Services and Outcomes Research Methodology, 12, 237-253. [PMC free article: PMC4831129] [PubMed: 27087781]
Belloni, A., Chernozhukov, V., and Hansen, C. (2014) “Inference on treatment effects after selection among high-dimensional controls,” Review of Economic Studies, 81, 608--650.
Bohning, D. and Lindsay, B.G. (1988) “Monotonicity of quadratic approximation algorithms,” Annals of the Institute of Statistical Mathematics, 40, 641-663.
Bound, J., Jaeger, D.A., and Bakar, R.M. (1995) “Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak,” Journal of the American Statistical Association, 90, 443-50.
Bradic, J., Chernozhukov, V., Newey, W. K., and Zhu, Y. (2021) “Minimax semiparametric learning with approximate sparsity,” arXiv:1912.12213v3.
Brookhart, M.A., Rassen, J.A., Wang, P.S., Dormuth, C., Mogun, H., and Schneeweiss, S. (2007) “Evaluating the validity of an instrumental variable study of neuroleptics,” Medical Care, 45, S116-S122. [PubMed: 17909369]
Buhlmann, P. and Hothorn, T. (2007) “Boosting algorithms: regularization, prediction and model fitting” (with discussion), Statistical Science, 22, 477-505.
Buhlmann, P. and van de Geer, S. (2011) Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer.
Card, D. (1995) “Using Geographic Variation in College Proximity to Estimate the Return to Schooling,” in Aspects of Labor Market Behavior: Essays in Honor of John Vanderkamp, eds. L. N. Christophides, E. K. Grant, and R. Swidinsky. University of Toronto Press, pp. 201-222.
Chan, K.C.G., Yam, S.C.P., and Zhang, Z. (2016) “Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting,” Journal of the Royal Statistical Society, Ser. B, 78, 673-700. [PMC free article: PMC4915747] [PubMed: 27346982]
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J.M. (2018) “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21, C1-C68.
Connors, A. F., Speroff, T., Dawson, N. V., et al. (1996) “The Effectiveness of Right Heart Catheterization in the Initial Care of Critically Ill Patients,” Journal of the American Medical Association, 276, 889-897. [PubMed: 8782638]
Crown, W.H., Henk, H.J., and Vanness, D.J. (2011) “Some cautions on the use of instrumental variables estimators in outcomes research: How bias in instrumental variables estimators is affected by instrument strength, instrument contamination, and sample size,” Value in Health, 14, 1078-1084. [PubMed: 22152177]
Dukes, O. and Vansteelandt, S. (2021) “Inference for treatment effect parameters in potentially misspecified high-dimensional models,” Biometrika, 2, 321-334.
Farrell, M.H. (2015) “Robust inference on average treatment effects with possibly more covariates than observations,” Journal of Econometrics, 189, 1-23.
Folsom, R.E. (1991) “Exponential and logistic weight adjustments for sampling and nonresponse error reduction,” Proceedings of the American Statistical Association, Social Statistics Section, 197-202.
Friedman, J., Hastie, T., and Tibshirani, R. (2000) “Additive logistic regression: A statistical view of boosting” (with discussion), Annals of Statistics, 28, 337-407.
Friedman, J., Hastie, T. and Tibshirani, R. (2010) “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, 33, 1-22. [PMC free article: PMC2929880] [PubMed: 20808728]
Frolich, M. (2007) “Nonparametric IV estimation of local average treatment effects with covariates,” Journal of Econometrics, 139, 35-75.
Gerhard, T., Huybrechts, K., Olfson, M., et al. (2014) “Comparative mortality risks of antipsychotic medications in community dwelling older adults,” British Journal of Psychiatry, 205, 44-51. [PubMed: 23929443]
Ghosh, S. and Tan, Z. (2020) “Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data,” arXiv:2009.12033.
Graham, B.S., de Xavier Pinto, C.C., and Egel, D. (2012) “Inverse probability tilting for moment condition models with missing data,” Review of Economic Studies, 79, 1053-1079.
Hainmueller, J. (2012) “Entropy balancing for causal effects: Multivariate reweighting method to produce balanced samples in observational studies, Political Analysis, 20, 25-46.
Hastie, T., Tibshirani, R., and Friedman, J. (2009) The Elements of Statistical Learning (2nd edition), Springer.
Hastie, T., Tibshirani, R., and Wainwright, M. (2015) Statistical Learning with Sparsity: The Lasso and Generalizations, Chapman & Hall.
Heckman, J.J. (1997) “Instrumental variables: A study of implicit behavioral assumptions used in making program evaluations,” Journal of Human Resources, 32, 441-462.
Hernan, M.A. and Robins, J. M. (2006) “Instruments for causal inference: An epidemiologist's dream?” Epidemiology, 360-372. [PubMed: 16755261]
Hirano, K., and Imbens, G.W. (2002) “Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization,” Health Services and Outcomes Research Methodology, 2, 259-278.
Hirshberg, D.A. and S. Wager (2021) “Augmented Minimax Linear Estimation,” Annals of Statistics, to appear.
Holland, P.W. (1986) “Statistics and causal inference” (with discussion), Journal of the American Statistical Association, 81, 945-970.
Huybrechts, K., Gerhard, T., Franklin, J.M., Levin, R., Crystal, S., and Schneeweiss, S. (2014) “Instrumental variable applications using nursing home prescribing preferences in comparative effectiveness research,” Pharmacoepidemiology and Drug Safety, 23, 830-838. [PMC free article: PMC4116440] [PubMed: 24664805]
Imai, K. and Ratkovic, M. (2014) “Covariate balancing propensity score,” Journal of the Royal Statistical Society, Ser. B, 76, 243-263.
Javanmard, A. and Montanari, A. (2014) “Confidence intervals and hypothesis testing for high-dimensional regression,” Journal of Machine Learning Research, 15, 2869-2909.
Kang, J.D.Y. and Schafer, J.L. (2007) “Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data” (with discussion), Statistical Science, 523-539. [PMC free article: PMC2397555] [PubMed: 18516239]
Kim, J.K. and Haziza, D. (2014) “Doubly robust inference with missing data in survey sampling,” Statistica Sinica, 24, 375-394.
Lee, B.K., Lessler, J., Stuart, E.A. (2010) “Improving propensity score weighting using machine learning,” Statistics in Medicine, 29, 337-346. [PMC free article: PMC2807890] [PubMed: 19960510]
Manski, C.F. (1990) “Nonparametric bounds on treatment effects,” American Economic Review, 80, 319-323.
McCaffrey, D.F., Ridgeway, G., and Morral, A.R. (2004) “Propensity score estimation with boosted regression for evaluating causal effects in observational studies,” Psychological Methods, 9, 403-425. [PubMed: 15598095]
McCullagh, P. and Nelder, J. (1989) Generalized Linear Models (2nd edition), Chapman & Hall.
Morgan, S.L. and Winship, C. (2014) Counterfactuals and Causal Inference: Methods and Principles for Social Research (second ed.), Cambridge University Press.
Neyman, J. (1923) “On the application of probability theory to agricultural experiments: Essay on principles, Section 9,” translated in Statistical Science, 1990, 5, 465-480.
Ning, Y., Peng, S., and Imai, K. (2020) “Robust estimation of causal effects via a high-dimensional covariate balancing propensity score,” Biometrika, 107, 533-554.
Osborne, M., Presnell, B., and Turlach, B. (2000) “A new approach to variable selection in least squares problems,” IMA Journal of Numerical Analysis, 20, 389-404.
Patient-Centered Outcomes Research Institute (PCORI) Methodology Committee (2021) The PCORI Methodology Report. PCORI.
Robins, J.M. (1994) “Correcting for non-compliance in randomized trials using structural nested mean models,” Communications in Statistics, 23, 2379-2412.
Robins, J.M., Rotnitzky, A., and Zhao, L.P. (1994) “Estimation of regression coefficients when some regressors are not always observed,” Journal of the American Statistical Association, 89, 846-866.
Robins, J.M., Li, L., Mukherjee, R., Tchetgen, E.T., and van der Vaart, A. (2017) “Minimax estimation of a functional on a structured high-dimensional model,” Annals of Statistics, 45, 1951-1987. [PMC free article: PMC6453538] [PubMed: 30971851]
Rosenbaum, P.R. and Rubin, D.B. (1983) “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41-55.
Rosenbaum, P.R. and Rubin, D.B. (1984) “Reducing bias in observational studies using subclassification on the propensity score,” Journal of the American Statistical Association, 79, 516-524.
Rubin, D.B. (1974) “Estimating causal effects of treatments in randomized and nonrandomized Studies,” Journal of Educational Psychology, 66, 688-701.
Rubin, D.B. (1976) Inference and missing data, Biometrika, 63, 581-590.
Sarndal, C.E., Swensson, B. and Wretman, J.H. (1992) Model Assisted Survey Sampling, Springer.
Schapire, R.E. and Freund, Y. (2012) Boosting: Foundations and Algorithms, MIT Press.
Setoguchi, S., Schneeweiss, S., Brookhart, M.A., Glynn, R.J., and Cook, E.F. (2008) “Evaluating uses of data mining techniques in propensity score estimation: a simulation study,” Pharmacoepidemiology and Drug Safety, 17, 546-555. [PMC free article: PMC2905676] [PubMed: 18311848]
Smucler, E., Rotnitzky, A., and Robins, J. M. (2019) “A unifying approach for doubly robust ℓ1 regularized estimation of causal contrasts,” arXiv:1904.03737.
Staiger D. and Stock J.H. (1997) “Instrumental variables regression with weak instruments,” Econometrica, 65, 557-86.
Stroup, T.S., Gerhard, T., Crystal, S., Huang, C., Tan, Z., Wall, M.M., Mathai, C., Olfson, M. (2019) Comparative effectiveness of adjunctive psychotropic medications in patients with schizophrenia, JAMA Psychiatry, 76, 508-515. [PMC free article: PMC6495353] [PubMed: 30785609]
Sun, B. and Tan, Z. (2020) “High-dimensional model-assisted inference for local average treatment effects with instrumental variables,” arXiv:2009.09286.
Tan, Z. (2006a) “A distributional approach for causal inference using propensity scores,” Journal of the American Statistical Association, 101, 1619-1637.
Tan, Z. (2006b) “Regression and weighting methods for causal inference using instrumental variables,” Journal of the American Statistical Association, 101, 1607-1618.
Tan, Z. (2007) “Comment: Understanding OR, PS and DR,” Statistical Science, 22, 560-568.
Tan, Z. (2010a) “Bounded, efficient, and doubly robust estimation with inverse weighting,” Biometrika, 97, 661-682.
Tan, Z. (2010b) “Marginal and nested structural models using instrumental variables,” Journal of the American Statistical Association, 105, 157-169.
Tan, Z. (2010c) “Nonparametric likelihood and doubly robust estimating equations for marginal and nested structural models,” Canadian Journal of Statistics, 38, 609-632.
Tan, Z. (2018) “Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data,” arXiv:1801.09817.
Tan, Z. (2020a) “Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data,” Biometrika, 107, 137-158.
Tan, Z. (2020b) “Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data,” Annals of Statistics, 48, 811-837.
Tan, Z. and Sun, B. (2020) RCAL: Regularized calibrated estimation, R package version 2.0, available at https://cran.r-project.org/web/packages/RCAL/index.html
Tibshirani, R. (1996) “Regression shrinkage and selection via the Lasso” Journal of the Royal Statistical Society, Ser. B, 58, 267-288.
Tsiatis, A.A. (2006) Semiparametric Theory and Missing Data, Springer.
van der Laan, M.J., Benkeser, D., and Cai, W. (2019) “Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive Lasso,” arXiv:1908.05607. [PMC free article: PMC10238856] [PubMed: 35851449]
van der Laan, M.J., Polley, E.C., and Hubbard, A.E. (2007) “Super learning,” Statistical Applications in Genetics and Molecular Biology, 6, Article 25. [PMC free article: PMC2473869] [PubMed: 17402922]
van der Laan, M.J. and Robins, J.M. (2003) Unified Methods for Censored Longitudinal Data and Causality, Springer.
van der Laan, M.J. and Rose, S. (2017) Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies, Springer.
van der Laan, M.J. and Rubin, D.B. (2006) “Targeted maximum likelihood learning,” International Journal of Biostatistics, 2, Article 11. van de Geer, S., Buhlmann, P., Ritov, Y., Dezeure, R. (2014) “On asymptotically optimal confidence regions and tests for high-dimensional models,” Annals of Statistics, 42, 1166-1202.
Vermeulen K. and Vansteelandt, S. (2015) “Bias-reduced doubly robust estimation,” Journal of the American Statistical Association, 110, 1024-1036.
Weitzen, S., Lapane, K.L., Toledano, A.Y., Hume, A.L., Mor, V. (2004) “Principles for modeling propensity scores in medical research: a systematic literature review,” Pharmacoepidemiology and Drug Safety, 13, 841-853. [PubMed: 15386709]
Westreich, D., Cole, S.R., Funk, M.J., Brookhart, M.A., and Stürmer, T. (2011) “The role of the c-statistic in variable selection for propensity score models,” Pharmacoepidemiology and Drug Safety, 20, 317-320. [PMC free article: PMC3081361] [PubMed: 21351315]
Winterstein, A.G., Gerhard, T., Kubilis, P., Linden, S., Shuster, J., Zito, J., Crystal, S., and Olfson, M. (2012) “Cardiovascular safety of central nervous system stimulants in children and adolescents,” BMJ, 345, e4627. [PMC free article: PMC3399772] [PubMed: 22809800]
Wooldridge, J.M. (2002) Econometric Analysis of Cross Section and Panel Data, MIT Press.
Wright, S. (1928) The Tariff on Animal and Vegetable Oils, MacMillan, the Appendix.
Wu, T.T. and Lange, K. (2010) “The MM alternative to EM,” Statistical Science, 25, 492-505.
Wyss, R., Ellis, A.R., Brookhart, M.A., Girman, C.J., Funk, M.J., LoCasale, R., and Stürmer, T. (2014) “The role of prediction modeling in propensity score estimation: An evaluation of logistic regression, bCART, and the covariate-balancing propensity score,” American Journal of Epidemiology, 645-655. [PMC free article: PMC4157700] [PubMed: 25143475]
Yang, T. and Tan, Z. (2018) “Backfitting algorithms for total-variation and empirical-norm penalized additive modeling with high-dimensional data,” Stat, 7, e198.
Zhang, C.-H. and Zhang, S.S. (2014) “Confidence intervals for low-dimensional parameters with high-dimensional data,” Journal of the Royal Statistical Society, Ser. B, 76, 217-242.

Related Publications

Ghosh S, Tan Z. Doubly robust semiparametric inference using regularized calibrated estimation with high-dimensional data. 2020. Accessed November 30, 2021. arXiv:2009.12033
Sun B, Tan Z. High-dimensional model-assisted inference for local average treatment effects with instrumental variables. 2020. Accessed November 30, 2021. arXiv:2009.09286
Tan Z. Analysis of odds, probability, and hazard ratios: From 2 by 2 tables to two-sample survival data. 2019. Accessed November 30, 2021. arXiv:1911.10682
Tan Z, Zhang C.-H. Doubly penalized estimation in additive regression with high-dimensional data. Ann Statist. 2019;47:2567-2600.
Tan Z. Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika. 2020;107:137-158.
Tan Z. Model-assisted inference for treatment effects using regularized calibrated estimation with high-dimensional data. Ann Statist. 2020;48:811-837.

Acknowledgment

Research reported in this report was funded through a Patient-Centered Outcomes Research Institute® (PCORI®) Award (ME-1511-32740). Further information available at: https://www.pcori.org/research-results/2016/developing-new-methods-causal-inference-observational-studies

Institution Receiving Award: Rutgers, The State University of New Jersey

Original Project Title: Improving Causal Inference Methods via Statistical Learning with High-dimensional Data

PCORI Award Number: ME-1511-32740

Suggested citation:

Tan Z, Gerhard T, Sun B. (2022). Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimensional Data (PCORI). http://doi.org/10.25302/11.2021.ME.151132740

Disclaimer

The [views, statements, opinions] presented in this report are solely the responsibility of the author(s) and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute® (PCORI®), its Board of Governors or Methodology Committee.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License which permits noncommercial use and distribution provided the original author(s) and source are credited. (See https://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK609505PMID: 39585945DOI: 10.25302/11.2021.ME.151132740

PubReader
Print View
Cite this Page
Tan Z, Gerhard T, Sun B. Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimension Data [Internet]. Washington (DC): Patient-Centered Outcomes Research Institute (PCORI); 2021 Nov. doi: 10.25302/11.2021.ME.151132740
PDF version of this title (985K)

In this Page

Background
Patient and Stakeholder Engagement
Propensity Scores and Unconfounded Estimation
Instrumental Variables
Conclusions
References
Related Publications
Acknowledgment

Other titles in this collection

PCORI Final Research Reports

Related information

PMC
PubMed Central citations
PubMed
Links to PubMed

Recent Activity

Clear Turn Off Turn On

Developing and Testing New Methods for Estimating Treatment Effectiveness in Obs...
Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimension Data

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Developing and Testing New Methods for Estimating Treatment Effectiveness in Observational Studies Using High-Dimension Data

Authors

Affiliations

Structured Abstract

Background:

Objectives:

Methods:

Results:

Conclusions:

Limitations:

Background

Causal Inference

Gaps in Existing Methods

Specific Aims

Significance

Changes in the Research Strategy

Patient and Stakeholder Engagement

Propensity Scores and Unconfounded Estimation

Setup

PSs and IPW Estimation

Model-Assisted and Doubly Robust Inferences

Simulation Studies

Table 1

Empirical Application

Figure 1

Table 2

Instrumental Variables

Framework

Existing Estimators

Model-Assisted Inference

Simulation Studies

Table 3

Empirical Application

Table 4

Conclusions

References

Related Publications

Acknowledgment

Suggested citation:

Disclaimer

Views

In this Page

Other titles in this collection

Related information

Recent Activity