### Learning A New Statistical Method: Bayesian Additive Regression Trees

I may do some work for Jennifer Hill, an applied statistics professor at NYU's Steinhardt School. Having a career like hers is something I'm very interested in doing if I go the PhD route, which is get her doctorate in Statistics, focus on applications to social science, and work on interesting causal inference problems.

This last weekend I read a paper she sent me on Bayesian Additive Regression Trees (BART), which is quite interesting. The article, Bayesian Nonparametric Modeling for Causal Inference is coming out this January in Journal of Computational and Graphical Statistics. The abstract:

Researchers have long struggled to identify causal effects in nonexperimental settings. Many recently proposed strategies assume ignorability of the treatment assignment mechanism and require fitting two models—one for the assignment mechanism and one for the response surface. This article proposes a strategy that instead focuses on very flexibly modeling just the response surface using a Bayesian nonparametric modeling procedure, Bayesian Additive Regression Trees (BART). BART has several advantages: it is far simpler to use than many recent competitors, requires less guesswork in model fitting, handles a large number of predictors, yields coherent uncertainty intervals, and fluidly handles continuous treatment variables and missing data for the outcome variable. BART also naturally identifies heterogeneous treatment effects. BART produces more accurate estimates of average treatment effects compared to propensity score matching, propensity-weighted estimators, and regression adjustment in the nonlinear simulation situations examined. Further, it is highly competitive in linear settings with the “correct” model, linear regression. Supplemental materials including code and data to replicate simulations and examples from the article as well as methods for population inference are available online.

(This is perhaps more for me, than any reader) Basically, when using some methods to improve causal inference, such as matching, you're often fitting two models: one on whether or not a unit was treated, and than the more easily (or commonly) understood "response surface," which is the model for the outcome conditional on treatment and all the confounders. BART is a method to estimate the response surface non-parametrically, while being (it appears) as or more robust than other methods.

When trying to figure out how effective a treatment of some kind is, you cannot observe the outcomes for when an individual both receives the treatment and does not receive the treatment . A fancy way of saying that is , where is an indicator of whether you have or have not gotten the treatment. So that equation is saying that if you got the treatment the second term on the right side of the equal sign is zero, and in the alternative case, the first term is zero.

When doing casual inference, you want to compare two groups, one that received the treatment and one that did not, that are as similar as possible. That is, the only difference in the comparison groups is that one got the treatment and the other didn't. In this way, you can be sure that any observed difference in the groups is due to the treatment. This idea is formalized through the term ignorability. That is, if the two groups cannot be distinguished on all the observable characteristics (they have "balance"), the assignment to the treatment group is ignorable. (More formally, the potential outcomes are independent of treatment assignment, given the covariates or , where X are confounders and means conditionally independence). Ignorability also requires overlap or common support in the covariates across the two groups.

So, in the end with ignorability, we're left to estimate the and . Unfortunately, this estimation can be very difficult if the treatment outcomes are not linearly related to the covariates, the distribution of the covariates are different across the two groups, or, as is often the case in a world with increasing data, there are tons of confounding covariates or (and this happens all the the time) you really don't know which of them are needed to satisfy ignorability. A bunch of methods have been proposed to address this estimation problem (see the paper for a ton of citations), but the BART method, as I mentioned earlier is different because it "focuses solely on precise estimation of the response surface." Also, part of BART's advantage is that it doesn't require as many researchers choices:

Nonparametric and semiparametric versions of these [other cited] methods are more robust but require a higher level of researcher sophistication to understand and implement (e.g., to specify smoothing parameters such as number of terms in a series estimator or bandwidth for a kernel estimator). This article proposes that the beneﬁts of the BART strategy in terms of simplicity, precision, robustness, and lack of required researcher interference outweigh the potential beneﬁt of having an estimator that is strictly consistent under certain sets of conditions.

I think I'll save a careful description of the trees themselves for a later post, even thought that is most of the paper. Basically, though, BART is a sum-of-trees model that uses a set of binary trees to split up the observations on the confounders. What's most fascinating, though, is that the parameters for BART are defined as a statistical model, with a prior put on the parameters, which is quite different than the other learning/mining models I've learned about. For those happy few who might be interested, BART is described in even greater detail in "BART: Bayesian additive regression trees." Abstract:

We develop a Bayesian “sum-of-trees” model where each tree is constrained by a regularization prior to be a weak learner, and fitting and inference are accomplished via an iterative Bayesian backfitting MCMC algorithm that generates samples from a posterior. Effectively, BART is a nonparametric Bayesian regression approach which uses dimensionally adaptive random basis elements. Motivated by ensemble methods in general, and boosting algorithms in particular, BART is defined by a statistical model: a prior and a likelihood. This approach enables full posterior inference including point and interval estimates of the unknown regression function as well as the marginal effects of potential predictors. By keeping track of predictor inclusion frequencies, BART can also be used for model-free variable selection. BART’s many features are illustrated with a bake-off against competing methods on 42 different data sets, with a simulation experiment and on a drug discovery classification problem.