A/B tests and multivariate experiments provide a principled way of analyzing whether a product change improves business metrics. However, such experiments often yield flat results where we see no statistically significant difference between the treatment and the control. Such an outcome is typically disappointing, as it often leaves us with no clear path towards improving the efficiency of the business.

However, experiment analysis need not come to an end after flat aggregate results. While traditional analysis methods only allow us to discover whether the change is better or worse for the entire population of users, causal modeling techniques enable us to extract valuable information about the variation in responses across different subpopulations.

Even in an experiment with flat results, a substantial subset of users may have a strong positive reaction. If we can correctly identify these users, we may be able to drive value for the business through improved personalization—i.e. by rolling out the new variant selectively to the subpopulations that had a positive reaction, as in Figure 1, below. In this way, such causal modeling approaches—also known as Heterogeneous Treatment Effect (HTE) models—can help to substantially increase the value of flat experimental results.

At DoorDash we were able to utilize these HTE models to improve our promotion personalization efforts by identifying which subpopulations had positive reactions to new treatments and sending the relevant promotions to those populations only. In order to explain how we successfully deployed this approach, we will first provide some technical background on how we built our HTE models with meta-learners, a concept described in the HTE paper referenced above, and then how we applied them to our consumer promotion model. We will then review the results of this approach and how it improved our campaign’s performance.

## How uplift modeling works

In order to determine which subset of users we should target with a specific treatment based on our A/B test results, we need to move beyond the Average Treatment Effect (ATE), which describes the effect of the treatment versus the control at the entire population-level. Instead, we need to look at the Conditional Average Treatment Effect (CATE), which measures the average treatment effect for a given user, conditional on that user’s attributes.

For HTE modeling, we estimate the CATE using *meta-learners*. Meta-learners are algorithms that combine existing supervised learning algorithms (called base-learners) in various ways to estimate causal effects. We can use any supervised learning algorithm as part of the overall approach for predicting the CATE.

For example, we can use one random forest to predict whether the user will click an email given the treatment condition, and another random forest to predict whether the user will click an email given the control condition. We can then estimate the conditional treatment effect by taking the difference between these two predictions. This particular approach is called a *T-learner* and is described in greater depth below.

## Simple meta-learners for HTE modeling

As in supervised learning tasks, the best meta-learner for a particular causal effects modeling task depends on the structure of the problem at hand. While in supervised learning we can presume that we have access to the ground truth for the quantity we’re trying to predict, this assumption does not hold for causal effects modeling.

As the true causal effect is almost always unknown, we can not simply try out all possible meta-learners and pick the one that produces the most accurate estimate of the causal effect. Since the best meta-learner cannot be identified quantitatively, we need to make an informed selection based on the characteristics of each candidate. To provide a sense of the considerations involved when making such a selection, we will quickly discuss the structure of three fairly simple meta-learners—the S, T, and X-learners—and the strengths and weaknesses of each.

### S-learner

**Naming:** The S-learner gets its name because it uses only a single supervised learning predictor to generate its estimates of the CATE.

**Structure: **The S-learner directly estimates the dependent variable Y using a single model where the treatment binary T is appended to the set of attributes X during the model training process. This yields a machine learning model that predicts Y as a function of both X and T: S(T, X). Given context X, we then estimate the causal effect as S(1, X) - S(0, X).

**Training: **The hyperparameters of an S-learner are typically tuned to maximize the cross-validated accuracy of its predictions of Y.

**Strengths:**

- Since the S-learner needs to train only a single machine learning model, it can be used in a smaller data environment than other meta-learners that require separate models to be built independently on test and control data. The S-learner is particularly useful when either treatment or control data is limited but a larger volume of data for the other variant is available, as the learner can use data from the larger-volume variant to learn common trends.
- For the same reason, we can also build the S-learner to be easier to interpret than more complex meta-learners. That is, the exact mechanism by which the learner generates its causal effect estimates can be made more transparent and easier to explain to non-technical stakeholders. In particular, S-learners based on elastic net regression and similar linear machine learning models allow for straightforward modeling of heterogeneous treatment effects, i.e., through the introduction of interactions between the treatment indicator T and the attributes X. Note that an S-learner used in this manner is very similar to traditional statistical or econometric specifications of causal models.

**Weaknesses: **

- For some base-learners, the S-learner’s estimate of the treatment effect can be biased toward zero when the influence of the treatment T on the outcome Y is small relative to the influence of the attributes X. This is particularly pronounced in the case of tree-based models, as in this circumstance the learner may rarely choose to split on T.

The S-learner is generally a good choice when the average treatment effect is large, the data set is small and the interpretability of the result is important.

Additionally, when building an S-learner on a tree-based base-learner, it's important to first validate that the ATE is on the same order of magnitude as key relationships between X and Y.

### T-learner

**Naming:** The T-learner derives its name from the fact that it uses two separate machine learning models to create its estimate of the CATE, one trained on treatment data and another trained on control data.

**Structure: **The T-learner builds separate models to predict the dependent variable Y given attributes X on the treatment and control datasets: T(X) and C(X). It then estimates the causal effect as T(X) - C(X).

**Training: **The hyperparameters of each base-learner in the T-learner are typically tuned to maximize the cross-validated accuracy of its predictions of Y.

**Strengths:**

- The T-learner can be used to estimate heterogeneous treatment effects using tree-based models even when the average treatment effect is small.
- As the difference between two supervised learning models, the T-learner is still interpretable when compared to more complex approaches.

**Weaknesses:**

- Since the T-learner requires separate models to be built using treatment and control data it may not be advisable to use when either the available treatment or control data is relatively limited.
- The T-learner also doesn’t respond well to local sparsity—when there are regions of the attribute space where data points are almost exclusively treatment or control—even when treatment and control are well-balanced in aggregate. In such regions, the supervised learning algorithm corresponding to the missing experimental variant is unlikely to make accurate predictions. As a result, the estimated causal effects in this region may not be reliable.

The T-learner is a good choice when using tree-based base-learners that have a small treatment effect that may not be well-estimated by an S-learner.

Before using a T-learner, make sure that both the treatment and control datasets are sufficiently large and that there aren’t any significant local sparsity issues. One good way to check for local sparsity is to try to predict the treatment indicator T given X. Significant deviations of the prediction from 0.5 merit further investigation.

### X-learner

**Naming:** The X-learner gets its name from the fact that the outputs of its first-stage models are “crossed” to generate its causal effect predictions. That is, the estimator built on the control data set is later applied to the treatment data set and vice versa.

**Structure: **Like the T-learner, the X-learner builds separate models to predict the dependent variable Y given attributes X on the treatment and control datasets: T(X) and C(X). It also builds a predictor, P(X), for the likelihood of a particular data point being in treatment.

The control predictor C(X) is then applied to the treatment dataset and subtracted from the treatment actuals to generate the treatment effect estimates Y - C(X). The treatment predictor is similarly applied to the control dataset to generate the estimates T(X) - Y. Two new supervised learning models are then built to predict these differences: DT(X) is built on the treatment set differences Y - C(X), and DC(X) is built on the control set differences T(X) - Y.

The final treatment effects predicted by the X-learner are given by weighting these estimators by the likelihood of receiving treatment: P(X) * DC(X) + (1 - P(X)) * DT(X).

**Training: **The hyperparameters of each first-stage regressor in the X-learner are typically tuned to maximize the cross-validated accuracy of its predictions of Y. The propensity model P(X) is generally tuned to minimize classification log loss since we care about the accuracy of the model’s probabilistic predictions and not just the ultimate accuracy of the classifier. The second-stage regressors are again tuned to minimize cross-validated regression loss.

**Strengths:**

- Like the T-learner, the X-learner can be used to estimate heterogeneous treatment effects using tree-based models even when the average treatment effect is relatively small.
- The X-learner is by design more robust to local sparsity than the T-learner. When P(X) is close to one, i.e. almost all data near X is expected to be in treatment, the base-learner built on control data C(X) is expected to be poorly estimated. In this circumstance, the algorithm gives almost all weight to DC(X) since it does not depend on C(X). The opposite pattern holds when P(X) is close to zero.

**Weaknesses**:

- Like the T-learner, the X-learner requires separate models to be built using treatment and control data. It may not be advisable to use when either the available treatment or control data is relatively limited.
- Since the X-learner is a multi-stage model, there is an inherent risk that errors from the first-stage may propagate to the second stage. The X-learner should be used with caution when first-stage models have low predictive power.
- As a multi-stage model, X-learner predictions aren’t easily interpretable.

The X-learner is useful when deploying a T-learner but local sparsity is present in the data. However, the multi-stage structure it uses to reduce the impact of local sparsity assumes that any first-stage predictions are fairly accurate, so its outputs should be treated skeptically when these first stage predictions are imprecise.

When P(X) is always near 0.5, it's possible to dramatically simplify the X-learner by training a single second-stage learning model using constructed counterfactuals. The SHAP plots and/or feature importances from this second stage model can then be used to visualize the drivers of your estimate of the CATE.

## Using uplift modeling for consumer personalization and targeting

At DoorDash, we leverage HTE models to ensure that our churned customers, people that stopped ordering from DoorDash, get a more personalized promotional experience. These days, people get bombarded by so many messages that we want to use these models to make sure that we only send promotions to individuals that we think will have a positive reaction. This helps make our promotions more cost effective because we reduce the number of promotions that the model thinks will not increase the likelihood of a conversion.

HTE models also let us send different incentives to different consumers based on each consumer’s predicted future ordering behavior. Specifically, we want to predict how each consumer’s order volume over the next month would change if they received a promotion.

To make these predictions, we trained a model that predicted order volume for the consumer over the next 28 days on previous A/B test data. We chose an S-learner given the small set of data available and the size of the aggregate treatment effect relative to other effects.

The base model was a gradient-boosted machine learning model using LightGBM that included many features related to the consumer. Informative features include historical aggregate features for the number of deliveries in previous time periods, recent app visits, how long they have been a customer, and features related to the quality of delivery experiences such as low ratings, or the consumer experience, such as number of merchants available.

We generated the training data as follows:

- We first take a set of all the orders where a promo was redeemed; for these training examples, we set promo_order = True. We then generated features (X) based on the information available at the time before the order.
- To generate examples where promo_order = False, we used noise contrastive estimation: we took positive examples and replaced them with other consumers in the same region who did not order. We took this approach because, when the promos were originally sent, they were not conducted as A/B tests so we need some way to add negative examples.

### Results

When we implemented targeting from the HTE models, we found a subpopulation that would react to our targeted promotions far more cost-effectively than the population at large, reducing our promotional costs by 33% and avoiding sending unwanted promotions to the larger population. Because we also want to achieve high volume at low cost, this targeting can be tuned as needed by marketers to select a wider audience and generate incrementally more orders, or identify a smaller audience and get fewer, but more cost-effective, conversions.

The curve in Figure 2, below, shows the effect of the targeting models. When we choose to limit our reach, we will have smaller Promotion Cost/Incremental Delivery, because the targeting models will choose the best consumers. Based on our goals, we can choose a particular threshold in order to achieve a specific cost per incremental delivery, while ensuring we obtain a specific number of incremental deliveries. At 5% audience reach, we see a 33% decrease in Promotional Cost/Incremental Delivery. This result shows that, when we apply targeting, we can be more efficient with promotion cost, allowing us to fulfill our business objectives and give users more relevant messaging.

## Future HTE model use cases

Consumer promotions represent just one area where we are exploring the use of causal modeling at DoorDash. In general, in any area where we can run an A/B test, we can consider using Heterogeneous Treatment Effects to improve personalization. On the consumer side, we are just getting started with the use in consumer promotions. Beyond promotions, we can also use this method to deliver more effective messaging.

Within our logistics team, we are currently exploring several use cases for HTEs. First, we are looking into improving the efficiency of our operations by using causal models to identify subpopulations or markets where changes to our infrastructure that may not be globally beneficial can drive significant improvements. Second, we are looking into using HTEs to develop personalized onboarding experiences for new Dashers, drivers on our platform. By customizing the first-week experience for Dashers based on leading indicators of pain points, we believe we can make the onboarding process significantly easier.

## Conclusion

Causal modeling using Heterogeneous Treatment Effects and meta-learners allows us to take existing data from A/B tests and train predictive machine learning models. In the case of promotions, we can target the best subset of users to optimize for our business objectives. This can turn good results into better ones and a flat experiment into something that can allow us to achieve our goals.

Because causal modeling can be used whenever a company has an A/B test or multivariate testing data, it is broadly applicable, and a great way to explore personalization.

With regards to X-learner, in the field experiment by definition P(X) should be same for every X.

Can we use meta-learners on observational data?

Here P(x) is a sample estimator rather than the true distribution function for the treatment assignment, so that’s not necessarily the case. If a bunch of data around the neighborhood of point z all happen to be in treatment because of random variation we may have P(z) close to one even if the treatment assignment process was independent of x.

Before applying meta-learners to observational data (i.e. where treatment and control were not assigned randomly), I would suggest first applying a technique like propensity score matching to prune your data and reduce the bias introduced by the non-random assignment of treatment. If your available covariates can implicitly capture the dynamics of the non-random assignment, you may be able to build a data sets that is “as if” randomized.

Given such a de-biased data set, you could then apply the causal modeling techniques described here. However, such a cleaning process will never work perfectly–(stratified) randomization is still the gold standard where possible.

If the treatment assignment process is independent of X then P(X=z) being close to one is more a chance( I feel this is like random noise modeling )

In the case of a well-designed random study, that is correct. However, such modeling can still add value when your sample size is small as random drift across important covariates can prove meaningful.

Another potential use case would be in the “synthetic experiment” created from observational data I described above where such gaps may prove pronounced even if you’ve done your best to clean up the data before proceeding to causal modeling.

Regarding “synthetic experiment”, after propensity score matching we will have equal number of control & test units then be design we will end up with constant propensity for every unit.

Is PSM must before applying X-learner on observational data, given we are incorporating propensity scores as weights in X-learner?

Regarding your first comment random drift across important covariates, by definition modeling random drift is risky thing right?

“When P(X) is always near 0.5, it’s possible to dramatically simplify the X-learner by training a single second-stage learning model using constructed counterfactuals”

How we are getting constructed counterfactuals ?

How should we calculate the confidence interval (other than using boostrap) and required sample size before the experiment?

The “consumer targeting” example is not a causal inference problem but a machine learning problem classifying which customer will respond or not to the promotion.

Did you estimate HTE or make use of HTE estimates when you target potential customers?

I do not think so.

The training sample should include two types of data

(1) response to promotion

(2) nonresponse to promotion

Did you include (2)?

Please see the section on training data generation. Because of a lack of A/B test data for promo_order = False, we generated this data using noise contrastive estimation.

Could you elaborate on the noise contrastive estimation where “we took positive examples and replaced them with other consumers in the same region who did not order”?

What exactly are positive examples?

Could you elaborate on noise contrastive estimation?

What exactly are positive examples

Hi Will, in this case, they are the examples where promo_order = True. Hope that helps.