Random sampling (random assignment to ‘treatment’ and ‘control’ groups) in an experiment allows one to calculate an unbiased estimate of a causal treatment effect.

The *average treatment effect* is the most common summary measure of a treatment effect: it is estimated as the difference in (sample) mean response between the treated group and untreated (‘control’) group:

\[\overline{Y}_{treatment}-\overline{Y}_{control}\]

which is an unbiased estimator of

\[E\Big[Y_i^{\text{if received treatment}}-Y_i^{\text{if NOT received treatment}}\Big]\]

..if assignment to ‘treatment’ and ‘control’ groups was random.

However, there are other possible treatment effect estimators (estimators of \(\quad E\Big[Y_i^{\text{if received treatment}}-Y_i^{\text{if NOT received treatment}}\Big]\)) which have lower variance over repeated experiments than this simple difference between sample means, while remaining unbiased.

Reducing estimator variance allows one to measure effects with greater precision, to detect smaller effects, or to achieve comparable precision with a smaller sample size. In this post, I assess 15 different treatment effect estimators via simulation, comparing their performance over 1000 simulated experiments.

............................................................................................................................. ............................................................................................................................. ............................................................................................................................. ......................................................11111111111111111...................................................... ...............................................1110000000001111.110000000001111.............................................. .........................................1110001.110000000001.1.1.00000000001.000011......................................... ......................................11000000011.100000000111111100000000001.0000000011..................................... ...................................10000000000011110001110000010000011110000110000000000011.................................. ................................1111111100000001111.....1111111111111......111000000001.111111............................... ..............................10001.11.1000011...11110000...........100001111...1100001.1..10011............................. ............................10000011111111011.100000000001..........100000000001.11110111111000001........................... ..........................1000000000011011....100000000001..........00000000000.....1111000000000001......................... ........................100000000001101........10000000000.........10000000000........110110000000001........................ .......................10111000001111...........10000000001........1000000000............111100000..001...................... .....................1000.1.1001..111............1000000001........000000000.............11..10001..1001..................... ....................10000111001.1000001...........10111....11111111...11101............100001..10011000001................... ...................1000000001..1000000001........111..111001......100111..11........1100000000..10000000001.................. ..................1000000001..1000000000001....1011.10000001......0000001..1111...11000000000001..1000000001................. .................10000000011.100000000000000.101.....1000001.....1000001......11.1000000000000001.1000000000................. ................11.11.100101...110000000001..11.......100000.....100001........1..1000000000011...01001111111................ ................00.1..10101........1100001..10001.......00001....00001......110001..1000111........01001111.11............... ...............10011111110.............11.100000001......0001...10001.....110000001..11............10101111100............... ...............10000000101.............101.1100000001.....000...1001....11000000111.10..............11100000001.............. ...............0000000010.............101......11000001....001..001...110000011......01.............10100000001.............. ..............100000001.111...........11..........1110001...01.101..1100011..........101..........1111100000000.............. ..............100000001.10000000001111111..............1101..1..1.10111.............111.1111000000000..100000001............. ..............10111.10..1000000000000..100000001111111.....1......1.....11111110000000..1000000000000..100111001............. ..............101.1.10..1000000000000..10000000000000000111..101...1110000000000000000..1000000000000..101...101............. ..............10110110..10000000000001.100001111111.......11......11......111111110000..1000000000000..100111001............. ..............100000001.1000000011111.11..............11001..1.11..10111..............111111110000000..100000001............. ..............10000000111.............10.........11100011..101..01...1000011.........101............11100000000.............. ...............0000000010..............01.....110000011...1001..101....100000011.....01.............10100000001.............. ...............10000100101.............10111000000011....1000...1001.....1000000011.11..............01000010001.............. ...............1001..10010............111..00000011.....10001...10001......10000001..111...........1010111..10............... ................001..100101.......11000001..10011......100001....00001.......10001..10000111......101001111.11............... ................100110000101..11000000000001..1.......100000.....100000........1..10000000000011.101000110111................ .................1000000001..100000000000001.111.....1000001.....1000000.....111.1000000000000001.1000000000................. ..................1000000001..1000000000011....1111.10000001......0000001.11011....100000000000..1000000000.................. ...................1000000001..1000000011.........11...1111.......11111...11.........100000000..1000000000................... ....................10001111111..100011...........1001111..11111111..111100............100001..11.1.11000.................... ......................001.1.1100...1.............1000000001........0000000001............11..1000.111.01..................... .......................11111110001111...........1000000000.........10000000001..........111100000111111...................... ........................1000000000011011.......10000000001.........100000000001.......101100000000001........................ ..........................10000000000011011...100000000001..........00000000000....1101100000000001.......................... ............................10000001110001111.100000000001..........100000000011.11110011110000011........................... ..............................110000.1000000111...1111001...........1001111....1100000001100001.............................. ................................11001000000000001111.....111111111111....11110000000000111101................................ ...................................110000000001.1.1100001000001000000100001.11.00000000001................................... ......................................111000000.111100000000011100000000001.11.00000011...................................... ..........................................111001111100000000011.00000000000.1110011.......................................... ................................................111100000000011100000000011111............................................... .........................................................111111111111........................................................ ............................................................................................................................. .............................................................................................................................

More specifically, the situation which I am considering is this:

We have an ‘experimental’ dataset - \(n\) independent subjects \((y_i, \mathbf{x}_i, T_i) \space i=1,..,n\) - where response variable \(y_i\) is some unknown function (possibly with a random component) of the feature vector \(\mathbf{x}_i\) and the treatment \(T_i\) i.e. \(Y\) is some random data-generating process \(Y_i=f(\mathbf{x}_i, T_i) + \epsilon_{i} \\ T_i=\begin{cases}1 \quad\text{if subject }i\text{ received treatment}\\0 \quad\text{if subject }i\text{ did NOT receive treatment}\end{cases} \\\epsilon_i \text{ is random error/noise}\)

We have control over the assignment of the observations \(i\) in the ‘experimental’ dataset: to

*treatment*(\(T_i=1\)), or to*control*(\(T_i=0\)) groups.We also have a ‘training’ dataset - a set of \(m\) observations \((y_j, \mathbf{x}_j) \space j=1,..,m\) - which are samples from the same data-generating process \(Y_j=f(\mathbf{x}_j, T_j=0)+\epsilon_{j}\) which generated our ‘experimental’ dataset. We consider this data to have been collected prior to, or to be otherwise independent of, the ‘experimental’ data. We can use this dataset to make inferences about (or model) the relationship between \(Y\) and \(\mathbf{x}\), without it affecting the inferences from our experiment (for example, it could be used to inform assignment to experimental

*treatment*/*control*groups). No*treatment*is given to subjects in the ‘training’ dataset.For simplicity, in my experiments I have assumed that the treatment effect is constant: simply increasing the \(Y_i\) value of any individual exposed to the treatment by \(+2\).

The methods for estimator variance reduction that I’ve assessed in this simulation are:

method name | description |
---|---|

simple_difference |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. The treatment effect is then estimated as the difference between the sample group means (\(\overline{y}_{treated}-\overline{y}_{control}\)). This is the baseline estimator that I’m trying to improve upon. The ‘training’ dataset is not used. |

cupac_ranger |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. A Random Forest model is trained on the full ‘training’ dataset. The Control Variates method is then used to estimate the treatment effect on the ‘experimental’ dataset, using the model-predicted response \(\hat y\) on the ‘experimental’ dataset as the control variate. DoorDash engineering call this technique CUPAC. I provide a short description of the Control Variates method at the end of this post |

cupac_lm |
This method is the same as cupac_ranger, except that a linear model is used as the predictive model, instead of a Random Forest model |

matched_control |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. A matched control group is created by pairing each treatment subject with a single control subject (with replacement). The treatment effect is then estimated as the difference between the mean response \(Y\) in the original treatment group and the mean response \(Y\) in this matched control group. The ‘training’ dataset is not used. |

y_stratified |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. The subjects in the ‘experimental’ dataset are then grouped into strata based on rough binning of the features \(\mathbf{X}\). The treatment effect is estimated in each stratum (\(\overline{y}_{treated\text{ in stratum } k}-\overline{y}_{control \text{ in stratum } k}\)), then these estimates are combined via a weighted sum into a single estimate (the weights used are proportional to the relative sizes of the strata). The ‘training’ dataset is not used for this method |

y_matched_propensity_score |
Reduction of estimator variance in a random experiment is not the usual use of propensity matching, but I tried it anyway (spoiler: it doesn’t work very well). Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. Propensity \(\space e(\mathbf{x})=Pr[T=1|\mathbf{x}] \space\) (conditional probability of treatment) is estimated using a logistic regression model fit also to the ‘experimental’ dataset. Following that, in the ‘experimental dataset’ each treated subject is paired with a control subject whose predicted propensity is as close as possible to their own (matching done with replacement). The average treatment effect is then estimated as the difference between the mean response \(Y\) in the treatment group and the mean response \(Y\) in the propensity-matched control group. The ‘training’ dataset is not used for this method |

matchit_propensity_score |
In case I made a mistake in my manual propensity-score matching, I performed the same procedure using the R libary MatchIt, which is a wonderful library with phenomenal documentation (read it for a thorough introduction to propensity methods). The result is identical to the y_matched_propensity_score method |

propensity_score_weighting |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. Propensity \(\space e(\mathbf{x})=Pr[T=1|\mathbf{x}] \space\) is then estimated using a logistic regression model also trained on the ‘experimental’ dataset. Then, the treatment effect is estimated as the difference between the weighted average response \(Y\) in the treatment group and the weighted average response \(Y\) in the control group, where each treated subject is weighted by the factor \(\frac{1}{\hat e(\mathbf{x})}\) and each control group subject is weighted by \(\frac{1}{1-\hat e(\mathbf{x})}\). The ‘training’ dataset is not used for this method |

reduce_variance_by_lm |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. The effects of the features \(\mathbf{x}\) and the treatment \(T\) on the response \(Y\) are then estimated using a linear regression model \(Y_i=\beta_0+\beta_1X_{i1}+\beta_2X_{i2}+\beta_3X_{i3}+...+\beta_pX_{ip}+\beta_{p+1}T_i\) fit to the ‘experimental’ dataset (\(T_i=\begin{cases} 1 \quad\text{ if subject } i \text{ received treatment}\\0 \quad\text{ if subject } i \text{ did not receive treatment}\end{cases}\)). \(\quad\) Model coefficient \(\beta_{p+1}\) is then the estimate of the treatment effect. This method reduces variance in the treatment effect estimator by attributing some of the previously unexplained variance in the estimator to the effects of the features (\(\mathbf{x}\)). The ‘training’ dataset is not used for this method |

match_on_yhat_lm |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. A linear model is trained on the ‘training’ dataset to predict response \(Y\) from the features \(\mathbf{x}\). This model is then used to generate a predicted response \(\hat{y}\) for every subject in the ‘experimental’ dataset. In the ‘experimental’ dataset, each treated subject is then matched with a control subject (with replacement) with the same (or as close as possible) predicted response \(\hat{y}\). The treatment effect is then estimated as the difference between the mean response \(Y\) in the treated group and the mean response \(Y\) in the matched control group |

match_on_yhat_ranger |
This method is the same as match_on_yhat_lm, except that a Random Forest model is used instead of a linear model |

group_assign_on_yhat_ranger |
A Random Forest model is trained on the ‘training’ dataset to predict response \(Y\) from features \(\mathbf{x}\). This model is then used to generate predicted response values \(\hat y\) for the subjects in the ‘experimental’ dataset. Then, these predicted response values \(\hat y\) are used to assign subjects in the ‘experimental’ dataset to treatment and control groups (ensuring that the distribution of \(\hat y\) is the same in both treatment and control) prior to the administering of the treatment |

group_assign_on_yhat_knn |
This method is the same as yhat_ranger, except that 1-nearest-neighbour regression is used instead of a Random Forest model |

cca_control_var |
Subjects in the ‘experimental’ dataset are assigned at random to treatment and control groups. A linear combination of the features \(\mathbf{x}\) in the ‘experimental’ dataset is then used to form a new univariate feature \(z\). The weights in this linear combination are chosen using canonical correlation analysis, which ensures that the variable \(z\) is maximally correlated with the response \(Y\) (Pearson Correlation). The canonical correlation analysis is performed (i.e. weights learned) on the ‘training’ dataset. The weights learned on the ‘training’ dataset are then applied to the subjects in the ‘experimental’ dataset to form the new feature \(z\) in the ‘experimental’ dataset. The Control Variates method is then used to estimate the treatment effect on the ‘experimental’ dataset, using this new feature \(z\) as the control variate |

cca_naughty |
This method is the same as cca_control_var, except that the weights used in the linear combination that creates new variable \(z\) are learned on the ‘experimental’ data itself, rather than on the ‘training’ dataset. This method seems empirically to perform well (refer to the results later on), but I’m unsure of whether it can be trusted in all situations (it feels too much like data leakage to me). The ‘training’ dataset is not used for this method |

multivar_PCA_control_var |
This is my own extension of the Control Variates method, extended to incorporate multiple control variates. Since the optimisation becomes very messy with correlated features, I use Principal Component Analysis on the sample features (to decorrelate them) prior to implementing the method, assuming that the effect of this on inference is neglible. I’ve put a longer explanation of this method at the end of this post. |

The showcase in this post is not my R code (which was written very quickly, so is long and unoptimised). I have folded the code sections for readability. I have left a button available for you to unfold the code blocks if you wish, in order to interrogate my methods.

```
run_1_experiment <- function(
n_control_group,
n_target_group,
n_training_group,
population_sim_method = c(
"multivariate_normal",
"linear_model_with_interactions",
"feature_partition",
"knn_reps"
),
effect_size
){
results_this_experiment <- c( true_difference = effect_size )
# simulate populations --------------------------------------------------------------------------------
if( population_sim_method == "multivariate_normal" ){
control_mu <- c(
y = 5,
x1 = 2,
x2 = -1,
x3 = 0
)
target_mu <- c(
y = control_mu["y"] + effect_size,
x1 = 2,
x2 = -1,
x3 = 0
)
names(target_mu) <- c("y","x1","x2","x3")
shared_sigma <- matrix( c( # y x1 x2 x3
1 , 0.2, 0.5, 0.3, # y
0.2, 1 , 0.7, 0.1, # x1
0.5, 0.7, 1 , 0.6, # x2
0.3, 0.1, 0.6, 1 # x3
),
byrow = TRUE,
nrow = 4,
ncol = 4
)
colnames(shared_sigma) <- c("y","x1","x2","x3")
rownames(shared_sigma) <- colnames(shared_sigma)
control_group <-
MASS::mvrnorm(
n = n_control_group,
mu = control_mu,
Sigma = shared_sigma
) %>%
as_tibble()
target_group <-
MASS::mvrnorm(
n = n_target_group,
mu = target_mu,
Sigma = shared_sigma
) %>%
as_tibble()
training_data <-
MASS::mvrnorm( n = n_training_group,
mu = control_mu,
Sigma = shared_sigma
) %>%
as_tibble()
# assign to target/control using matching on ranger model score:
potential_target_and_control_group <-
MASS::mvrnorm(
n = n_target_group + n_control_group,
mu = control_mu,
Sigma = shared_sigma
) %>%
as_tibble()
fit_ranger <- ranger::ranger( formula = y ~ x1 + x2 + x3,
data = training_data
)
potential_target_and_control_group$yhat <-
predict( object = fit_ranger,
data = potential_target_and_control_group
)$predictions
target_control_ratio <- round( n_target_group / n_control_group )
assign_groups_ranger_score <-
potential_target_and_control_group %>%
arrange( yhat ) %>%
mutate( rownumber = row_number(),
target_control = if_else( rownumber%%target_control_ratio == 0,
"control",
"target"
)
) %>%
mutate( y = if_else( target_control == "target",
y + effect_size,
y
)
)
# assign to test/holdout by nearest-neighbour regression -------------------------------------------------------
potential_target_and_control_group <-
MASS::mvrnorm(
n = n_target_group + n_control_group,
mu = control_mu,
Sigma = shared_sigma
) %>%
as_tibble()
target_control_ratio <- round( n_target_group / n_control_group )
potential_target_and_control_group$yhat_nearest_neighbour <- as.numeric(NA) # initialize column
# calculate estimate of y as response value (y) of nearest neighbour in training data:
for( customer_i in 1:nrow(potential_target_and_control_group) ){
customer_i_X <- potential_target_and_control_group[ customer_i, c("x1","x2","x3") ] %>% unlist()
nearest_training_neighbour_index <-
which.min(
# euclidean distance of customer_i to each customer in the training data:
sapply(
1:nrow(training_data),
function(i){ sum(
( customer_i_X -
unlist(training_data[i,c("x1","x2","x3")])
)^2
)
}
)
)
potential_target_and_control_group$yhat_nearest_neighbour[customer_i] <- training_data$y[nearest_training_neighbour_index]
}
assign_groups_nearest_neighbour <-
potential_target_and_control_group %>%
arrange( yhat_nearest_neighbour ) %>%
mutate( rownumber = row_number(),
target_control = if_else( rownumber%%target_control_ratio == 0,
"control",
"target"
)
) %>%
mutate( y = if_else( target_control == "target",
y + effect_size,
y
)
)
# # assign to test/holdout by nearest-neighbour matching -------------------------------------------------------
# potential_target_and_control_group <-
# MASS::mvrnorm(
# n = n_target_group + n_control_group,
# mu = control_mu,
# Sigma = shared_sigma
# ) %>%
# as_tibble() %>%
# arrange( runif( n() ) ) %>% # put rows in random order
# mutate( subject_id = row_number() ) # give each row (subject) an id number
#
# target_control_ratio <- round( n_target_group / n_control_group ) # I'm assuming that n_target_group >= n_control_group
#
# final_treatment_group <- potential_target_and_control_group %>% slice(0) # create empty table with correct column structure
# final_control_group <- potential_target_and_control_group %>% slice(0) # create empty table with correct column structure
#
# while( nrow(potential_target_and_control_group) > (target_control_ratio+1) ){
#
# # sample a random subject (subject_i):
# subject_i <- potential_target_and_control_group %>% sample_n(1)
# # remove subject_i from the pool of subjects:
# potential_target_and_control_group <-
# potential_target_and_control_group %>%
# filter( subject_id != subject_i$subject_id )
# # get closest [target_control_ratio] neighbours of subject_i:
# distances_to_subject_i <-
# sapply(
# 1:nrow(potential_target_and_control_group),
# function(row_k){ sum(
# (
# unlist( potential_target_and_control_group[ row_k, c("x1","x2","x3") ] ) -
# c(subject_i$x1, subject_i$x2, subject_i$x3)
# )^2
# )
# }
# )
# closest_neighbour_locations <- rank(distances_to_subject_i) < target_control_ratio
# closest_neighbour_data <- potential_target_and_control_group %>% slice( which(closest_neighbour_locations) )
#
# # add subject_i to the control group:
# final_control_group <- bind_rows(
# final_control_group,
# subject_i %>% select( y, x1, x2, x3 )
# )
# # add subject_i's closest [target_control_ratio] neighbours to the treatment group:
} else if( population_sim_method == "linear_model_with_interactions" ){
x1_effect <- runif( n=1, min=-0.5, max=0.5)
x2_effect <- runif( n=1, min=-0.5, max=0.5)
x3_effect <- runif( n=1, min=-0.5, max=0.5)
x1_x2_effect <- runif( n=1, min=-0.5, max=0.5)
x1_x3_effect <- runif( n=1, min=-0.5, max=0.5)
x2_x3_effect <- runif( n=1, min=-0.5, max=0.5)
x1_x2_x3_effect <- runif( n=1, min=-0.5, max=0.5)
control_group <- tibble( x1 = runif( n=n_control_group, min=-5, max=5 ),
x2 = runif( n=n_control_group, min=-5, max=5 ),
x3 = runif( n=n_control_group, min=-5, max=5 )
)
control_group$y <-
x1_effect*control_group$x1 +
x2_effect*control_group$x2 +
x3_effect*control_group$x3 +
x1_x2_effect*control_group$x1*control_group$x2 +
x1_x3_effect*control_group$x1*control_group$x3 +
x2_x3_effect*control_group$x2*control_group$x3 +
x1_x2_x3_effect*control_group$x1*control_group$x2*control_group$x3 +
rnorm( nrow(control_group), mean=0, sd=1 )
control_group <- control_group %>% dplyr::select(y, x1, x2, x3)
target_group <- tibble( x1 = runif( n=n_target_group, min=-5, max=5 ),
x2 = runif( n=n_target_group, min=-5, max=5 ),
x3 = runif( n=n_target_group, min=-5, max=5 )
)
target_group$y <-
effect_size +
x1_effect*target_group$x1 +
x2_effect*target_group$x2 +
x3_effect*target_group$x3 +
x1_x2_effect*target_group$x1*target_group$x2 +
x1_x3_effect*target_group$x1*target_group$x3 +
x2_x3_effect*target_group$x2*target_group$x3 +
x1_x2_x3_effect*target_group$x1*target_group$x2*target_group$x3 +
rnorm( nrow(target_group), mean=0, sd=1 )
target_group <- target_group %>% dplyr::select(y, x1, x2, x3)
training_data <- tibble( x1 = runif( n=n_training_group, min=-5, max=5 ),
x2 = runif( n=n_training_group, min=-5, max=5 ),
x3 = runif( n=n_training_group, min=-5, max=5 )
)
training_data$y <-
x1_effect*training_data$x1 +
x2_effect*training_data$x2 +
x3_effect*training_data$x3 +
x1_x2_effect*training_data$x1*training_data$x2 +
x1_x3_effect*training_data$x1*training_data$x3 +
x2_x3_effect*training_data$x2*training_data$x3 +
x1_x2_x3_effect*training_data$x1*training_data$x2*training_data$x3 +
rnorm( nrow(training_data), mean=0, sd=1 )
training_data <- training_data %>% dplyr::select(y, x1, x2, x3)
# choose target/control using matching on ranger model score -----------------------------------------------------------
potential_target_and_control_group <-
tibble( x1 = runif( n=n_control_group+n_target_group, min=-5, max=5 ),
x2 = runif( n=n_control_group+n_target_group, min=-5, max=5 ),
x3 = runif( n=n_control_group+n_target_group, min=-5, max=5 )
)
potential_target_and_control_group$y <-
x1_effect*potential_target_and_control_group$x1 +
x2_effect*potential_target_and_control_group$x2 +
x3_effect*potential_target_and_control_group$x3 +
x1_x2_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x2 +
x1_x3_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x3 +
x2_x3_effect*potential_target_and_control_group$x2*potential_target_and_control_group$x3 +
x1_x2_x3_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x2*potential_target_and_control_group$x3 +
rnorm( nrow(control_group), mean=0, sd=1 )
fit_ranger <- ranger::ranger( formula = y ~ x1 + x2 + x3,
data = training_data
)
potential_target_and_control_group$yhat <-
predict( object = fit_ranger,
data = potential_target_and_control_group
)$predictions
target_control_ratio <- round( n_target_group / n_control_group )
assign_groups_ranger_score <-
potential_target_and_control_group %>%
arrange( yhat ) %>%
mutate( rownumber = row_number(),
target_control = if_else( rownumber%%target_control_ratio == 0,
"control",
"target"
)
) %>%
mutate( y = if_else( target_control == "target",
y + effect_size,
y
)
)
# assign to test/holdout by nearest-neighbour matching -------------------------------------------------------
potential_target_and_control_group <-
tibble( x1 = runif( n=n_control_group+n_target_group, min=-5, max=5 ),
x2 = runif( n=n_control_group+n_target_group, min=-5, max=5 ),
x3 = runif( n=n_control_group+n_target_group, min=-5, max=5 )
)
potential_target_and_control_group$y <-
x1_effect*potential_target_and_control_group$x1 +
x2_effect*potential_target_and_control_group$x2 +
x3_effect*potential_target_and_control_group$x3 +
x1_x2_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x2 +
x1_x3_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x3 +
x2_x3_effect*potential_target_and_control_group$x2*potential_target_and_control_group$x3 +
x1_x2_x3_effect*potential_target_and_control_group$x1*potential_target_and_control_group$x2*potential_target_and_control_group$x3 +
rnorm( nrow(control_group), mean=0, sd=1 )
target_control_ratio <- round( n_target_group / n_control_group )
potential_target_and_control_group$yhat_nearest_neighbour <- as.numeric(NA) # initialize column
# calculate estimate of y as response value (y) of nearest neighbour in training data:
for( customer_i in 1:nrow(potential_target_and_control_group) ){
customer_i_X <- potential_target_and_control_group[ customer_i, c("x1","x2","x3") ] %>% unlist()
nearest_training_neighbour_index <-
which.min(
# euclidean distance of customer_i to each customer in the training data:
sapply(
1:nrow(training_data),
function(i){ sum(
( customer_i_X -
unlist(training_data[i,c("x1","x2","x3")])
)^2
)
}
)
)
potential_target_and_control_group$yhat_nearest_neighbour[customer_i] <- training_data$y[nearest_training_neighbour_index]
}
assign_groups_nearest_neighbour <-
potential_target_and_control_group %>%
arrange( yhat_nearest_neighbour ) %>%
mutate( rownumber = row_number(),
target_control = if_else( rownumber%%target_control_ratio == 0,
"control",
"target"
)
) %>%
mutate( y = if_else( target_control == "target",
y + effect_size,
y
)
)
} else if( population_sim_method == "feature_partition" ){
x1_split_points <- sort( runif(n=2, min=-5, max=5) )
x2_split_points <- sort( runif(n=2, min=-5, max=5) )
x3_split_points <- sort( runif(n=2, min=-5, max=5) )
partition_means <-
# 1=below both split points, 2=between the two split points, 3=above both split points
expand_grid(
x1_partition = 1:3,
x2_partition = 1:3,
x3_partiti
```