User Guide¶

The goal of this package is to make hypothesis testing using variance reduction methods as easy as using scipy.stats.ttest_ind() and scipy.stats.ttest_ind_from_stats(). A lot of the API is designed to match that simplicity as much as possible.

The publication in 1 was implented using this package. The variance reduction ideas here are built on top of the CUPED method ideas in 2 and 3.

The package currently supports three kinds of tests:

basic \(z\)-test: This is the one from the intro stats textbooks.
held out: This is a held out control variate method (train the predictor on a held out set).
cross val: This is a \(k\)-fold cross validation type setup when training the predictor.

The distinction between basic, held out (aka cv), and cross val (aka stacked) is discussed in 4.

Each method has a few different ways to call it:

basic: Call the method using the raw data and the control variate predictions.
from stats: Call the method using sufficient statistics of the data and predictions only.
train: Pass in a predictor object to train and evaluate the predictor in the routine.
- For lack of a better choice, I assume the model has a sklearn-style fit() and predict() API.

Every statistical test in this package returns the same set of variables:

A best estimate (of the difference of means)
A confidence interval (on the difference of means)
A p-value under the H0 that the two means are equal
- The p-value and confidence interval are tested to be consistent with each under inversion.

References

1: R. Turner, U. Pavalanathan, S. Webb, N. Hammerla, B. Cohn, and A. Fu. Isotonic regression adjustment for variance reduction. In CODE@MIT, 2021.
2: A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pages 123–132, 2013.
3: A. Poyarkov, A. Drutsa, A. Khalyavin, G. Gusev, and P. Serdyukov. Boosted decision tree regression adjustment for variance reduction in online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 235–244, 2016.
4: I. Barr. Reducing the variance of A/B tests using prior information. Degenerate State, Jun 2018.

twiser.twiser.ztest_from_stats(mean1, std1, nobs1, mean2, std2, nobs2, *, alpha=0.05)[source]¶

Version of ztest() that works off the sufficient statistics of the data.

Parameters

mean1 (float) – The sample mean of the treatment group outcome \(x\).
std1 (float) – The sample standard deviation of the treatment group outcome.
nobs1 (int) – The number of samples in the treatment group.
mean2 (float) – The sample mean of the control group outcome \(y\).
std2 (float) – The sample standard deviation of the control group outcome.
nobs2 (int) – The number of samples in the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest(x, y, *, alpha=0.05, ddof=1)[source]¶

Standard two-sample unpaired \(z\)-test. It does not assume equal sample sizes or variances.

Parameters

x (numpy.ndarray of shape of shape (n,)) – Outcomes for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
ddof (int) – The “Delta Degrees of Freedom” argument for computing sample variances.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_held_out_from_stats(mean1, cov1, nobs1, mean2, cov2, nobs2, *, alpha=0.05)[source]¶

Version of ztest_held_out() that works off the sufficient statistics of the data.

Parameters

mean1 (numpy.ndarray of shape (2,)) – The sample mean of the treatment group outcome and its prediction: [mean(x), mean(xp)].
cov1 (numpy.ndarray of shape (2, 2)) – The sample covariance matrix of the treatment group outcome and its prediction: cov([x, xp]).
nobs1 (int) – The number of samples in the treatment group.
mean2 (numpy.ndarray of shape (2,)) – The sample mean of the control group outcome and its prediction: [mean(y), mean(yp)].
cov2 (numpy.ndarray of shape (2, 2)) – The sample covariance matrix of the control group outcome and its prediction: cov([y, yp]).
nobs2 (int) – The number of samples in the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_held_out(x, xp, y, yp, *, alpha=0.05, health_check_output=True, ddof=1)[source]¶

Two-sample unpaired \(z\)-test with variance reduction using control variarates. It does not assume equal sample sizes or variances.

The predictions (control variates) must be derived from features that are independent of assignment to treatment or control. If the predictions in treatment and control have a different distribution then the test may be invalid.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
xp (numpy.ndarray of shape (n,)) – Predicted outcomes for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
yp (numpy.ndarray of shape (m,)) – Predicted outcomes for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.
ddof (int) – The “Delta Degrees of Freedom” argument for computing sample variances.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_held_out_train(x, x_covariates, y, y_covariates, *, alpha=0.05, train_frac=0.2, health_check_input=False, health_check_output=True, predictor=None, random=None, ddof=1)[source]¶

Version of ztest_held_out() that also trains the control variate predictor.

The covariates/features must be independent of assignment to treatment or control. If the features in treatment and control have a different distribution then the test may be invalid.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
x_covariates (numpy.ndarray of shape (n, d)) – Covariates/features for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
y_covariates (numpy.ndarray of shape (m, d)) – Covariates/features for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
train_frac (float) – The fraction of data to hold out for training the predictors. To ensure test validity, we do not use the same data for training the predictors and performing the test. This must be inside the interval range [0, 1].
health_check_input (bool) – If True perform a health check that ensures the features have the same distribution in treatment and control. If not, issue a warning. It works by training a classifier to predict if a data point is in training or control. This can be slow for a large data set since it requires training a classifier.
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.
predictor (sklearn-like regression object) – An object that has a fit and predict routine to make predictions. The object does not need to be fit yet. It will be fit in this method.
random (numpy.random.RandomState) – An optional numpy random stream can be passed in for reproducibility.
ddof (int) – The “Delta Degrees of Freedom” argument for computing sample variances.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_in_sample_train(x, x_covariates, y, y_covariates, *, alpha=0.05, health_check_input=False, health_check_output=False, predictor=None, random=None, ddof=1)[source]¶

Version of ztest_held_out() that also trains the control variate predictor.

The covariates/features must be independent of assignment to treatment or control. If the features in treatment and control have a different distribution then the test may be invalid.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
x_covariates (numpy.ndarray of shape (n, d)) – Covariates/features for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
y_covariates (numpy.ndarray of shape (m, d)) – Covariates/features for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
health_check_input (bool) – If True perform a health check that ensures the features have the same distribution in treatment and control. If not, issue a warning. It works by training a classifier to predict if a data point is in training or control. This can be slow for a large data set since it requires training a classifier.
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.
predictor (sklearn-like regression object) – An object that has a fit and predict routine to make predictions. The object does not need to be fit yet. It will be fit in this method.
random (numpy.random.RandomState) – An optional numpy random stream can be passed in for reproducibility.
ddof (int) – The “Delta Degrees of Freedom” argument for computing sample variances.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_cross_val_from_stats(mean1, cov1, nobs1, mean2, cov2, nobs2, *, alpha=0.05)[source]¶

Version of ztest_cross_val() that works off the sufficient statistics of the data.

Parameters

mean1 (numpy.ndarray of shape (k, 2)) – The sample mean of the treatment group outcome and its prediction: [mean(x), mean(xp)], for each fold in the \(k\)-fold cross validation.
cov1 (numpy.ndarray of shape (k, 2, 2)) – The sample covariance matrix of the treatment group outcome and its prediction: cov([x, xp]), for each fold in the \(k\)-fold cross validation.
nobs1 (numpy.ndarray of shape (k,)) – The number of samples in the treatment group, for each fold in the \(k\)-fold cross validation.
mean2 (numpy.ndarray of shape (k, 2)) – The sample mean of the control group outcome and its prediction: [mean(y), mean(yp)], for each fold in the \(k\)-fold cross validation.
cov2 (numpy.ndarray of shape (k, 2, 2)) – The sample covariance matrix of the control group outcome and its prediction: cov([y, yp]), for each fold in the \(k\)-fold cross validation.
nobs2 (numpy.ndarray of shape (k,)) – The number of samples in the control group, for each fold in the \(k\)-fold cross validation.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_cross_val(x, xp, x_fold, y, yp, y_fold, *, alpha=0.05, health_check_output=True)[source]¶

Two-sample unpaired \(z\)-test with variance reduction using the cross validated control variarates method. It does not assume equal sample sizes or variances.

The predictions (control variates) must be derived from features that are independent of assignment to treatment or control. If the predictions in treatment and control have a different distribution then the test may be invalid.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
xp (numpy.ndarray of shape (n,)) – Predicted outcomes for the treatment group derived from a cross-validation routine.
x_fold (numpy.ndarray of shape (n,)) – The cross validation fold assignment for each data point in treatment (of dtype int).
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
yp (numpy.ndarray of shape (m,)) – Predicted outcomes for the control group derived from a cross-validation routine.
y_fold (numpy.ndarray of shape (n,)) – The cross validation fold assignment for each data point in control (of dtype int).
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_cross_val_train(x, x_covariates, y, y_covariates, *, alpha=0.05, k_fold=5, health_check_input=False, health_check_output=True, predictor=None, random=None)[source]¶

Version of ztest_cross_val() that also trains the control variate predictor.

The covariates/features must be independent of assignment to treatment or control. If the features in treatment and control have a different distribution then the test may be invalid.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
x_covariates (numpy.ndarray of shape (n, d)) – Covariates/features for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
y_covariates (numpy.ndarray of shape (m, d)) – Covariates/features for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
k_fold (int) – The number of folds in the cross validation: \(k\).
health_check_input (bool) – If True perform a health check that ensures the features have the same distribution in treatment and control. If not, issue a warning. It works by training a classifier to predict if a data point is in training or control. This can be slow for a large data set since it requires training a classifier.
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.
predictor (sklearn-like regression object) – An object that has a fit and predict routine to make predictions. The object does not need to be fit yet. It will be fit in this method.
random (numpy.random.RandomState) – An optional numpy random stream can be passed in for reproducibility.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_cross_val_train_blockwise(x, x_covariates, y, y_covariates, *, alpha=0.05, k_fold=5, health_check_input=False, health_check_output=True, predictor=None, random=None)[source]¶

Version of ztest_cross_val_train() that is more efficient if the fit routine scales worse than O(N), otherwise this will not be more efficient.

Parameters

x (numpy.ndarray of shape (n,)) – Outcomes for the treatment group.
x_covariates (numpy.ndarray of shape (n, d)) – Covariates/features for the treatment group.
y (numpy.ndarray of shape (m,)) – Outcomes for the control group.
y_covariates (numpy.ndarray of shape (m, d)) – Covariates/features for the control group.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
k_fold (int) – The number of folds in the cross validation: \(k\).
health_check_input (bool) – If True perform a health check that ensures the features have the same distribution in treatment and control. If not, issue a warning. It works by training a classifier to predict if a data point is in training or control. This can be slow for a large data set since it requires training a classifier.
health_check_output (bool) – If True perform a health check that ensures the predictions have the same distribution in treatment and control. If not, issue a warning.
predictor (sklearn-like regression object) – An object that has a fit and predict routine to make predictions. The object does not need to be fit yet. It will be fit in this method.
random (numpy.random.RandomState) – An optional numpy random stream can be passed in for reproducibility.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).

twiser.twiser.ztest_cross_val_train_load_blockwise(data_iter, *, alpha=0.05, predictor=None, callback=None)[source]¶

Version of ztest_cross_val_train_blockwise() that loads the data in blocks to avoid overflowing memory. Using ztest_cross_val_train_blockwise() is faster if all the data fits in memory.

Parameters

data_iter (Sequence[Callable[[], Tuple[ndarray, ndarray, ndarray, ndarray]]]) – An iterable of functions, where each function returns a different cross validation fold. The functions should return data in the format of a tuple: (x, x_covariates, y, y_covariates). See the parameters of ztest_cross_val_train_blockwise() for details on the shapes of these variables.
alpha (float) – Required confidence level, typically this should be 0.05, and must be inside the interval range \([0, 1)\).
predictor (sklearn-like regression object) – An object that has a fit and predict routine to make predictions. The object does not need to be fit yet. It will be fit in this method.
callback (Optional[Callable[[Any], None]]) – An optional callback that gets called for each cross validation fold in the format callback(predictor). This is sometimes useful for logging.

Return type

Tuple[float, Tuple[float, float], float]

Returns

estimate – Estimate of the difference in means: \(\mathbb{E}[x] - \mathbb{E}[y]\).
ci – Confidence interval (with coverage \(1 - \alpha\)) for the estimate.
pval – The p-value under the null hypothesis H0 that \(\mathbb{E}[x] = \mathbb{E}[y]\).