Open In Colab

colab


31. A/B Testing#

Here we will look at how to collect and analyze data to determine the difference between two groups. The idea here is that if we randomly assign individuals to two groups we end up with comparable groups. If we then measure how these two groups respond to a treatment (e.g., being given game version A vs. game version B) we can better determine the effect of that treatment.

We’ll take a look at data collected to test how effective different versions of a game are at retaining users.

#load packages
import pandas as pd
import sklearn as sk
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

Load the data

#load data
df_cats = pd.read_csv("/content/cookie_cats.csv")

#take a look
df_cats.head()

31.1. Describe the data#

How many in each group?

df_cats.version.?

How many users returned after 7 days?

#gate placed at level 30
df_cats.groupby("?")["?"].mean()

31.2. Visualize the data#

#plot the differences between the versions
sns.barplot(?)

31.3. Wrangle the data#

Convert the binary traget and binary input variable to 0/1

from sklearn.preprocessing import LabelEncoder

#build the encoder
le_retention7 = LabelEncoder()
le_version = LabelEncoder()

#fit and transform the gender column
df_cats['retention_7'] = le_retention7.fit_transform(df_cats['retention_7'])
df_cats['version'] = le_version.fit_transform(df_cats['version'])

#take a look
df_cats
le_retention7.classes_
le_version.classes_

Split your data into training and testing

#split data into predictors (X) and target (y)
X = df_cats.drop(['retention_7','retention_1','userid','sum_gamerounds'], axis=1)
y = df_cats['retention_7']

#split these data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y)

31.4. Build a model#

Can we predict which game version does better?

  • Note: given the large class imbalance let’s make the model more sensitive to errors in the minority class. So, as class 1 (i.e., making it to 7 days) is rare, errors in predicting class 1 will “hurt” the model more. It does this by changing the loss function (something we have not covered in this class but you will be seeing more of in later classes!).

from sklearn.ensemble import RandomForestClassifier

#1. build the algorithm
classifier = RandomForestClassifier(class_weight="balanced")

#2. fit the algorithm to the data
classifier.fit(X_train, y_train)

Check how your model does on the test data

from sklearn.metrics import confusion_matrix

#predict on testing data
y_pred = classifier.predict(X_test)

#create a confusion matrix
cm_logit = confusion_matrix(y_test, y_pred)

#visualize the confusion matrix
sns.heatmap(cm_logit, annot=True)
plt.xlabel('Predicted label')
plt.ylabel('True label')
from sklearn.metrics import accuracy_score, precision_score, recall_score

model_acc = accuracy_score(y_test, y_pred)
model_prec = precision_score(y_test, y_pred)
model_rec = recall_score(y_test, y_pred)

print(f"accuracy: {model_acc:.2f}" )
print(f"precision: {model_prec:.2f}" )
print(f"recall: {model_rec:.2f}" )

Not great! This suggests that the version a user is playing is not likely to have a big impact on wether they play for more than 7 days.

31.5. Estimate the effect of version#

Let’s estimate how the model thinks the probability of making it to more than 7 days is impacted by the version.

#1. Create a dataframe
df_question = pd.DataFrame({'version':[0,1]})

#2. Use the model to make predictions
question_pred =  classifier.predict_proba(df_question)

#3. Take a look at the answer
question_pred

Now we can calculate the difference in probabilities

question_pred[0,1] - question_pred[1,1]

This is known as the average treatment effect, i.e., how much does the treatment (versions) impact the outcome on average.

As a final step, if presenting this information to someone, it’s a good idea to quantify an estimate of the uncertainty around the effect. Let’s do that below with a bootstrapping.

import numpy as np
from sklearn.utils import resample

n_bootstraps = 30
ates = []

for _ in range(n_bootstraps):
    # resample data
    X_b, y_b = resample(X_train, y_train, replace=True)
    clf = RandomForestClassifier().fit(X_b, y_b)

    # predict prob for version=0 and version=1
    df_question = pd.DataFrame({'version':[0,1]})
    probs = clf.predict_proba(df_question)
    ate = probs[1,1] - probs[0,1]
    ates.append(ate)

sns.histplot(ates)
plt.xlabel("ATE")
plt.show()

The histogram shows us the distribution of average treatment effects. Let’s summarize this distribution to make it easier to communicate. Let’s get the mean and the confidence intervals of the estimate.

print(f"ATE: {np.mean(ates):.3f}")

ci_lower, ci_upper = np.percentile(ates, [2.5, 97.5])
print(f"ATE 95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")

colab

Try redoing the exercise above without the increased penalty for the minority class: e.g., remove class_weight=”balanced”. How do the results change?

If time permits, try redoing the exercise with sum_game_rounds instead of retention_7 as the outcome variable. Do you come to the same conclusion about which version is better?

31.6. Bonus#

Try redoing the exercise above with a linear regression approach. You can use a scikit-learn LinearRegression or you can use smf which gives more statistical outputs.

How do the results differ or remain the same?

import statsmodels.api as sm #for running regression!
import statsmodels.formula.api as smf

#split these data into training and testing datasets
# for smf we need to have retention_7 and version in a data frame
df_cats_train, df_cats_test = train_test_split(df_cats, test_size=0.20, stratify=df_cats['retention_7'])

#1. Build the model
linear_reg_model = smf.logit(formula='retention_7 ~ version ', data=df_cats_train)

#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)
linear_reg_model.fit()

#3. take a look at the summary
linear_reg_model.summary()

31.7. Further reading#

If you would like the notebook without missing code check out the full code version.