{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbonne/peds/blob/main/docs/introComm/A_B_Testing.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H_cCZ26oC09s"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1JqHLLc4CGhWdmUAb0xCn1anYe6Qkl_U_' width=500>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BQpnBRMDe_mB"
      },
      "source": [
        "## <font color='darkorange'>A/B Testing</font>\n",
        "\n",
        "Here we will look at how to collect and analyze data to determine the difference between two groups. The idea here is that if we randomly assign individuals to two groups we end up with comparable groups. If we then measure how these two groups respond to a treatment (e.g., being given game version A vs. game version B) we can better determine the effect of that treatment. \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yFmN1aLjfVej"
      },
      "source": [
        "We'll take a look at data collected to test how effective different versions of a game are at retaining users. \n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "R6oL7k7uftq4"
      },
      "outputs": [],
      "source": [
        "#load packages\n",
        "import pandas as pd\n",
        "import sklearn as sk\n",
        "import seaborn as sns\n",
        "from matplotlib import pyplot as plt\n",
        "from sklearn.model_selection import train_test_split"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GZ7MxOMHflXV"
      },
      "source": [
        "Load the data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "5NHlsf8GeMWW",
        "outputId": "42d4b2c8-2098-4ed6-d2e4-931154d98930"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>userid</th>\n",
              "      <th>version</th>\n",
              "      <th>sum_gamerounds</th>\n",
              "      <th>retention_1</th>\n",
              "      <th>retention_7</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>116</td>\n",
              "      <td>gate_30</td>\n",
              "      <td>3</td>\n",
              "      <td>False</td>\n",
              "      <td>False</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>337</td>\n",
              "      <td>gate_30</td>\n",
              "      <td>38</td>\n",
              "      <td>True</td>\n",
              "      <td>False</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>377</td>\n",
              "      <td>gate_40</td>\n",
              "      <td>165</td>\n",
              "      <td>True</td>\n",
              "      <td>False</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>483</td>\n",
              "      <td>gate_40</td>\n",
              "      <td>1</td>\n",
              "      <td>False</td>\n",
              "      <td>False</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>488</td>\n",
              "      <td>gate_40</td>\n",
              "      <td>179</td>\n",
              "      <td>True</td>\n",
              "      <td>True</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "   userid  version  sum_gamerounds  retention_1  retention_7\n",
              "0     116  gate_30               3        False        False\n",
              "1     337  gate_30              38         True        False\n",
              "2     377  gate_40             165         True        False\n",
              "3     483  gate_40               1        False        False\n",
              "4     488  gate_40             179         True         True"
            ]
          },
          "execution_count": 6,
          "metadata": {
            "tags": []
          },
          "output_type": "execute_result"
        }
      ],
      "source": [
        "#load data\n",
        "df_cats = pd.read_csv(\"/content/cookie_cats.csv\")\n",
        "\n",
        "#take a look\n",
        "df_cats.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mfKCZZ6mL0Fj"
      },
      "source": [
        "### <font color='darkorange'>Describe the data</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2XSKxhqkFsDY"
      },
      "source": [
        "How many in each group?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PB_UE20MFuEK"
      },
      "outputs": [],
      "source": [
        "?.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DqhZStXAFzfZ"
      },
      "source": [
        "How many users returned after 7 days?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fcRRUsKgF4XF"
      },
      "outputs": [],
      "source": [
        "#gate placed at level 30\n",
        "df_cats[?=='gate_30'].retention_7.sum() / len(df_cats[df_cats['version']=='gate_30'])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bJPkzEdsGa0H"
      },
      "outputs": [],
      "source": [
        "#gate placed at level 40\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FovWxCFQkiTn"
      },
      "source": [
        "### <font color='darkorange'>Visualize the data</font>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Lm36a2uZmIW5"
      },
      "outputs": [],
      "source": [
        "#plot the differences between the versions\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lKdW7X6B0epK"
      },
      "source": [
        "### <font color='darkorange'>Wrangle the data</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jOHxEi5y0xLo"
      },
      "source": [
        "Convert the binary traget and binary input variable to 0/1"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wANllSTG0wjp"
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import OrdinalEncoder\n",
        "\n",
        "#get the columns names of features you'd like to turn into 0/1\n",
        "bin_names = ['retention_7','version']\n",
        "\n",
        "#create the OrdinalEncoder\n",
        "my_ordinal = ?()\n",
        "\n",
        "#fit and transform the data\n",
        "df_cats[bin_names] = my_ordinal.?(?)\n",
        "\n",
        "#take a look\n",
        "df_cats"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qkJ0wPp46_dK"
      },
      "source": [
        "Check which version is assigned to which value"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9R_97F8d67t4"
      },
      "outputs": [],
      "source": [
        "my_ordinal.categories_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9oZGMz1Y7G79"
      },
      "source": [
        "Split your data into training and testing"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "id": "7uuaVICoRlMs"
      },
      "outputs": [],
      "source": [
        "#split these data into training and testing datasets\n",
        "df_train, df_test = train_test_split(df_cats, test_size=0.20, stratify=df_cats[['retention_7']])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OAk0s-hKmH7i"
      },
      "source": [
        "### <font color='darkorange'>Build a model</font>\n",
        "\n",
        "Can we predict which game version does better?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eHrG5Z1lRJ5q"
      },
      "outputs": [],
      "source": [
        "import statsmodels.api as sm #for running regression!\n",
        "import statsmodels.formula.api as smf\n",
        "\n",
        "#1. Build the model\n",
        "linear_reg_model = ?(formula='retention_7 ~ version ', data=?)\n",
        "\n",
        "#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
        "linear_reg_results = linear_reg_model.?\n",
        "\n",
        "#3. take a look at the summary\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wGPMT5A6HvaA"
      },
      "source": [
        "Make predictions to get the probability (i.e., in the table these are values on the logit scale!)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lWdIDwxXH2ZB"
      },
      "outputs": [],
      "source": [
        "y_pred_prob = linear_reg_results.predict(df_train)\n",
        "y_pred_prob.hist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GuaEe9OhJayk"
      },
      "source": [
        "Run the model again but this time add in the sum of the times they played the game in the first 2 weeks."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OYa4-d0ZJaf7"
      },
      "outputs": [],
      "source": [
        "import statsmodels.api as sm #for running regression!\n",
        "import statsmodels.formula.api as smf\n",
        "\n",
        "#1. Build the model\n",
        "linear_reg_model = ?(formula='retention_7 ~ version + sum_gamerounds  ', data=?)\n",
        "\n",
        "#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
        "linear_reg_results = linear_reg_model.?()\n",
        "\n",
        "#3. take a look at the summary\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GoJsIwCmIKdI"
      },
      "source": [
        "Calculate the difference in predicted probability"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Yw_3YM9HIJ6M"
      },
      "outputs": [],
      "source": [
        "y_pred_prob = linear_reg_results.predict(df_train)\n",
        "y_pred_prob.hist()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "u1adWFDz7Luc"
      },
      "source": [
        "Check to make sure the pattern you found generalizes to the whitheld dataset. (i.e., are you overfitting)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CG0HiBf_7SR_"
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import confusion_matrix\n",
        "\n",
        "#predict on testing data\n",
        "y_pred_prob = ?\n",
        "\n",
        "#convert probs to 0/1\n",
        "y_pred = (y_pred_prob > 0.5).astype(int)\n",
        "\n",
        "#create a confusion matrix\n",
        "cm_logit = confusion_matrix(df_test.retention_7, ?)\n",
        "\n",
        "#visualize the confusion matrix\n",
        "sns.heatmap(cm_logit, annot=True)\n",
        "plt.xlabel('Predicted label')\n",
        "plt.ylabel('True label')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ia-v8NPYO1S8"
      },
      "source": [
        "Measure the accuracy, precision, and recall of the model on the test dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 201
        },
        "id": "KXDKdR2_O3FH",
        "outputId": "aa213a8f-e3bb-4cf7-8172-b8873f75b23d"
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import accuracy_score, precision_score, recall_score\n",
        "\n",
        "model_acc = accuracy_score(df_test.retention_7, ?)\n",
        "model_prec = precision_score(?)\n",
        "model_rec = recall_score(?)\n",
        "\n",
        "print(f\"accuracy: {model_acc}\" )\n",
        "print(f\"?: {model_prec}\" )\n",
        "print(f\"?: {?}\" )\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VNOUHRuV45oP"
      },
      "source": [
        "### <font color='darkorange'>Bonus</font>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I54NX306uLG2"
      },
      "source": [
        "What does the model think retetion will change when we vary versions and sum_gamerounds?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HiCvLf23uG-B"
      },
      "outputs": [],
      "source": [
        "#1. Create a dataframe\n",
        "df_question = pd.DataFrame({'version':[?,?],\n",
        "                            'sum_gamerounds':?})\n",
        "                            \n",
        "#2. Use the model to make predictions\n",
        "question_pred =  ?(df_question)\n",
        "\n",
        "#3. add a column to the df_question\n",
        "df_question['predicted_retention'] = question_pred\n",
        "\n",
        "#4. plot the predictions\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DGoQRF1C5ZOP"
      },
      "source": [
        "Try to match those predictions based on your knowledge of the linear formula (y=a+bx)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hzVj-nAL5F6g"
      },
      "outputs": [],
      "source": [
        "import scipy\n",
        "\n",
        "#the following function can be used to convert numbers on the logit scale back into the probability scale\n",
        "scipy.special.expit(0)\n",
        "\n",
        "#i.e., on the logit scale 0 is equivalent to 0.5 probability"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zh4cX-XM6He4"
      },
      "outputs": [],
      "source": [
        "#what was the intercept and slope of the line your model estimated? \n",
        "intercept = ?\n",
        "b_version = ?\n",
        "b_sumGame = ?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FzGMKz9Q50HO"
      },
      "outputs": [],
      "source": [
        "#probability of retention for version 0\n",
        "scipy.special.expit(intercept + b_version * 0 + b_sumGame*100)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "uS3l9wxf5z-e"
      },
      "outputs": [],
      "source": [
        "#probability of retention for version 1\n",
        "scipy.special.expit(intercept + b_version * 1 + b_sumGame*100)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BJqFrYyyYzMx"
      },
      "source": [
        "### <font color='darkorange'>Further reading</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_A_B_Testing.ipynb) version."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyPmO6yEZIxgXnEr8J0sF8Bc",
      "collapsed_sections": [],
      "include_colab_link": true,
      "mount_file_id": "1ZSejLEaSqtPw_BnKWwNvlKloiaTPp-zf",
      "name": "A_B_Testing.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}