Open In Colab


24. Class Data Challenge II

Welcome to data challenge II. In this challenge we will break into groups to work on a large dataset. The goal will be to gain the highest prediction, while being able to explain how the model is making those predictions.

Context

Congratulations you’ve just been hired by the local school board to help them make decision about how to reduce alcohol consumption in high schoolers. They have provided you with a dataset collected over the last year from anonymous students. They would like to know the following:

  1. Can you build a model to help identify individuals that might be high drinkers?

  2. What are some of the most important factors leading to these predictions?

import pandas as pd
import numpy as np

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Load in the dataset

df = pd.read_csv('/content/students.csv')

df

24.1. Data wrangling

Get your data ready for visualization and modeling!

24.2. Data Visualization and Exploration

Let’s look at the distribution of the variable we are trying to predict.

24.3. Model building and fitting

Build your model then fit it to the data!

How well does it perform?

24.4. Model inference

Let’s see what it thought was important for making predictions

24.5. New data

Let me know when your group is happy with your model and I’ll release some new data! We can use this new data to compare and see which group has the most useful model: it makes predictions outside the data it was built with (i.e., it generalizes well).

24.6. Bonus

Let’s try that again with another dataset!

df = pd.read_csv('/content/customer_churn.csv')

df