24. Class Data Challenge II¶

Welcome to data challenge II. In this challenge we will break into groups to work on a large dataset. The goal will be to gain the highest prediction, while being able to explain how the model is making those predictions.

Context

Congratulations you’ve just been hired by the local school board to help them make decision about how to reduce alcohol consumption in high schoolers. They have provided you with a dataset collected over the last year from anonymous students. They would like to know the following:

Can you build a model to help identify individuals that might be high drinkers?
What are some of the most important factors leading to these predictions?

import pandas as pd
import numpy as np

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

Load in the dataset

df = pd.read_csv('/content/students.csv')

df

24.1. Data wrangling¶

Get your data ready for visualization and modeling!

24.2. Data Visualization and Exploration¶

Let’s look at the distribution of the variable we are trying to predict.

24.3. Model building and fitting¶

Build your model then fit it to the data!

How well does it perform?

24.4. Model inference ¶

Let’s see what it thought was important for making predictions

24.5. New data¶

Let me know when your group is happy with your model and I’ll release some new data! We can use this new data to compare and see which group has the most useful model: it makes predictions outside the data it was built with (i.e., it generalizes well).

24.6. Bonus¶

Let’s try that again with another dataset!

df = pd.read_csv('/content/customer_churn.csv')

df

Practical exercises in data science - PEDS

Class Data Challenge II

Contents