29. Class Data Challenge IIb#
Welcome to data challenge II (second round)! In this challenge we will build a predictive model. Our goal will be to gain the highest prediction, explain how the model is making those predictions, and use the model to suggest an interention in a system!
Context
Congratulations you’ve just been hired by a mysterious company X to help them increase the chance that Y will occur. They don’t want to provide you with too much detail due to privacy concerns. They do provided you with a dataset, however the labels have been removed.
They would like to know the following:
Can you build a model to predict the variable “c”?
What are some of the most important factors leading to these predictions?
What can they do to increase the chance that “c” occurs: i.e., c is a 1 and not a 0.
import pandas as pd
import numpy as np
import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
Load in the dataset
df = pd.read_csv('/content/company_X_unlabeled.csv')
df
29.1. Data wrangling#
Get your data ready for visualization and modeling!
29.2. Data Visualization and Exploration#
Let’s look at the distribution of the variable we are trying to predict.
import seaborn as sns
29.3. Model building and fitting#
Build your model then fit it to the data!
How well does it perform?
29.4. Model inference #
Let’s see what it thought was important for making predictions
29.5. New data#
Let me know when your group is happy with your model and I’ll release some new data! We can use this new data to compare and see which group has the most useful model: it makes predictions outside the data it was built with (i.e., it generalizes well).
Write out a statement to the company X about how they can increase the chances that c will occur.
29.6. Bonus#
Let’s try that again with the labeled dataset and see if our domain expertise can help us make a better model and better statement about how they might be able to increase c!
df = pd.read_csv('/content/company_X_labeled.csv')
df