<a href="https://colab.research.google.com/github/tbonne/peds/blob/main/docs/introViz/IntroViz3_scatterplots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='http://drive.google.com/uc?export=view&id=1C2o3BW9_N9LkeIi2lM4viUwvXK75G6Nc'>

***

## <font color='darkorange'>Scatter plots</font>

In this exercise we will learn how to plot data in scatter plots. Unlike the previous examples with histograms and density plots, scatter plots will let us look at two variables at once (i.e., bivariate relationships).



Import the libraries


In [None]:
#import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


Bring in the nyc flight data

In [None]:
df_flights = ?

In our previous exercises we looked at departure and arrival delays as seperate, but are they related to each other. Let's use a scatter plot to see if higher departure delays lead to higher arrival delays.


In [None]:
#plot a scaterplot
sns.scatterplot(data=df_flights, x='dep_delay',y='arr_delay')

Let's add some nicer labels to the axis of the plot.

In [None]:
#scatterplot with labels
sns.scatterplot(data=df_flights, x='dep_delay',y='arr_delay').set(xlabel='Departure delay (minutes)', ylabel='Arrival delay (minutes)')

#save file
plt.savefig("delay_scatterplot.png")

This plot is showing that if departure delays are high, so too will arrival delays. In other words flights that start out late are not able to make up for lost time and end up arriving at their destination late.

### <font color='darkorange'>Estimate correlations</font>

> We will use the corr function build into pandas to estimate the correlation between departure delay and arrival delay.

In [None]:
#estimate correlation
df_flights.arr_delay.corr(df_flights.dep_delay)

<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width="100">  

Try and estimate correlations between a few other variables

In [None]:
#take a look at potential variables to compare (i.e., what columns/variables do we have)
?

In [None]:
#estimate the correlation between two variables
df_flights.?.corr(df_flights.?)

What is the largest correlation you can find? Can you also plot this relationship as a scatter plot?

In [None]:
#Scatterplot
sns.scatterplot(data=df_flights, x='?',y='?')

### <font color='darkorange'>Compare many variables using pair plots</font>

In [None]:
#let's choose some varibles to look at 
df_flights_pairs = df_flights[["arr_delay","dep_delay","distance","carrier"]] #notice it did not use carrier... why?

#use the pairplot method to look at all combinations of these variables
sns.pairplot(df_flights_pairs)

<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width="100">  

Try and visualize the relationships between a few variables.

In [None]:
#Choose some varibles to look at 
df_flights_pairs = df_flights[["?","?","?"]] #notice it did not use carrier... why?

#use the pairplot method to look at all combinations of these variables
sns.pairplot(?)

### <font color='darkorange'>Heat Maps</font>

We will use our new found correlation skills to more effectively search for patterns in our data using heat maps! These maps can quickly help us identify high/low correlations between our variables.


In [None]:
#run a correlation on all combinations of variables in df_flights
corrmat = df_flights.corr()

#plot the results as a heat map
sns.heatmap(corrmat, square=False)

In [None]:
#plot the results as a heat map (this time let's make the figure bigger)
plt.subplots(figsize=(12,9))
sns.heatmap(corrmat, square=False)

Here we can see from the legend on the right that the lighter colours are combinations of variables that have high positive correlations. While darker colors have larger negative correlations.

Things to think about:
- do all these comparisons make sense? e.g., flight# and distance?
- what variable types are there?
- why are year and month not showing any values?

### <font color='darkorange'>Further reading</font>

Check out seaborn's very nice page on [plotting relationships using scatterplots](https://seaborn.pydata.org/tutorial/relational.html).

> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroViz3_scatterplots.ipynb) version.

### <font color='darkorange'>Bonus material</font>

Visualizing your data is a very important step in any data science workflow. Let's take a look at the case below where four seperate datasets have the same mean and standard deviation, but differ wildly in how their data is ditributed.





<img src='http://drive.google.com/uc?export=view&id=1o1sS3SVNg7SFivTay0damWoN9hq9-0ap'>

Let's load in the data.

In [None]:
df_anscombe = pd.read_json("/content/sample_data/anscombe.json")

df_anscombe.head()

First let's show that each has the same summary statistics.

In [None]:
df_anscombe.groupby('Series').mean()

In [None]:
df_anscombe.groupby('Series').std()

Now let's take a look using scatter plots

In [None]:
sns.scatterplot(data=df_anscombe[(df_anscombe.Series=="I")],x="X",y="Y")

In [None]:
sns.scatterplot(data=df_anscombe[(df_anscombe.Series=="II")],x="X",y="Y")

In [None]:
sns.scatterplot(data=df_anscombe[(df_anscombe.Series=="III")],x="X",y="Y")

In [None]:
sns.scatterplot(data=df_anscombe[(df_anscombe.Series=="IV")],x="X",y="Y")

Even though each of these series of points have the same descriptive statistics (mean and standard deviation) they are very different in how they are distributed. This is why it is important to visualize your data!