Open In Colab

18. Project 1 - Exploratory Data Analysis

In this project you will perform an exploratory data analysis (EDA) using visualizations and correlations. You may choose from one of the datasets within the class shared data folder, or search for a dataset that interests you the most! Kaggle is a good place to start, as they often have relatively clean and easy to use datasets, but feel free to explore other places. There is a lot of data out there!

In this project you will:

  1. Choose and download a dataset

  2. Get summary statistics for key variables

  3. Create visuals to help understand your data

  4. Use correlation to measure relationships between key variables

  5. Summarise how EDA helped (or not!) in understanding your dataset

Import python libraries


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

18.1. Data

Action: Import your data into colaboratory.

Action: Determine the types of data are you dealing with. Marks (0.5)

18.2. Summary statistics

Action: Estimate the summary statistics of some of the key variables, and describe what you find. Marks (1)

18.3. Visualize the data

Action: Visualize the distribution of values for some key variables. Marks (2)

Q1: Explain your choice of plots using the five visualization components: Marks (2.5)

  1. Data component – what kinds of data are you dealing with?

  2. Graphical component – what kinds of plot can you use?

  3. Label component – what should be on the plot axis?

  4. Esthetic component – what should you plot say, and how best to do this?

  5. Ethical component – Is the graph misleading, what is left out?

18.4. Correlations

Action: Use correlation to estimate the relationship between some of the key variables. Try exploring for interesting relationships using heatmaps. Marks (1)

Q2: Choose one or two correlations and describe what the magnitude and direction of the correlation suggests about the relationship between the two variables. Marks (2)

18.5. Discussion

Q3: Did this exploritory data analysis help you better understand your chosen dataset? If so how? Is there still parts that don’t make sense? Marks (1)

The idea with this question is not to see if you know everything about this dataset, just how EDA might have helped (or not!).