Open In Colab


7. Using existing data

In this exercise we will start working with some real data. We will first learn how to load in data, then test out what you’ve learnt about DataFrames.

Outline:

  • Load in data

  • Explore the dataframe

  • Manipulate the dataframe

#Load in pandas
import pandas as pd

7.1. Loading data

There are many ways to load in datasets, here we’ll look at two methods.

But first let’s get the data:

Go to the shared data folder for the class and download nyc_flight_data.csv

Let’s then load in the flights data using a function from pandas called read_csv.

Option 1 - load from temporary folder (easier but not permanent).

  1. Drag and drop the data into the files tab on the left.

  2. Then right click and copy the path.

  3. Then paste the path into the pd.read_csv(PATH_HERE)

#load in the nyc flight data
df_flights = pd.read_csv('/content/nyc_flight_data.csv') 

Option 2 - load from your google drive

  1. mount your google drive

#mount your google drive
from google.colab import drive
drive.mount('/content/gdrive')
  1. Use the files tab on the left to find the file you like to load

  2. Right click on the file and copy the path

  3. Paste the path into the pd.read_csv()

##load in the nyc flight data 
df_flights = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DataScience/dataDSI/nyc_flight_data.csv') 

7.2. Exploring the dataframe

Now that we have the dataset loaded, let’s take a look at it. Let’s first look at it’s shape.

df_flights.shape

Here we can get a quick look the dataframe using the head method.

df_flights.head(5)

Or if you like a little more options you can use the following to help visualize the table better.

%load_ext google.colab.data_table

Now try looking at your dataframe again.

df_flights.head(5)

We can also see what kinds of variables are in the dataframe by using the dtype method:

#find the data types
df_flights.dtypes

We can get a quick description of the numeric data using the describe method:

df_flights.describe()

We can also take a look at the dataframe by double clicking on the nycflights.csv file in the files tab on the left. This gives us a more dynamic view as we can use filter to explore the data a little more. Try filtering for carrier: e.g., AA

7.3. Manipulate the dataframe

Let’s now use your new found skills to answer some questions!

Make a dataframe with only carrier, arr_delay, dep_delay, and air_time as columns.

#keep only the a subset of columns
df_flights_sub = df_flights.loc[? ,?  ]  #Warning: the column names will be case sensitive!

#take a look at the new dataframe
?

Make a dataframe of all flights that have an arrival delay of greater than 60min

#keep only the rows with arr_delay of greater than 60 minutes
df_flights_delayed60 = df_flights[?]

#What is the shape of this new dataframe
?

Can you also calculate how many flights arrived earlier than expected (i.e., where the plane arrived before the scheduled arrival time).

#keep only the rows with arr_delay less than 0 minutes (i.e., negative delay!)

#what is the shape of this new dataframe

Finally can you express these numbers as percentages?

#percentage of flights delayed more than 60mins, i.e., number delayed rows / number total rows
#percentage of flights arriving before scheduled time, i.e., number early rows / number total rows

Post to slack once you’ve got an answer!

7.4. Further reading

The many ways to work with data in colab.

If you would like the notebook without missing code check out the full code version.