Using existing data
Contents
7. Using existing data¶
In this exercise we will start working with some real data. We will first learn how to load in data, then test out what you’ve learnt about DataFrames.
Outline:
Load in data
Explore the dataframe
Manipulate the dataframe
#Load in pandas
import pandas as pd
7.1. Loading data¶
There are many ways to load in datasets, here we’ll look at two methods.
But first let’s get the data:
Go to the shared data folder for the class and download nyc_flight_data.csv
Let’s then load in the flights data using a function from pandas called read_csv.
Option 1 - load from temporary folder (easier but not permanent).
Drag and drop the data into the files tab on the left.
Then right click and copy the path.
Then paste the path into the pd.read_csv(PATH_HERE)
#load in the nyc flight data
df_flights = pd.read_csv('/content/nyc_flight_data.csv')
Option 2 - load from your google drive
mount your google drive
#mount your google drive
from google.colab import drive
drive.mount('/content/gdrive')
Use the files tab on the left to find the file you like to load
Right click on the file and copy the path
Paste the path into the pd.read_csv()
##load in the nyc flight data
df_flights = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/DataScience/dataDSI/nyc_flight_data.csv')
7.2. Exploring the dataframe¶
Now that we have the dataset loaded, let’s take a look at it. Let’s first look at it’s shape.
df_flights.shape
Here we can get a quick look the dataframe using the head method.
df_flights.head(5)
Or if you like a little more options you can use the following to help visualize the table better.
%load_ext google.colab.data_table
Now try looking at your dataframe again.
df_flights.head(5)
We can also see what kinds of variables are in the dataframe by using the dtype method:
#find the data types
df_flights.dtypes
We can get a quick description of the numeric data using the describe method:
df_flights.describe()
We can also take a look at the dataframe by double clicking on the nycflights.csv file in the files tab on the left. This gives us a more dynamic view as we can use filter to explore the data a little more. Try filtering for carrier: e.g., AA
7.3. Manipulate the dataframe¶
Let’s now use your new found skills to answer some questions!
Make a dataframe with only carrier, arr_delay, dep_delay, and air_time as columns.
#keep only the a subset of columns
df_flights_sub = df_flights.loc[? ,? ] #Warning: the column names will be case sensitive!
#take a look at the new dataframe
?
Make a dataframe of all flights that have an arrival delay of greater than 60min
#keep only the rows with arr_delay of greater than 60 minutes
df_flights_delayed60 = df_flights[?]
#What is the shape of this new dataframe
?
Can you also calculate how many flights arrived earlier than expected (i.e., where the plane arrived before the scheduled arrival time).
#keep only the rows with arr_delay less than 0 minutes (i.e., negative delay!)
#what is the shape of this new dataframe
Finally can you express these numbers as percentages?
#percentage of flights delayed more than 60mins, i.e., number delayed rows / number total rows
#percentage of flights arriving before scheduled time, i.e., number early rows / number total rows
Post to slack once you’ve got an answer!
7.4. Further reading¶
The many ways to work with data in colab.
If you would like the notebook without missing code check out the full code version.